Advanced Computing in the Age of AI | Thursday, July 4, 2024

Matillion Bringing AI to Data Pipelines 

Data engineers historically have toiled away in the virtual basement, doing the dirty work of spinning raw data into something usable by data scientists and analysts. The advent of generative AI is changing the nature of the data engineer’s job, as well as the data she works with–and ETL software developer Matillion is right there in the thick of the change.

Matillion built its ETL/ELT business during the last tectonic shift in the big data industry: the move from on-prem analytics to running big data warehouses in the cloud. It takes expertise and knowledge to extract, transform, and load business data into cloud data warehouses like Amazon Redshift, and the folks at Matillion found ways to automate much of the drudgery through abundant connectors and low-code/no-code interfaces for building data pipelines.

Now we’re 18 months into the generative AI revolution, and the big data industry finds itself once again being rocked by seismic waves. Large language models (LLMs) are giving companies compelling new ways of serving customers when text is the interface and an actionable new data source.

But LLMs and the coterie of tools and techniques that surround them–vector databases, retrieval augmented generation (RAG), prompt engineering–are also enabling companies to do old things in new ways through copilots and autonomous agents. One of the older things that GenAI has targeted for a facelift is ETL/ELT, and Matillion is at the front of that transformation.

Matillion’s AI Strategy

Like many other data tool makers, Matillion has developed an AI strategy for adapting its business and tools to the GenAI revolution.

 

Copilots help with coding work (Phonlamai Photo/Shutterstock)

On the one hand, the company is updating its existing tools to enable data engineers to work with unstructured data (mostly text) that is the feedstock for GenAI applications. To that end, it’s adapted its software to work with the new data pipelines being built for GenAI applications. That includes connecting into various vector databases and RAG tools, such as LangChain, that developers are using to build GenAI applications, according to Ciaran Dynes, Matillion’s chief product officer.

“There’s a skill in building that. It doesn’t come cheap,” Dynes tells Datanami. “A lot of what we’ll see in Matillion is plain old ETL pipelines–prepping the data, cutting out all the junk, the non-printable characters in PDF, stripping out all the headers and footers. If you send those to an LLM, I’m afraid you’re paying for every single token.”

Matillion is also adopting GenAI technology to improve the workflow in its own products. Earlier this year, the company unveiled Matillion Copilot, which allows data engineers to use natural language commands to transform and prepare data.

The copilot, which will soon be in preview, gives engineers another option for building ETL/ELT pipelines in addition to the low code/no code interface and the drag-and-drop environment.

According to Dynes, the copilot works with Matillion’s Data Pipelining Language, or DPL, to convert natural language requests to transform data using scripts written in SQL, Python, dbt, LangChain, or other languages. In the right hands, Matillion Copilot can enable data analysts to build data transformation pipelines.

“A copilot will definitely help the business analyst be faster, cheaper, better, as well as opposed to needing or always needing the data engineer to fix the data for them,” Dynes said.

Creating AI Pipelines

Matillion developed its ETL/ELT chops working primarily with structured data. But GenAI works predominantly on unstructured data, including text and images, and that changes the nature of the new data pipelines that are being created.

For instance, matching a particular data source into the appropriate table in the destination isn’t always straightforward, as there can be variations in the semantic meanings of data values that machines have a hard time picking up. This is where Matillion has focused much of its energy in creating Copilot.

In Dynes demo, viewer ratings of movies are being loaded into a vector database in preparation for use in a prompt to an LLM. The trouble starts immediately with the word “movies.” What does that mean? Does it include “film”? What about “ratings”? Is that the same as “quality”?

“You can send in information called user context and you can teach a large language model, for the purpose of movie rating, ‘movie’ and ‘film’ are interchangeable words,” Dynes said. “What does quality mean? You look within the database, and maybe it doesn’t have the thing called ‘quality,’ but maybe it has ‘user score.’ To you and me, oh, that’s quality, but how does the how does the machine know the quality and user score interchangeable?”

To alleviate these challenges, Matillion gives users the ability to set rules within Copilot that link certain concepts together. As the user works in the copilot to fine-tune the data that will be used in the prompt, she’s able to see the results in a visual sample at the bottom of the screen. If the data transformation looks good, she can move on to the next thing. If there’s something off, she keeps iterating until it’s right.

Ultimately, Matillion’s goal is to leverage AI to lower the barrier to entry for data transformation work, thereby allowing data analysts to developer their own data pipelines. That will leave data engineers to tackle more difficult tasks, such as building new AI pipelines between unstructured data sources, vector databases, and LLMs.

“The hardest thing is basically teaching the data engineers the new practice called prompt engineering. It is different,” he said. “AI pipelines are not [traditional ETL]. It’s unstructured data, and the way that you work with using this natural language prompt is actually a real skill.”

Hallucinations are a concern. So is the tendency of LLMs to go into “Chatty Kathy” mode. Getting data engineers to prompt the LLMs, which are probabilistic entities, to give them more deterministic output requires some targeted teaching.

“If you do not tell the model to say ‘answer yes or no only,’ it will give you a big blob of text. ‘Well, I don’t know. Do you really like Martin Scorsese movies?’ It will just tell you a lot of bunch of garbage,” Dynes said. “I don’t want to get all that stuff! If I don’t have a yes/no answer or a number, I can’t do analytics on it.”

Matillion Copilot is slated to be released later this year. The company is currently accepting applications to join the preview.

Related Items:

Matillion Looks to Unlock Data for AI

Matillion Debuts Data Integration Service on K8S

Matillion Unveils Streaming CDC in the Cloud

EnterpriseAI