October 31, 2022
Written by
Laurel Orr
Avanika Narayan

Data wrangling with foundation models

State-of-the-art data wrangling is not self-service, leading organizations to use legacy, slow-moving technology. Using cutting-edge foundation models, Numbers Station lowers the barrier to state-of-the-art data wrangling, accelerating existing wrangling processes and enabling data analysts to derive new insights from their toughest data.

For data to be leveraged in business analyses, it needs to be transformed from raw data to usable data, a process called data wrangling or data preparation. It consists of all the mundane tasks that data workers have to manually perform to make their raw, messy data usable such as data structuring, cleaning, deduplication or enrichment. This process is a fundamental step required before any valuable data analysis can be done. However, it remains a major challenge for organizations dealing with large volumes of data, and can take up to 80% of their time.

Existing data wrangling software solutions broadly fall into two main categories: self-service solutions and custom in-house built solutions. Self-service solutions are usually easy to use via an intuitive user interface and integrate nicely with enterprise data software tools. However, these solutions are built using legacy technology that has limited capabilities. As a result, these solutions are usually restricted to a narrow set of data sources and tasks, and are difficult or impossible to customize for complex enterprise-specific tasks (e.g. structuring unstructured data, transforming data according to organization-specific standards, matching tables with different schemas, …).  

On the other hand, organizations may decide to invest in custom in-house built wrangling solutions to prepare their toughest data. These are usually based on intelligent AI models that yield state-of-the-art quality and capabilities (e.g. automation of complex, organization-specific wrangling tasks). However, these solutions can take weeks or months to implement. First, they need to be engineered by technical experts which sets a high bar to entry for non-technical data workers. Then, these models need to be trained with manually labeled data which can take days-to-months to acquire, slowing down processes. Finally, models tend to be specialized for a specific data type or tasks, leading to disparate systems which need to be simultaneously maintained by engineers for quality control. Consequently, it can take several months for data workers to implement custom wrangling pipelines before getting any results.

In this blogpost, we describe our research effort in collaboration with the Stanford AI Lab where we pioneered the use of cutting-edge Foundation Models (FMs) on data wrangling problems. These FMs are generative models that have billions of parameters and take text as input and generate text as output. They can be adapted to a wide variety of downstream tasks by simply changing the input text (e.g., prompt). Unlike traditional AI models that suffer from a high barrier to entry, FMs can be used by any data worker via their natural language interface, without any custom pipeline code. Additionally, these models can be used out-of-the-box with limited to no labeled data, reducing time to value by orders of magnitude compared to traditional AI solutions. Finally, the same model can be used on a wide variety of tasks, alleviating the need to maintain complex,  hand-engineered pipelines. Because of all these properties, our goal in this work was to understand whether FMs could lower the barrier to entry to state-of-the-art data wrangling.

To use a foundation model for structured tabular data tasks, we have to address two main technical questions. First, how do we feed tabular data to these models that are trained and designed for unstructured text? And second, how do we cast the data tasks as natural language generation tasks? I.e. how do we craft good prompts that could address inherently structured tasks.

To answer the first question, we followed the approach proposed by Li et al. for tabular data serialization. Concretely, for a table with column names attr1, …, attrm and for a table entry with values val1, …, valm, we serial the data as a string representation as follows:

Note that compared to Li et al., we do not use a [CLS] or [SEP] token which are specific to masked language models (e.g. BERT) tokenizers.

We then proposed to cast a variety of structured data tasks as generation tasks. Concretely, we construct natural language question answering templates for each task. For instance, the template for a product entity matching task is:

In our paper, we proposed other prompt templates for other data wrangling tasks such as data imputation, error detection and schema matching.  At inference time, we complete the above template with the serialized tabular data for each table entry. An example prompt to evaluate whether two Apple MacBook products are the same based on their title and price attributes is shown in Figure 1. The prompt is then fed into the model which produces a yes/no answer that is used as the final result.

The above figure illustrates how to use FM in zero shot setting, i.e. no demonstrations are provided to the model. It is also possible to improve the performance of FMs using few-shot demonstrations, i.e. by including a few examples for desired input - output pairs in the prompt template. In our work, we also study the benefits of adding a few demonstrations to the prompt on these data tasks.

We evaluated the out-of-the-box performance of off the shelf foundation models (more specifically GPT-3) on challenging data wrangling tasks, including data imputation (filling missing entries in a table), data matching (identifying similar records across different structured sources) and error detection (detecting erroneous entries in table). We found that even with no demonstrations, these models achieve reasonable quality on multiple challenging data wrangling benchmarks. Additionally, with only 10 manually chosen demonstrations, GPT-3 was able to match or outperform the existing state-of-the-art on 7 out of 10 benchmark datasets.

These results support that FMs can be applied to data wrangling tasks, unlocking new opportunities to bring state-of-the-art and automated data wrangling to the self-service analytics world. However, there are many technical challenges to using very large foundation models on real enterprise data wrangling problems. One of the biggest challenges being the issues with the size of these models. GPT-3 and other very large FMs require a very large amount of compute which causes issues around scalability and cost. There are also other challenges beyond size that need to be addressed to make these models usable in the enterprise such as integration with data software, model quality control and adaptability to domain-specific tasks. At Numbers Station, we are building enterprise-ready FMs that are compute-efficient and achieve state-of-the-art quality on these data wrangling tasks. If you are interested in learning more about this, please contact us or join the Numbers Station waitlist.