NSQL-Llama-2-7b: Advancing SQL Foundation Models for Personalized Analytics

We are thrilled to introduce the NSQL-Llama-2-7b model, a SQL generation foundation model (FM) built on top of Meta’s Llama 2. Through extensive training on NSText2SQL data, the NSQL-Llama-2-7b model achieves up to a 15.5-point execution accuracy improvement when compared to our previous NSQL 6b model. Further, we applied the NSQL-Llama-2-7b model to one of our customers’ production workloads and measured a 7.8-point improvement over GPT-4. We are excited about the progress this model represents, as it signifies yet another significant milestone in our mission to empower every enterprise to develop custom analytics copilots that precisely cater to their unique needs.

Open-source FMs for Analytics

At Numbers Station, our goal is to bring the power of FMs to the modern data stack to democratize access to insights, by providing enterprises with models they can fully own and control. In line with this vision, just under a month ago, we launched the initial generation of NSQL models and their corresponding training data. This first step marked the beginning of our efforts to improve the SQL generation performance of open-source models, which enterprises can utilize commercially and further develop. We are thrilled that this release has already garnered more than 4000 downloads from HuggingFace, a testament to the strong interest from the community.

Within days of introducing NSQL, Meta released a new state-of-the-art open-source foundation model known as Llama 2, garnering rapid recognition as one of the best models in the open-source landscape today. Its outstanding performance across diverse benchmarks, combined with a suite of streamlined model sizes in comparison to giants like GPT-3, opens up remarkable possibilities in the field of AI. Moreover, the model’s permissive license empowers enterprises to leverage its potential and build their own private models. Thanks to Meta’s efforts, Llama 2 is now freely accessible for both open-source and commercial use, ushering in a new era of AI innovation and fostering collaborative development within the community.

Building on both the momentum of NSQL and Llama 2, we are excited to unveil the NSQL-Llama-2-7b built on top of Llama 2. To build NSQL-Llama-2-7b, we trained the Llama-2-7b model on our publicly released NSText2SQL dataset while following the same training process as discussed in our previous blog post. In the remainder of this blog, we describe NSQL-Llama-2-7b in more detail by presenting experimental results and a brief technical analysis.

Experimental Results
‍

The results of our experiments are summarized in Table 1. Through our fine-tuning efforts, the NSQL-Llama-2-7b model achieves a 75-point execution accuracy on the Spider benchmark dataset, marking a substantial 46-point improvement over the vanilla Llama-2-7b model. Moreover, compared to our previous NSQL model, the NSQL-Llama-2-7b model showcases a large 11.4-point execution accuracy boost on the Spider benchmark, and a noteworthy 2.2-point execution accuracy advantage over ChatGPT on the same benchmark. This advancement is significant in bridging the performance gap between open-source large language models and closed OpenAI models. For the GeoQuery benchmark, we observe similar execution accuracy¹ but measured a substantial 15.2-point improvement in matching accuracy compared to NSQL 6b.

	Model	Size	Spider		GeoQuery
	Model	Size	Execution Accuracy	Matching Accuracy	Execution Accuracy	Matching Accuracy
Open Source Other	replit/replit-code-v1-3b	3b	0.47	0.343	0.265	0.109
	bigcode/starcoder	15.5b	0.577	0.36	0.122	0.13
	mosaicml/mpt-7b	7b	0.358	0.264	0.143	0.261
	facebook/incoder-6b	6b	0.316	0.147	0.021	0.022
	Salesforce/xgen-7b-8k-inst	7b	0.393	0.273	0.061	0.239
	Salesforce/codegen-350m-multi	350m	0.102	0.067	0.02	0.022
	Salesforce/codegen-2b-multi	2b	0.25	0.17	0.061	0.065
	Salesforce/codegen-6b-multi	6b	0.321	0.226	0.061	0.109
	meta-llama/Llama-2-7b	7b	0.291	0.193	0	0.087
	meta-llama/Llama-2-13b	13b	0.449	0.322	0.122	0.261
	meta-llama/Llama-2-70b	70b	0.615	0.354	0.122	0.283
Open Source Ours – Pretrain	NSQL 350m	350m	0.199	0.125	0.041	0.043
	NSQL 2b	2b	0.345	0.226	0.02	0.043
	NSQL 6b	6b	0.434	0.279	0.061	0.13
	NSQL-Llama-2-7b	7b	0.47	0.312	0.02	0.13
Open Source Ours – Pretrain + Instruct	NSQL 350m	350m	0.517	0.456	0.184	0.043
	NSQL 2b	2b	0.593	0.532	0.184	0.152
	NSQL 6b	6b	0.636	0.574	0.265	0.152
	NSQL-Llama-2-7b	7b	0.75	0.663	0.265	0.304
Closed Source	Text Davinci 003	175b	0.72	0.499	0.245	0.413
	GPT 3.5	–	0.728	0.442	0.469	0.391
	GPT 4	–	0.762	0.419	0.551	0.391

Table 1: Execution accuracy and exact match accuracy of Spider and GeoQuery datasets. Dataset splits taken from here. We use the prompt format specified here.

In addition to academic benchmarks, we also evaluated NSQL-Llama-2-7b on a customers’ production workload. This workload consists of 30 hand crafted SQL queries, run in production daily by this customer, and their corresponding natural language descriptions. Each query embeds essential business logic in the form of both single table queries and complex join queries over multiple tables. In Figure 1, we show that on this workload NSQL-Llama-2-7b significantly boosts execution accuracy, surpassing our previous NSQL 6b by 15.5 points and outperforming GPT-4 by 7.8 points. We found that this was because NSQL-Llama-2-7b excels in handling the complex logic and longer contexts common in production workloads. This important customer win underscores the potential of NSQL-LLama-2-7b for wider adoption across enterprises.

Figure 1: NSQL-Llama-2-7b execution accuracy performance on a production customer workload. NSQL-Llama-2-7b outperforms closed models on this customer production workload.

Technical Analysis

The NSQL-Llama-2-7b model, thanks to its SQL-specific pretraining, exhibits better knowledge understanding and improved capabilities in handling long-context scenarios, which is particularly useful on complex and multi-table join queries.

Figure 2 provides a detailed view of these improvements, showcasing a 43% improvement for “Join” queries and a 54% improvement for “Nested” queries in the Spider benchmark.‍
Table 3 provides concrete examples (examples 1, 2, 3, and 4) that showcase the NSQL-Llama-2-7b model’s newfound capabilities. For instance, in example 1, we show that NSQL-Llama-2-7b can now generate ratio-type queries, a feature that was previously absent in our NSQL 6b model. Similarly, in example 2, the model adeptly comprehends complex question logic, generating the correct SQL query and effectively addressing issues with NSQL-6B where it often overlooked vital filters in similar cases. Moreover, examples 3 and 4 demonstrate NSQL-Llama-2-7b’s prowess in understanding the optimal join strategy among three tables and deducing the accurate query logic, overcoming the challenges faced by NSQL 6B in identifying the proper joining pattern or critical group by logic during query generation.‍
When applied to the customer workload detailed above, NSQL-Llama-2-7b achieved a 60% improvement in questions necessitating a large context (1000 tokens or more in our evaluation setting) compared to our previous NSQL 6B model. This establishes the model’s superior handling of long-context scenarios, setting it apart from previous iterations of NSQL models.

Figure 2: Comparison between NSQL 6b and NSQL-Llama-2-7b models on different categories of Spider benchmark test set. “Join” is the subset of test questions that require join SQL queries, “Nested” is the subset of test questions that require nested SQL queries and “Simple” is the subset of test questions that require neither “Join” nor “Nested” SQL queries.

Nonetheless, while NSQL-Llama-2-7b has made significant strides, we have also come across certain challenges. Specifically, the model encounters difficulties in understanding calculations involving date-related columns, an important query type in business analytics for trend analysis (see example 5 in Table 2). To overcome this limitation, we plan to improve the model’s performance by including more date-related query examples in the finetuning dataset.

Furthermore, as emphasized in our previous blog, there are instances where NSQL models may require domain-specific information to generate precise queries (illustrated by examples 6 and 7 in Table 2). To address this issue and enhance the model’s understanding of domain knowledge and business semantics, we offer finetuning on domain-specific data to our customers, as this is a critical step in gaining organizational insights in each distinct domain.

CREATE TABLE work_order (
    id NUMBER,
    property_id NUMBER,
    cost FLOAT,
    invoice_amount FLOAT,i
    entered_date DATE,
    due_date DATE,
    complete_date DATE,
    cancel_date DATE,
    is_canceled BOOLEAN,
    is_completed BOOLEAN,
    is_open BOOLEAN,
    order_type TEXT,
    assigned_to TEXT
)

CREATE TABLE property (
    id NUMBER,
    property_name TEXT,
    area FLOAT,
    owner_id NUMBER,
    city TEXT,
    country TEXT
)

CREATE TABLE owner (
    id NUMBER,
    name TEXT,
    salary FLOAT
)

Example	Question	NSQL-Llama-2-7b	NSQL 6B
1	What is the completion percentage in the work orders for each order type?	SELECT order_type, SUM(is_completed) * 100 / COUNT(*) FROM work_order GROUP BY order_type	SELECT order_type, SUM(complete_date) * 100 / SUM(due_date) FROM work_order GROUP BY order_type;
2	How many open work orders for the largest property in Chicago?	SELECT COUNT(*) FROM work_order AS T1 JOIN property AS T2 ON T1.property_id = T2.id WHERE T2.city = ‘Chicago’ AND T1.is_open = 1 AND T2.area = (SELECT MAX(area) FROM property WHERE city = ‘Chicago’)	SELECT COUNT(*) FROM work_order AS T1 JOIN property AS T2 ON T1.property_id = T2.id WHERE T2.city = ‘Chicago’ AND T1.is_open = 1;
3	What is the average work order cost for the properties by each owner?	SELECT AVG(t1.cost), owner_id FROM work_order AS t1 JOIN property AS t2 ON t1.property_id = t2.id GROUP BY owner_id	SELECT AVG(T1.cost), T2.name FROM work_order AS T1 JOIN owner AS T2 ON T1.owner_id = T2.id GROUP BY T2.name;
4	Who has the property that has the most number of work orders?	SELECT T1.name FROM owner AS T1 JOIN property AS T2 ON T1.id = T2.owner_id JOIN work_order AS T3 ON T2.id = T3.property_id GROUP BY T2.id ORDER BY COUNT(*) DESC LIMIT 1	SELECT T1.property_name FROM property AS T1 JOIN work_order AS T2 ON T1.id = T2.property_id GROUP BY T1.property_name ORDER BY COUNT(*) DESC LIMIT 1;
5	How many work order got cancelled within 3 months before the due date?	SELECT COUNT(*) FROM work_order WHERE cancel_date <= due_date – 3	SELECT COUNT(*) FROM work_order WHERE cancel_date BETWEEN due_date – 3 AND due_date;
6	What is the expected profit of open work orders?	SELECT SUM(invoice_amount) FROM work_order WHERE is_open = 1	SELECT SUM(invoice_amount) FROM work_order WHERE is_open = 1;
7	What is the expected profit of open work orders? Profit is total difference of invoice and cost across orders.	SELECT SUM(invoice_amount – cost) FROM work_order WHERE is_open = 1;	SELECT SUM(invoice_amount – cost) FROM work_order WHERE is_open = 1;

Table 2: Examples of query generations using NSQL models. The input table schemas are defined in Figure 3.

Summary

With the release of the NSQL-Llama-2-7b model, we invite data practitioners and AI enthusiasts to join us in shaping the future of AI-driven analytics. You can access NSQL-Llama-2-7b, along with weights, by visiting Hugging Face here or by trying out one of our example notebooks here. Together, we can continue to bridge the divide between cutting-edge AI capabilities and accessible, enterprise-owned models, fostering innovation and driving the data ecosystem forward. At Numbers Station, we are dedicated to driving positive change in the AI landscape, and we look forward to the impact of this next step in our journey.

If you are interested in bringing the power of NSQL-Llama-2-7b and related analytics FMs to your enterprise in a personalized manner, see more at https://www.numbersstation.ai/. Join us on this exciting journey of transforming the world of data analytics and decision-making.

^1.^{The execution accuracy did not improve due to case sensitivity in the emitted SQL (e.g. the model emits “California” when the query requires “california”). We plan to address this issue in future releases. Regardless, the exact match accuracy provides a more accurate representation of the model’s overall improvement.}

NSQL-Llama-2-7b: Advancing SQL Foundation Models for Personalized Analytics

Open-source FMs for Analytics

Experimental Results
‍

Technical Analysis

Summary

You May Also Like

Programmatic Model Serving with Databricks in Python

DuckDB-NSQL: How to Quack in SQL

The Unified Knowledge Graph: The Key To Making Text-to-SQL Work

NSQL-Llama-2-7b: Advancing SQL Foundation Models for Personalized Analytics

Open-source FMs for Analytics

Experimental Results‍

Technical Analysis

Summary

You May Also Like

Programmatic Model Serving with Databricks in Python

DuckDB-NSQL: How to Quack in SQL

The Unified Knowledge Graph: The Key To Making Text-to-SQL Work

Experimental Results
‍