We are thrilled to introduce the NSQL-Llama-2-7b model, a SQL generation foundation model (FM) built on top of Meta's Llama 2. Through extensive training on NSText2SQL data, the NSQL-Llama-2-7b model achieves up to a 15.5-point execution accuracy improvement when compared to our previous NSQL 6b model. Further, we applied the NSQL-Llama-2-7b model to one of our customers' production workloads and measured a 7.8-point improvement over GPT-4. We are excited about the progress this model represents, as it signifies yet another significant milestone in our mission to empower every enterprise to develop custom analytics copilots that precisely cater to their unique needs.
Open-source FMs for Analytics
At Numbers Station, our goal is to bring the power of FMs to the modern data stack to democratize access to insights, by providing enterprises with models they can fully own and control. In line with this vision, just under a month ago, we launched the initial generation of NSQL models and their corresponding training data. This first step marked the beginning of our efforts to improve the SQL generation performance of open-source models, which enterprises can utilize commercially and further develop. We are thrilled that this release has already garnered more than 4000 downloads from HuggingFace, a testament to the strong interest from the community.
Within days of introducing NSQL, Meta released a new state-of-the-art open-source foundation model known as Llama 2, garnering rapid recognition as one of the best models in the open-source landscape today. Its outstanding performance across diverse benchmarks, combined with a suite of streamlined model sizes in comparison to giants like GPT-3, opens up remarkable possibilities in the field of AI. Moreover, the model's permissive license empowers enterprises to leverage its potential and build their own private models. Thanks to Meta's efforts, Llama 2 is now freely accessible for both open-source and commercial use, ushering in a new era of AI innovation and fostering collaborative development within the community.
Building on both the momentum of NSQL and Llama 2, we are excited to unveil the NSQL-Llama-2-7b built on top of Llama 2. To build NSQL-Llama-2-7b, we trained the Llama-2-7b model on our publicly released NSText2SQL dataset while following the same training process as discussed in our previous blog post. In the remainder of this blog, we describe NSQL-Llama-2-7b in more detail by presenting experimental results and a brief technical analysis.
Experimental Results
The results of our experiments are summarized in Table 1. Through our fine-tuning efforts, the NSQL-Llama-2-7b model achieves a 75-point execution accuracy on the Spider benchmark dataset, marking a substantial 46-point improvement over the vanilla Llama-2-7b model. Moreover, compared to our previous NSQL model, the NSQL-Llama-2-7b model showcases a large 11.4-point execution accuracy boost on the Spider benchmark, and a noteworthy 2.2-point execution accuracy advantage over ChatGPT on the same benchmark. This advancement is significant in bridging the performance gap between open-source large language models and closed OpenAI models. For the GeoQuery benchmark, we observe similar execution accuracy1 but measured a substantial 15.2-point improvement in matching accuracy compared to NSQL 6b.
In addition to academic benchmarks, we also evaluated NSQL-Llama-2-7b on a customers’ production workload. This workload consists of 30 hand crafted SQL queries, run in production daily by this customer, and their corresponding natural language descriptions. Each query embeds essential business logic in the form of both single table queries and complex join queries over multiple tables. In Figure 1, we show that on this workload NSQL-Llama-2-7b significantly boosts execution accuracy, surpassing our previous NSQL 6b by 15.5 points and outperforming GPT-4 by 7.8 points. We found that this was because NSQL-Llama-2-7b excels in handling the complex logic and longer contexts common in production workloads. This important customer win underscores the potential of NSQL-LLama-2-7b for wider adoption across enterprises.

Technical Analysis
The NSQL-Llama-2-7b model, thanks to its SQL-specific pretraining, exhibits better knowledge understanding and improved capabilities in handling long-context scenarios, which is particularly useful on complex and multi-table join queries.
- Figure 2 provides a detailed view of these improvements, showcasing a 43% improvement for "Join" queries and a 54% improvement for "Nested" queries in the Spider benchmark.
- Table 3 provides concrete examples (examples 1, 2, 3, and 4) that showcase the NSQL-Llama-2-7b model's newfound capabilities. For instance, in example 1, we show that NSQL-Llama-2-7b can now generate ratio-type queries, a feature that was previously absent in our NSQL 6b model. Similarly, in example 2, the model adeptly comprehends complex question logic, generating the correct SQL query and effectively addressing issues with NSQL-6B where it often overlooked vital filters in similar cases. Moreover, examples 3 and 4 demonstrate NSQL-Llama-2-7b's prowess in understanding the optimal join strategy among three tables and deducing the accurate query logic, overcoming the challenges faced by NSQL 6B in identifying the proper joining pattern or critical group by logic during query generation.
- When applied to the customer workload detailed above, NSQL-Llama-2-7b achieved a 60% improvement in questions necessitating a large context (1000 tokens or more in our evaluation setting) compared to our previous NSQL 6B model. This establishes the model's superior handling of long-context scenarios, setting it apart from previous iterations of NSQL models.

Nonetheless, while NSQL-Llama-2-7b has made significant strides, we have also come across certain challenges. Specifically, the model encounters difficulties in understanding calculations involving date-related columns, an important query type in business analytics for trend analysis (see example 5 in Table 2). To overcome this limitation, we plan to improve the model's performance by including more date-related query examples in the finetuning dataset.
Furthermore, as emphasized in our previous blog, there are instances where NSQL models may require domain-specific information to generate precise queries (illustrated by examples 6 and 7 in Table 2). To address this issue and enhance the model's understanding of domain knowledge and business semantics, we offer finetuning on domain-specific data to our customers, as this is a critical step in gaining organizational insights in each distinct domain.
Summary
With the release of the NSQL-Llama-2-7b model, we invite data practitioners and AI enthusiasts to join us in shaping the future of AI-driven analytics. You can access NSQL-Llama-2-7b, along with weights, by visiting Hugging Face here or by trying out one of our example notebooks here. Together, we can continue to bridge the divide between cutting-edge AI capabilities and accessible, enterprise-owned models, fostering innovation and driving the data ecosystem forward. At Numbers Station, we are dedicated to driving positive change in the AI landscape, and we look forward to the impact of this next step in our journey.
If you are interested in bringing the power of NSQL-Llama-2-7b and related analytics FMs to your enterprise in a personalized manner, see more at https://www.numbersstation.ai. Join us on this exciting journey of transforming the world of data analytics and decision-making.
1. The execution accuracy did not improve due to case sensitivity in the emitted SQL (e.g. the model emits “California” when the query requires “california”). We plan to address this issue in future releases. Regardless, the exact match accuracy provides a more accurate representation of the model's overall improvement.