Skip to content

Curated tutorials and resources for Large Language Models, Text2SQL, and more.

License

Notifications You must be signed in to change notification settings

warrentak18/Awesome-Text2SQL

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Text2SQL🎉🎉🎉

GitHub Repo stars GitHub Repo forks Awesome License: MIT last commit

English | 中文版

Curated tutorials and resources for Large Language Models, Text2SQL, and more.

🌱 How to Contribute

We warmly welcome contributions from everyone, whether you've found a typo, a bug, have a suggestion, or want to share a resource related to LLM+Text2SQL. For detailed guidelines on how to contribute, please see our CONTRIBUTING.md file.

🔔 Leaderboard

WikiSQL Spider
Exact Match(EM)
Spider
Exact Execution(EX)
BIRD
Valid Efficiency Score (VES)
BIRD
Execution Accuracy (EX)
🏆1 93.0
(2021/05-SeaD+Execution-Guided Decoding)
74.0
(2022/09-Graphix-3B + PICARD)
86.6
(2023/08-DAIL-SQL + GPT-4 + Self-Consistency)
64.22
(2023/10-SFT CodeS-15B)
60.37
(2023/10-SFT CodeS-15B)
🥈2 92.7
(2021/03-SDSQL+Execution-Guided Decoding)
73.9
(2022/09-CatSQL + GraPPa)
86.2
(2023/08-DAIL-SQL + GPT-4)
63.62
(2023/10-SFT CodeS-7B)
59.25
(2023/10-SFT CodeS-7B)
🥉3 92.5
(2020/11-IE-SQL+Execution-Guided Decoding)
73.1
(2022/09-SHiP + PICARD)
85.3
(2023/04-DIN-SQL + GPT-4)
60.77
(2023/07-GPT-4)
55.90
(2023/08-DIN-SQL + GPT-4)
4 92.2
(2020/03-HydraNet+Execution-Guided Decoding)
72.9
(2022/05-G³R + LGESQL + ELECTRA)
83.9
(2023/07-Hindsight Chain of Thought with GPT-4)
59.44
(2023/08-DIN-SQL + GPT-4)
54.89
(2023/07-GPT-4)
5 91.9
(2020/12-BRIDGE+Execution-Guided Decoding)
72.4
(2022/08-RESDSQL+T5-1.1-lm100k-xl)
82.3
(2023/06-C3 + ChatGPT + Zero-Shot)
56.99
(2023/10-SFT CodeS-15B)
52.15
(2023/10-SFT CodeS-15B)
6 91.8
(2019/08-X-SQL+Execution-Guided Decoding)
72.4
(2022/05-T5-SR)
80.8
(2023/07-Hindsight Chain of Thought with GPT-4 and Instructions)
56.56
(2023/03-ChatGPT + CoT)
50.25
(2023/10-SFT CodeS-7B)
7 91.4
(2021/03-SDSQL)
72.2
(2022/12-N-best List Rerankers + PICARD)
79.9
(2023/02-RESDSQL-3B + NatSQ)
54.84
(2023/10-SFT CodeS-7B)
49.02
(2023/07-Claude-2)
8 91.1
(2020/12-BRIDGE)
72.1
(2021/09-S²SQL + ELECTRA )
78.5
(2022/11-SeaD + PQL)
51.40
(2023/03-ChatGPT)
40.08
(2023/03-ChatGPT + CoT)
9 91.0
(2021/04-Text2SQLGen + EG)
72.0
(2023/02-RESDSQL-3B + NatSQL)
78.2
(2023/04-DIN-SQL + CodeX)
49.69
(2023/03-ChatGPT + CoT)
39.30
(2023/03-ChatGPT)
10 90.5
(2020/11-SeqGenSQL+EG)
72.0
(2021/06-LGESQL + ELECTRA )
78.0
(2023/08-T5-3B+NatSQL+Token Preprocessing)
41.60
(2023/02-Codex)
36.47
(2023/02-Codex)

📜 Contents

👋 Introduction

  • Text-to-SQL (or Text2SQL), as the name implies, is to convert text into SQL. A more academic definition is to convert natural language problems in the database field into structured query languages ​​that can be executed in relational databases. Therefore, Text-to-SQL can also be abbreviated as NL2SQL.
    • Input: natural language questions, such as "Query the relevant information of the table t_user, and the results are sorted in descending order by id, and only the first 10 data are kept."
    • Output: SQL, such as "SELECT * FROM t_user ORDER BY id DESC LIMIT 10"

📖 Survey

  • (2023-International Conference on Very Large Data Bases, VLDB, CCF-A)A survey on deep learning approaches for text-to-SQL [paper]
  • (2022-IEEE Transactions on Knowledge and Data Engineering, TKDE, CCF-A) A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions [paper]
  • (2022-International Conference on Computational Linguistics, COLOING, CCF-B) Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect [paper]
  • (2022-arXiv)Deep Learning Driven Natural Languages Text to SQL Query Conversion: A Survey [paper]

💬 Classic Model

  • (2023-arXiv, None) Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation [paper] [code]

  • (2023-AAAI 2023, CCF-A) RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL [paper] [code]

  • (2023-arXiv, None) Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs [paper] [code]

  • (2023-arXiv, None) DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction [paper] [code]

  • (2023-arXiv, None) A comprehensive evaluation of ChatGPT's zero-shot Text-to-SQL capability [paper] [code]

  • (2023-ICLR, CCF-A) Binding Language Models in Symbolic Languages [paper] [code]

  • (2023-SIGMOD, CCF-A) Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning [paper] [code]

  • (2023-ICASSP, CCF-B) T5-SR: A Unified Seq-to-Seq Decoding Strategy for Semantic Parsing [paper]

  • (2022-ACL, CCF-A) S2SQL: Injecting Syntax to Question-Schema Interaction Graph Encoder for Text-to-SQL Parsers [paper]

  • (2022-NAACL, CCF-B) SeaD: End-to-end Text-to-SQL Generation with Schema-aware Denoising [paper]

  • (2022-EMNLP, CCF-B) STAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing [paper] [code]

  • (2022-EMNLP, CCF-B) RASAT: Integrating Relational Structures into Pretrained Seq2Seq Model for Text-to-SQL [paper] [code]

  • (2022-EMNLP, CCF-B) CQR-SQL: Conversational Question Reformulation Enhanced Context-Dependent Text-to-SQL Parsers [paper]

  • (2022-ACL, CCF-A) HIE-SQL: History Information Enhanced Network for Context-Dependent Text-to-SQL Semantic Parsing [paper]

  • (2022-arXiv, None) Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [paper]

  • (2021-ACL, CCF-A) Decoupled Dialogue Modeling and Semantic Parsing for Multi-Turn Text-to-SQL [paper]

  • (2021-arXiv, None) Pay More Attention to History: A Context Modelling Strategy for Conversational Text-to-SQL [paper] [code]

  • (2021-ICLR, CCF-A) SCORE: Pre-training for Context Representation in Conversational Semantic Parsing [paper]

  • (2021-DASFAA, CCF-B) An Interactive NL2SQL Approach with Reuse Strategy [paper]

  • (2021-NAACL, CCF-B) Structure-Grounded Pretraining for Text-to-SQL [paper]

  • (2021-EMNLP, CCF-B) PICARD:Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models [paper] [code]

  • (2021-ICLR, CCF-A) GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [paper] [code]

  • (2021-ACL, CCF-A) LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations [paper] [code]

  • (2020-EMNLP, CCF-B) Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing [paper] [code]

  • (2020-ACL, CCF-A) TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [paper] [code]

  • (2020-ACL, CCF-A) RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers [paper] [code]

  • (2020-EMNLP, CCF-B) Mention Extraction and Linking for SQL Query Generation [paper]

  • (2020-EMNLP, CCF-B) IGSQL: Database Schema Interaction Graph Based Neural Model for Context-Dependent Text-to-SQL Generation [paper] [code]

  • (2020-arXiv, None) Hybrid Ranking Network for Text-to-SQL [paper] [code]

  • (2019-arXiv, None) X-SQL: reinforce schema representation with context [paper]

  • (2019-EMNLP, CCF-B) Global Reasoning over Database Structures for Text-to-SQL Parsing [paper] [code]

  • (2019-EMNLP, CCF-B) Editing-Based SQL Query Generation for Cross-Domain Context-Dependent Questions [paper] [code]

  • (2019-ACL, CCF-A) Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing [paper] [code]

  • (2019-ACL, CCF-A) Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation [paper] [code]

  • (2018-EMNLP, CCF-B) SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-DomainText-to-SQL Task [paper] [code]

  • (2018-NAACL, CCF-B) TypeSQL: Knowledge-based Type-Aware Neural Text-to-SQL Generation [paper] [code]

  • (2017-arXiv, None) SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning [paper] [code]

🔥 Base Model

  • Llama [paper] [code] [model]

    • 2023/02, Meta AI proposes the open source LLM Llama, which has four scales: 7b, 13b, 33b, and 65b.
  • ChatGLM [paper] [code] [model]

    • 2023/03, Tsinghua University proposes the open bilingual language model ChatGLM, based on General Language Model framework, with the specification of 7b.
  • Alpaca [paper] [code] [model]

    • 2023/03, Stanford University proposes Alpaca, an open source LLM fine-tuned based on the Llama 7b model. There are 1 specification of 7b, and the training is simpler and cheaper.
  • Vicuna [paper] [code] [model]

    • 2023/03, UC Berkeley University, CMU and Stanford University propose Vicuna, an open souce LLM based on the Llama model, has two specifications: 7b and 13b.
  • WizardLM [paper] [code] [model]

    • 2023/04, Peking University and Microsoft propose WizardLM, a LLM of evolutionary instructions, with three specifications of 7b, 13b, and 30b. 2023/06, They propose WizardMath, a LLM in the field of mathematics. 2023/08, They propose WizardCoder, a LLM in the field of code.
  • Falcon [paper] [code] [model]

    • 2023/06, United Arab Emirates proposes Falcon, an open source LLM trained solely on refinedweb datasets, with four parameter specifications of 1b, 7b, 40b and 180b. It is worth noting that the performance on model 40B exceeds that of 65B LLaMA.
  • ChatGLM2[paper] [code] [model]

    • 2023/06, Tsinghua University proposes the second-generation version of ChatGLM,with the specification of 7b, which has stronger performance, longer context, more efficient inference and more open license.
  • Baichuan-7b [code] [model]

    • 2023/06, Baichuan Intelligent Technology proposes the Baichuan-7B, an open-source, large-scale pre-trained language model based on Transformer architecture, which contains 7 billion parameters and trained on approximately 1.2 trillion tokens. It supports both Chinese and English languages with a context window length of 4096.
  • Baichuan-13b [code] [model]

    • 2023/07, Baichuan Intelligent Technology proposes the Baichuan-13B, an open-source, commercially available large-scale language model, following Baichuan-7B, which has two versions: pre-training (Baichuan-13B-Base) and alignment (Baichuan-13B-Chat).
  • InternLM [paper] [code] [model]

    • 2023/07, Shanghai AI Laboratory and SenseTime propose the InternLM, which has open-sourced a 7b and 20b parameter base models and chat models tailored for practical scenarios and the training system.
  • Llama 2 [paper] [code] [model]

    • 2023/07, Meta AI proposes the second-generation Llama series open-source LLM Llama 2. Compared with Llama 1, the training data is 40% more, and the context length is doubled. The model has four specifications: 7b, 13b, 34b, and 70b, but 34b is not open source.
  • Code LLama [paper] [code] [model]

    • 2023/08, Meta AI proposes Code LLama, based on Llama 2. Code Llama reaches state-of-the-art performance among open models on several code benchmarks. There are foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models, with 7B, 13B and 34B parameters each.
  • Qwen [paper] [code] [model]

    • 2023/08, Alibaba Cloud proposes the 7b-parameter version of the large language model series Qwen-7B (abbr. Tongyi Qianwen), is pretrained on a large volume of data, including web texts, books, codes, etc, which has open sourced two models with Qwen-7B and Qwen-7B-Chat. 2023/09, Alibaba Cloud updated the Qwen-7B and Qwen-7B-Chat and open sourced Qwen-14B and Qwen-14B-Chat.
  • Baichuan 2 [code] [model]

    • 2023/09, Baichuan Intelligent Technology proposes the new generation of open-source large language models Baichuan 2, trained on a high-quality corpus with 2.6 trillion tokens, which has base and chat versions for 7B and 13B, and a 4bits quantized version for the chat model.
  • Phi-1.5 [paper] [model]

    • 2023/09, Microsoft Research proposes the open source language model phi-1.5, a Transformer with 1.3 billion parameters, which was trained using the same data sources as phi-1, augmented with a new data source that consists of various NLP synthetic texts. When assessed against benchmarks testing common sense, language understanding, and logical reasoning, phi-1.5 demonstrates a nearly state-of-the-art performance among models with less than 10 billion parameters.

💡 Fine-tuning

  • P-Tuning [paper] [code]

    • 2021/03, Tsinghua University and others propose P-Tuning, a fine-tuning method for LLM, which uses trainable continuous prompt word embeddings to reduce the cost of fine-tuning.
  • LoRA [paper] [code]

    • 2021/06, Microsoft proposes the Low-Rank Adaptation method for fine-tuning LLM by freezing the pre-training weights.
  • P-Tuning V2 [paper] [code]

    • 2021/10, Tsinghua University proposes P-Tuning V2, an improved version of P-Tuning with better performance.
  • RLHF [paper] [code]

    • 2022/12, OpenAI uses the RLHF (Reinforcement Learning from Human Feedback) method to train ChatGPT, and uses human feedback signals to directly optimize the language model, with excellent performance.
  • RRHF [paper] [code]

    • 2023/04, Alibaba proposes a novel learning paradigm called RRHF(Rank Responses to Align Language Models with Human Feedback without tears), which can be tuned as easily as fine-tuning and achieve a similar performance as PPO in HH dataset.
  • QLoRA [paper] [code]

    • 2023/05, Washington University proposes the qlora method, based on the frozen 4bit quantization model, combined with LoRA method training, which further reduces the cost of fine-tuning.
  • RLTF [paper] [code]

    • 2023/07, Tencent proposes RLTF(Reinforcement Learning from Unit Test Feedback), a novel online RL framework with unit test feedback of multi-granularity for refining code LLMs.
  • RRTF [paper]

    • 2023/07, Huawei proposes RRTF(Rank Responses to align Test&Teacher Feedback). Compared with RLHF, RRHF can efficiently align the output probabilities of a language model with human preferences, with only 1-2 models required during the tuning period, and it is simpler than PPO in terms of implementation, hyperparameter tuning, and training.
  • RLAIF [paper]

    • 2023/09, Google proposes RLAIF (RL from AI Feedback), a technique where preferences are labeled by an off-the-shelf LLM in lieu of humans. They find that the RLHF and RLAIF methods achieve the similar results on the task of summarization.

💪 Dataset

  • WikiSQL [paper] [code] [dataset]

    • 2017/09, Salesforce proposes a large Text-to-SQL dataset WikiSQL, the data comes from Wikipedia, which belongs to a single domain, contains 80,654 natural language questions, and 77,840 SQL statements. The form of SQL statements is relatively simple, and does not include sorting, grouping, and subqueries and other complex operations.
  • Spider [paper] [code] [dataset]

    • 2018/09, Yale University proposes the Text-to-SQL dataset Spider with multiple databases, multiple tables, and single-round query. It is also recognized as the most difficult large-scale cross-domain evaluation list in the industry. It contains 10,181 natural language questions and 5,693 SQL statements. Involving more than 200 databases in 138 different fields, the difficulty level is divided into: easy, medium, difficult, and extremely difficult.
  • SParC [paper] [code] [dataset]

    • 2019/06, Yale University proposes a large dataset SParC for complex, cross-domain, and context-dependent(multi-turn) semantic parsing and text-to-SQL task, which consists of 4,298 coherent question sequences (12k+ unique individual questions annotated with SQL queries annotated by 14 Yale students), obtained from user interactions with 200 complex databases over 138 domains.
  • CSpider [paper] [code] [dataset]

    • 2019/09, Westlake University propposes a large Chinese dataset CSpider for complex and cross-domain semantic parsing and text-to-SQL task, translated from Spider by 2 NLP researchers and 1 computer science student, which consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains.
  • CoSQL [paper] [code] [dataset]

    • 2019/09, Yale University and Salesforce Research propose a cross-domain database CoSQL, which consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz (WOZ) collection of 3k dialogues querying 200 complex DBs spanning 138 domains.
  • TableQA [paper] [dataset]

    • 2020/06, Zhuiyi Technology proposes a large-scale cross-domain Natural Language to SQL dataset TableQA in Chinese language consisting 64,891 questions and 20,311 unique SQL queries on over 6,000 tables.
  • DuSQL [paper] [dataset]

    • 2020/11, Baidu proposes a larges-scale and pragmatic Chinese dataset DuSQL for the cross-domain text-toSQL task, containing 200 databases, 813 tables, and 23,797 question/SQL pairs.
  • CHASE [paper] [code] [dataset]

    • 2021/08, Xi'an Jiaotong University and Microsoft propose the first cross-domain, multi-round Text-to-SQL Chinese dataset, which contains a list of 5459 multi-round questions and 17940 <query, SQL> binary groups.
  • BIRD-SQL [paper] [code] [dataset]

    • 2023/05, the University of Hong Kong and Alibaba propose a large-scale cross-domain dataset BIRD, which contains over 12,751 unique question-SQL pairs, 95 big databases with a total size of 33.4 GB. It also covers more than 37 professional domains, such as blockchain, hockey, healthcare and education, etc.

🌈 Evaluation Index

  • Execution Accuracy (EX) [paper]

    • Definition: Calculate the proportion of the correct number of SQL execution results in the data set, and the result may be overestimated.
  • Exact Match (EM) [paper]

    • Definition: Calculate the matching degree between the SQL generated by the model and the marked SQL, and the result may be underestimated.

🔧 Practice Project

  • DB-GPT-Hub GitHub Repo stars last commit

    • The eosphoros organization proposes an open source project focusing on Text-to-SQL fine-tuning based on LLM, including large-scale model download, dataset preprocessing, fine-tuning technologies such as LoRA and QLoRA, model prediction, model evaluation and other steps.
  • sqlcoder GitHub Repo stars last commit

    • The Defog organization proposes an advanced Text-to-SQL LLM, which has outstanding performance and is better than GPT3.5, wizardcoder and starcoder, etc., second only to GPT4.
  • modal_finetune_sql GitHub Repo stars last commit

    • This project is based on the LLaMa 2 7b model for Text-to-SQL fine-tuning, which includes a complete training, fine-tuning, and evaluation process.
  • LLaMA-Efficient-Tuning GitHub Repo stars last commit

    • Easy-to-use LLM fine-tuning framework (LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, Chat

🤝 Friendship Links

  • eosphoros GitHub Repo stars last commit

    • They are a team of technology enthusiasts from internet companies and NLP graduate students who are passionate about open source projects. Their focus is on developing solutions that protect the privacy and security of databases and large language models. Their aim is to ensure that the abilities of these models remain absolutely private, secure, and under control.
  • Awesome-AIGC-Tutorials GitHub Repo stars last commit

    • Awesome AIGC Tutorials houses a curated collection of tutorials and resources spanning across Large Language Models, AI Painting, and related fields. Discover in-depth insights and knowledge catered for both beginners and advanced AI enthusiasts.

    Star History Chart

About

Curated tutorials and resources for Large Language Models, Text2SQL, and more.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published