Name		Name	Last commit message	Last commit date
parent directory ..
spider-0_01-code-davinci-002		spider-0_01-code-davinci-002
spider-0_01-gpt-35-turbo		spider-0_01-gpt-35-turbo
spider-0_01-gpt-4		spider-0_01-gpt-4
spider-0_01-text-davinci-003		spider-0_01-text-davinci-003
.gitignore		.gitignore
README.md		README.md
evaluate.py		evaluate.py
evaluate.sh		evaluate.sh
generate_sql.py		generate_sql.py
sample_benchmark.py		sample_benchmark.py
spider_utils.py		spider_utils.py

README.md

Evaluation using Spider Text-to-SQL Dataset

We want to benchmark LlamaIndex's performance for complex queries on multiple domains, and measure how each iteration of LLM improves its Text-to-SQL capability, thus this project.

Usage

Download benchmark dataset, the download link is in the left-side bar under section "Get Started". Unzip the file after download.
Use sample_benchmark.py to sample the benchmark dataset so we don't spend too much money when testing. Skip this step when running the complete benchmark.

python sample_benchmark.py --input <benchmark path> --output spider-0_001 --sample-factor 0.001
# A smaller benchmark with 1/1000 examples is saved in directory spider-0_001, which we use as our benchmark for testing purpose.

Use generate_sql.py to generate the predicted SQL queries given the input benchmark.

python generate_sql.py --input spider-0_001 --output spider-0_001-pred --model gpt-3.5-turbo
# Predicted SQLs are saved in the output directory.

Use evaluate.sh to evaluate the prediction. The script will download the Spider Evaluation code and use it to generate performance reports saved in the same directory as the predicted SQLs. See here to understand the evaluation metrics.

./evaluate.sh spider-0_001 spider-0_001-pred

New! Use evaluate.py to evalaute the generated SQLs against golden SQLs by matching the natural language answers generated from their respective execution outputs. This is called Answer Accuracy.

python evaluate.py --spider-dir spider-0_001 --predict-dir spider-0_001-pred \
    --model gpt-3.5-turbo

This will produce two JSON files train_eval.json and dev_eval.json with the evaluation results in the --predict-dir directory.

Result

Based on 96 examples (86 train + 10 dev) sampled from Spider benchmark.

Model	Answer Accuracy
code-davinci-002	0.7917
text-davinci-003	0.8854
gpt-3.5-turbo	0.8542
gpt-4	0.8958

TODO

Auto-course-correction encountering SQL errors using Langchain agent.
Use training set to generate in-context learning examples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spider

spider

README.md

Evaluation using Spider Text-to-SQL Dataset

Usage

Result

TODO

Files

spider

Directory actions

More options

Directory actions

More options

Latest commit

History

spider

Folders and files

parent directory

README.md

Evaluation using Spider Text-to-SQL Dataset

Usage

Result

TODO