To assess LLM's ability to use the Python Code Interpreter for tasks such as mathematical problem solving, data visualization, and other general-purpose tasks such as file handling and web scraping, we have created and open-sourced a benchmark specifically designed for evaluating these capabilities.
The metrics are divided into two parts: code executability and code correctness.
- Code executability: evaluating the ability of the LLM-generated code to be executed.
- Code correctness: evaluating whether the LLM-generated code runs correctly.
When evaluating the accuracy of the code execution results for code correctness, we further divide it into two specific domains: Math
, Visualization
.
In terms of code executability, we calculate executable rate of the generated code for General problem-solving
.
- Qwen-7B-Chat refers to the version updated after September 25, 2023.
- The code correctness judger model for
Visualization
has changed fromQwen-vl-chat
togpt-4-vision-preview
in the version 20231206.
In-house Code Interpreter Benchmark (Version 20231206) | ||||
---|---|---|---|---|
Model | Accuracy of Code Execution Results (%) | Executable Rate of Code (%) | ||
Math↑ | Visualization-Hard↑ | Visualization-Easy↑ | General↑ | |
GPT-4 | 82.8 | 66.7 | 60.8 | 82.8 |
GPT-3.5 | 47.3 | 33.3 | 55.7 | 74.1 |
LLaMA2-13B-Chat | 8.3 | 1.2 | 15.2 | 48.3 |
CodeLLaMA-13B-Instruct | 28.2 | 15.5 | 21.5 | 74.1 |
InternLM-20B-Chat | 34.6 | 10.7 | 24.1 | 65.5 |
ChatGLM3-6B | 54.2 | 4.8 | 15.2 | 62.1 |
Qwen-1.8B-Chat | 25.6 | 21.4 | 22.8 | 65.5 |
Qwen-7B-Chat | 41.9 | 23.8 | 38.0 | 67.2 |
Qwen-14B-Chat | 58.4 | 31.0 | 45.6 | 65.5 |
Qwen-72B-Chat | 72.7 | 41.7 | 43.0 | 82.8 |
Furthermore, we also provide the results of Qwen-vl-plus
as the code correctness judger model for Visualization
task to serve as a reference.
Code Correctness Judger Model = Qwen-vl-plus | ||
---|---|---|
Model | Accuracy of Code Execution Results (%) | |
Visualization-Hard↑ | Visualization-Easy↑ | |
LLaMA2-13B-Chat | 2.4 | 17.7 |
CodeLLaMA-13B-Instruct | 17.9 | 34.2 |
InternLM-20B-Chat | 9.5 | 31.7 |
ChatGLM3-6B | 10.7 | 29.1 |
Qwen-1.8B-Chat | 32.1 | 32.9 |
Qwen-7B-Chat | 26.2 | 39.2 |
Qwen-14B-Chat | 36.9 | 41.8 |
Qwen-72B-Chat | 38.1 | 38.0 |
git clone https://github.com/QwenLM/Qwen-Agent.git
cd benchmark
pip install -r requirements.txt
cd benchmark
wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/assets/qwen_agent/benchmark_code_interpreter_data.zip
unzip benchmark_code_interpreter_data.zip
mkdir eval_data
mv eval_code_interpreter_v1.jsonl eval_data/
To reproduce the comprehensive results of benchmark, you can run the following script:
python inference_and_execute.py --model {model_name}
{model_name}:
- qwen-1.8b-chat
- qwen-7b-chat
- qwen-14b-chat
- qwen-72b-chat
- llama-2-7b-chat
- llama-2-13b-chat
- codellama-7b-instruct
- codellama-13b-instruct
- internlm-7b-chat-1.1
- internlm-20b-chat
The benchmark will run the test cases and generate the performance results. The results will be saved in the output_data
directory.
Notes:
Please install simhei.ttf
font for proper display in matplotlib when evaluating visualization task. You can do this by preparing simhei.ttf
(which can be found on any Windows PC) and then running the following code snippet:
import os
import matplotlib
target_font_path = os.path.join(
os.path.abspath(
os.path.join(matplotlib.matplotlib_fname(), os.path.pardir)),
'fonts', 'ttf', 'simhei.ttf')
os.system(f'cp simhei.ttf {target_font_path}')
font_list_cache = os.path.join(matplotlib.get_cachedir(), 'fontlist-*.json')
os.system(f'rm -f {font_list_cache}')
python inference_and_execute.py --task {task_name} --model {model_name}
{task_name}:
general
: General problem-solving task
python inference_and_execute.py --task {task_name} --model {model_name}
{task_name}:
visualization
: Visualization taskgsm8k
: Math task
The inference_and_exec.py file contains the following configurable options:
--model
: The model to test which can be one ofqwen-72b-chat
,qwen-14b-chat
,qwen-7b-chat
,qwen-1.8b-chat
,qwen-7b-chat
,llama-2-7b-chat
,llama-2-13b-chat
,codellama-7b-instruct
,codellama-13b-instruct
,internlm-7b-chat-1.1
,internlm-20b-chat
.--task
: The test task which can be one ofall
,visualization
,general
,gsm8k
.--output-path
: The path for saving evaluation result.--input-path
: The path for placing evaluation data.--output-fname
: The file name for evaluation result.--input-fname
: The file name for evaluation data.--force
: Force generation and will overwrite the cached results.--eval-only
: Only calculate evaluation metrics without re-inference.--eval-code-exec-only
: Only evaluate code executable rate--gen-exec-only
: Only generate and execuate code without calculating evaluation metrics.--gen-only
: Only generate without execuating code and calculating evaluation metrics.--vis-judger
: The model to judge the result correctness forVisualization
task which can be one ofgpt-4-vision-preview
,qwen-vl-chat
,qwen-vl-plus
. It is set togpt-4-vision-preview
by default in the version 20231206, andQwen-vl-chat
has been deprecated.