
- Setup
- Generating New Completions
- Converting to AlpacaEval Format
- Running the Leaderboard Evaluation
- Viewing Results
- Ensure you have Python 3.8+ installed
- Create and activate a virtual environment:
python -m venv venv
# On Windows
.\venv\Scripts\Activate.ps1
# On Linux/Mac
source venv/bin/activate
- Install the required packages:
pip install -e .
pip install alpaca_eval
- Set up your OpenAI API key in a
.env
file:
OPENAI_API_KEY=your_api_key_here
To generate new completions with different sampling strategies:
- Prepare your prompts in a JSONL file with the following format:
{"prompt": "Write a short story about...", "source": "creative_writing"}
- Use the generation script with your desired sampling parameters:
python generate_completions.py \
--model "openchat-3.5-0106" \
--temperature 1.5 \
--min_p 0.1 \
--input_file "data/prompts.jsonl" \
--output_file "data/outputs/7b_temp150_minp_10.jsonl"
Available sampling parameters:
--temperature
: Controls randomness (e.g., 0.8, 1.0, 1.5)--min_p
: Minimum probability threshold (e.g., 0.02, 0.05, 0.1)--top_p
: Nucleus sampling threshold (e.g., 0.9, 0.95, 0.98)--tfs
: Tail-free sampling threshold
The output file will be saved in JSONL format with the model outputs.
After generating completions, you need to convert them to the AlpacaEval format:
- For individual files:
python -m quest.to_alpaca_eval \
--input_path "data/outputs/7b_temp150_minp_10.jsonl" \
--output_path "data/outputs/aeval_7b_temp150_minp_10.json"
- For multiple files, you can use the conversion script:
python convert_all.py
This will:
- Convert all specified JSONL files to the AlpacaEval JSON format
- Create a combined file (
aeval_all.json
) with all model outputs
To evaluate the model outputs and create a leaderboard:
python -m alpaca_eval.main make_leaderboard \
--all_model_outputs data/outputs/aeval_all.json \
--reference_outputs data/outputs/aeval_7b_temp100.json \
--annotators_config alpaca_eval_cot_gpt4_turbo_fn \
--leaderboard-path leaderboard.csv
This command:
- Takes all model outputs from
aeval_all.json
- Uses outputs from
aeval_7b_temp100.json
as the reference - Uses GPT-4 Turbo with chain-of-thought prompting for evaluation
- Saves the results to
leaderboard.csv
Alternatively, you can use the provided shell script:
# On Linux/Mac
./eval.sh
# On Windows PowerShell
python -m alpaca_eval.main make_leaderboard --all_model_outputs data/outputs/aeval_all.json --reference_outputs data/outputs/aeval_7b_temp100.json --annotators_config alpaca_eval_cot_gpt4_turbo_fn --leaderboard-path leaderboard.csv
To view the results in a more readable format:
python display_results.py
This will display a sorted table of results with:
- Model configuration
- Win rate
- Length-controlled win rate
- Average output length
If you encounter errors related to the OpenAI API when running AlpacaEval, you may need to fix compatibility issues between the AlpacaEval code and newer versions of the OpenAI API:
-
Schema Validation Error: With OpenAI API version 1.65.3+, you might see this error:
openai.BadRequestError: Error code: 400 - {'error': {'message': "Invalid schema for function 'make_partial_leaderboard': In context=(), 'required' is required to be supplied and to be an array including every key in properties. Missing 'concise_explanation'."}}
-
Fix for Schema Validation Error:
- Edit the AlpacaEval config files to add the missing field to the required array:
- Modify
venv/Lib/site-packages/alpaca_eval/evaluators_configs/alpaca_eval_cot_gpt4_turbo_fn/configs.yaml
:# Change this line: required: [ "ordered_models" ] # To: required: [ "ordered_models", "concise_explanation" ]
- Also modify
venv/Lib/site-packages/alpaca_eval/evaluators_configs/alpaca_eval_gpt4_turbo_fn/configs.yaml
in the same way.
-
PowerShell Command Syntax: When running commands in PowerShell, use semicolons (
;
) instead of&&
to chain commands:# Incorrect: cd quest && python -m alpaca_eval.main make_leaderboard ... # Correct: cd quest; python -m alpaca_eval.main make_leaderboard ...
-
Leaderboard Display Error: You might encounter an error with the
print_leaderboard()
function:TypeError: print_leaderboard() got an unexpected keyword argument 'leaderboard_mode'
This is just a display issue and doesn't affect the evaluation results. The leaderboard.csv file will still be created correctly.
-
Checking Results: If the evaluation completes but you're not sure if it worked, check for the existence of the leaderboard file:
dir leaderboard.csv
Model | Win Rate | LC Win Rate | Avg Length |
---|---|---|---|
temp150_minp_10 |
56.54% | 58.12% | 1852 |
temp150_minp_15 |
53.37% | 56.73% | 1816 |
temp150_minp_20 |
53.38% | 55.45% | 1835 |
quad_20_100 |
52.80% | 55.43% | 1821 |
temp100_minp_05 |
52.01% | 55.07% | 1808 |
temp200_minp_20 |
53.08% | 54.82% | 1861 |
temp80_topp_98 |
51.29% | 54.65% | 1810 |
dynatemp_50_150_75_minp_05 |
51.58% | 54.42% | 1807 |
dynatemp_50_200_100_minp_10 |
51.87% | 54.33% | 1825 |
temp150_minp_10_seed1337 |
52.86% | 53.84% | 1856 |
temp170_minp_15 |
52.65% | 53.75% | 1855 |
temp120_minp_10 |
51.36% | 53.75% | 1829 |
quad_15_100 |
51.65% | 53.70% | 1843 |
tfs_95 |
50.79% | 53.49% | 1802 |
tfs_98 |
50.72% | 53.39% | 1807 |
temp100_minp_10 |
50.14% | 53.24% | 1793 |
temp100_topp_98 |
50.43% | 53.00% | 1834 |
temp100_topp_90 |
50.07% | 52.57% | 1815 |
temp80 |
49.28% | 52.40% | 1797 |
temp100_topp_95 |
50.22% | 51.80% | 1835 |
temp100_minp_02 |
50.43% | 51.62% | 1853 |
temp80_minp_02 |
48.85% | 51.46% | 1802 |
temp80_minp_05 |
47.84% | 50.99% | 1808 |
temp80_topp_95 |
48.78% | 50.76% | 1793 |
temp100 |
50.00% | 50.00% | 1902 |
dynatemp_100_250_100_minp_10 |
50.86% | 50.00% | 2227 |
quad_25_100 |
47.85% | 49.94% | 1807 |
temp150_tfs_95 |
51.08% | 49.94% | 1969 |
greedy |
46.64% | 49.90% | 1765 |
temp150_minp_05 |
48.57% | 48.13% | 1919 |
temp150_topp_80 |
20.00% | 43.05% | 3576 |
temp150_minp_02 |
44.83% | 42.09% | 2149 |
dynatemp_50_150_100 |
35.37% | 34.94% | 2764 |
mirostat_40_10 |
16.69% | 16.04% | 1848 |
mirostat_50_10 |
16.40% | 16.04% | 1822 |
mirostat_60_10 |
15.62% | 14.97% | 1838 |
temp150_topp_98 |
0.00% | 0.02% | 4136 |
temp150_topp_95 |
0.00% | 0.00% | 4943 |
temp150_topp_90 |
0.00% | 0.00% | 9204 |