This repo serves as an open effort on instruction-tuning popular pretrained language models on publicly available datasets. We release this repo and will keep updating it with:
- Code for finetuning language models with latest techniques and instruction datasets in a unified format.
- Code for running standard evaluation on a range of benchmarks, targeting for differnt capabilities of these language models.
- Checkpoints or other useful artifacts that we build in our exploration.
Please see our first paper How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources for more thoughts behind this project and our initial findings.
- [2023-08-18] Added support for ToxiGen/TrutufulQA evaluation. Check our
scripts/eval/
for examples of running them. - [2023-08-08] Supported several new instruction dataset, including LIMA / WizardLM / Open-Orca. See the preparation script for details. Performance hasn't been evaluated yet.
- [2023-08-06] Supported LLaMa 2 finetuning and FlashAttention-2 by bumping the version of transformers and many other dependencies.
- [2023-06-29] Added licensing info for our released models.
- [2023-06-09] Released Tülu (a suite of LLaMa models fully-finetuned on a strong mix of datasets) and many other checkpoints on HuggingFace [Links].
- [2023-06-09] Initial release of the codebase containing the training and evaluation code for our arxiv paper.
To run training, evaluation, or inference for our finetuned models, you need to install the required packages by running the following command (after installing pytorch):
pip install -r requirements.txt
If you just want the dependencies for the weight diff script, use:
pip install -r weight-diff-requirements.txt
We include a collection of representative instruction datasets in our exploration and are adding new ones to our list. We unify them into the same chatting format. To download and prepare these datasets, simply run the following command:
./scripts/prepare_train_data.sh
Please check these datasets for licenses and restrictions around their use!
Generally, most huggingface-compatible causal language models should work fine with our codebase, potentially with some adjusting for different tokenizers etc. Some models may require addtional requests to download. E.g., for LLaMa 1 and 2, please consult the Hugging Face documentation for requesting access and converting them to a huggingface-compatible format.
You can use the following command to run instruction tuning (finetuning a pretrained model to follow instructions):
./scripts/finetune_with_accelerate.sh
Make sure to adjust model_name_or_path
, tokenizer_name
, train_file
, and output_dir
to your models / data / setting. By default, this uses deepspeed
with accelerate
.
We provide a number of model checkpoints that we trained. You can find them on Hugging Face here. Here are some quick links to the checkpoints that are finetuned from LLaMa 1:
Datasets ↓ Model Sizes → | 7B | 13B | 30B | 65B |
---|---|---|---|---|
SuperNI | link | link | ||
CoT | link | link | ||
Flan V2 | link | link | ||
Dolly | link | link | ||
Open Assistant 1 | link | link | ||
ShareGPT | link | link | link | link |
Self-instruct (original) | link | link | ||
Unnatural Instructions | link | link | ||
Alpaca | link | link | ||
Code-Alpaca | link | link | ||
GPT4-Alpaca | link | link | ||
Baize | link | link | ||
Human-Mix | link | link | link | link |
Tulu | link | link | link | link |
We also trained Pythia and OPT models on the Tulu mixture (aka the Human+GPT mixture), and they are available here:
Some of the checkpoints are released as weight diffs to the base model (mostly for LLaMa 1). We use a slightly modified form of the Alpaca weight diff script, which runs the same.
To merge a model:
- Download the relevant LLaMa model and convert it to Hugging Face format (see above).
- Download our repository and install the right dependencies (see above).
- Download the model diff you want.
- Run the command below:
python scripts/weight_diff.py recover --path_raw ${hf_llama_path} --path_tuned ${output_path} --path_diff ${diff_location}
We provide the scripts for running evaluation of Huggingface/OpenAI models on a list of standard benchmarks targeting for the core capabilities of large language models. These benchmakrs include:
We are working on including more promising benchmarks into this list. Please stay tuned!
You can use the following script to download all the evaluation data:
./scripts/prepare_eval_data.sh
Evaluation scripts for different datasets are put under ./scripts
. For example, you can use the following command to run the MMLU evaluation script:
./scripts/eval/mmlu.sh
We support using GPT4 to evaluate the quality of model's response following the GPT4 evaluation protocol proposed in AlpacaFarm. To run this AlpacaFarm eval, please make sure you install our fork of AlpacaFarm (https://github.com/hamishivi/alpaca_farm) and use the following script:
python eval/alpaca_farm_eval.py --model <model> --batch_size 8
Please check the script for more details on the script itself!
We will release our human evaluation interface and data soon!
This codebase is licensed under Apache 2.0 as given in LICENSE.
The license we use for the models released (along with the base model licenses) can be found in model_licenses/tulu_license.txt - just replace <MODELNAME>
with the actual model name (i.e., the name on HuggingFace).
If you used this repository or our models, please cite our work:
@misc{wang2023far,
title={How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources},
author={Yizhong Wang and Hamish Ivison and Pradeep Dasigi and Jack Hessel and Tushar Khot and Khyathi Raghavi Chandu and David Wadden and Kelsey MacMillan and Noah A. Smith and Iz Beltagy and Hannaneh Hajishirzi},
year={2023},
eprint={2306.04751},
archivePrefix={arXiv},
primaryClass={cs.CL}
}