Skip to content

yifan1130/open-instruct

Repository files navigation

Training Open Instruction-Following Language Models

This repo serves as an open effort on instruction-tuning popular pretrained language models on publicly available datasets. We release this repo and will keep updating it with:

  1. Code for finetuning language models with latest techniques and instruction datasets in a unified format.
  2. Code for running standard evaluation on a range of benchmarks, targeting for differnt capabilities of these language models.
  3. Checkpoints or other useful artifacts that we build in our exploration.

Please see our first paper How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources for more thoughts behind this project and our initial findings.

Tülu (a hybrid camel) represents a suite of LLaMa models that we built by fully-finetuning them on a strong mix of datasets.

News

  • [2023-09-17] Supported LoRA and QLoRA finetuning. See here for more details.
  • [2023-08-18] Added support for ToxiGen/TrutufulQA evaluation. Check our scripts/eval/ for examples of running them.
  • [2023-08-08] Supported several new instruction dataset, including LIMA / WizardLM / Open-Orca. See the preparation script for details. Performance hasn't been evaluated yet.
  • [2023-08-06] Supported LLaMa 2 finetuning and FlashAttention-2 by bumping the version of transformers and many other dependencies.
  • [2023-06-29] Added licensing info for our released models.
  • [2023-06-09] Released Tülu (a suite of LLaMa models fully-finetuned on a strong mix of datasets) and many other checkpoints on HuggingFace [Links].
  • [2023-06-09] Initial release of the codebase containing the training and evaluation code for our arxiv paper.

Setup

To run training, evaluation, or inference for our finetuned models, you need to install the required packages by running the following command (after installing pytorch):

pip install -r requirements.txt

If you just want the dependencies for the weight diff script, use:

pip install -r weight-diff-requirements.txt

Training

Dataset preparation

We include a collection of representative instruction datasets in our exploration and are adding new ones to our list. We unify them into the same chatting format. To download and prepare these datasets, simply run the following command:

./scripts/prepare_train_data.sh

Please check these datasets for licenses and restrictions around their use!

Model preparation

Generally, most huggingface-compatible causal language models should work fine with our codebase, potentially with some adjusting for different tokenizers etc. Some models may require addtional requests to download. E.g., for LLaMa 1 and 2, please consult the Hugging Face documentation for requesting access and converting them to a huggingface-compatible format.

Finetuning

You can use the following command to run instruction tuning (finetuning a pretrained model to follow instructions):

./scripts/finetune_with_accelerate.sh

Make sure to adjust model_name_or_path, tokenizer_name, train_file, and output_dir to your models / data / setting. By default, this uses deepspeed with accelerate.

Parameter-Efficient Finetuning

We support LoRA finetuning, wherein only a small number of parameters are updated, resulting in faster and cheaper training. For even more efficiency, we also support QLoRA finetuning, wherein the non-trained (underlying) model parameters are quantised during 4-bit training. This means you can train a 70b Llama model on a single 80GB A100! Please refer to the respective papers for more details.

Please also note you cannot currently run QLoRA with model parallelism - only data-parallel training is supported, so you cannot train a model that does not fit on one GPU. For LoRA, you can use deepspeed + zero-3 to achieve model parallelism (and FSDP is not currently supported).

Please see ./scripts/finetune_lora_with_accelerate.sh and ./scripts/finetune_qlora_with_accelerate.sh for example hyperparameters. We found a larger rank (e.g. 256) and higher learning rate (e.g. 2e-4) worked best. Additionally, we found that QLoRA tended to always achieve similar results to LoRA, while LoRA itself sometimes fell behind full-finetuning, especially in long, complex generation tasks. However, for most purposes, LoRA training essentially matches full-finetuning performance. Curiously, we found that merging QLoRA modules back into the non-quantised model tended to result in slightly better performance.

Released Checkpoints

We provide a number of model checkpoints that we trained. You can find them on Hugging Face here. Here are some quick links to the checkpoints that are finetuned from LLaMa 1:

Datasets ↓ Model Sizes → 7B 13B 30B 65B
SuperNI link link
CoT link link
Flan V2 link link
Dolly link link
Open Assistant 1 link link
ShareGPT link link link link
Self-instruct (original) link link
Unnatural Instructions link link
Alpaca link link
Code-Alpaca link link
GPT4-Alpaca link link
Baize link link
Human-Mix link link link link
Tulu link link link link

We also trained Pythia and OPT models on the Tulu mixture (aka the Human+GPT mixture), and they are available here:

Weight diff script

Some of the checkpoints are released as weight diffs to the base model (mostly for LLaMa 1). We use a slightly modified form of the Alpaca weight diff script, which runs the same.

To merge a model:

  1. Download the relevant LLaMa model and convert it to Hugging Face format (see above).
  2. Download our repository and install the right dependencies (see above).
  3. Download the model diff you want.
  4. Run the command below:
python scripts/weight_diff.py recover --path_raw ${hf_llama_path} --path_tuned ${output_path} --path_diff ${diff_location}

Evaluation

Benchmark-based eval

We provide the scripts for running evaluation of Huggingface/OpenAI models on a list of standard benchmarks targeting for the core capabilities of large language models. These benchmakrs include:

We are working on including more promising benchmarks into this list. Please stay tuned!

You can use the following script to download all the evaluation data:

./scripts/prepare_eval_data.sh

Evaluation scripts for different datasets are put under ./scripts. For example, you can use the following command to run the MMLU evaluation script:

./scripts/eval/mmlu.sh

Model-based eval

We support using GPT4 to evaluate the quality of model's response following the GPT4 evaluation protocol proposed in AlpacaFarm. To run this AlpacaFarm eval, please make sure you install our fork of AlpacaFarm (https://github.com/hamishivi/alpaca_farm) and use the following script:

python eval/alpaca_farm_eval.py --model <model> --batch_size 8

Please check the script for more details on the script itself!

Human evaluation

We will release our human evaluation interface and data soon!

Licensing

This codebase is licensed under Apache 2.0 as given in LICENSE.

The license we use for the models released (along with the base model licenses) can be found in model_licenses/tulu_license.txt - just replace <MODELNAME> with the actual model name (i.e., the name on HuggingFace).

Citation

If you used this repository or our models, please cite our work:

@misc{wang2023far,
   title={How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources}, 
   author={Yizhong Wang and Hamish Ivison and Pradeep Dasigi and Jack Hessel and Tushar Khot and Khyathi Raghavi Chandu and David Wadden and Kelsey MacMillan and Noah A. Smith and Iz Beltagy and Hannaneh Hajishirzi},
   year={2023},
   eprint={2306.04751},
   archivePrefix={arXiv},
   primaryClass={cs.CL}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.7%
  • Cuda 3.5%
  • Shell 0.8%
  • C++ 0.3%
  • HTML 0.3%
  • JavaScript 0.2%
  • Other 0.2%