Merge remote-tracking branch 'upstream/main' into fix-logprobs

yifan1130 · Sep 22, 2023 · 0ddab2b · 0ddab2b
2 parents ad2385f + 7040019
commit 0ddab2b
Show file tree

Hide file tree

Showing 68 changed files with 3,979 additions and 4,371 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,137 @@
+results
+models
+wandb
+data/*
+# !data/processed
+output/
+beaker_configs/auto_created
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
diff --git a/Dockerfile b/Dockerfile
@@ -1,11 +1,7 @@
-FROM pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel
+# This dockerfile is forked from ai2/cuda11.8-cudnn8-dev-ubuntu20.04
+FROM gcr.io/ai2-beaker-core/public/cjvktq5s0r0fr8pb7470:latest
 
-ENV LC_ALL=C.UTF-8
-ENV LANG=C.UTF-8
-ENV CUDA_HOME=/usr/local/cuda/
-
-RUN apt-get -y update
-RUN apt-get -y install git vim jq curl wget
+RUN apt update && apt install -y openjdk-8-jre-headless
 
 RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
 RUN apt-get -y install git-lfs
@@ -14,7 +10,9 @@ WORKDIR /stage/
 
 COPY requirements.txt .
 RUN pip install --upgrade pip setuptools wheel
+RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
 RUN pip install packaging
+RUN pip install flash-attn --no-build-isolation
 RUN pip install -r requirements.txt
 
 COPY open_instruct open_instruct
@@ -24,4 +22,4 @@ COPY scripts scripts
 RUN chmod +x scripts/*
 
 # for interactive session
-RUN chmod -R 777 /stage/
+RUN chmod -R 777 /stage/
diff --git a/README.md b/README.md
@@ -1,24 +1,30 @@
-# Training Open Instruction-following Language Models
+# Training Open Instruction-Following Language Models
 
-This is the repository for the paper [How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
-](https://arxiv.org/abs/2306.04751).
+This repo serves as an open effort on instruction-tuning popular pretrained language models on publicly available datasets. We release this repo and will keep updating it with:
 
-We explore instruction-tuning popular base models on publicly available datasets. This repository contains:
-1. Training code used for training all models.
-2. Evaluation code for the evaluation done in the paper.
-3. Script for merging and creating model diffs.
+1. Code for finetuning language models with latest techniques and instruction datasets in a unified format.
+2. Code for running standard evaluation on a range of benchmarks, targeting for differnt capabilities of these language models.
+3. Checkpoints or other useful artifacts that we build in our exploration.
 
-As part of this work we introduce Tülu, a suite of LLaMa models fully-finetuned on a strong mix of datasets!
+Please see our first paper [How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources](https://arxiv.org/abs/2306.04751) for more thoughts behind this project and our initial findings.
 
-<p align="center">
-<img src="images/tulu_logo.png" width="200" />
+<p align="center" width="100%">
+      <img src="images/tulu_logo.png" alt="Tülu (a hybrid camel) represents a suite of LLaMa models that we built by fully-finetuning them on a strong mix of datasets." style="width: 20%; min-width: 200px; display: block; margin: auto;">
 </p>
 
-**Tülu 65B is the strongest model we build and available [here](https://huggingface.co/allenai/tulu-65b)** - see below for how to make use of this model yourself!
+## News
+
+- [2023-09-17] Supported [LoRA](https://arxiv.org/abs/2106.09685) and [QLoRA](https://arxiv.org/abs/2305.14314) finetuning. See [here](#parameter-efficient-finetuning) for more details.
+- [2023-08-18] Added support for [ToxiGen](https://github.com/microsoft/TOXIGEN)/[TrutufulQA](https://github.com/sylinrl/TruthfulQA) evaluation. Check our `scripts/eval/` for examples of running them.
+- [2023-08-08] Supported several new instruction dataset, including [LIMA](https://huggingface.co/datasets/GAIR/lima) / [WizardLM](https://github.com/nlpxucan/WizardLM) / [Open-Orca](https://huggingface.co/datasets/Open-Orca/OpenOrca). See the [preparation script](./scripts/prepare_train_data.sh) for details. Performance hasn't been evaluated yet.
+- [2023-08-06] Supported LLaMa 2 finetuning and FlashAttention-2 by bumping the version of transformers and many other dependencies.
+- [2023-06-29] Added [licensing info](#licensing) for our released models.
+- [2023-06-09] Released Tülu (a suite of LLaMa models fully-finetuned on a strong mix of datasets) and many other checkpoints on HuggingFace [[Links]](#released-checkpoints).
+- [2023-06-09] Initial release of the codebase containing the training and evaluation code for our [arxiv paper](https://arxiv.org/abs/2306.04751).
 
 ## Setup
 
-You can install the required packages by running the following command (after installing pytorch):
+To run training, evaluation, or inference for our finetuned models, you need to install the required packages by running the following command (after installing pytorch):
 
 ```bash
 pip install -r requirements.txt
@@ -29,53 +35,45 @@ If you just want the dependencies for the weight diff script, use:
 pip install -r weight-diff-requirements.txt
 ```
 
-### Model preparation
-
-To get LLaMa checkpoints, please acquire them via Meta [here](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform) and consult [the Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/llama) for converting them to a huggingface-compatible format.
-
-Generally, most huggingface-compatible models should work fine, potentially with some adjusting for different tokenizers etc.
-
-
-## Weight Diff Script
-
-We use a slightly modified form of the [Alpaca weight diff script](https://github.com/tatsu-lab/stanford_alpaca/blob/main/weight_diff.py), which runs the same.
-
-To merge a model:
-1. Download the relevant LLaMa model and convert it to Hugging Face format (see above).
-2. Download our repository and install the right dependencies (see above).
-3. Download the model diff you want.
-4. Run the command below:
-
-```bash
-python scripts/weight_diff.py recover --path_raw ${hf_llama_path} --path_tuned ${output_path} --path_diff ${diff_location}
-```
-
 ## Training
 
-### Dataset Preparation
+### Dataset preparation
 
-To download and prepare the instruction datasets we explore, use:
+We include a collection of representative instruction datasets in our exploration and are adding new ones to our list. We unify them into the same chatting format. To download and prepare these datasets, simply run the following command:
 
 ```bash
 ./scripts/prepare_train_data.sh
 ```
 
 Please check these datasets for licenses and restrictions around their use!
 
+### Model preparation
+
+Generally, most huggingface-compatible causal language models should work fine with our codebase, potentially with some adjusting for different tokenizers etc. Some models may require addtional requests to download. E.g., for LLaMa 1 and 2, please consult [the Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/llama) for requesting access and converting them to a huggingface-compatible format.
+
 ### Finetuning
-To run instruction tuning, you can use the following command:
+
+You can use the following command to run instruction tuning (finetuning a pretrained model to follow instructions):
 
 ```bash
 ./scripts/finetune_with_accelerate.sh
 ```
 
-Adjust `model_name_or_path`, `tokenizer_name`, `train_file`, and `output_dir` to your models / data / setting. By default, this uses `deepspeed` with `accelerate`.
+Make sure to adjust `model_name_or_path`, `tokenizer_name`, `train_file`, and `output_dir` to your models / data / setting. By default, this uses `deepspeed` with `accelerate`.
+
+### Parameter-Efficient Finetuning
+
+We support [LoRA](https://arxiv.org/abs/2106.09685) finetuning, wherein only a small number of parameters are updated, resulting in faster and cheaper training. For even more efficiency, we also support [QLoRA](https://arxiv.org/abs/2305.14314) finetuning, wherein the non-trained (underlying) model parameters are quantised during 4-bit training. This means you can train a 70b Llama model on a single 80GB A100! Please refer to the respective papers for more details.
 
-## Model Checkpoints
+Please also note you cannot currently run QLoRA with model parallelism - only data-parallel training is supported, so you cannot train a model that does not fit on one GPU. For LoRA, you can use deepspeed + zero-3 to achieve model parallelism (and FSDP is not currently supported).
 
-We provide a number of model checkpoints as diffs. You can find them on Hugging Face [here](https://huggingface.co/models?other=arxiv:2306.04751). They are also all here:
+Please see `./scripts/finetune_lora_with_accelerate.sh` and `./scripts/finetune_qlora_with_accelerate.sh` for example hyperparameters. We found a larger rank (e.g. 256) and higher learning rate (e.g. 2e-4) worked best. Additionally, we found that QLoRA tended to always achieve similar results to LoRA, while LoRA itself sometimes fell behind full-finetuning, especially in long, complex generation tasks. However, for most purposes, LoRA training essentially matches full-finetuning performance. Curiously, we found that merging QLoRA modules back into the non-quantised model tended to result in slightly better performance.
 
-| **Model**                | **7B**                                                                         | **13B**                                                                         | **30B**                                                            | **65B**                                                            |
+## Released Checkpoints
+
+We provide a number of model checkpoints that we trained. You can find them on Hugging Face [here](https://huggingface.co/models?other=arxiv:2306.04751). Here are some quick links to the checkpoints that are finetuned from LLaMa 1:
+
+| **Datasets ↓ Model Sizes →**                | **7B**                                                                         | **13B**                                                                         | **30B**                                                            | **65B**                                                            |
 |--------------------------|--------------------------------------------------------------------------------|---------------------------------------------------------------------------------|--------------------------------------------------------------------|--------------------------------------------------------------------|
 | SuperNI                  | [link](https://huggingface.co/allenai/open-instruct-sni-7b)                    | [link](https://huggingface.co/allenai/open-instruct-sni-13b)                    |                                                                    |                                                                    |
 | CoT                      | [link](https://huggingface.co/allenai/open-instruct-cot-7b)                    | [link](https://huggingface.co/allenai/open-instruct-cot-13b)                    |                                                                    |                                                                    |
@@ -93,12 +91,41 @@ We provide a number of model checkpoints as diffs. You can find them on Hugging
 | **Tulu**                 | [link](https://huggingface.co/allenai/tulu-7b)                                 | [link](https://huggingface.co/allenai/tulu-13b)                                 | [link](https://huggingface.co/allenai/tulu-30b)                    | [link](https://huggingface.co/allenai/tulu-65b)                    |
 
 We also trained Pythia and OPT models on the Tulu mixture (aka the Human+GPT mixture), and they are available here:
+
 - [Pythia 6.9B Tulu](https://huggingface.co/allenai/open-instruct-pythia-6.9b-tulu)
 - [OPT 6.7B Tulu](https://huggingface.co/allenai/open-instruct-opt-6.7b-tulu)
 
+
+### Weight diff script
+
+Some of the checkpoints are released as weight diffs to the base model (mostly for LLaMa 1). We use a slightly modified form of the [Alpaca weight diff script](https://github.com/tatsu-lab/stanford_alpaca/blob/main/weight_diff.py), which runs the same.
+
+To merge a model:
+1. Download the relevant LLaMa model and convert it to Hugging Face format (see above).
+2. Download our repository and install the right dependencies (see above).
+3. Download the model diff you want.
+4. Run the command below:
+
+```bash
+python scripts/weight_diff.py recover --path_raw ${hf_llama_path} --path_tuned ${output_path} --path_diff ${diff_location}
+```
+
 ## Evaluation
 
-First, run the following script to download all the evaluation datasets:
+### Benchmark-based eval
+
+We provide the scripts for running evaluation of Huggingface/OpenAI models on a list of standard benchmarks targeting for the core capabilities of large language models. These benchmakrs include:
+
+- [MMLU](https://github.com/hendrycks/test)
+- [Grade School Math (GSM)](https://github.com/openai/grade-school-math)
+- [Big-Bench Hard (BBH)](https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main)
+- [TydiQA](https://github.com/google-research-datasets/tydiqa)
+- [Codex HumanEval](https://github.com/openai/human-eval/tree/master)
+- [ToxiGen](https://github.com/microsoft/TOXIGEN)
+
+We are working on including more promising benchmarks into this list. Please stay tuned!
+
+You can use the following script to download all the evaluation data:
 
 ```bash
 ./scripts/prepare_eval_data.sh
@@ -110,36 +137,37 @@ Evaluation scripts for different datasets are put under `./scripts`. For example
 ./scripts/eval/mmlu.sh
 ```
 
-### AlpacaFarm
+### Model-based eval
+
+We support using GPT4 to evaluate the quality of model's response following the GPT4 evaluation protocol proposed in [AlpacaFarm](https://arxiv.org/abs/2305.14387). To run this AlpacaFarm eval, please make sure you install our fork of AlpacaFarm (https://github.com/hamishivi/alpaca_farm) and use the following script:
 
-To run AlpacaFarm eval, please make sure you install our fork of AlpacaFarm (https://github.com/hamishivi/alpaca_farm) and use the following script:
 ```bash
 python eval/alpaca_farm_eval.py --model <model> --batch_size 8
 ```
 
 Please check the script for more details on the script itself!
 
-### Human Evaluation Interface
+### Human evaluation
 
-Coming soon!
+We will release our human evaluation interface and data soon!
 
-### Licensing
+## Licensing
 
-The is licensed under Apache 2.0 as given in `LICENSE`.
+This codebase is licensed under Apache 2.0 as given in [LICENSE](./LICENSE).
 
-The license we use for the models released (along with the base model licenses) can be found in `model_licenses/tulu_license.txt` - just replace `<MODELNAME>` with the actual model name (i.e., the name on HuggingFace).
+The license we use for the models released (along with the base model licenses) can be found in [model_licenses/tulu_license.txt](./model_licenses/tulu_license.txt) - just replace `<MODELNAME>` with the actual model name (i.e., the name on HuggingFace).
 
-# Citation
+## Citation
 
 If you used this repository or our models, please cite our work:
-```
+
+```bibtex
 @misc{wang2023far,
-      title={How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources}, 
-      author={Yizhong Wang and Hamish Ivison and Pradeep Dasigi and Jack Hessel and Tushar Khot and Khyathi Raghavi Chandu and David Wadden and Kelsey MacMillan and Noah A. Smith and Iz Beltagy and Hannaneh Hajishirzi},
-      year={2023},
-      eprint={2306.04751},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL}
+   title={How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources}, 
+   author={Yizhong Wang and Hamish Ivison and Pradeep Dasigi and Jack Hessel and Tushar Khot and Khyathi Raghavi Chandu and David Wadden and Kelsey MacMillan and Noah A. Smith and Iz Beltagy and Hannaneh Hajishirzi},
+   year={2023},
+   eprint={2306.04751},
+   archivePrefix={arXiv},
+   primaryClass={cs.CL}
 }
 ```
-