Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/main' into fix-logprobs
Browse files Browse the repository at this point in the history
  • Loading branch information
OyvindTafjord committed Sep 22, 2023
2 parents ad2385f + 7040019 commit 0ddab2b
Show file tree
Hide file tree
Showing 68 changed files with 3,979 additions and 4,371 deletions.
137 changes: 137 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
results
models
wandb
data/*
# !data/processed
output/
beaker_configs/auto_created

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/
14 changes: 6 additions & 8 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,11 +1,7 @@
FROM pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel
# This dockerfile is forked from ai2/cuda11.8-cudnn8-dev-ubuntu20.04
FROM gcr.io/ai2-beaker-core/public/cjvktq5s0r0fr8pb7470:latest

ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
ENV CUDA_HOME=/usr/local/cuda/

RUN apt-get -y update
RUN apt-get -y install git vim jq curl wget
RUN apt update && apt install -y openjdk-8-jre-headless

RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
RUN apt-get -y install git-lfs
Expand All @@ -14,7 +10,9 @@ WORKDIR /stage/

COPY requirements.txt .
RUN pip install --upgrade pip setuptools wheel
RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
RUN pip install packaging
RUN pip install flash-attn --no-build-isolation
RUN pip install -r requirements.txt

COPY open_instruct open_instruct
Expand All @@ -24,4 +22,4 @@ COPY scripts scripts
RUN chmod +x scripts/*

# for interactive session
RUN chmod -R 777 /stage/
RUN chmod -R 777 /stage/
142 changes: 85 additions & 57 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,30 @@
# Training Open Instruction-following Language Models
# Training Open Instruction-Following Language Models

This is the repository for the paper [How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
](https://arxiv.org/abs/2306.04751).
This repo serves as an open effort on instruction-tuning popular pretrained language models on publicly available datasets. We release this repo and will keep updating it with:

We explore instruction-tuning popular base models on publicly available datasets. This repository contains:
1. Training code used for training all models.
2. Evaluation code for the evaluation done in the paper.
3. Script for merging and creating model diffs.
1. Code for finetuning language models with latest techniques and instruction datasets in a unified format.
2. Code for running standard evaluation on a range of benchmarks, targeting for differnt capabilities of these language models.
3. Checkpoints or other useful artifacts that we build in our exploration.

As part of this work we introduce Tülu, a suite of LLaMa models fully-finetuned on a strong mix of datasets!
Please see our first paper [How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources](https://arxiv.org/abs/2306.04751) for more thoughts behind this project and our initial findings.

<p align="center">
<img src="images/tulu_logo.png" width="200" />
<p align="center" width="100%">
<img src="images/tulu_logo.png" alt="Tülu (a hybrid camel) represents a suite of LLaMa models that we built by fully-finetuning them on a strong mix of datasets." style="width: 20%; min-width: 200px; display: block; margin: auto;">
</p>

**Tülu 65B is the strongest model we build and available [here](https://huggingface.co/allenai/tulu-65b)** - see below for how to make use of this model yourself!
## News

- [2023-09-17] Supported [LoRA](https://arxiv.org/abs/2106.09685) and [QLoRA](https://arxiv.org/abs/2305.14314) finetuning. See [here](#parameter-efficient-finetuning) for more details.
- [2023-08-18] Added support for [ToxiGen](https://github.com/microsoft/TOXIGEN)/[TrutufulQA](https://github.com/sylinrl/TruthfulQA) evaluation. Check our `scripts/eval/` for examples of running them.
- [2023-08-08] Supported several new instruction dataset, including [LIMA](https://huggingface.co/datasets/GAIR/lima) / [WizardLM](https://github.com/nlpxucan/WizardLM) / [Open-Orca](https://huggingface.co/datasets/Open-Orca/OpenOrca). See the [preparation script](./scripts/prepare_train_data.sh) for details. Performance hasn't been evaluated yet.
- [2023-08-06] Supported LLaMa 2 finetuning and FlashAttention-2 by bumping the version of transformers and many other dependencies.
- [2023-06-29] Added [licensing info](#licensing) for our released models.
- [2023-06-09] Released Tülu (a suite of LLaMa models fully-finetuned on a strong mix of datasets) and many other checkpoints on HuggingFace [[Links]](#released-checkpoints).
- [2023-06-09] Initial release of the codebase containing the training and evaluation code for our [arxiv paper](https://arxiv.org/abs/2306.04751).

## Setup

You can install the required packages by running the following command (after installing pytorch):
To run training, evaluation, or inference for our finetuned models, you need to install the required packages by running the following command (after installing pytorch):

```bash
pip install -r requirements.txt
Expand All @@ -29,53 +35,45 @@ If you just want the dependencies for the weight diff script, use:
pip install -r weight-diff-requirements.txt
```

### Model preparation

To get LLaMa checkpoints, please acquire them via Meta [here](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform) and consult [the Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/llama) for converting them to a huggingface-compatible format.

Generally, most huggingface-compatible models should work fine, potentially with some adjusting for different tokenizers etc.


## Weight Diff Script

We use a slightly modified form of the [Alpaca weight diff script](https://github.com/tatsu-lab/stanford_alpaca/blob/main/weight_diff.py), which runs the same.

To merge a model:
1. Download the relevant LLaMa model and convert it to Hugging Face format (see above).
2. Download our repository and install the right dependencies (see above).
3. Download the model diff you want.
4. Run the command below:

```bash
python scripts/weight_diff.py recover --path_raw ${hf_llama_path} --path_tuned ${output_path} --path_diff ${diff_location}
```

## Training

### Dataset Preparation
### Dataset preparation

To download and prepare the instruction datasets we explore, use:
We include a collection of representative instruction datasets in our exploration and are adding new ones to our list. We unify them into the same chatting format. To download and prepare these datasets, simply run the following command:

```bash
./scripts/prepare_train_data.sh
```

Please check these datasets for licenses and restrictions around their use!

### Model preparation

Generally, most huggingface-compatible causal language models should work fine with our codebase, potentially with some adjusting for different tokenizers etc. Some models may require addtional requests to download. E.g., for LLaMa 1 and 2, please consult [the Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/llama) for requesting access and converting them to a huggingface-compatible format.

### Finetuning
To run instruction tuning, you can use the following command:

You can use the following command to run instruction tuning (finetuning a pretrained model to follow instructions):

```bash
./scripts/finetune_with_accelerate.sh
```

Adjust `model_name_or_path`, `tokenizer_name`, `train_file`, and `output_dir` to your models / data / setting. By default, this uses `deepspeed` with `accelerate`.
Make sure to adjust `model_name_or_path`, `tokenizer_name`, `train_file`, and `output_dir` to your models / data / setting. By default, this uses `deepspeed` with `accelerate`.

### Parameter-Efficient Finetuning

We support [LoRA](https://arxiv.org/abs/2106.09685) finetuning, wherein only a small number of parameters are updated, resulting in faster and cheaper training. For even more efficiency, we also support [QLoRA](https://arxiv.org/abs/2305.14314) finetuning, wherein the non-trained (underlying) model parameters are quantised during 4-bit training. This means you can train a 70b Llama model on a single 80GB A100! Please refer to the respective papers for more details.

## Model Checkpoints
Please also note you cannot currently run QLoRA with model parallelism - only data-parallel training is supported, so you cannot train a model that does not fit on one GPU. For LoRA, you can use deepspeed + zero-3 to achieve model parallelism (and FSDP is not currently supported).

We provide a number of model checkpoints as diffs. You can find them on Hugging Face [here](https://huggingface.co/models?other=arxiv:2306.04751). They are also all here:
Please see `./scripts/finetune_lora_with_accelerate.sh` and `./scripts/finetune_qlora_with_accelerate.sh` for example hyperparameters. We found a larger rank (e.g. 256) and higher learning rate (e.g. 2e-4) worked best. Additionally, we found that QLoRA tended to always achieve similar results to LoRA, while LoRA itself sometimes fell behind full-finetuning, especially in long, complex generation tasks. However, for most purposes, LoRA training essentially matches full-finetuning performance. Curiously, we found that merging QLoRA modules back into the non-quantised model tended to result in slightly better performance.

| **Model** | **7B** | **13B** | **30B** | **65B** |
## Released Checkpoints

We provide a number of model checkpoints that we trained. You can find them on Hugging Face [here](https://huggingface.co/models?other=arxiv:2306.04751). Here are some quick links to the checkpoints that are finetuned from LLaMa 1:

| **Datasets ↓ Model Sizes →** | **7B** | **13B** | **30B** | **65B** |
|--------------------------|--------------------------------------------------------------------------------|---------------------------------------------------------------------------------|--------------------------------------------------------------------|--------------------------------------------------------------------|
| SuperNI | [link](https://huggingface.co/allenai/open-instruct-sni-7b) | [link](https://huggingface.co/allenai/open-instruct-sni-13b) | | |
| CoT | [link](https://huggingface.co/allenai/open-instruct-cot-7b) | [link](https://huggingface.co/allenai/open-instruct-cot-13b) | | |
Expand All @@ -93,12 +91,41 @@ We provide a number of model checkpoints as diffs. You can find them on Hugging
| **Tulu** | [link](https://huggingface.co/allenai/tulu-7b) | [link](https://huggingface.co/allenai/tulu-13b) | [link](https://huggingface.co/allenai/tulu-30b) | [link](https://huggingface.co/allenai/tulu-65b) |

We also trained Pythia and OPT models on the Tulu mixture (aka the Human+GPT mixture), and they are available here:

- [Pythia 6.9B Tulu](https://huggingface.co/allenai/open-instruct-pythia-6.9b-tulu)
- [OPT 6.7B Tulu](https://huggingface.co/allenai/open-instruct-opt-6.7b-tulu)


### Weight diff script

Some of the checkpoints are released as weight diffs to the base model (mostly for LLaMa 1). We use a slightly modified form of the [Alpaca weight diff script](https://github.com/tatsu-lab/stanford_alpaca/blob/main/weight_diff.py), which runs the same.

To merge a model:
1. Download the relevant LLaMa model and convert it to Hugging Face format (see above).
2. Download our repository and install the right dependencies (see above).
3. Download the model diff you want.
4. Run the command below:

```bash
python scripts/weight_diff.py recover --path_raw ${hf_llama_path} --path_tuned ${output_path} --path_diff ${diff_location}
```

## Evaluation

First, run the following script to download all the evaluation datasets:
### Benchmark-based eval

We provide the scripts for running evaluation of Huggingface/OpenAI models on a list of standard benchmarks targeting for the core capabilities of large language models. These benchmakrs include:

- [MMLU](https://github.com/hendrycks/test)
- [Grade School Math (GSM)](https://github.com/openai/grade-school-math)
- [Big-Bench Hard (BBH)](https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main)
- [TydiQA](https://github.com/google-research-datasets/tydiqa)
- [Codex HumanEval](https://github.com/openai/human-eval/tree/master)
- [ToxiGen](https://github.com/microsoft/TOXIGEN)

We are working on including more promising benchmarks into this list. Please stay tuned!

You can use the following script to download all the evaluation data:

```bash
./scripts/prepare_eval_data.sh
Expand All @@ -110,36 +137,37 @@ Evaluation scripts for different datasets are put under `./scripts`. For example
./scripts/eval/mmlu.sh
```

### AlpacaFarm
### Model-based eval

We support using GPT4 to evaluate the quality of model's response following the GPT4 evaluation protocol proposed in [AlpacaFarm](https://arxiv.org/abs/2305.14387). To run this AlpacaFarm eval, please make sure you install our fork of AlpacaFarm (https://github.com/hamishivi/alpaca_farm) and use the following script:

To run AlpacaFarm eval, please make sure you install our fork of AlpacaFarm (https://github.com/hamishivi/alpaca_farm) and use the following script:
```bash
python eval/alpaca_farm_eval.py --model <model> --batch_size 8
```

Please check the script for more details on the script itself!

### Human Evaluation Interface
### Human evaluation

Coming soon!
We will release our human evaluation interface and data soon!

### Licensing
## Licensing

The is licensed under Apache 2.0 as given in `LICENSE`.
This codebase is licensed under Apache 2.0 as given in [LICENSE](./LICENSE).

The license we use for the models released (along with the base model licenses) can be found in `model_licenses/tulu_license.txt` - just replace `<MODELNAME>` with the actual model name (i.e., the name on HuggingFace).
The license we use for the models released (along with the base model licenses) can be found in [model_licenses/tulu_license.txt](./model_licenses/tulu_license.txt) - just replace `<MODELNAME>` with the actual model name (i.e., the name on HuggingFace).

# Citation
## Citation

If you used this repository or our models, please cite our work:
```

```bibtex
@misc{wang2023far,
title={How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources},
author={Yizhong Wang and Hamish Ivison and Pradeep Dasigi and Jack Hessel and Tushar Khot and Khyathi Raghavi Chandu and David Wadden and Kelsey MacMillan and Noah A. Smith and Iz Beltagy and Hannaneh Hajishirzi},
year={2023},
eprint={2306.04751},
archivePrefix={arXiv},
primaryClass={cs.CL}
title={How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources},
author={Yizhong Wang and Hamish Ivison and Pradeep Dasigi and Jack Hessel and Tushar Khot and Khyathi Raghavi Chandu and David Wadden and Kelsey MacMillan and Noah A. Smith and Iz Beltagy and Hannaneh Hajishirzi},
year={2023},
eprint={2306.04751},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

Loading

0 comments on commit 0ddab2b

Please sign in to comment.