Skip to content

Commit

Permalink
NOM-796 Clean up Trainer Code (nomic-ai#1)
Browse files Browse the repository at this point in the history
* feat: forward single gpu working for clip

* fix: wandb logging, multigpu single node working

* fix: learning rate scheduler bug

accelerate used to step for each gpu, now we don't have to account for that!

* fix: model + logit saving

* chore: multiprocess/datasets broken

* fix: datasets version

* fix: working run

* chore: add dev reqs

* feat: deepspeed at least runs

* fix: eval loop and saving working, set seed correctly

* docs: update readme

* docs: add citation

* fix: scheduler, saving for deepspeed

* docs: update readme with new command

* chore: spelling

* fix: autocast

* feat: glue trainer

* fix: remove old files, update config

* nit: prints

* style: black isort

* docs: add arxiv link
  • Loading branch information
zanussbaum authored Feb 5, 2024
1 parent b0e94ca commit 564ee7a
Show file tree
Hide file tree
Showing 30 changed files with 1,405 additions and 4,159 deletions.
26 changes: 23 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,11 @@
- Support for training on multiple GPUs
- [GradCache](https://github.com/luyug/GradCache) support for training with large batch sizes in constrained memory environments
- Huggingface Support for easy loading of common models (Pythia/GPTNeoX, BERT, etc.)
- Masked Lanugage Modeling (MLM) Pretraining
- Masked Language Modeling (MLM) Pretraining

## Research

* [Nomic Embed: Training a Reproducible Long Context Text Embedder](https://arxiv.org/abs/2402.01613) by Zach Nussbaum, Jack Morris, Andrei Mulyar, and Brandon Duderstadt

## Getting Started and Requirements

Expand Down Expand Up @@ -120,7 +124,7 @@ To train your own BERT from scratch (with all the optimizations) run

```bash
cd src/contrastors
accelerate launch --num_processes=8 --num_machines=1 --mixed_precision=bf16 --use_deepspeed --deepspeed_config_file=configs/deepspeed/ds_config.json train_mlm.py --config=configs/train/mlm.yaml
deepspeed --num_gpus=8 train.py --config=configs/train/mlm.yaml --deepspeed_config_file=configs/deepspeed/ds_config.json --dtype=bf16
```

### Constrastive Pretraining and Finetuning
Expand All @@ -129,7 +133,7 @@ To launch an experiment run

```bash
cd src/contrastors
accelerate launch --num_processes=8 --num_machines=1 --mixed_precision=bf16 train_text_text.py --config=configs/train/contrastive_pretrain.yaml
torchrun --nproc-per-node=8 train.py --config=configs/train/contrastive_pretrain.yaml --dtype=bf16
```

This will train a bert model on all ~200M examples. To change the dataset, you can modify `data_args.input_shards`.
Expand Down Expand Up @@ -165,3 +169,19 @@ This project and models are licensed under the [Apache 2.0 License](LICENSE).

We thank Tri Dao for his work on Flash Attention and the custom kernels that make this project possible, the [OpenCLIP](https://github.com/mlfoundations/open_clip) team for their
great repository with which much of this work is based on, and the Huggingface team for their great work on the transformers library.


## Citation

If you find the model, dataset, or training code useful, please cite our work

```bibtex
@misc{nussbaum2024nomic,
title={Nomic Embed: Training a Reproducible Long Context Text Embedder},
author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},
year={2024},
eprint={2402.01613},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
datasets
datasets>=2.16.0
nomic>3.0.0
webdataset
s3fs>=2023.10.0
Expand Down
5 changes: 4 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,10 @@
package_dir={'': 'src'},
packages=find_packages(where='contrastors'),
install_requires=requirements,
extras_require={"eval": ["openai", "tiktoken", "mteb[beir]"]},
extras_require={
"eval": ["openai", "tiktoken", "mteb[beir]", "multiprocess==0.70.15"],
"dev": ["pytest", "black", "isort"],
},
include_package_data=True,
python_requires='>=3.7',
)
104 changes: 84 additions & 20 deletions src/contrastors/config.py
Original file line number Diff line number Diff line change
@@ -1,28 +1,27 @@
from typing import Any, Dict, Optional, Tuple, Union

from pydantic import BaseModel, validator

from contrastors.dataset.constants import OPENAI_IMAGE_DATASET_MEAN, OPENAI_IMAGE_DATASET_STD
from pydantic import BaseModel, root_validator, validator


class TrainArgs(BaseModel):
num_epochs: int
learning_rate: float
weight_decay: float
eps: Optional[float]
eps: Optional[float] = 1e-8
warmup_steps: Optional[int]
warmup_pct: Optional[float]
cooldown_steps: Optional[int]
checkpoint: Optional[str]
wandb: bool
wandb_project_name: str
wandb_entity: str
wandb_run_name: Optional[str]
log_grads_every: int
log_lr_every: int
save_every: Optional[int]
eval_every: Optional[int]
eval_steps: Optional[int]
eval_strategy: Optional[str]
output_dir: Optional[str]
gradient_checkpointing: Optional[bool]
gradient_accumulation_steps: Optional[int] = 1
# if using deepspeed, this will be ignored
schedule_type: str
Expand All @@ -32,45 +31,86 @@ class TrainArgs(BaseModel):
loss_fn: Optional[str]
grad_cache: Optional[bool]
chunk_size: Optional[int]
logit_scale: Optional[float] = 1 / 0.07
clamp_logits: Optional[bool] = True
logit_max: Optional[float] = 100.0
add_l2_loss: Optional[bool] = False

class Config:
validate_assignment = True

@validator('logit_scale')
def set_logit_scale(cls, scale):
return scale or 1 / 0.07

@validator('logit_max')
def set_logic_max(cls, max):
return max or 100.0

@validator("eval_strategy")
def validate_eval_strategy(cls, strategy):
if strategy not in ["steps", "epochs"]:
raise ValueError(f"Eval strategy {strategy} not found in eval strategy registry")
return strategy

@root_validator
def validate_steps_set(cls, values):
# validate that eval_steps is set if eval_strategy is set to steps
eval_steps, eval_strategy = values.get("eval_steps"), values.get("eval_strategy")
if eval_strategy == "steps" and eval_steps is None:
raise ValueError("Eval steps must be set if eval strategy is set to steps")

return values


class DataArgs(BaseModel):
input_shards: Optional[str]
tokenized_dataset: Optional[str]
task_name: Optional[Optional[str]]
image_text_shards: Optional[str]
shuffle: bool
workers: int
batch_size: int
seed: int
train_num_samples: Optional[int]
shuffle: bool
val_pct: Optional[float] = None


class MLMDataArgs(DataArgs):
tokenized_dataset: Optional[str]
mlm_prob: Optional[float]
task_name: Optional[Optional[str]]
val_mlm_prob: Optional[float]
val_pct: Optional[float]

@root_validator
def validate_data(cls, values):
tokenized, task_name = values.get("tokenized_dataset"), values.get("task_name")
if tokenized is None and task_name is None:
raise ValueError("Either tokenized dataset or task name must be set")
return values

@root_validator
def validate_mlm(cls, values):
tokenized, mlm_prob, val_prob = (
values.get("tokenized_dataset"),
values.get("mlm_prob"),
values.get("val_mlm_prob"),
)
# validate mlm_prob if tokenized is set
if tokenized is not None and mlm_prob is None:
raise ValueError("MLM probability must be set if tokenized dataset is set")
if tokenized is not None and val_prob is None:
raise ValueError("Validation MLM probability must be set if tokenized dataset is set")
if mlm_prob is not None and (mlm_prob < 0 or mlm_prob > 1):
raise ValueError("MLM probability must be between 0 and 1")
if val_prob is not None and (val_prob < 0 or val_prob > 1):
raise ValueError("Validation MLM probability must be between 0 and 1")
return values


class ContrastiveDataArgs(DataArgs):
input_shards: str
download: Optional[bool] = False
process_one_shard: Optional[bool] = False
streaming: Optional[bool] = True
weighted_sampling: Optional[bool] = False
verbose: Optional[bool] = False
imagenet_val_path: Optional[str] = None


class ModelArgs(BaseModel):
model_type: str
logit_scale: Optional[float] = 1 / 0.07
trainable_logit_scale: Optional[bool] = False
seq_len: Optional[int]
rotary_emb_fraction: Optional[float]
pad_vocab_to_multiple_of: Optional[int]
Expand All @@ -89,9 +129,33 @@ class ModelArgs(BaseModel):
attn_pdrop: Optional[float] = 0.0
projection_dim: Optional[int] = None
freeze: Optional[bool] = False
gradient_checkpointing: Optional[bool] = False

@validator('logit_scale')
def set_logit_scale(cls, scale):
return scale or 1 / 0.07

@validator('model_type')
def validate_model_type(cls, model_type):
if model_type not in ["encoder", "mlm", "glue"]:
raise ValueError(f"Model type {model_type} not found in model registry")
return model_type


class Config(BaseModel):
train_args: TrainArgs
data_args: DataArgs
mlm_data_args: Optional[MLMDataArgs]
contrastive_data_args: Optional[ContrastiveDataArgs]
model_args: Optional[ModelArgs]
deepspeed: Optional[bool] = False
deepspeed_config: Optional[dict] = None

@root_validator
def check_args(cls, values):
mlm, contrastive = values.get("mlm_data_args"), values.get("contrastive_data_args")

# Check if either arg1 or arg2 is set, but not both
if (mlm is None and contrastive is None) or (mlm is not None and contrastive is not None):
raise ValueError('Either arg1 or arg2 must be set, but not both')

return values
2 changes: 1 addition & 1 deletion src/contrastors/configs/deepspeed/ds_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"gradient_accumulation_steps": "auto",
"train_micro_batch_size_per_gpu": "auto",
"bf16": {
"enabled": "auto"
"enabled": "true"
},
"gradient_clipping": 0.0,
"zero_optimization": {
Expand Down
8 changes: 5 additions & 3 deletions src/contrastors/configs/train/contrastive_finetune.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,13 @@ train_args:
grad_cache: false
loss_fn: "clip"
use_fp8: false
logit_scale: 50
clamp_logits: false
logit_max: 100

model_args:
model_type: "encoder"
logit_scale: 50
trainable_logit_scale: false
seq_len: 512
pooling: "mean"
encoder: true
Expand All @@ -35,7 +37,7 @@ model_args:
pretrained: null


data_args:
contrastive_data_args:
input_shards: "configs/data/finetune_triplets.yaml"
workers: 8
batch_size: 256
Expand All @@ -44,4 +46,4 @@ data_args:
download: true
streaming: true
weighted_sampling: false
verbose: false
verbose: true
6 changes: 4 additions & 2 deletions src/contrastors/configs/train/contrastive_pretrain.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,13 @@ train_args:
grad_cache: true
loss_fn: "clip"
use_fp8: false
logit_scale: 50
clamp_logits: false
logit_max: 100

model_args:
logit_scale: 50
trainable_logit_scale: false
model_type: "encoder"
seq_len: 2048
rotary_emb_fraction: 0.0
pad_vocab_to_multiple_of: 64
Expand All @@ -40,7 +42,7 @@ model_args:
mlp_fc1_bias: false
mlp_fc2_bias: false

data_args:
contrastive_data_args:
input_shards: "configs/data/contrastive_pretrain.yaml"
workers: 0
batch_size: 16384
Expand Down
37 changes: 37 additions & 0 deletions src/contrastors/configs/train/glue.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
train_args:
num_epochs: 10
learning_rate: 3.0e-5
adam_beta1: 0.9
adam_beta2: 0.98
weight_decay: 1.0e-6
eps: 1e-6
max_grad_norm: 0.0
schedule_type: "linear"

warmup_steps: null
warmup_pct: 0.06
cooldown_steps: null
checkpoint: null

wandb: true
wandb_project_name: "bert"
wandb_entity: "gpt4all"
wandb_run_name: "glue-trainer-test"

log_grads_every: 100
log_lr_every: 10
save_every: -1
eval_strategy: "epochs"

model_args:
model_type: "glue"
seq_len: 128
tokenizer_name: "bert-base-uncased"
pretrained: "ckpts/mlm-trainer/epoch_0_model"

mlm_data_args:
task_name: "cola"
workers: -1
batch_size: 16
seed: 42
shuffle: true
Loading

0 comments on commit 564ee7a

Please sign in to comment.