Skip to content

Commit

Permalink
update dpo (LianjiaTech#528)
Browse files Browse the repository at this point in the history
Co-authored-by: tianxiaoyu011 <[email protected]>
  • Loading branch information
Emperorizzis and tianxiaoyu011 authored Sep 27, 2023
1 parent f4a43fd commit 0e3f2c9
Show file tree
Hide file tree
Showing 4 changed files with 622 additions and 9 deletions.
58 changes: 49 additions & 9 deletions train/README_RLHF.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
# RLHF训练流程
## 1. 奖励模型

### 1.1 准备数据
## 一、PPO

### 1. 奖励模型

#### 1.1 准备数据

```jsonl
{"chosen": xxx, "rejected": xxx}
Expand All @@ -10,7 +13,7 @@
注意:
xxx文本已经添加了正确的提示,用于区别人类和bot,如 `Human: \n{text}\n\nAssistant: \n{text}`

### 1.2 训练
#### 1.2 训练

```bash
bash scripts/run_rm.sh
Expand All @@ -22,22 +25,26 @@ bash scripts/run_rm.sh
- load_in_4bit:是否使用4bit载入

注意:

- 支持deepspeed stage 3直接运行
- 不支持deepspeed stage 3 + lora,deepspeed stage 1/2 + lora可以运行
- load_in_8bit和load_in_4bit不能和deepspeed同时使用,可以和lora同时使用。需要将`configs/accelerate_config_rm.yaml`中"distributed_type"从"DEEPSPEED"改为"MULTI_GPU"
- load_in_8bit和load_in_4bit不能和deepspeed同时使用,可以和lora同时使用。需要将 `configs/accelerate_config_rm.yaml`中"distributed_type"从"DEEPSPEED"改为"MULTI_GPU"

### TODO

- [ ] deepspeed stage 3 + lora支持
## 2. PPO

### 2.1 准备数据
### 2. PPO

#### 2.1 准备数据

```jsonl
{"text": xxx}
```

注意:xxx文本已经添加了正确的提示,用于区别人类和bot,如 `Human: \n{text}\n\nAssistant: \n`

### 2.2 训练
#### 2.2 训练

```bash
bash scripts/run_ppo.sh
Expand All @@ -52,13 +59,46 @@ bash scripts/run_ppo.sh
- data_epochs:在prompt数据上训练多少轮

注意:

- 支持deepspeed zero stage 3,拆分model、ref_model和reward_model
- 支持deepspeed zero stage 3 + lora
- $batch\\_size == mini\\_batch\\_size * gradient\\_accumulation\\_steps$
- 数据集大小要大于`num_processes * batch_size`,否则部分进程拿不到数据,出现报错,输出中`Train dataset length`可以看到经过长度过滤的数据集大小
- 数据集大小要大于 `num_processes * batch_size`,否则部分进程拿不到数据,出现报错,输出中 `Train dataset length`可以看到经过长度过滤的数据集大小

### TODO
#### TODO

- [ ] 每次训练的batch_size是 `num_processes * batch_size`,每个进程只会从自己的 `batch`中采样,而不是从全局的 `num_processes * batch_size`中采样,这会导致每个gpu采到的 `mini_batch`不是完全随机的,`mini_batch`不包含其它进程 `batch`中的样本
- [ ] gradient checkpointing
- [ ] resume from checkpoint



## 二、DPO

### 2.1 准备数据

格式:

```jsonl
{"chosen":xxx, "reject":xxx, "prompt":xxx}
```

一条数据样例:

```jsonl
{"chosen": "水的化学式是H2O。这意味着每个水分子由两个氢原子(H)和一个氧原子(O)组成。在这个结构中,氢原子和氧原子通过共价键相连。", "rejected": "H2O.", "prompt": "Human: \n水的化学式是什么?\n\nAssistant: \n"}
```

### 2.2 训练

首先,请将“train/scripts”下“run_dpo.sh”脚本中的“...”改成所需参数值

其次:

```bash
cd train/scripts
bash run_dpo.sh
```



51 changes: 51 additions & 0 deletions train/configs/deepspeed_config_stage3_dpo.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
{
"bfloat16": {
"enabled": true
},
"fp16": {
"enabled": false,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"weight_decay": "auto",
"betas": "auto",
"eps": "auto",
"torch_adam": true,
"adam_w_mode": true
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e12,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 1e5,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
61 changes: 61 additions & 0 deletions train/scripts/run_dpo.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
#! /bin/bash

dataset_name=...
model_name=...
torch_dtype=bfloat16
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=1
num_train_epochs=3
save_total_limit=1
learning_rate=...
weight_decay=0.0001
warmup_ratio=0.03
eval_and_save_ratio_per_epoch=0.1
max_length=...
max_prompt_length=...

model_name_or_path=/.../${model_name}
train_file=/.../${dataset_name}/${dataset_name}.train.json
validation_file=/.../${dataset_name}/${dataset_name}.dev.json

output_model_name=${model_name}_${dataset_name}_${learning_rate}_epoch${num_train_epochs}_${torch_dtype}
output_dir=/.../${output_model_name}

logging_dir=/.../${output_model_name}

# here we recommend use configs/deepspeed_config_stage3_dpo.json
deepspeed_config=...

torchrun --nnodes=1 --nproc_per_node=8 ../src/dpo_trainer.py \
--ddp_timeout 50000 \
--model_name_or_path ${model_name_or_path} \
--torch_dtype ${torch_dtype} \
--bf16 True \
--trust_remote_code True \
--load_best_model_at_end True \
--prediction_loss_only False \
--deepspeed ${deepspeed_config} \
--train_file ${train_file} \
--validation_file ${validation_file} \
--per_device_train_batch_size ${per_device_train_batch_size} \
--per_device_eval_batch_size ${per_device_eval_batch_size} \
--gradient_accumulation_steps ${gradient_accumulation_steps} \
--num_train_epochs ${num_train_epochs} \
--max_length ${max_length} \
--max_prompt_length ${max_prompt_length} \
--save_total_limit ${save_total_limit} \
--save_strategy "steps" \
--evaluation_strategy "steps" \
--metric_for_best_model "rewards/accuracies" \
--learning_rate ${learning_rate} \
--weight_decay ${weight_decay} \
--warmup_ratio ${warmup_ratio} \
--eval_and_save_ratio_per_epoch ${eval_and_save_ratio_per_epoch} \
--lr_scheduler_type "cosine" \
--logging_steps 3 \
--seed 3407 \
--gradient_checkpointing True \
--output_dir ${output_dir} \
--report_to "tensorboard" \
--logging_dir ${logging_dir}
Loading

0 comments on commit 0e3f2c9

Please sign in to comment.