GPU devices: A10, 3090, V100, A100 are all acceptable. For GPUs with memory <=24G, at least a dual-card environment is required. Since human alignment training loads two models on one card, it occupies more memory than fine-tuning due to an additional inference model's memory consumption.
# Install ms-swift
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e .[llm]
# Environment alignment (usually not necessary. If you encounter errors, you can run the following code, the repository uses the latest environment for testing)
pip install -r requirements/framework.txt -U
pip install -r requirements/llm.txt -U
The following shell script runs a human alignment training. First, you need to switch to the runtime directory:
cd examples/pytorch/llm
Run the following command:
# Experimental environment: 4*A100
# Memory usage: 4 * 20G, dual-card device_map * 2ddp
nproc_per_node=2
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
torchrun \
--nproc_per_node=$nproc_per_node \
--master_port 29500 \
llm_dpo.py \
--model_type yi-6b-chat \
--ref_model_type yi-6b-chat \
--model_revision master \
--sft_type lora \
--tuner_backend swift \
--dtype AUTO \
--output_dir output \
--dataset hh-rlhf-cn-harmless-base-cn \
--train_dataset_sample -1 \
--truncation_strategy truncation_left \
--val_dataset_sample 2000 \
--num_train_epochs 3 \
--max_length 1024 \
--max_prompt_length 512 \
--check_dataset_strategy none \
--lora_rank 8 \
--lora_alpha 32 \
--lora_dropout_p 0.05 \
--lora_target_modules ALL \
--gradient_checkpointing true \
--batch_size 1 \
--weight_decay 0.01 \
--learning_rate 5e-5 \
--gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
--max_grad_norm 1.0 \
--warmup_ratio 0.03 \
--eval_steps 2000 \
--save_steps 2000 \
--save_total_limit 2 \
--logging_steps 10 \
The sh script can be viewed here.
# The following script needs to be executed in this directory
cd examples/pytorch/llm
Tips:
- We default to setting
--gradient_checkpointing true
during training to save memory, which will slightly reduce training speed. - If you are using older GPUs such as V100, you need to set
--dtype AUTO
or--dtype fp16
, because they do not support bf16. - If your machine has high-performance graphics cards like A100 and you are using the qwen series models, we recommend installing flash-attn, which will speed up training and inference as well as reduce memory usage (A10, 3090, V100, etc. graphics cards do not support training with flash-attn). Models that support flash-attn can be viewed in LLM Supported Models
- If you need to train offline, please use
--model_id_or_path <model_dir>
and set--check_model_is_latest false
. For specific parameter meanings, please see Command Line Arguments. - If you want to push weights to the ModelScope Hub during training, you need to set
--push_to_hub true
.
# dpo training for mistral-7b max_length=1024, bs=1
# Recommended experimental environment: V100, A10, 3090, 2 cards, 4 cards or 8 cards
bash scripts/dpo/lora_ddp_mp/dpo.sh
bash scripts/dpo/lora_ddp_mp/infer.sh
Since DPO training will result in a complete model or adapter weights, the steps for LoRA merging and inference are the same as for fine-tuning, so please refer to the corresponding steps in the Fine-tuning Documentation.