这是该项目的存储库Alpaca-CoT
,旨在构建一个指令微调(IFT)平台,具有广泛的指令集合(特别是CoT数据集)以及各种大型语言模型和参数高效方法的统一接口。我们不断扩大指令调优数据收集,并集成更多的法学硕士和更高效的参数方法。此外,我们创建了一个新的分支tabular_llm
来构建用于解决表智能任务的表格法学硕士。
热烈欢迎您向我们提供任何未收集的指令调优数据集(或其来源)。我们将统一格式化它们,用这些数据集训练羊驼模型(以及未来的其他法学硕士),开源模型检查点,并进行广泛的实证研究。我们希望我们的项目能够为大型语言模型的开源进程做出一点微薄的贡献,降低NLP研究人员的入门门槛。
您还可以选择加入我们的群聊(微信),与更多有相同兴趣的人交流。目前群成员人数过多,无法直接通过群二维码入群。你需要先联系我才能进群。-
⚠ 如果您想使用LORA之外的其他方法,请在我们的项目中安装编辑后的版本
pip install -e ./peft
。 -
🚀12.8:LLM
InternLM
被合并。 -
🚀8.16:
4bit quantization
适用于lora
、qlora
和adalora
。 -
🚀8.16:参数有效的方法
Qlora
,Sequential adapter
并被Parallel adapter
合并。 -
🚀7.24:LLM
ChatGLM v2
被合并。 -
🚀7.20:LLM
Baichuan
被合并。 -
6.25:添加模型评估代码,包括belle和MMCU。
- 更多的
- 5.20:修复模型保存中的错误并添加wandb支持。
- 5.15:添加更多数据集,如
GPT4Tools
、Auto CoT
、 。pCLUE
- 🚀5.5:
tabular_llm
创建一个新分支来构建表格法学硕士。我们收集表格相关任务(例如表格问答)的指令微调数据,并使用它们来微调此存储库中的 LLM。 - 🚀5.4:合并了PEFT中所有参数有效的方法(例如p-tuning),可以直接通过超参数设置。
- 🚀5.4:LLM
MOSS
被合并。 - 4.21:收集数据集
GAOKAO
、camel
、FLAN-Muffin
、并格式化。COIG
- 4.15:收集数据集
webGPT
、dolly
、baize
、hh-rlhf
、并格式化。OIG(part)
- 4.12:现在您可以在Google Colab上试用 Alpaca-CoT 。
- 4.11:@paulcx
multi-turn conversation
添加了功能。 - 4.9:数据集
firefly
,instruct
,Code Alpaca
已收集并格式化,可以在此处找到。 - 4.7:添加了函数
Parameter merging
、Local chatting
和@weberr 。Batch predicting
Web service building
- 4.4: 数据集
GPTeacher
、Guanaco
、HC3
、prosocial-dialog
、 、belle-chat&belle-math
和xP3
被natural-instructions
收集并格式化。 - 4.3:中国CoT数据集可以在这里
CoT_CN_data.json
找到。
LLaMA [1] 是一部伟大的作品,展示了惊人的零样本和少样本能力。它显着降低了训练、微调和使用有竞争力的大语言模型的成本,即LLaMA-13B优于GPT-3(175B),LLaMA-65B与PaLM-540B具有竞争力。最近,为了提高 LLaMA 的指令跟踪能力,Stanford Alpaca [2] 在Self-Instruct [3] 技术生成的 52K 指令跟踪数据上对 LLaMA-7B 进行了微调。然而,目前LLM研究界仍然面临三个挑战:1.即使LLaMA-7b对计算资源仍然有很高的要求; 2. 用于指令微调的开源数据集很少; 3.缺乏对各类教学对汉语教学反应能力、CoT推理能力等模型能力影响的实证研究。
为此,我们提出了这个项目,该项目利用了随后提出的各种改进,具有以下优点:
-
- 该存储库包含从此处和此处修改的代码,可以通过使用低秩适应(LoRA) [4]、PEFT和bitsandbytes廉价而高效地微调 LLaMA(与斯坦福羊驼相比,性能不会下降)。 LLaMA模型的、和版本可以在单个 80G A100 上轻松训练。
7b
13b
30b
- 该存储库包含从此处和此处修改的代码,可以通过使用低秩适应(LoRA) [4]、PEFT和bitsandbytes廉价而高效地微调 LLaMA(与斯坦福羊驼相比,性能不会下降)。 LLaMA模型的、和版本可以在单个 80G A100 上轻松训练。
-
- 本仓库中发布的模型显着提高了 CoT(推理)能力。
-
- 本仓库中发布的模型显着提高了遵循中国指令的能力。
-
- 该仓库包含一系列持续收集的指令微调数据集,目前包括英文、中文和 CoT 指令。此外,还提供了使用各种指令数据集训练的检查点集合。
-
- 这个repo 集成了多个LLM并统一了它们的接口,可以通过超参数轻松切换。目前包括LLaMA、ChatGLM [5] 、Bloom [6] 和MOSS,未来还会继续添加更多,以便研究人员轻松调用和比较不同的 LLM。
-
- 这个repo 集成了多种参数高效的方法并统一了它们的接口,可以通过超参数轻松切换。目前,它包括LoRA、P-tuning [5] 、adalora和prefix adjustment,未来将继续添加更多内容,以便研究人员轻松调用和比较不同的参数高效方法。
-
- 本报告包含广泛的实证研究和定性分析,可能会提供有价值的发现并促进未来LLM的探索。
据我们所知,这项工作是第一个研究基于 LLaMA 和 Alpaca 的CoT 推理的工作。因此,我们将我们的工作缩写为Alpaca-CoT
。
收集的数据集的相对大小可以如下图所示:
参考这个(@yaodongC),我们根据以下规则标记每个收集的数据集:
(Lang)语言-标签:
- EN:英文说明数据集
- CN:中文指令数据集
- ML:[多语言]多种语言的指令数据集
(任务)任务标签:
- MT:[多任务]包含多个任务的数据集
- TS:[特定任务]针对特定任务定制的数据集
(Gen)生成方法:
- HG:[人类生成的数据集]人类创建的数据集
- SI:[自指示]使用自指示方法生成的数据集
- MIX:[混合数据集]数据集包含人类和机器生成的数据
- COL:[数据集集合] 由其他数据集集合而成的数据集
数据集 | 数字 | 郎 | 任务 | 根 | 类型 | 源代码 | 网址 |
---|---|---|---|---|---|---|---|
思想链 | 74771 | 英文/中文 | 公吨 | HG | 用 cot 推理进行指导 | 在现有数据上注释 CoT | 下载 |
GPT4all | 806199 | CN | 公吨 | 科尔 | 代码、故事和对话 | 从 GPT-3.5-turbo 蒸馏 | 下载 |
GP老师 | 29013 | CN | 公吨 | SI | 一般、角色扮演、工具形成者 | GPT-4 和模具成型机 | 下载 |
原驼 | 534610 | 机器学习 | 公吨 | SI | 各种语言任务 | 文本-达芬奇-003 | 下载 |
HC3 | 37175 | 英文/中文 | TS | 混合 | 对话评价 | 人类或 ChatGPT | 下载 |
羊驼毛 | 52002 | CN | 公吨 | SI | 一般指示 | 文本-达芬奇-003 | 下载 |
自然指令 | 5040134 | 机器学习 | 公吨 | 科尔 | 多样化的 NLP 任务 | 人工注释数据集集合 | 下载 |
美女网 | 1079517 | 中国 | TS/MT | SI | 一般、数学推理、对话 | 文本-达芬奇-003 | 下载 |
本能狂野 | 52191 | 英文/中文 | 公吨 | SI | 生成、开放质量保证、头脑风暴 | 文本-达芬奇-003 | 下载 |
亲社会对话 | 165681 | CN | TS | 混合 | 对话 | GPT-3 手动重写问题+人工反馈 | 下载 |
财务_cn | 68912 | CN | TS | 科尔 | 财务相关的质量保证 | GPT3.5 | 下载 |
xP3 | 78883588 | 机器学习 | 公吨 | 科尔 | 涵盖 46 种语言和 16 个 NLP 任务的提示和数据集的集合 | 人工注释数据集集合 | 下载 |
萤火虫 | 1649398 | 中国 | 公吨 | 科尔 | 23 个自然语言处理任务 | 人工注释数据集集合 | 下载 |
指导 | 888969 | CN | 公吨 | 科尔 | 增强了 GPT4All、Alpaca、开源元数据集 | 使用 AllenAI 提供的高级 NLP 工具进行增强 | 下载 |
代码羊驼 | 20022 | CN | TS | SI | 代码生成、编辑、优化 | 文本-达芬奇-003 | 下载 |
羊驼_GPT4 | 52002 | 英文/中文 | 公吨 | SI | 一般指示 | 由 GPT-4 使用 Alpaca 生成 | 下载 |
网络GPT | 18994 | CN | TS | 混合 | 信息检索 (IR) 质量保证 | 微调GPT-3,每条指令有两个输出,选择更好的一个 | 下载 |
多莉2.0 | 15015 | CN | TS | HG | 封闭式QA、总结等,维基百科作为参考 | 人工注释 | 下载 |
白泽 | 653699 | CN | 公吨 | 科尔 | Alpaca、Quora、StackOverFlow 和 MedQuAD 问题的集合 | 人工注释数据集集合 | 下载 |
hh-rlhf | 284517 | CN | TS | 混合 | 对话 | 人类和 RLHF 模型之间的对话 | 下载 |
监察长办公室(部分) | 49237 | CN | 公吨 | 科尔 | 由各种任务创建,例如问题和回答 | 使用数据增强、人工注释数据集收集 | 下载 |
高考 | 2785 | 中国 | 公吨 | 科尔 | 考试中的多项选择题、填空题和开放式问题 | 人工注释 | 下载 |
骆驼 | 760620 | CN | 公吨 | SI | 人工智能社会、代码、数学、物理、化学、生物学中的角色扮演对话 | GPT-3.5-涡轮 | 下载 |
果馅饼松饼 | 1764800 | CN | 公吨 | 科尔 | 60 个 NLP 任务 | 人工注释数据集集合 | 下载 |
COIG(标志指令) | 298428 | 中国 | 公吨 | 科尔 | 收集考试、翻译、人类价值调整说明和反事实纠正多轮聊天 | 使用自动工具和手动验证 | 下载 |
GPT4工具 | 71446 | CN | 公吨 | SI | 工具相关说明的集合 | GPT-3.5-涡轮 | 下载 |
分享聊天 | 1663241 | CN | 公吨 | 混合 | 一般指示 | 众包收集人们和 ChatGPT (ShareGPT) 之间的对话 | 下载 |
自动CoT | 5816 | CN | 公吨 | 科尔 | 算术、常识、符号和其他逻辑推理任务 | 人工注释数据集集合 | 下载 |
苔藓 | 1583595 | 英文/中文 | TS | SI | 一般指示 | 文本-达芬奇-003 | 下载 |
超级聊天 | 28247446 | CN | 关于世界的问题、写作与创作、现有材料的协助 | 两个独立的 gpt-3.5-turbo | 下载 | ||
中医 | 792099 | 中国 | TS | 科尔 | 有关医疗建议的问题 | 爬行 | 下载 |
中超联赛 | 396206 | 中国 | 公吨 | 科尔 | 论文文本生成、关键词提取、文本摘要和文本分类 | 爬行 | 下载 |
CLUE | 1200705 | 中国 | 公吨 | 科尔 | 一般指示 | 下载 | |
新闻评论 | 252776 | 中国 | TS | 科尔 | 翻译 | 下载 | |
堆栈LLaMA | 去做 | CN |
您可以在此处下载所有格式化数据。然后你应该把它们放在数据文件夹中。
您可以从这里下载所有经过各种类型指令数据训练的检查点。然后,将LoRA_WEIGHTS
(in )设置为本地路径后generate.py
,就可以直接执行模型推理了。
我们收集的所有数据都被格式化为相同的模板,其中每个样本如下:
[
{"instruction": instruction string,
"input": input string, # (may be empty)
"output": output string}
]
需要注意的是,对于CoT数据集,我们首先使用FLAN提供的模板将原始数据集更改为各种Chain-of-Thoughts形式,然后将其转换为上述格式。可以在此处找到格式化脚本。
pip install -r requirements.txt
请注意,微调 ChatGLM 时请确保 python>=3.9。
聚四氟乙烯
- 如果您想使用LORA之外的其他方法,请在我们的项目中安装编辑后的版本
pip install -e ./peft
为了让研究人员对LLM进行系统的IFT研究,我们收集了不同类型的教学数据,整合了多个LLM,并统一了接口,可以轻松定制所需的搭配:
--model_type
:设置您要使用的LLM。目前,支持[llama、chatglm、bloom、moss]。后两者中文能力较强,未来还会整合更多的LLM。--peft_type
:设置您要使用的 PEFT。目前支持[lora、adalora、前缀调音、p调音、提示符]。--data
:设置IFT使用的数据类型,灵活定制所需的命令遵从能力。例如,推理能力强,设置“alpaca-cot”,中文能力强,设置“belle1.5m”,编码和故事生成能力,设置“gpt4all”,金融相关反应能力,设置“金融” 。--model_name_or_path
:这被设置为加载目标LLM模型权重的不同版本--model_type
。例如,要加载 llama 的 13b 版本权重,您可以设置 decapoda-research/llama-13b-hf。
单GPU
- 对于美洲驼
python3 uniform_finetune.py --model_type llama --model_name_or_path decapoda-research/llama-7b-hf \
--data alpaca-belle-cot --lora_target_modules q_proj v_proj \
--per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1
注意:对于多个数据集,您可以使用--data
like--data ./data/alpaca.json ./data/finance.json <path2yourdata_1>
- 用于聊天GLM
python3 uniform_finetune.py --model_type chatglm --model_name_or_path THUDM/chatglm-6b \
--data alpaca-belle-cot --lora_target_modules query_key_value \
--lora_r 32 --lora_alpha 32 --lora_dropout 0.1 --per_gpu_train_batch_size 2 \
--learning_rate 2e-5 --epochs 1
请注意,load_in_8bit
尚不适合 ChatGLM,因此 batch_size 必须小于其他值。
- 为绽放
python3 uniform_finetune.py --model_type bloom --model_name_or_path bigscience/bloomz-7b1-mt \
--data alpaca-belle-cot --lora_target_modules query_key_value \
--per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1
- 对于莫斯
python3 uniform_finetune.py ---model_type moss --model_name_or_path fnlp/moss-moon-003-sft \
--data alpaca --lora_target_modules q_proj v_proj --per_gpu_train_batch_size 1 \
--learning_rate 3e-4 --epochs 3
- 实习生LM
python3 uniform_finetune.py --model_type internlm --model_name_or_path internlm/internlm-7b \
--data alpaca --lora_target_modules q_proj v_proj --lora_r 32 --lora_alpha 32 \
--lora_dropout 0.1 --per_gpu_train_batch_size 1 --learning_rate 2e-5 --epochs 1 \
--compute_dtype="fp32"
请注意,您还可以将本地路径(保存 LLM 权重的位置)传递到--model_name_or_path
.并且数据类型--data
可以根据您的兴趣自由设置。
多个 GPU
torchrun --nnodes 1 --nproc_per_node $ngpu uniform_finetune.py $args --data $data
- 对于美洲驼
python3 -m torch.distributed.launch --nproc_per_node 4 \
--nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy uniform_finetune.py \
--model_type llama --model_name_or_path decapoda-research/llama-7b-hf \
--data alpaca-belle-cot --lora_target_modules q_proj v_proj \
--per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1
- 用于聊天GLM
python3 -m torch.distributed.launch --nproc_per_node 4 \
--nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy \
uniform_finetune.py --model_type chatglm --model_name_or_path THUDM/chatglm-6b \
--data alpaca-belle-cot --lora_target_modules query_key_value \
--lora_r 32 --lora_alpha 32 --lora_dropout 0.1 --per_gpu_train_batch_size 2 \
--learning_rate 2e-5 --epochs 1
请注意,load_in_8bit
尚不适合 ChatGLM,因此 batch_size 必须小于其他值。
- 为绽放
python3 -m torch.distributed.launch --nproc_per_node 4 \
--nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy \
uniform_finetune.py --model_type bloom --model_name_or_path bigscience/bloomz-7b1-mt \
--data alpaca-belle-cot --lora_target_modules query_key_value \
--per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1
- 实习生LM
python3 -m torch.distributed.launch --nproc_per_node 4 \
--nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy \
uniform_finetune.py --model_type internlm --model_name_or_path internlm/internlm-7b \
--data alpaca --lora_target_modules q_proj v_proj --lora_r 32 --lora_alpha 32 \
--lora_dropout 0.1 --per_gpu_train_batch_size 1 --learning_rate 2e-5 --epochs 1 \
--compute_dtype="fp32"
python3 generate.py --data alpaca-belle-cot --model_type llama
python3 generate.py --data alpaca-belle-cot --model_type chatglm
python3 generate.py --data alpaca-belle-cot --model_type bloom
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-copy js-clipboard-copy-icon">
<path d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 0 1 0 1.5h-1.5a.25.25 0 0 0-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 0 0 .25-.25v-1.5a.75.75 0 0 1 1.5 0v1.5A1.75 1.75 0 0 1 9.25 16h-7.5A1.75 1.75 0 0 1 0 14.25Z"></path><path d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0 1 14.25 11h-7.5A1.75 1.75 0 0 1 5 9.25Zm1.75-.25a.25.25 0 0 0-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 0 0 .25-.25v-7.5a.25.25 0 0 0-.25-.25Z"></path>
有关指令微调和推理的更多详细信息可以在我们修改的地方找到。请注意,该文件夹saved-xxx7b
是 LoRA 权重的保存路径,LLaMA 权重是从 Hugging Face 中自动下载的。
top_p=0.9, #Moderately increase the probability threshold of nucleus sampling to increase the quantity of candidate tokens and increase generation diversity.
temperature=1.0, #The previous low temperature parameter could lead to a severe polarization in the probability distribution of generated words, which degenerates the generation strategy into greedy decoding.
do_sample=True, #do_sample parameter is set to False by default. After setting to True, the generation methods turn into beam-search multinomial sampling decoding strategy.
no_repeat_ngram_size=6, #Configure the probability of the next repeating n-gram to 0, to ensure that there are no n-grams appearing twice. This setting is an empirical preliminary exploration.
repetition_penalty=1.8, #For words that have appeared before, in the subsequent prediction process, we reduce the probability of their reoccurrence by introducing the repetition_penalty parameter. This setting is an empirical preliminary exploration.
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-copy js-clipboard-copy-icon">
<path d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 0 1 0 1.5h-1.5a.25.25 0 0 0-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 0 0 .25-.25v-1.5a.75.75 0 0 1 1.5 0v1.5A1.75 1.75 0 0 1 9.25 16h-7.5A1.75 1.75 0 0 1 0 14.25Z"></path><path d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0 1 14.25 11h-7.5A1.75 1.75 0 0 1 5 9.25Zm1.75-.25a.25.25 0 0 0-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 0 0 .25-.25v-7.5a.25.25 0 0 0-.25-.25Z"></path>
python3 merge.py --model_type llama --size 7b --lora_dir xxx --merged_dir yyy
python3 server.py --model_type chatglm --size 6b --lora_dir xxx
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-copy js-clipboard-copy-icon">
<path d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 0 1 0 1.5h-1.5a.25.25 0 0 0-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 0 0 .25-.25v-1.5a.75.75 0 0 1 1.5 0v1.5A1.75 1.75 0 0 1 9.25 16h-7.5A1.75 1.75 0 0 1 0 14.25Z"></path><path d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0 1 14.25 11h-7.5A1.75 1.75 0 0 1 5 9.25Zm1.75-.25a.25.25 0 0 0-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 0 0 .25-.25v-7.5a.25.25 0 0 0-.25-.25Z"></path>
python3 web.py --model_type chatglm --size 6b --lora_dir xxx
注:以下实验结果均来自___中文大语言模型指令调优的实证研究___。
本文选取Belle-eval和MMCU两个评价基准对中文LLM能力进行综合评价。
Belle-eval is constructed by self-instruct with ChatGPT, which has 1,000 diverse instructions that involve 10 categories covering common NLP tasks (e.g., QA) and challenging tasks (e.g., code and math). We use ChatGPT to rate the model responses based on the golden answers. This benchmark is considered to be as the assessment of AGI (instruction-following) capability.
MMCU is a collection of Chinese multiple choice questions in four professional disciplines of medicine, law, psychology and education (e.g., Gaokao examination). It allows LLMs to take exams in human society in a multiple-choice test manner, making it suitable for evaluating the breadth and depth of knowledge of LLMs across multiple disciplines.
Data statistics of Belle-eval and MMCU are shown in the table above.
We conduct experiments to study the three main factors in instruction-tuning LLMs: LLM bases, Parameter-efficient Methods, Chinese Instruction Datasets.
For open LLMs, we test existing LLMs and LLMs fine-tuned with LoRA on Alpaca-GPT4 on Belle-eval and MMCU, respectively.
Table 2 shows the scores of open LLMs on Belle-eval. Table 3 shows the accuracy of LLMs on MMCU. They fine-tune all the open LLMs with the same parameter-efficient method LoRA and the same instruction dataset Alpaca-GPT4.
Experimental Results:
-
Evaluation of Existing LLMs
Performance on Belle-eval
(1) For base LLMs, Bloom performs the best.
(2) For sft LLMs, ChatGLM outperforms others by large margins, thanks to the fact that it is trained with the most Chinese tokens and HFRL.
(3) The Open QA, Math, CloseQA and Extract categories are still very challenging for existing open LLMs.
(4) Vicuna and moss-sft have clear improvements compared to their bases, LLaMA and moss-base, respectively.
(5) In contrast, the performance of sft models, Bloomz and Bloomz-mt, is reduced compared to the base model Bloom, because they tend to generate a shorter response.
Performance on MMCU
(1) All base LLMs perform poorly because it is almost difficult to generate content in the specified format before fine-tuning, e.g., outputting option numbers.
(2) All sft LLMs outperform their corresponding base LLMs, respectively. In particular, Bloomz performs the best (even beats ChatGLM) because it can generate option number directly as required without generating other irrelevant content, which is also due to the data characteristics of its supervised fine-tuning dataset xP3.
(3) Among the four disciplines, law is the most challenging for LLMs.
The performance results of LLMs after instruction-tuning on Alpaca-GPT4-zh are shown in Figure 1.
-
Instruction-tuning Different LLMs
(1) On Belle-eval, the performance improvement of sft LLMs brought by instruction-tuning is not as significant as that of base LLMs, except for sft Bloomz and Bloomz-mt.
(2) Vicuna and ChatGLM encounter performance drops after instruction-tuning, because Vicuna is trained from real human-ChatGPT conversations, with better quality than Alpaca-GPT4. ChatGLM adopts HFRL, which may be no longer suitable for further instruction-tuning.
(3) On MMCU, most LLMs achieve performance boosts after instruction-tuning, with the exception of Bloomz and Bloomz-mt, which have unexpectedly significantly decreased performance.
(4) After instruction-tuning, Bloom has significant improvements and performs well on both benchmarks. Although ChatGLM beats Bloom consistently, it suffers performance drop during instruction-tuning. Therefore, among all open LLMs, Bloom is most suitable as a foundation model in the subsequent experiments for Chinese instruction-tuning exploration.
For parameter-efficient methods other than LoRA, the paper collects a range of parameter-efficient methods to instruction-tune Bloom on the Alpaca-GPT4 dataset.
Experimental Results:
-
Comparison of Parameter-efficient Methods
(1) SadapterH performs the best among all parameter-efficient methods, which can be used as an alternative to LoRA.
(2) P-tuning and prompt-tuning underperform others by large margins, indicating that only adding trainable layers in the embedding layer are not enough to support LLMs for generation tasks.
(3) Although AdaLoRA is an improvement of LoRA, its performance has a clear drop, possibly because the LoRA's trainable parameters for LLMs are not suitable for further reduction.
(4) Comparing the upper and lower parts, it can be seen that increasing the number of trainable parameters for sequential adapters (i.e., SadapterP and SadapterH) does not bring gain, while the opposite phenomenon is observed for parallel adapters(i.e., P-adapter)
-
Training Loss
(1) Prompt-tuning and P-tuning converge the slowest and has the highest losses after convergence. This shows that embedding-only adapters are not suitable for instruction-tuning LLMs.
(2) The initial loss of AdaLoRA is very high because it requires simultaneous learning of parameter budget allocation, which makes the model unable to fit the training data well.
(3) The other methods can quickly converge on training data and fit it well.
For the impact of various types of Chinese instruction datasets, authors gather popular open Chinese instructions (as shown in Table 5) to fine-tune Bloom with LoRA.
Table 6 and Table 7 show Bloom's fine-tuning on different instruction datasets.
Experimental Results:
-
Performance on Belle-eval
(1) the instruction data constructed by ChatGPT (e.g., using self-instruction methods or collecting real human-ChatGPT conversations) consistently enhances the instruction-following ability with 3.1 ∼ 11-point score increases.
(2) Among these datasets, Belle has the best performance due to the largest amount of instruction data. However, the performance of models trained on moss-sft-data, containing more data built in a similar way, is unsatisfactory.
(3) The performance brought by the Alpaca-GPT4 instructions is the second best, with only 49K being comparable to the 1.54M Belle.
(4) Instinwild brings the least performance gains among them because the seed instructions it crawls from Tweet ("in wild") are not as comprehensive as those (like Alpaca) carefully designed by humans.
(5) These ChatGPT-based data mainly have a significant improvement effect on open generation tasks such as Brain Storm and Generation, while there is a significant decrease in tasks that require high reading comprehension skills, such as Close QA and Extract.
(6) These instruction datasets cause damage to the model's instruction-following ability, because the form and intent of each NLP or examination dataset are unitary, which can easily be overfitted.
(7) Among them, COIG-trans performs the best because it involves over 2000 different tasks with a wide variety of task instructions. In contrast, xP3 and COIG-ccmc have the worst negative impact on model performance. Both of them only cover a few types of tasks (translation and QA for the former, counterfactual correction conversations for the latter), which hardly cover the popular instructions and tasks for humans.
-
Performance on MMCU
(1) Instruction-tuning on each dataset can always result in performance improvement.
(2) Among the ChatGPT-based data shown in the upper part, ShareGPT-zh underperforms others by large margins. This may be due to the fact that real users rarely ask multiple choice questions about academic topics.
(3) Among the dataset-collection data shown in the lower part, HC3 and COIG-ccmc results in the lowest accuracy because the unique questions of HC3 are only 13K, and the task format of COIG-ccmc is significantly different from MMCU.
(4) COIG-exam brings the greatest accuracy improvement, benefiting from the similar task format as MMCU.
Four Other Factors: CoT, Expansion of Chinese Vocabulary, Language of Prompts and Human-value Alignment
For CoT, authors compare the performance before and after adding CoT data during instruction-tuning.
Experiment Settings:
We collect 9 CoT datasets and their prompts from FLAN, and then translate them into Chinese using Google Translate. They compare the performance before and after adding CoT data during instruction-tuning.
First note the way to add CoT data as "Alpaca-GPT4+CoT". In addition, add a sentence "先思考,再决定" ("think step by step" in Chinese) at the end of each instruction, to induce the model to respond to instructions based on the CoT, and label this way as "Alpaca-GPT4+CoT*".
Experimental Results:
-
"Alpaca-GPT4+CoT" outperforms "Alpaca-GPT4" in Code and Math tasks that require strong reasoning ability. Besides, there is also a significant improvement in the MMCU Education task.
-
As shown in the line of "Alpaca-GPT4+CoT*", the simple sentence can further improve the performance of reasoning tasks Code and Education, while the Math performance is slightly inferior to "Alpaca-GPT4+CoT". This may require further exploring of more robust prompts.
For expansion of Chinese vocabulary, authors test the influence of the number of Chinese tokens in the tokenizer’s vocabulary on LLMs’ ability to express Chinese. For example, if a Chinese character is in the vocabulary, it can be represented by a single token, otherwise it may require multiple tokens to represent it.
Experiment Settings: Authors mainly conduct experiments on LLaMA, which uses SentencePiece(32K vocabulary size of Chinese characters) covering fewer Chinese characters than Bloom(250K).
Experimental Results:
-
Pre-training on more Chinese corpus with expansion of Chinese vocabulary is consistently helpful for instruction-following ability.
-
And counterintuitively, "llama-voc-pre-l" (100B) is inferior to "llama-voc-pre" (20B) on MMCU, which shows that pre-training on more data may not necessarily lead to higher performance for academic exams.
For the language of prompts, authors test the suitability of instruction fine-tuning for using Chinese prompts.
Figure 4 shows the results of using Chinese and English prompts based on LLaMA and Bloom. When instruction-tuning LLaMA, using Chinese prompts can improve the performance on both benchmarks compared to English prompts, while the opposite phenomenon can be observed on Bloom.
Experimental Results:
-
For models with weaker Chinese abilities(e.g., LLaMA), using Chinese prompts can effectively help respond in Chinese.
-
For models with good Chinese abilities (e.g., Bloom), using prompts in English (the language they are better at) can better guide the model to understand the process of fine-tuning with instructions.
To avoid LLMs generating toxic content, aligning them with human values is a crucial issue. We add human-value alignment data built by COIG into instruction-tuning to explore its impact.
Figure 5 compares the results of instruction-tuning with and without human-value alignment.
Experimental Results: The human-value alignment results in a slight performance drop. How to balance the harmlessness and performance of LLMs is a research direction worth exploring in the future.
注:下图为截至3月26日采集数据集的统计数据,仅作为数据采集动机展示。已经收集了更多的数据集,例如金融相关的指令数据集。
The current collection of instruction-finetuning datasets consists mainly of three parts:
alpaca_data_cleaned.json
: about 52K English instruction-following training samples.CoT_data.json
: 9 CoT datasets involving about 75k samples. (published by FLAN[7])belle_data_cn.json
: about 0.5M Chinese |instruction-following training samples. (published by BELLE [8])
"w/o CoT" and "w/o CN" denote models that exclude CoT data and Chinese instructions from their instruction finetuning data, respectively.
The above table shows two examples (involving with numerical calculations) that require a certain amount of reasoning ability to respond correctly.
As shown in the middle column, Ours w/o CoT
fails to generate the correct response, which shows that once the finetuning data does not contain CoT data, the model's reasoning ability significantly decreases. This further demonstrates that CoT data is essential for LLM models.
The above table shows two examples that require the ability to respond to Chinese instructions.
As shown in the right column, either the generated content of Ours w/o CN
is unreasonable, or the Chinese instructions are answered in English by Ours w/o CN
. This shows that removing Chinese data during finetuning will cause the model to be unable to handle Chinese instructions, and further demonstrates the need to collect Chinese instruction finetuning data.
The above table shows a relatively difficult example, which requires both a certain accumulation of knowledge of Chinese history and a logical and complete ability to state historical events. As shown in this table, Ours w/o CN
can only generate a short and erroneous response, because due to the lack of Chinese finetuning data, the corresponding knowledge of Chinese history is naturally lacking. Although Ours w/o CoT
lists some relevant Chinese historical events, its logic of expression is self-contradictory, which is caused by the lack of CoT data.
`
In summary, the models finetuned from our complete dataset (English, Chinese, and CoT instruction data) can significantly improve model reasoning and Chinese instruction following abilities.
Samples of each odd number of rows do not apply the CoT prompt, such as "step-by-step reasoning." Both
Ours(w/CoT)
and Alpaca are based on LLaMA-7B, and the only difference between them two is that the instruction-finetuning data of Ours(w/CoT)
has a extra CoT data than that of Alpaca.
From the above table, we find that:
Ours(w/CoT)
always generates the correct rationale before the answer, while Alpaca fails to generate any reasonable rationale, as shown in the first 4 examples (commonsense questions). This shows that using CoT data for finetuning can significantly improve reasoning ability.- For
Ours(w/CoT)
, the CoT prompt (e.g., concatenate 'step-by-step' with the input question) has little effect on easy examples (e.g., commonsense questions) and has an important effect on challenging questions (e.g., questions requiring reasoning, like the last four examples). - For Alpaca, CoT prompt always has little effect or even negative impact. For the last two examples, after adding CoT prompt, Aplpaca changes the correct generated answer to the wrong one. This may be due to the inconsistency between the input forms of finetuning and inference.
Quantitative comparison of responses to Chinese instructions.
Our model is finetuned from a 7B LLaMA on 52K English instructions and 0.5M Chinese instructions. Stanford Alpaca (our reimplementation) is finetuned from a 7B LLaMA on 52K English instructions. BELLE is finetuned from a 7B BLOOM on 2B Chinese instructions.
From the above table, several observations can be found:
- Compared to Alpaca,
ours (w/ CN)
has a stronger ability to understand Chinese instructions. For the first example, Alpaca fails to distinguish between theinstruction
part andinput
part, while we do. - Chinese instruction finetuning data can significant enhance the ability to interact in Chinese. For the second example,
ours (w/ CN)
not only provides the correct code, but also provides the corresponding Chinese annotation, while Alpaca does not. In addition, as shown in the 3-5 examples, Alpaca can only respond to Chinese instruction with an English response. - Compared to BELLE,
ours (w/ CN)
's performance on instructions requiring an open response (as shown in last two examples) still needs to be improved. BELLE's outstanding performance against such instructions is due to: 1. Its BLOOM backbone model encounters much more multilingual data during pre-training; 2. Its Chinese instruction finetuning data is more than ours, that is, 2M vs 0.5M.
Quantitative comparison of responses to English instructions. The purpose of this subsection is to explore whether finetuning on Chinese instructions has a negative impact on Alpaca.
From the above table, we find that:
- Finetuning with Chinese instruction data does not weaken the original English instruction–following ability, on the contrary, there is also a certain enhancement in generating a better response to English instructions. The response of
ours (w/ CN)
shows more detail than that of Alpaca, e.g. for the third example,ours (w/ CN)
list three more provinces than Alpaca.
如果您使用此存储库中的数据收集、代码和实验结果,请引用该存储库。
@misc{si2023empirical,
title={An Empirical Study of Instruction-tuning Large Language Models in Chinese},
author={Qingyi Si and Tong Wang and Zheng Lin and Xu Zhang and Yanan Cao and Weiping Wang},
year={2023},
eprint={2310.07328},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
<svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-copy js-clipboard-copy-icon">
<path d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 0 1 0 1.5h-1.5a.25.25 0 0 0-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 0 0 .25-.25v-1.5a.75.75 0 0 1 1.5 0v1.5A1.75 1.75 0 0 1 9.25 16h-7.5A1.75 1.75 0 0 1 0 14.25Z"></path><path d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0 1 14.25 11h-7.5A1.75 1.75 0 0 1 5 9.25Zm1.75-.25a.25.25 0 0 0-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 0 0 .25-.25v-7.5a.25.25 0 0 0-.25-.25Z"></path>
对于数据和模型,请注明原始数据、参数有效方法和法学硕士来源。
我们要特别感谢 APUS AilMe Lab 赞助 8 个 A100 GPU 进行实验。
(回到顶部)