Skip to content

我们将指令调优数据(例如,CoT 数据)、多个 LLM 和参数高效方法(例如,lora、p-tuning)的接口统一在一起,以便于使用。我们欢迎开源爱好者在此仓库上发起任何有意义的 PR,并尽可能多地集成 LLM 相关技术。我们方便了研究人员现场使用大型模型等模型平台,我们欢迎开源爱好者发起任何有意义的项目!

License

Notifications You must be signed in to change notification settings

yuanxiaoming8899/Alpaca-CoT

 
 

Repository files navigation

中文|英语

羊驼毛CoT

Alpaca-CoT:具有统一指令收集接口、参数高效方法和大型语言模型的指令调优平台

执照 火炬 数据 模型 万德宝 科拉布

这是该项目的存储库Alpaca-CoT,旨在构建一个指令微调(IFT)平台,具有广泛的指令集合(特别是CoT数据集)以及各种大型语言模型和参数高效方法的统一接口。我们不断扩大指令调优数据收集,并集成更多的法学硕士和更高效的参数方法。此外,我们创建了一个新的分支tabular_llm来构建用于解决表智能任务的表格法学硕士。

热烈欢迎您向我们提供任何未收集的指令调优数据集(或其来源)。我们将统一格式化它们,用这些数据集训练羊驼模型(以及未来的其他法学硕士),开源模型检查点,并进行广泛的实证研究。我们希望我们的项目能够为大型语言模型的开源进程做出一点微薄的贡献,降低NLP研究人员的入门门槛。

您还可以选择加入我们的群聊(微信),与更多有相同兴趣的人交流。目前群成员人数过多,无法直接通过群二维码入群。你需要先联系我才能进群。

消息

  • ⚠ 如果您想使用LORA之外的其他方法,请在我们的项目中安装编辑后的版本pip install -e ./peft

  • 🚀12.8:LLMInternLM被合并。

  • 🚀8.16:4bit quantization适用于loraqloraadalora

  • 🚀8.16:参数有效的方法QloraSequential adapter并被Parallel adapter合并。

  • 🚀7.24:LLMChatGLM v2被合并。

  • 🚀7.20:LLMBaichuan被合并。

  • 6.25:添加模型评估代码,包括belle和MMCU。

- 更多的

  • 5.20:修复模型保存中的错误并添加wandb支持。
  • 5.15:添加更多数据集,如GPT4ToolsAuto CoT、 。pCLUE
  • 🚀5.5:tabular_llm创建一个新分支来构建表格法学硕士。我们收集表格相关任务(例如表格问答)的指令微调数据,并使用它们来微调此存储库中的 LLM。
  • 🚀5.4:合并了PEFT中所有参数有效的方法(例如p-tuning),可以直接通过超参数设置。
  • 🚀5.4:LLMMOSS被合并。
  • 4.21:收集数据集GAOKAOcamelFLAN-Muffin、并格式化。COIG
  • 4.15:收集数据集webGPTdollybaizehh-rlhf、并格式化。OIG(part)
  • 4.12:现在您可以在Google Colab上试用 Alpaca-CoT 。
  • 4.11:@paulcxmulti-turn conversation添加了功能。
  • 4.9:数据集firefly, instruct,Code Alpaca已收集并格式化,可以在此处找到。
  • 4.7:添加了函数Parameter mergingLocal chatting和@weberr Batch predictingWeb service building
  • 4.4: 数据集GPTeacherGuanacoHC3prosocial-dialog、 、belle-chat&belle-mathxP3natural-instructions收集并格式化。
  • 4.3:中国CoT数据集可以在这里CoT_CN_data.json找到。

概述

图像

LLaMA [1] 是一部伟大的作品,展示了惊人的零样本和少样本能力。它显着降低了训练、微调和使用有竞争力的大语言模型的成本,即LLaMA-13B优于GPT-3(175B),LLaMA-65B与PaLM-540B具有竞争力。最近,为了提高 LLaMA 的指令跟踪能力,Stanford Alpaca [2] 在Self-Instruct [3] 技术生成的 52K 指令跟踪数据上对 LLaMA-7B 进行了微调。然而,目前LLM研究界仍然面临三个挑战:1.即使LLaMA-7b对计算资源仍然有很高的要求; 2. 用于指令微调的开源数据集很少; 3.缺乏对各类教学对汉语教学反应能力、CoT推理能力等模型能力影响的实证研究。

为此,我们提出了这个项目,该项目利用了随后提出的各种改进,具有以下优点:

    1. 该存储库包含从此处此处修改的代码,可以通过使用低秩适应(LoRA) [4]、PEFTbitsandbytes廉价而高效地微调 LLaMA(与斯坦福羊驼相比,性能不会下降)。 LLaMA模型的、和版本可以在单个 80G A100 上轻松训练。7b13b30b
    1. 本仓库中发布的模型显着提高了 CoT(推理)能力
    1. 本仓库中发布的模型显着提高了遵循中国指令的能力
    1. 该仓库包含一系列持续收集的指令微调数据集,目前包括英文、中文和 CoT 指令。此外,还提供了使用各种指令数据集训练的检查点集合。
    1. 这个repo 集成了多个LLM并统一了它们的接口,可以通过超参数轻松切换。目前包括LLaMAChatGLM [5] 、Bloom [6] 和MOSS,未来还会继续添加更多,以便研究人员轻松调用和比较不同的 LLM。
    1. 这个repo 集成了多种参数高效的方法并统一了它们的接口,可以通过超参数轻松切换。目前,它包括LoRAP-tuning [5] 、adaloraprefix adjustment,未来将继续添加更多内容,以便研究人员轻松调用和比较不同的参数高效方法。
    1. 本报告包含广泛的实证研究和定性分析,可能会提供有价值的发现并促进未来LLM的探索。

据我们所知,这项工作是第一个研究基于 LLaMA 和 Alpaca 的CoT 推理的工作。因此,我们将我们的工作缩写为Alpaca-CoT

数据采集

收集的数据集的相对大小可以如下图所示:

图像

参考这个@yaodongC),我们根据以下规则标记每个收集的数据集:

(Lang)语言-标签:

  • EN:英文说明数据集
  • CN:中文指令数据集
  • ML:[多语言]多种语言的指令数据集

(任务)任务标签:

  • MT:[多任务]包含多个任务的数据集
  • TS:[特定任务]针对特定任务定制的数据集

(Gen)生成方法:

  • HG:[人类生成的数据集]人类创建的数据集
  • SI:[自指示]使用自指示方法生成的数据集
  • MIX:[混合数据集]数据集包含人类和机器生成的数据
  • COL:[数据集集合] 由其他数据集集合而成的数据集

统计数据

数据集 数字 任务 类型 源代码 网址
思想链 74771 英文/中文 公吨 HG 用 cot 推理进行指导 在现有数据上注释 CoT 下载
GPT4all 806199 CN 公吨 科尔 代码、故事和对话 从 GPT-3.5-turbo 蒸馏 下载
GP老师 29013 CN 公吨 SI 一般、角色扮演、工具形成者 GPT-4 和模具成型机 下载
原驼 534610 机器学习 公吨 SI 各种语言任务 文本-​​达芬奇-003 下载
HC3 37175 英文/中文 TS 混合 对话评价 人类或 ChatGPT 下载
羊驼毛 52002 CN 公吨 SI 一般指示 文本-​​达芬奇-003 下载
自然指令 5040134 机器学习 公吨 科尔 多样化的 NLP 任务 人工注释数据集集合 下载
美女网 1079517 中国 TS/MT SI 一般、数学推理、对话 文本-​​达芬奇-003 下载
本能狂野 52191 英文/中文 公吨 SI 生成、开放质量保证、头脑风暴 文本-​​达芬奇-003 下载
亲社会对话 165681 CN TS 混合 对话 GPT-3 手动重写问题+人工反馈 下载
财务_cn 68912 CN TS 科尔 财务相关的质量保证 GPT3.5 下载
xP3 78883588 机器学习 公吨 科尔 涵盖 46 种语言和 16 个 NLP 任务的提示和数据集的集合 人工注释数据集集合 下载
萤火虫 1649398 中国 公吨 科尔 23 个自然语言处理任务 人工注释数据集集合 下载
指导 888969 CN 公吨 科尔 增强了 GPT4All、Alpaca、开源元数据集 使用 AllenAI 提供的高级 NLP 工具进行增强 下载
代码羊驼 20022 CN TS SI 代码生成、编辑、优化 文本-​​达芬奇-003 下载
羊驼_GPT4 52002 英文/中文 公吨 SI 一般指示 由 GPT-4 使用 Alpaca 生成 下载
网络GPT 18994 CN TS 混合 信息检索 (IR) 质量保证 微调GPT-3,每条指令有两个输出,选择更好的一个 下载
多莉2.0 15015 CN TS HG 封闭式QA、总结等,维基百科作为参考 人工注释 下载
白泽 653699 CN 公吨 科尔 Alpaca、Quora、StackOverFlow 和 MedQuAD 问题的集合 人工注释数据集集合 下载
hh-rlhf 284517 CN TS 混合 对话 人类和 RLHF 模型之间的对话 下载
监察长办公室(部分) 49237 CN 公吨 科尔 由各种任务创建,例如问题和回答 使用数据增强、人工注释数据集收集 下载
高考 2785 中国 公吨 科尔 考试中的多项选择题、填空题和开放式问题 人工注释 下载
骆驼 760620 CN 公吨 SI 人工智能社会、代码、数学、物理、化学、生物学中的角色扮演对话 GPT-3.5-涡轮 下载
果馅饼松饼 1764800 CN 公吨 科尔 60 个 NLP 任务 人工注释数据集集合 下载
COIG(标志指令) 298428 中国 公吨 科尔 收集考试、翻译、人类价值调整说明和反事实纠正多轮聊天 使用自动工具和手动验证 下载
GPT4工具 71446 CN 公吨 SI 工具相关说明的集合 GPT-3.5-涡轮 下载
分享聊天 1663241 CN 公吨 混合 一般指示 众包收集人们和 ChatGPT (ShareGPT) 之间的对话 下载
自动CoT 5816 CN 公吨 科尔 算术、常识、符号和其他逻辑推理任务 人工注释数据集集合 下载
苔藓 1583595 英文/中文 TS SI 一般指示 文本-​​达芬奇-003 下载
超级聊天 28247446 CN 关于世界的问题、写作与创作、现有材料的协助 两个独立的 gpt-3.5-turbo 下载
中医 792099 中国 TS 科尔 有关医疗建议的问题 爬行 下载
中超联赛 396206 中国 公吨 科尔 论文文本生成、关键词提取、文本摘要和文本分类 爬行 下载
CLUE 1200705 中国 公吨 科尔 一般指示 下载
新闻评论 252776 中国 TS 科尔 翻译 下载
堆栈LLaMA 去做 CN

下载

您可以在此处下载所有格式化数据。然后你应该把它们放在数据文件夹中。

您可以从这里下载所有经过各种类型指令数据训练的检查点。然后,将LoRA_WEIGHTS(in )设置为本地路径后generate.py,就可以直接执行模型推理了。

数据格式化

我们收集的所有数据都被格式化为相同的模板,其中每个样本如下:

[
{"instruction": instruction string,
"input": input string, # (may be empty)
"output": output string}
]

需要注意的是,对于CoT数据集,我们首先使用FLAN提供的模板将原始数据集更改为各种Chain-of-Thoughts形式,然后将其转换为上述格式。可以在此处找到格式化脚本。

多接口统一平台

设置

pip install -r requirements.txt

请注意,微调 ChatGLM 时请确保 python>=3.9。

聚四氟乙烯

  • 如果您想使用LORA之外的其他方法,请在我们的项目中安装编辑后的版本
pip install -e ./peft

指令微调

为了让研究人员对LLM进行系统的IFT研究,我们收集了不同类型的教学数据,整合了多个LLM,并统一了接口,可以轻松定制所需的搭配:

  • --model_type:设置您要使用的LLM。目前,支持[llama、chatglm、bloom、moss]。后两者中文能力较强,未来还会整合更多的LLM。
  • --peft_type:设置您要使用的 PEFT。目前支持[lora、adalora、前缀调音、p调音、提示符]。
  • --data:设置IFT使用的数据类型,灵活定制所需的命令遵从能力。例如,推理能力强,设置“alpaca-cot”,中文能力强,设置“belle1.5m”,编码和故事生成能力,设置“gpt4all”,金融相关反应能力,设置“金融” 。
  • --model_name_or_path:这被设置为加载目标LLM模型权重的不同版本 --model_type。例如,要加载 llama 的 13b 版本权重,您可以设置 decapoda-research/llama-13b-hf。

单GPU

  • 对于美洲驼
python3 uniform_finetune.py --model_type llama --model_name_or_path decapoda-research/llama-7b-hf \
    --data alpaca-belle-cot --lora_target_modules q_proj v_proj \
    --per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1

注意:对于多个数据集,您可以使用--datalike--data ./data/alpaca.json ./data/finance.json <path2yourdata_1>

  • 用于聊天GLM
python3 uniform_finetune.py   --model_type chatglm --model_name_or_path THUDM/chatglm-6b \
    --data alpaca-belle-cot --lora_target_modules query_key_value \
    --lora_r 32 --lora_alpha 32 --lora_dropout 0.1 --per_gpu_train_batch_size 2 \
    --learning_rate 2e-5 --epochs 1

请注意,load_in_8bit尚不适合 ChatGLM,因此 batch_size 必须小于其他值。

  • 为绽放
python3 uniform_finetune.py   --model_type bloom --model_name_or_path bigscience/bloomz-7b1-mt \
    --data alpaca-belle-cot --lora_target_modules query_key_value \
    --per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1
  • 对于莫斯
python3 uniform_finetune.py   ---model_type moss --model_name_or_path fnlp/moss-moon-003-sft  \
    --data alpaca --lora_target_modules q_proj v_proj --per_gpu_train_batch_size 1 \
    --learning_rate 3e-4 --epochs 3
  • 实习生LM
python3 uniform_finetune.py   --model_type internlm --model_name_or_path internlm/internlm-7b \
    --data alpaca --lora_target_modules q_proj v_proj --lora_r 32 --lora_alpha 32 \
    --lora_dropout 0.1 --per_gpu_train_batch_size 1 --learning_rate 2e-5 --epochs 1 \
    --compute_dtype="fp32"

请注意,您还可以将本地路径(保存 LLM 权重的位置)传递到--model_name_or_path.并且数据类型--data可以根据您的兴趣自由设置。

多个 GPU

torchrun --nnodes 1 --nproc_per_node $ngpu uniform_finetune.py $args --data $data 
  • 对于美洲驼
python3 -m torch.distributed.launch --nproc_per_node 4  \
    --nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy uniform_finetune.py \
    --model_type llama --model_name_or_path decapoda-research/llama-7b-hf \
    --data alpaca-belle-cot --lora_target_modules q_proj v_proj \
    --per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1
  • 用于聊天GLM
python3 -m torch.distributed.launch --nproc_per_node 4  \
    --nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy \
    uniform_finetune.py   --model_type chatglm --model_name_or_path THUDM/chatglm-6b \
    --data alpaca-belle-cot --lora_target_modules query_key_value \
    --lora_r 32 --lora_alpha 32 --lora_dropout 0.1 --per_gpu_train_batch_size 2 \
    --learning_rate 2e-5 --epochs 1

请注意,load_in_8bit尚不适合 ChatGLM,因此 batch_size 必须小于其他值。

  • 为绽放
python3 -m torch.distributed.launch --nproc_per_node 4  \
    --nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy \
    uniform_finetune.py   --model_type bloom --model_name_or_path bigscience/bloomz-7b1-mt \
    --data alpaca-belle-cot --lora_target_modules query_key_value \
    --per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1
  • 实习生LM
python3 -m torch.distributed.launch --nproc_per_node 4  \
    --nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy \
    uniform_finetune.py   --model_type internlm --model_name_or_path internlm/internlm-7b \
    --data alpaca --lora_target_modules q_proj v_proj --lora_r 32 --lora_alpha 32 \
    --lora_dropout 0.1 --per_gpu_train_batch_size 1 --learning_rate 2e-5 --epochs 1 \
    --compute_dtype="fp32"

推理

python3 generate.py  --data alpaca-belle-cot --model_type llama

python3 generate.py --data alpaca-belle-cot --model_type chatglm

python3 generate.py --data alpaca-belle-cot --model_type bloom

  <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-copy js-clipboard-copy-icon">
<path d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 0 1 0 1.5h-1.5a.25.25 0 0 0-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 0 0 .25-.25v-1.5a.75.75 0 0 1 1.5 0v1.5A1.75 1.75 0 0 1 9.25 16h-7.5A1.75 1.75 0 0 1 0 14.25Z"></path><path d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0 1 14.25 11h-7.5A1.75 1.75 0 0 1 5 9.25Zm1.75-.25a.25.25 0 0 0-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 0 0 .25-.25v-7.5a.25.25 0 0 0-.25-.25Z"></path>

有关指令微调和推理的更多详细信息可以我们修改的地方找到。请注意,该文件夹saved-xxx7b是 LoRA 权重的保存路径,LLaMA 权重是从 Hugging Face 中自动下载的。

推理超参数解释

top_p=0.9,
        #Moderately increase the probability threshold of nucleus sampling to increase the quantity of candidate tokens and increase generation diversity.

temperature=1.0, #The previous low temperature parameter could lead to a severe polarization in the probability distribution of generated words, which degenerates the generation strategy into greedy decoding.

do_sample=True, #do_sample parameter is set to False by default. After setting to True, the generation methods turn into beam-search multinomial sampling decoding strategy.

no_repeat_ngram_size=6, #Configure the probability of the next repeating n-gram to 0, to ensure that there are no n-grams appearing twice. This setting is an empirical preliminary exploration.

repetition_penalty=1.8, #For words that have appeared before, in the subsequent prediction process, we reduce the probability of their reoccurrence by introducing the repetition_penalty parameter. This setting is an empirical preliminary exploration.

  <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-copy js-clipboard-copy-icon">
<path d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 0 1 0 1.5h-1.5a.25.25 0 0 0-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 0 0 .25-.25v-1.5a.75.75 0 0 1 1.5 0v1.5A1.75 1.75 0 0 1 9.25 16h-7.5A1.75 1.75 0 0 1 0 14.25Z"></path><path d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0 1 14.25 11h-7.5A1.75 1.75 0 0 1 5 9.25Zm1.75-.25a.25.25 0 0 0-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 0 0 .25-.25v-7.5a.25.25 0 0 0-.25-.25Z"></path>

参数合并

python3 merge.py --model_type llama --size 7b --lora_dir xxx --merged_dir yyy

本地聊天

python3 server.py --model_type chatglm --size 6b --lora_dir xxx
  <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-copy js-clipboard-copy-icon">
<path d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 0 1 0 1.5h-1.5a.25.25 0 0 0-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 0 0 .25-.25v-1.5a.75.75 0 0 1 1.5 0v1.5A1.75 1.75 0 0 1 9.25 16h-7.5A1.75 1.75 0 0 1 0 14.25Z"></path><path d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0 1 14.25 11h-7.5A1.75 1.75 0 0 1 5 9.25Zm1.75-.25a.25.25 0 0 0-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 0 0 .25-.25v-7.5a.25.25 0 0 0-.25-.25Z"></path>

网络服务建设

python3 web.py --model_type chatglm --size 6b --lora_dir xxx

中文开放法学硕士教学调整的实证研究(截至6月25日)

注:以下实验结果均来自___中文大语言模型指令调优的实证研究___。

1. 基准

本文选取Belle-eval和MMCU两个评价基准对中文LLM能力进行综合评价。

Belle-eval is constructed by self-instruct with ChatGPT, which has 1,000 diverse instructions that involve 10 categories covering common NLP tasks (e.g., QA) and challenging tasks (e.g., code and math). We use ChatGPT to rate the model responses based on the golden answers. This benchmark is considered to be as the assessment of AGI (instruction-following) capability.

MMCU is a collection of Chinese multiple choice questions in four professional disciplines of medicine, law, psychology and education (e.g., Gaokao examination). It allows LLMs to take exams in human society in a multiple-choice test manner, making it suitable for evaluating the breadth and depth of knowledge of LLMs across multiple disciplines.

Data statistics of Belle-eval and MMCU are shown in the table above.

2. Main Factors

We conduct experiments to study the three main factors in instruction-tuning LLMs: LLM bases, Parameter-efficient Methods, Chinese Instruction Datasets.

2.1 LLM Bases

For open LLMs, we test existing LLMs and LLMs fine-tuned with LoRA on Alpaca-GPT4 on Belle-eval and MMCU, respectively.

Table 2 shows the scores of open LLMs on Belle-eval. Table 3 shows the accuracy of LLMs on MMCU. They fine-tune all the open LLMs with the same parameter-efficient method LoRA and the same instruction dataset Alpaca-GPT4.

Experimental Results:

  1. Evaluation of Existing LLMs

    Performance on Belle-eval

    (1) For base LLMs, Bloom performs the best.

    (2) For sft LLMs, ChatGLM outperforms others by large margins, thanks to the fact that it is trained with the most Chinese tokens and HFRL.

    (3) The Open QA, Math, CloseQA and Extract categories are still very challenging for existing open LLMs.

    (4) Vicuna and moss-sft have clear improvements compared to their bases, LLaMA and moss-base, respectively.

    (5) In contrast, the performance of sft models, Bloomz and Bloomz-mt, is reduced compared to the base model Bloom, because they tend to generate a shorter response.

    Performance on MMCU

    (1) All base LLMs perform poorly because it is almost difficult to generate content in the specified format before fine-tuning, e.g., outputting option numbers.

    (2) All sft LLMs outperform their corresponding base LLMs, respectively. In particular, Bloomz performs the best (even beats ChatGLM) because it can generate option number directly as required without generating other irrelevant content, which is also due to the data characteristics of its supervised fine-tuning dataset xP3.

    (3) Among the four disciplines, law is the most challenging for LLMs.

The performance results of LLMs after instruction-tuning on Alpaca-GPT4-zh are shown in Figure 1.

  1. Instruction-tuning Different LLMs

    (1) On Belle-eval, the performance improvement of sft LLMs brought by instruction-tuning is not as significant as that of base LLMs, except for sft Bloomz and Bloomz-mt.

    (2) Vicuna and ChatGLM encounter performance drops after instruction-tuning, because Vicuna is trained from real human-ChatGPT conversations, with better quality than Alpaca-GPT4. ChatGLM adopts HFRL, which may be no longer suitable for further instruction-tuning.

    (3) On MMCU, most LLMs achieve performance boosts after instruction-tuning, with the exception of Bloomz and Bloomz-mt, which have unexpectedly significantly decreased performance.

    (4) After instruction-tuning, Bloom has significant improvements and performs well on both benchmarks. Although ChatGLM beats Bloom consistently, it suffers performance drop during instruction-tuning. Therefore, among all open LLMs, Bloom is most suitable as a foundation model in the subsequent experiments for Chinese instruction-tuning exploration.

2.2 Parameter-efficient Methods

For parameter-efficient methods other than LoRA, the paper collects a range of parameter-efficient methods to instruction-tune Bloom on the Alpaca-GPT4 dataset.

Experimental Results:

  1. Comparison of Parameter-efficient Methods

    (1) SadapterH performs the best among all parameter-efficient methods, which can be used as an alternative to LoRA.

    (2) P-tuning and prompt-tuning underperform others by large margins, indicating that only adding trainable layers in the embedding layer are not enough to support LLMs for generation tasks.

    (3) Although AdaLoRA is an improvement of LoRA, its performance has a clear drop, possibly because the LoRA's trainable parameters for LLMs are not suitable for further reduction.

    (4) Comparing the upper and lower parts, it can be seen that increasing the number of trainable parameters for sequential adapters (i.e., SadapterP and SadapterH) does not bring gain, while the opposite phenomenon is observed for parallel adapters(i.e., P-adapter)

  2. Training Loss

    (1) Prompt-tuning and P-tuning converge the slowest and has the highest losses after convergence. This shows that embedding-only adapters are not suitable for instruction-tuning LLMs.

    (2) The initial loss of AdaLoRA is very high because it requires simultaneous learning of parameter budget allocation, which makes the model unable to fit the training data well.

    (3) The other methods can quickly converge on training data and fit it well.

2.3 Chinese instruction Datasets

For the impact of various types of Chinese instruction datasets, authors gather popular open Chinese instructions (as shown in Table 5) to fine-tune Bloom with LoRA.

Table 6 and Table 7 show Bloom's fine-tuning on different instruction datasets.

Experimental Results:

  1. Performance on Belle-eval

    (1) the instruction data constructed by ChatGPT (e.g., using self-instruction methods or collecting real human-ChatGPT conversations) consistently enhances the instruction-following ability with 3.1 ∼ 11-point score increases.

    (2) Among these datasets, Belle has the best performance due to the largest amount of instruction data. However, the performance of models trained on moss-sft-data, containing more data built in a similar way, is unsatisfactory.

    (3) The performance brought by the Alpaca-GPT4 instructions is the second best, with only 49K being comparable to the 1.54M Belle.

    (4) Instinwild brings the least performance gains among them because the seed instructions it crawls from Tweet ("in wild") are not as comprehensive as those (like Alpaca) carefully designed by humans.

    (5) These ChatGPT-based data mainly have a significant improvement effect on open generation tasks such as Brain Storm and Generation, while there is a significant decrease in tasks that require high reading comprehension skills, such as Close QA and Extract.

    (6) These instruction datasets cause damage to the model's instruction-following ability, because the form and intent of each NLP or examination dataset are unitary, which can easily be overfitted.

    (7) Among them, COIG-trans performs the best because it involves over 2000 different tasks with a wide variety of task instructions. In contrast, xP3 and COIG-ccmc have the worst negative impact on model performance. Both of them only cover a few types of tasks (translation and QA for the former, counterfactual correction conversations for the latter), which hardly cover the popular instructions and tasks for humans.

  2. Performance on MMCU

    (1) Instruction-tuning on each dataset can always result in performance improvement.

    (2) Among the ChatGPT-based data shown in the upper part, ShareGPT-zh underperforms others by large margins. This may be due to the fact that real users rarely ask multiple choice questions about academic topics.

    (3) Among the dataset-collection data shown in the lower part, HC3 and COIG-ccmc results in the lowest accuracy because the unique questions of HC3 are only 13K, and the task format of COIG-ccmc is significantly different from MMCU.

    (4) COIG-exam brings the greatest accuracy improvement, benefiting from the similar task format as MMCU.

3. Other Factors

Four Other Factors: CoT, Expansion of Chinese Vocabulary, Language of Prompts and Human-value Alignment

3.1 CoT

For CoT, authors compare the performance before and after adding CoT data during instruction-tuning.

Experiment Settings:

We collect 9 CoT datasets and their prompts from FLAN, and then translate them into Chinese using Google Translate. They compare the performance before and after adding CoT data during instruction-tuning.

First note the way to add CoT data as "Alpaca-GPT4+CoT". In addition, add a sentence "先思考,再决定" ("think step by step" in Chinese) at the end of each instruction, to induce the model to respond to instructions based on the CoT, and label this way as "Alpaca-GPT4+CoT*".

Experimental Results:

  1. "Alpaca-GPT4+CoT" outperforms "Alpaca-GPT4" in Code and Math tasks that require strong reasoning ability. Besides, there is also a significant improvement in the MMCU Education task.

  2. As shown in the line of "Alpaca-GPT4+CoT*", the simple sentence can further improve the performance of reasoning tasks Code and Education, while the Math performance is slightly inferior to "Alpaca-GPT4+CoT". This may require further exploring of more robust prompts.

3.2 Expansion of Chinese Vocabulary

For expansion of Chinese vocabulary, authors test the influence of the number of Chinese tokens in the tokenizer’s vocabulary on LLMs’ ability to express Chinese. For example, if a Chinese character is in the vocabulary, it can be represented by a single token, otherwise it may require multiple tokens to represent it.

Experiment Settings: Authors mainly conduct experiments on LLaMA, which uses SentencePiece(32K vocabulary size of Chinese characters) covering fewer Chinese characters than Bloom(250K).

Experimental Results:

  1. Pre-training on more Chinese corpus with expansion of Chinese vocabulary is consistently helpful for instruction-following ability.

  2. And counterintuitively, "llama-voc-pre-l" (100B) is inferior to "llama-voc-pre" (20B) on MMCU, which shows that pre-training on more data may not necessarily lead to higher performance for academic exams.

3.3 Language of Prompts

For the language of prompts, authors test the suitability of instruction fine-tuning for using Chinese prompts.

Figure 4 shows the results of using Chinese and English prompts based on LLaMA and Bloom. When instruction-tuning LLaMA, using Chinese prompts can improve the performance on both benchmarks compared to English prompts, while the opposite phenomenon can be observed on Bloom.

Experimental Results:

  1. For models with weaker Chinese abilities(e.g., LLaMA), using Chinese prompts can effectively help respond in Chinese.

  2. For models with good Chinese abilities (e.g., Bloom), using prompts in English (the language they are better at) can better guide the model to understand the process of fine-tuning with instructions.

3.4 Human-value Alignment

To avoid LLMs generating toxic content, aligning them with human values is a crucial issue. We add human-value alignment data built by COIG into instruction-tuning to explore its impact.

Figure 5 compares the results of instruction-tuning with and without human-value alignment.

Experimental Results: The human-value alignment results in a slight performance drop. How to balance the harmlessness and performance of LLMs is a research direction worth exploring in the future.

定量分析

注:下图为截至3月26日采集数据集的统计数据,仅作为数据采集动机展示。已经收集了更多的数据集,例如金融相关的指令数据集。

data collection statistics The current collection of instruction-finetuning datasets consists mainly of three parts:

  • alpaca_data_cleaned.json: about 52K English instruction-following training samples.
  • CoT_data.json: 9 CoT datasets involving about 75k samples. (published by FLAN[7])
  • belle_data_cn.json: about 0.5M Chinese |instruction-following training samples. (published by BELLE [8])

Ablation of CoT and Chinese Instructions

ablation-cot "w/o CoT" and "w/o CN" denote models that exclude CoT data and Chinese instructions from their instruction finetuning data, respectively.

The above table shows two examples (involving with numerical calculations) that require a certain amount of reasoning ability to respond correctly. As shown in the middle column, Ours w/o CoT fails to generate the correct response, which shows that once the finetuning data does not contain CoT data, the model's reasoning ability significantly decreases. This further demonstrates that CoT data is essential for LLM models.

ablation-cot

The above table shows two examples that require the ability to respond to Chinese instructions. As shown in the right column, either the generated content of Ours w/o CN is unreasonable, or the Chinese instructions are answered in English by Ours w/o CN. This shows that removing Chinese data during finetuning will cause the model to be unable to handle Chinese instructions, and further demonstrates the need to collect Chinese instruction finetuning data.

ablation-cot

The above table shows a relatively difficult example, which requires both a certain accumulation of knowledge of Chinese history and a logical and complete ability to state historical events. As shown in this table, Ours w/o CN can only generate a short and erroneous response, because due to the lack of Chinese finetuning data, the corresponding knowledge of Chinese history is naturally lacking. Although Ours w/o CoT lists some relevant Chinese historical events, its logic of expression is self-contradictory, which is caused by the lack of CoT data. `

In summary, the models finetuned from our complete dataset (English, Chinese, and CoT instruction data) can significantly improve model reasoning and Chinese instruction following abilities.

The Effect of CoT Data

CoT-comparison Samples of each odd number of rows do not apply the CoT prompt, such as "step-by-step reasoning." Both Ours(w/CoT) and Alpaca are based on LLaMA-7B, and the only difference between them two is that the instruction-finetuning data of Ours(w/CoT) has a extra CoT data than that of Alpaca.

From the above table, we find that:

  • Ours(w/CoT) always generates the correct rationale before the answer, while Alpaca fails to generate any reasonable rationale, as shown in the first 4 examples (commonsense questions). This shows that using CoT data for finetuning can significantly improve reasoning ability.
  • For Ours(w/CoT), the CoT prompt (e.g., concatenate 'step-by-step' with the input question) has little effect on easy examples (e.g., commonsense questions) and has an important effect on challenging questions (e.g., questions requiring reasoning, like the last four examples).
  • For Alpaca, CoT prompt always has little effect or even negative impact. For the last two examples, after adding CoT prompt, Aplpaca changes the correct generated answer to the wrong one. This may be due to the inconsistency between the input forms of finetuning and inference.

The Effect of Chinese Instruction Data

Quantitative comparison of responses to Chinese instructions. CN_compare_CN

Our model is finetuned from a 7B LLaMA on 52K English instructions and 0.5M Chinese instructions. Stanford Alpaca (our reimplementation) is finetuned from a 7B LLaMA on 52K English instructions. BELLE is finetuned from a 7B BLOOM on 2B Chinese instructions.

From the above table, several observations can be found:

  • Compared to Alpaca, ours (w/ CN) has a stronger ability to understand Chinese instructions. For the first example, Alpaca fails to distinguish between the instruction part and input part, while we do.
  • Chinese instruction finetuning data can significant enhance the ability to interact in Chinese. For the second example, ours (w/ CN) not only provides the correct code, but also provides the corresponding Chinese annotation, while Alpaca does not. In addition, as shown in the 3-5 examples, Alpaca can only respond to Chinese instruction with an English response.
  • Compared to BELLE, ours (w/ CN)'s performance on instructions requiring an open response (as shown in last two examples) still needs to be improved. BELLE's outstanding performance against such instructions is due to: 1. Its BLOOM backbone model encounters much more multilingual data during pre-training; 2. Its Chinese instruction finetuning data is more than ours, that is, 2M vs 0.5M.

Quantitative comparison of responses to English instructions. The purpose of this subsection is to explore whether finetuning on Chinese instructions has a negative impact on Alpaca. CN_compare_EN

From the above table, we find that:

  • Finetuning with Chinese instruction data does not weaken the original English instruction–following ability, on the contrary, there is also a certain enhancement in generating a better response to English instructions. The response of ours (w/ CN) shows more detail than that of Alpaca, e.g. for the third example, ours (w/ CN) list three more provinces than Alpaca.

引文

如果您使用此存储库中的数据收集、代码和实验结果,请引用该存储库。

@misc{si2023empirical,
      title={An Empirical Study of Instruction-tuning Large Language Models in Chinese}, 
      author={Qingyi Si and Tong Wang and Zheng Lin and Xu Zhang and Yanan Cao and Weiping Wang},
      year={2023},
      eprint={2310.07328},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
  <svg aria-hidden="true" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-copy js-clipboard-copy-icon">
<path d="M0 6.75C0 5.784.784 5 1.75 5h1.5a.75.75 0 0 1 0 1.5h-1.5a.25.25 0 0 0-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 0 0 .25-.25v-1.5a.75.75 0 0 1 1.5 0v1.5A1.75 1.75 0 0 1 9.25 16h-7.5A1.75 1.75 0 0 1 0 14.25Z"></path><path d="M5 1.75C5 .784 5.784 0 6.75 0h7.5C15.216 0 16 .784 16 1.75v7.5A1.75 1.75 0 0 1 14.25 11h-7.5A1.75 1.75 0 0 1 5 9.25Zm1.75-.25a.25.25 0 0 0-.25.25v7.5c0 .138.112.25.25.25h7.5a.25.25 0 0 0 .25-.25v-7.5a.25.25 0 0 0-.25-.25Z"></path>

对于数据和模型,请注明原始数据、参数有效方法和法学硕士来源。

我们要特别感谢 APUS AilMe Lab 赞助 8 个 A100 GPU 进行实验。

回到顶部

感谢我们的贡献者

About

我们将指令调优数据(例如,CoT 数据)、多个 LLM 和参数高效方法(例如,lora、p-tuning)的接口统一在一起,以便于使用。我们欢迎开源爱好者在此仓库上发起任何有意义的 PR,并尽可能多地集成 LLM 相关技术。我们方便了研究人员现场使用大型模型等模型平台,我们欢迎开源爱好者发起任何有意义的项目!

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 89.7%
  • Python 8.1%
  • MDX 2.1%
  • Other 0.1%