Skip to content

Latest commit

 

History

History
398 lines (282 loc) · 18.1 KB

language_modeling.md

File metadata and controls

398 lines (282 loc) · 18.1 KB

因果语言建模

[[open-in-colab]]

语言建模有两种类型,因果和掩码。本指南介绍因果语言建模。因果语言模型经常用于文本生成。你可以将这些模型用于创意应用,例如选择你自己的文字冒险或智能编码助手(如Copilot或CodeParrot)。

因果语言建模预测标记序列中的下一个标记,并且模型只能关注左侧的标记。这意味着模型无法看到未来的标记。GPT-2是因果语言模型的一个例子。

本指南将向你展示如何:

  1. ELI5数据集的r/askscience子集上微调DistilGPT2模型。
  2. 使用微调后的模型进行推理。
你可以按照本指南中的相同步骤微调其他用于因果语言建模的架构。 选择以下架构之一:

BART, BERT, Bert Generation, BigBird, BigBird-Pegasus, BioGpt, Blenderbot, BlenderbotSmall, BLOOM, CamemBERT, CodeLlama, CodeGen, CPM-Ant, CTRL, Data2VecText, ELECTRA, ERNIE, Falcon, GIT, GPT-Sw3, OpenAI GPT-2, GPTBigCode, GPT Neo, GPT NeoX, GPT NeoX Japanese, GPT-J, LLaMA, Marian, mBART, MEGA, Megatron-BERT, MPT, MusicGen, MVP, OpenLlama, OpenAI GPT, OPT, Pegasus, Persimmon, PLBart, ProphetNet, QDQBert, Reformer, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, RWKV, Speech2Text2, Transformer-XL, TrOCR, XGLM, XLM, XLM-ProphetNet, XLM-RoBERTa, XLM-RoBERTa-XL, XLNet, X-MOD

开始之前,请确保安装了所有必要的库:

pip install transformers datasets evaluate

我们建议你登录Hugging Face帐户,这样你就可以将模型上传并与社区共享。在提示时,输入你的token进行登录:

>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载ELI5数据集

首先,从🤗数据集库中加载ELI5数据集的较小子集r/askscience子集。这样可以让你有机会进行实验,并确保在完整数据集上进行更多时间的训练之前,一切正常。

>>> from datasets import load_dataset

>>> eli5 = load_dataset("eli5", split="train_asks[:5000]")

使用[~datasets.Dataset.train_test_split]方法将数据集的“train_asks”拆分为训练集和测试集:

>>> eli5 = eli5.train_test_split(test_size=0.2)

然后看一个示例:

>>> eli5["train"][0]
{'answers': {'a_id': ['c3d1aib', 'c3d4lya'],
  'score': [6, 3],
  'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
   "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]},
 'answers_urls': {'url': []},
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']},
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls': {'url': []}}

虽然这看起来很多,但你实际上只对“text”字段感兴趣。语言建模任务的有趣之处在于你不需要标签(也称为无监督任务),因为下一个单词就是标签。

预处理

下一步是加载DistilGPT2标记器以处理“text”子字段:

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

你将从上面的示例中注意到,“text”字段实际上是嵌套在“answers”内部的。这意味着你需要使用[~datasets.Dataset.flatten]方法从其嵌套结构中提取“text”子字段:

>>> eli5 = eli5.flatten()
>>> eli5["train"][0]
{'answers.a_id': ['c3d1aib', 'c3d4lya'],
 'answers.score': [6, 3],
 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
  "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"],
 'answers_urls.url': [],
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'],
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls.url': []}

现在,每个子字段都是一个单独的列,如“answers”前缀所示,“text”字段现在是一个列表。不是分别对每个句子进行标记化,而是将列表转换为字符串,以便可以联合对其进行标记化。

下面是用于连接示例中的字符串列表并对结果进行标记化的第一个预处理函数:

>>> def preprocess_function(examples):
...     return tokenizer([" ".join(x) for x in examples["answers.text"]])

使用🤗数据集[~datasets.Dataset.map]方法将该预处理函数应用于整个数据集。通过将batched=True设置为同时处理数据集的多个元素,并使用num_proc增加进程的数量,可以加速map函数的处理速度。删除不需要的列:

>>> tokenized_eli5 = eli5.map(
...     preprocess_function,
...     batched=True,
...     num_proc=4,
...     remove_columns=eli5["train"].column_names,
... )

该数据集包含标记序列,但其中一些序列长度超过了模型的最大输入长度。

现在,可以使用第二个预处理函数来

  • 连接所有序列
  • 将连接的序列拆分为长度由block_size定义的较短的块,其长度应小于最大输入长度,并且足够短以适应GPU RAM。
>>> block_size = 128


>>> def group_texts(examples):
...     # 连接所有文本。
...     concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
...     total_length = len(concatenated_examples[list(examples.keys())[0]])
...     # 我们丢弃剩余的小块,我们可以添加填充而不是丢弃的部分,你可以根据需要自定义此部分。
...     if total_length >= block_size:
...         total_length = (total_length // block_size) * block_size
...     # 按block_size进行拆分。
...     result = {
...         k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
...         for k, t in concatenated_examples.items()
...     }
...     result["labels"] = result["input_ids"].copy()
...     return result

在整个数据集上应用group_texts函数:

>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

现在,使用[DataCollatorForLanguageModeling]创建一批示例。在整理过程中使用动态填充模型在一批中最长长度的句子更有效,而不是将整个数据集填充到最大长度。

使用终止序列token作为填充token,并设置`mlm=False`。这将使用右移一个元素的标签作为输入:
>>> from transformers import DataCollatorForLanguageModeling

>>> tokenizer.pad_token = tokenizer.eos_token
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
使用终止序列token作为填充token,并设置`mlm=False`。这将使用右移一个元素的标签作为输入:
>>> from transformers import DataCollatorForLanguageModeling

>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

训练

如果你不熟悉使用[Trainer]微调模型,请查看基本教程

你现在已经准备好开始训练模型了!使用[AutoModelForCausalLM]加载DistilGPT2模型:

>>> from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

>>> model = AutoModelForCausalLM.from_pretrained("distilgpt2")

现在只剩下三个步骤:

  1. 使用[TrainingArguments]定义训练超参数。唯一需要的参数是output_dir,用于指定保存模型的位置。设置push_to_hub=True将此模型推送到Hub(你需要登录Hugging Face以上传模型)。
  2. 将训练参数与模型、数据集和数据整理器一起传递给[Trainer]。
  3. 调用[~Trainer.train]来微调模型。
>>> training_args = TrainingArguments(
...     output_dir="my_awesome_eli5_clm-model",
...     evaluation_strategy="epoch",
...     learning_rate=2e-5,
...     weight_decay=0.01,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=lm_dataset["train"],
...     eval_dataset=lm_dataset["test"],
...     data_collator=data_collator,
... )

>>> trainer.train()

训练完成后,使用[~transformers.Trainer.evaluate]方法评估模型并获得困惑度:

>>> import math

>>> eval_results = trainer.evaluate()
>>> print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
Perplexity: 49.61

然后可以使用[~transformers.Trainer.push_to_hub]方法将模型分享到Hub,以便每个人都可以使用你的模型:

>>> trainer.push_to_hub()

如果你不熟悉使用Keras微调模型,请查看基本教程

要在TensorFlow中微调模型,请首先设置优化器函数、学习率调度和一些训练超参数:
>>> from transformers import create_optimizer, AdamWeightDecay

>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

然后可以使用[TFAutoModelForCausalLM]加载DistilGPT2模型:

>>> from transformers import TFAutoModelForCausalLM

>>> model = TFAutoModelForCausalLM.from_pretrained("distilgpt2")

使用[~transformers.TFPreTrainedModel.prepare_tf_dataset]将数据集转换为tf.data.Dataset格式:

>>> tf_train_set = model.prepare_tf_dataset(
...     lm_dataset["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_test_set = model.prepare_tf_dataset(
...     lm_dataset["test"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

使用compile为训练配置模型。请注意,Transformer模型都有一个默认的与任务相关的损失函数,所以除非你想要指定一个,否则不需要特别指定:

>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)  # 没有损失参数!

这可以通过 在[~transformers.PushToHubCallback]中指定要推送模型和分词器的位置 来实现:

>>> from transformers.keras_callbacks import PushToHubCallback

>>> callback = PushToHubCallback(
...     output_dir="my_awesome_eli5_clm-model",
...     tokenizer=tokenizer,
... )

最后,你可以开始训练模型了!使用fit方法并传入训练和验证数据集、训练轮数,以及回调函数来微调模型:

>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])

训练完成后,你的模型会自动上传到Hub,这样每个人都可以使用它!

更详细的关于如何为因果语言建模微调模型的示例,请参考相应的PyTorch笔记本TensorFlow笔记本

推断

很好,现在你已经微调了模型,可以用它进行推断了!

构造一个你想要生成文本的输入提示:

>>> prompt = "Somatic hypermutation allows the immune system to"

尝试使用[pipeline] 中的模型进行推断是最简单的方法。使用你的模型实例化一个文本生成的 pipeline,并将文本传递给它:

>>> from transformers import pipeline

>>> generator = pipeline("text-generation", model="my_awesome_eli5_clm-model")
>>> generator(prompt)
[{'generated_text': "Somatic hypermutation allows the immune system to be able to effectively reverse the damage caused by an infection.\n\n\nThe damage caused by an infection is caused by the immune system's ability to perform its own self-correcting tasks."}]
将文本分词处理并将`input_ids`返回为PyTorch张量:
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model")
>>> inputs = tokenizer(prompt, return_tensors="pt").input_ids

使用[~transformers.generation_utils.GenerationMixin.generate]方法生成文本。有关不同的文本生成策略和控制生成的参数的更多细节,请查看文本生成策略页面。

>>> from transformers import AutoModelForCausalLM

>>> model = AutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model")
>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

将生成的token id解码回文本:

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
["Somatic hypermutation allows the immune system to react to drugs with the ability to adapt to a different environmental situation. In other words, a system of 'hypermutation' can help the immune system to adapt to a different environmental situation or in some cases even a single life. In contrast, researchers at the University of Massachusetts-Boston have found that 'hypermutation' is much stronger in mice than in humans but can be found in humans, and that it's not completely unknown to the immune system. A study on how the immune system"]
将文本分词处理并将`input_ids`返回为TensorFlow张量:
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model")
>>> inputs = tokenizer(prompt, return_tensors="tf").input_ids

使用[~transformers.generation_tf_utils.TFGenerationMixin.generate]方法创建摘要。有关不同的文本生成策略和控制生成的参数的更多细节,请查看文本生成策略页面。

>>> from transformers import TFAutoModelForCausalLM

>>> model = TFAutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model")
>>> outputs = model.generate(input_ids=inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

将生成的token id解码回文本:

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Somatic hypermutation allows the immune system to detect the presence of other viruses as they become more prevalent. Therefore, researchers have identified a high proportion of human viruses. The proportion of virus-associated viruses in our study increases with age. Therefore, we propose a simple algorithm to detect the presence of these new viruses in our samples as a sign of improved immunity. A first study based on this algorithm, which will be published in Science on Friday, aims to show that this finding could translate into the development of a better vaccine that is more effective for']