Skip to content

Commit

Permalink
[GPT] GPT model with sharding and model parallel. (PaddlePaddle#339)
Browse files Browse the repository at this point in the history
  • Loading branch information
ZHUI authored May 14, 2021
1 parent 2e6f0b1 commit 669b3ae
Show file tree
Hide file tree
Showing 42 changed files with 3,421 additions and 620 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,13 +85,13 @@ wordemb.cosine_sim("艺术", "火车")
### 一键加载高质量中文预训练模型

```python
from paddlenlp.transformers import ErnieModel, BertModel, RobertaModel, ElectraModel, GPT2ForPretraining
from paddlenlp.transformers import ErnieModel, BertModel, RobertaModel, ElectraModel, GPTForPretraining

ernie = ErnieModel.from_pretrained('ernie-1.0')
bert = BertModel.from_pretrained('bert-wwm-chinese')
roberta = RobertaModel.from_pretrained('roberta-wwm-ext')
electra = ElectraModel.from_pretrained('chinese-electra-small')
gpt2 = GPT2ForPretraining.from_pretrained('gpt2-base-cn')
gpt = GPTForPretraining.from_pretrained('gpt-cpm-large-cn')
```

### 便捷获取文本特征
Expand Down
4 changes: 2 additions & 2 deletions README_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,13 +81,13 @@ For more TokenEmbedding usage, please refer to [Embedding API](./docs/embeddings
### Rich Chinese Pre-trained Models

```python
from paddlenlp.transformers import ErnieModel, BertModel, RobertaModel, ElectraModel, GPT2ForPretraining
from paddlenlp.transformers import ErnieModel, BertModel, RobertaModel, ElectraModel, GPTForPretraining

ernie = ErnieModel.from_pretrained('ernie-1.0')
bert = BertModel.from_pretrained('bert-wwm-chinese')
roberta = RobertaModel.from_pretrained('roberta-wwm-ext')
electra = ElectraModel.from_pretrained('chinese-electra-small')
gpt2 = GPT2ForPretraining.from_pretrained('gpt2-base-cn')
gpt = GPTForPretraining.from_pretrained('gpt-cpm-large-cn')
```

For more pretrained model selection, please refer to [Transformer API](./docs/transformers.md)
Expand Down
2 changes: 1 addition & 1 deletion docs/model_zoo.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ PaddleNLP提供了丰富的模型结构,包含经典的RNN类模型结构,
| [ERNIE-Tiny](../examples/text_classification/pretrained_models) | 百度自研的小型化ERNIE网络结构,采用浅层Transformer,加宽隐层参数,中文subword粒度词表结合蒸馏的方法使模型相比SOTA Before BERT 提升8.35%, 速度提升4.3倍。 |
| [ERNIE-GEN](../examples/text_generation/ernie-gen) | [ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation](https://arxiv.org/abs/2001.11314) ERNIE-GEN是百度发布的生成式预训练模型,通过Global-Attention的方式解决训练和预测曝光偏差的问题,同时使用Multi-Flow Attention机制来分别进行Global和Context信息的交互,同时通过片段生成的方式来增加语义相关性。 |
| [ERNIESage](../examples/text_graph/erniesage)| ERNIESage(ERNIE SAmple aggreGatE) 通过Graph(图)来构建自身节点和邻居节点的连接关系,将自身节点和邻居节点的关系构建成一个关联样本输入到ERNIE中,ERNIE作为聚合函数 (Aggregators) 来表征自身节点和邻居节点的语义关系,最终强化图中节点的语义表示。|
| [GPT-2](../examples/language_model/gpt2) |[Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf) |
| [GPT-2](../examples/language_model/gpt) |[Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf) |
| [ELECTRA](../examples/language_model/electra/) | [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://arxiv.org/abs/2003.10555) |
| [XLNet](../examples/language_model/xlnet/) | [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) |
| [RoBERTa](../examples/text_classification/pretrained_models) | [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) |
Expand Down
6 changes: 3 additions & 3 deletions docs/model_zoo/transformers.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
|[ERNIE](https://arxiv.org/abs/1904.09223)|ErnieTokenizer<br>ErnieTinyTokenizer|ErnieModel<br> ErnieForQuestionAnswering<br> ErnieForSequenceClassification<br> ErnieForTokenClassification | `ernie-1.0`<br> `ernie-tiny`<br> `ernie-2.0-en`<br> `ernie-2.0-large-en`|
|[ERNIE-GEN](https://arxiv.org/abs/2001.11314)|ErnieTokenizer| ErnieForGeneration|`ernie-gen-base-en`<br>`ernie-gen-large-en`<br>`ernie-gen-large-en-430g`|
| ERNIE-CTM | ErnieCtmTokenizer | ErnieCtmModel<br> ErnieCtmWordtagModel | `ernie-ctm` |
|[GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)| GPT2Tokenizer<br> GPT2ChineseTokenizer| GPT2ForGreedyGeneration| `gpt2-base-cn` <br> `gpt2-medium-en`|
|[GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)| GPTTokenizer<br> GPTChineseTokenizer| GPTForGreedyGeneration| `gpt-cpm-large-cn` <br> `gpt2-medium-en`|
|[RoBERTa](https://arxiv.org/abs/1907.11692)|RobertaTokenizer| RobertaModel<br>RobertaForQuestionAnswering<br>RobertaForSequenceClassification<br>RobertaForTokenClassification| `roberta-wwm-ext`<br> `roberta-wwm-ext-large`<br> `rbt3`<br> `rbtl3`|
| [BigBird](https://arxiv.org/abs/2007.14062) | BigBirdTokenizer | BigBirdModel<br> BigBirdForSequenceClassification<br> BigBirdForPretraining | `bigbird-base-uncased` |
|[ELECTRA](https://arxiv.org/abs/2003.10555) | ElectraTokenizer| ElectraModel<br>ElectraForSequenceClassification<br>ElectraForTokenClassification<br>|`electra-small`<br> `electra-base`<br> `electra-large`<br> `chinese-electra-small`<br> `chinese-electra-base`<br>|
Expand All @@ -25,7 +25,7 @@

| 其中中文的预训练模型有:|
|---|
|`bert-base-chinese, bert-wwm-chinese, bert-wwm-ext-chinese, ernie-1.0, ernie-tiny`,<br> `gpt2-base-cn, roberta-wwm-ext, roberta-wwm-ext-large, rbt3, rbtl3`,<br> `chinese-electra-base, chinese-electra-small, chinese-xlnet-base, chinese-xlnet-mid`, <br>`chinese-xlnet-large, unified_transformer-12L-cn, unified_transformer-12L-cn-luge` |
|`bert-base-chinese, bert-wwm-chinese, bert-wwm-ext-chinese, ernie-1.0, ernie-tiny`,<br> `gpt-cpm-large-cn, roberta-wwm-ext, roberta-wwm-ext-large, rbt3, rbtl3`,<br> `chinese-electra-base, chinese-electra-small, chinese-xlnet-base, chinese-xlnet-mid`, <br>`chinese-xlnet-large, unified_transformer-12L-cn, unified_transformer-12L-cn-luge` |


## 预训练模型使用方法
Expand Down Expand Up @@ -75,7 +75,7 @@ for input_ids, token_type_ids, labels in train_dataloader:
|文本分类<br>SequenceClassification |BertForSequenceClassification <br> ErnieForSequenceClassification <br> RobertaForSequenceClassification <br> ElectraForSequenceClassification <br> XLNetForSequenceClassification | [见上表](#Transformer预训练模型汇总)|
|序列标注<br>TokenClassification|BertForTokenClassification <br> ErnieForTokenClassification <br> RobertaForTokenClassification <br> ElectraForTokenClassification <br> XLNetForTokenClassification |[见上表](#Transformer预训练模型汇总)|
|问答任务<br>QuestionAnswering|BertForQuestionAnswering <br> ErnieForQuestionAnswering <br> RobertaForQuestionAnswering|[见上表](#Transformer预训练模型汇总)|
|文本生成<br>TextGeneration | ErnieForGeneration <br> GPT2ForGreedyGeneration |[见上表](#Transformer预训练模型汇总)|
|文本生成<br>TextGeneration | ErnieForGeneration <br> GPTForGreedyGeneration |[见上表](#Transformer预训练模型汇总)|
|机器翻译<br>MachineTranslation| TransformerModel |[见上表](#Transformer预训练模型汇总)|

用户可以切换表格中的不同模型,来处理相同类型的任务。如对于[预训练模型使用方法](#预训练模型使用方法)中的文本分类任务,您可以将`BertForSequenceClassification`换成`ErnieForSequenceClassification`, 来寻找更适合的预训练模型。
Expand Down
6 changes: 3 additions & 3 deletions docs/transformers.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,14 @@
| [BERT](https://arxiv.org/abs/1810.04805) | BertTokenizer|BertModel<br> BertForQuestionAnswering<br> BertForSequenceClassification<br>BertForTokenClassification| `bert-base-uncased`<br> `bert-large-uncased` <br>`bert-base-multilingual-uncased` <br>`bert-base-cased`<br> `bert-base-chinese`<br> `bert-base-multilingual-cased`<br> `bert-large-cased`<br> `bert-wwm-chinese`<br> `bert-wwm-ext-chinese` |
|[ERNIE](https://arxiv.org/abs/1904.09223)|ErnieTokenizer<br>ErnieTinyTokenizer|ErnieModel<br> ErnieForQuestionAnswering<br> ErnieForSequenceClassification<br> ErnieForTokenClassification | `ernie-1.0`<br> `ernie-tiny`<br> `ernie-2.0-en`<br> `ernie-2.0-large-en`|
|[ERNIE-GEN](https://arxiv.org/abs/2001.11314)|ErnieTokenizer| ErnieForGeneration|`ernie-gen-base-en`<br>`ernie-gen-large-en`<br>`ernie-gen-large-en-430g`|
|[GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)| GPT2Tokenizer<br> GPT2ChineseTokenizer| GPT2ForGreedyGeneration| `gpt2-base-cn` <br> `gpt2-medium-en`|
|[GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)| GPTTokenizer<br> GPTChineseTokenizer| GPTForGreedyGeneration| `gpt-cpm-large-cn` <br> `gpt2-medium-en`|
|[RoBERTa](https://arxiv.org/abs/1907.11692)|RobertaTokenizer| RobertaModel<br>RobertaForQuestionAnswering<br>RobertaForSequenceClassification<br>RobertaForTokenClassification| `roberta-wwm-ext`<br> `roberta-wwm-ext-large`<br> `rbt3`<br> `rbtl3`|
|[ELECTRA](https://arxiv.org/abs/2003.10555) | ElectraTokenizer| ElectraModel<br>ElectraForSequenceClassification<br>ElectraForTokenClassification<br>|`electra-small`<br> `electra-base`<br> `electra-large`<br> `chinese-electra-small`<br> `chinese-electra-base`<br>|
|[XLNet](https://arxiv.org/abs/1906.08237)| XLNetTokenizer| XLNetModel<br> XLNetForSequenceClassification<br> XLNetForTokenClassification |`xlnet-base-cased`<br> `xlnet-large-cased`<br> `chinese-xlnet-base`<br> `chinese-xlnet-mid`<br> `chinese-xlnet-large`|
|[UnifiedTransformer](https://arxiv.org/abs/2006.16779)| UnifiedTransformerTokenizer| UnifiedTransformerModel<br> UnifiedTransformerLMHeadModel |`unified_transformer-12L-cn`<br> `unified_transformer-12L-cn-luge` |
|[Transformer](https://arxiv.org/abs/1706.03762) |- | TransformerModel | - |

**NOTE**:其中中文的预训练模型有`bert-base-chinese, bert-wwm-chinese, bert-wwm-ext-chinese, ernie-1.0, ernie-tiny, gpt2-base-cn, roberta-wwm-ext, roberta-wwm-ext-large, rbt3, rbtl3, chinese-electra-base, chinese-electra-small, chinese-xlnet-base, chinese-xlnet-mid, chinese-xlnet-large, unified_transformer-12L-cn, unified_transformer-12L-cn-luge`
**NOTE**:其中中文的预训练模型有`bert-base-chinese, bert-wwm-chinese, bert-wwm-ext-chinese, ernie-1.0, ernie-tiny, gpt-cpm-large-cn, roberta-wwm-ext, roberta-wwm-ext-large, rbt3, rbtl3, chinese-electra-base, chinese-electra-small, chinese-xlnet-base, chinese-xlnet-mid, chinese-xlnet-large, unified_transformer-12L-cn, unified_transformer-12L-cn-luge`

## 预训练模型使用方法

Expand Down Expand Up @@ -79,7 +79,7 @@ for input_ids, token_type_ids, labels in train_data_loader():
|文本分类<br>SequenceClassification |BertForSequenceClassification <br> ErnieForSequenceClassification <br> RobertaForSequenceClassification <br> ElectraForSequenceClassification <br> XLNetForSequenceClassification | [文本分类](../examples/text_classification/pretrained_models/)[阅读理解](../examples/machine_reading_comprehension/DuReader-yesno/)| [见上表](#Transformer预训练模型汇总)|
|序列标注<br>TokenClassification|BertForTokenClassification <br> ErnieForTokenClassification <br> RobertaForTokenClassification <br> ElectraForTokenClassification <br> XLNetForTokenClassification | [命名实体标注](../examples/information_extraction/msra_ner/)|[见上表](#Transformer预训练模型汇总)|
|问答任务<br>QuestionAnswering|BertForQuestionAnswering <br> ErnieForQuestionAnswering <br> RobertaForQuestionAnswering| [阅读理解](../examples/machine_reading_comprehension/SQuAD/)|[见上表](#Transformer预训练模型汇总)|
|文本生成<br>TextGeneration | ErnieForGeneration <br> GPT2ForGreedyGeneration |[文本生成](../examples/text_generation/ernie-gen)|[见上表](#Transformer预训练模型汇总)|
|文本生成<br>TextGeneration | ErnieForGeneration <br> GPTForGreedyGeneration |[文本生成](../examples/text_generation/ernie-gen)|[见上表](#Transformer预训练模型汇总)|
|机器翻译<br>MachineTranslation| TransformerModel | [机器翻译](../examples/machine_translation/transformer/)|[见上表](#Transformer预训练模型汇总)|

用户可以切换表格中的不同模型,来处理相同类型的任务。如对于[预训练模型使用方法](#预训练模型使用方法)中的文本分类任务,用户可以将`BertForSequenceClassification`换成`ErnieForSequenceClassification`, 来寻找更适合的预训练模型。
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,18 @@
```text
.
├── args.py # 训练参数配置
├── data.py # 数据处理
├── create_pretraining_data.py # 数据预处理脚本
├── dataset.py # 数据处理
├── decompress.sh # 数据集解压脚本
├── generate_sample.py # 生成文本示例demo
├── deploy/ # 模型部署的inference脚本
├── export_model.py # 导出预测部署的模型脚本
├── predict.py # 生成文本示例demo
├── lr.py # 学习率控制
├── process_data.py # 数据预处理脚本
├── README.md # 文档
├── run_pretrain.py # 预训练入口
├── run_eval.py # 评估入口
└── scripts # 训练脚本
├── run_pretrain.py # 预训练入口
├── run_pretrain_static.py # 混合并行,预训练脚本
└── scripts/ # 训练脚本
```

## 快速开始
Expand Down Expand Up @@ -49,7 +52,7 @@ bash decompress.sh
为了提升训练速度,我们在训练前将文本数据转成相应的id,并保存为npz格式:

```shell
python process_data.py --input_path raw_data \
python create_pretraining_data.py --input_path raw_data \
--model_name gpt2-medium-en \
--append_eod \
--workers 8
Expand All @@ -58,7 +61,7 @@ python process_data.py --input_path raw_data \
运行命令后,产出`raw_data_ids.npz`文件。为了方便用户运行测试本模型,本项目提供了处理好的300M的训练样本:

```shell
wget https://paddlenlp.bj.bcebos.com/models/transformers/gpt2/train.data.json_ids.npz
wget https://paddlenlp.bj.bcebos.com/models/transformers/gpt/train.data.json_ids.npz
```

将所有预处理得到的npz文件统一放入一个文件夹中,以备训练使用:
Expand All @@ -74,8 +77,8 @@ mv train.data.json_ids.npz data

```shell
CUDA_VISIBLE_DEVICES=0 python run_pretrain.py \
--model_type gpt2 \
--model_name_or_path gpt2-small-en \
--model_type gpt \
--model_name_or_path gpt2-en \
--input_dir "./data"\
--output_dir "output"\
--weight_decay 0.01\
Expand All @@ -84,7 +87,7 @@ CUDA_VISIBLE_DEVICES=0 python run_pretrain.py \
--save_steps 100000\
--decay_steps 320000\
--warmup_rate 0.01\
--batch_size 8\
--batch_size 4\
--device gpu
```

Expand All @@ -108,8 +111,8 @@ CUDA_VISIBLE_DEVICES=0 python run_pretrain.py \
```shell
unset CUDA_VISIBLE_DEVICES
python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py \
--model_type gpt2 \
--model_name_or_path gpt2-small-en \
--model_type gpt \
--model_name_or_path gpt2-en \
--input_dir "./data"\
--output_dir "output"\
--weight_decay 0.01\
Expand All @@ -118,7 +121,7 @@ python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_pretrain.py \
--save_steps 100000\
--decay_steps 320000\
--warmup_rate 0.01\
--batch_size 8\
--batch_size 4\
--device gpu
```

Expand Down Expand Up @@ -148,7 +151,7 @@ python run_eval.py --model_name gpt2-medium-en \
--device gpu
```
其中参数释义如下:
`model_name` 使用的模型名称,如gpt2-samll-en等。
`model_name` 使用的模型名称,如gpt2-medium-en等。
`eval_path` 数据集地址。
`init_checkpoint_path` 模型参数地址
`batch_size` batch size大小。
Expand Down Expand Up @@ -180,6 +183,46 @@ python generate_sample.py
对影成三人。
```

## 模型导出预测

下面提供了简单的示例,帮助用户将预训练模型导出成预测部署的参数。

导出中文模型
```"shell
python export_model.py --model_type=gpt-cn \
--model_path=gpt-cpm-large-cn \
--output_path=./infer_model/model
```
用户在`infer_model`中可以看到导出的文件。

对于导出的模型,我们提供了Python的infer脚本,调用预测库对简单的例子进行预测。
```shell
python deploy/python/inference.py --model_type gpt-cn \
--model_path ./infer_model/model
```


导出英文模型
```"shell
python export_model.py --model_type=gpt \
--model_path=gpt2-medium-en \
--output_path=./infer_model/model
python deploy/python/inference.py --model_type gpt \
--model_path ./infer_model/model
```

用户可以看到屏幕输出预测结果。

## 飞桨4D混合并行训练
飞桨4D混合并行,使用sharding、模型并行、流水线并行和数据并行策略,使得训练千亿参数规模的模型成为可能。在本示例中,我们提供了基于飞桨最新混合并行策略的GPT预训练模型。运行下面脚本,即可进行模型预训练:
```shell
sh scripts/run_static.sh
```
用户可以根据自己的机器资源,灵活调整并行策略,选择最合适的策略来训练模型。更多关于混合并行策略的的例子详见[飞桨4D混合并行训练使用指南](https://fleet-x.readthedocs.io/en/latest/paddle_fleet_rst/collective/collective_mp/hybrid_parallelism.html)

## 参考文献
- [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
- [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413)
- [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/abs/2104.04473)
Loading

0 comments on commit 669b3ae

Please sign in to comment.