Skip to content

Commit

Permalink
Fix dead link for whole repo
Browse files Browse the repository at this point in the history
  • Loading branch information
ZeyuChen committed Mar 10, 2021
1 parent b6e5d37 commit e14231a
Show file tree
Hide file tree
Showing 16 changed files with 77 additions and 49 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ PaddleNLP 2.0拥有**覆盖多场景的模型库**、**简洁易用的全流程A
## 特性

- **覆盖多场景的模型库**
- PaddleNLP集成了RNN与Transformer等多种主流模型结构,涵盖从[词向量](./exmaples/word_embedding/)[词法分析](./examples/lexical_analysis/)[命名实体识别](./examples/information_extraction/msra_ner/)[语义表示](./examples/language_model/)等NLP基础技术,到[文本分类](./examples/text_classification/)[文本匹配](./examples/text_matching/)[文本生成](./examples/text_generation/)[文本图学习](./examples/text_graph/erniesage/)[信息抽取](./examples/information_extraction)等NLP核心技术。同时针对[机器翻译](./examples/machine_translation/)[通用对话](./examples/dialogue/)[阅读理解](./exampels/machine_reading_comprehension/)等系统应用提供相应核心组件与预训练模型。更多详细介绍请查看[PaddleNLP应用示例](./examples/)
- PaddleNLP集成了RNN与Transformer等多种主流模型结构,涵盖从[词向量](./examples/word_embedding/)[词法分析](./examples/lexical_analysis/)[命名实体识别](./examples/information_extraction/msra_ner/)[语义表示](./examples/language_model/)等NLP基础技术,到[文本分类](./examples/text_classification/)[文本匹配](./examples/text_matching/)[文本生成](./examples/text_generation/)[文本图学习](./examples/text_graph/erniesage/)[信息抽取](./examples/information_extraction)等NLP核心技术。同时针对[机器翻译](./examples/machine_translation/)[通用对话](./examples/dialogue/)[阅读理解](./examples/machine_reading_comprehension/)等系统应用提供相应核心组件与预训练模型。更多详细介绍请查看[PaddleNLP应用示例](./examples/)


- **简洁易用的全流程API**
Expand Down Expand Up @@ -45,9 +45,9 @@ pip install paddlenlp\>=2.0.0rc
### 数据集快速加载

```python
from paddlenlp.datasets import ChnSentiCorp
from paddlenlp.datasets import load_dataset

train_ds, dev_ds, test_ds = ChnSentiCorp.get_datasets(['train', 'dev', 'test'])
train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])
```

可参考[Dataset文档](./docs/datasets.md)查看更多数据集。
Expand Down
44 changes: 31 additions & 13 deletions README_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,9 @@ pip install paddlenlp>=2.0.0rc
### Quick Dataset Loading

```python
from paddlenlp.datasets import ChnSentiCorp
from paddlenlp.datasets import load_dataset

train_ds, test_ds = ChnSentiCorp.get_datasets(['train','test'])
train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])
```

### Chinese Text Emebdding Loading
Expand Down Expand Up @@ -75,19 +75,35 @@ gpt2 = GPT2ForPretraining.from_pretrained('gpt2-base-cn')

For more pretrained model selection, please refer to [Pretrained-Models](./docs/transformers.md)

### 便捷获取文本特征

```python
import paddle
from paddlenlp.transformers import ErnieTokenizer, ErnieModel

tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
model = ErnieModel.from_pretrained('ernie-1.0')

text = tokenizer('自然语言处理')
pooled_output, sequence_output = model.forward(input_ids=paddle.to_tensor([text['input_ids']]))
```

## Model Zoo and Applications

- [Word Embedding](./examples/word_embedding/README.md)
- [Lexical Analysis](./examples/lexical_analysis/README.md)
- [Language Model](./examples/language_model)
- [Text Classification](./examples/text_classification/README.md)
- [Text Generation](./examples/text_generation/README.md)
- [Semantic Matching](./examples/text_matching/README.md)
- [Named Entity Recognition](./examples/named_entity_recognition/README.md)
- [Text Graph](./examples/text_graph/README.md)
- [General Dialogue](./examples/dialogue)
- [Machine Translation](./exmaples/machine_translation)
- [Question Answering](./exmaples/machine_reading_comprehension)
For model zoo introduction please refer to[PaddleNLP Model Zoo](./docs/model_zoo.md). As for applicaiton senario please refer to [PaddleNLP Examples](./examples/)

- [Word Embedding](./examples/word_embedding/)
- [Lexical Analysis](./examples/lexical_analysis/)
- [Name Entity Recognition](./examples/information_extraction/msra_ner/)
- [Language Model](./examples/language_model/)
- [Text Classification](./examples/text_classification/)
- [Text Gneeration](./examples/text_generation/)
- [Semantic Maching](./examples/text_matching/)
- [Text Graph](./examples/text_graph/erniesage/)
- [Information Extraction](./examples/information_extraction/)
- [General Dialogue](./examples/dialogue/)
- [Machine Translation](./examples/machine_translation/)
- [Machine Readeng Comprehension](./examples/machine_reading_comprehension/)

## Advanced Application

Expand All @@ -113,6 +129,8 @@ Please refer to our official AI Studio account for more interactive tutorials: [
* [Waybill Information Extraction with BiGRU-CRF Model](https://aistudio.baidu.com/aistudio/projectdetail/1317771) shows how to make use of Bi-GRU plus CRF to finish information extraction task.

* [Waybill Information Extraction with ERNIE](https://aistudio.baidu.com/aistudio/projectdetail/1329361) shows how to use ERNIE, the Chinese pre-trained model improve information extraction performance.

* [Use TCN Model to predict COVID-19 confirmed cases](https://aistudio.baidu.com/aistudio/projectdetail/1290873)


## Community
Expand Down
4 changes: 2 additions & 2 deletions docs/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ PaddleNLP提供了
| ---- | --------- | ------ |
| [Conll05](https://www.cs.upc.edu/~srlconll/spec.html) | 语义角色标注数据集| `paddle.text.datasets.Conll05st`|
| [MSRA_NER](https://github.com/lemonhu/NER-BERT-pytorch/tree/master/data/msra) | MSRA 命名实体识别数据集| `paddlenlp.datasets.MSRA_NER`|
| [Express_Ner](https://aistudio.baidu.com/aistudio/projectdetail/131360?channelType=0&channel=-1) | 快递单命名实体识别数据集| [express_ner](../examples/named_entity_recognition/express_ner/data)|
| [ExpressNer](https://aistudio.baidu.com/aistudio/projectdetail/131360?channelType=0&channel=-1) | 快递单信息抽取数据集 | [waybill_ie](../examples/information_extraction/waybill_ie/data/)|

## 机器翻译

Expand All @@ -53,7 +53,7 @@ PaddleNLP提供了

| 数据集名称 | 简介 | 调用方法 |
| ---- | --------- | ------ |
| [CSSE COVID-19](https://github.com/CSSEGISandData/COVID-19) |约翰·霍普金斯大学系统科学与工程中心新冠病例数据 | [time_series](../examples/time_series)|
| [CSSE COVID-19](https://github.com/CSSEGISandData/COVID-19) |约翰·霍普金斯大学系统科学与工程中心新冠病例数据 | [time_series](../examples/time_series/tcn)|
| [UCIHousing](https://archive.ics.uci.edu/ml/datasets/Housing) | 波士顿房价预测数据集 | `paddle.text.datasets.UCIHousing`|

## 语料库
Expand Down
2 changes: 1 addition & 1 deletion docs/model_zoo.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ PaddleNLP提供了丰富的模型结构,包含经典的RNN类模型结构,
| ------ | ------ |
| [Transformer](../examples/machine_translation/transformer/) | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) |
| [Transformer-XL](../examples/language_model/transformer-xl/) | [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) |
| [BERT](../examples/language_model/bert/) |[BERT(Bidirectional Encoder Representation from Transformers)](./examples/language_model/bert) |
| [BERT](../examples/language_model/bert/) | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) |
| [ERNIE](../examples/text_classification/pretrained_models) | [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) |
| [ERNIE-Tiny](../examples/text_classification/pretrained_models) | 百度自研的小型化ERNIE网络结构,采用浅层Transformer,加宽隐层参数,中文subword粒度词表结合蒸馏的方法使模型相比SOTA Before BERT 提升8.35%, 速度提升4.3倍。 |
| [ERNIE-GEN](../examples/text_generation/ernie-gen) | [ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation](https://arxiv.org/abs/2001.11314) ERNIE-GEN是百度发布的生成式预训练模型,通过Global-Attention的方式解决训练和预测曝光偏差的问题,同时使用Multi-Flow Attention机制来分别进行Global和Context信息的交互,同时通过片段生成的方式来增加语义相关性。 |
Expand Down
8 changes: 5 additions & 3 deletions docs/transformers.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,15 +65,17 @@ for input_ids, token_type_ids, labels in train_dataloader:
|任务|模型|应用场景|预训练权重|
|---|---|---|---|
|文本分类<br>SequenceClassification |BertForSequenceClassification <br> ErnieForSequenceClassification <br> RobertaForSequenceClassification <br> ElectraForSequenceClassification <br> XLNetForSequenceClassification | [文本分类](../examples/text_classification/pretrained_models/)[阅读理解](../examples/machine_reading_comprehension/DuReader-yesno/)| [见上表](#Transformer预训练模型汇总)|
|序列标注<br>TokenClassification|BertForTokenClassification <br> ErnieForTokenClassification <br> RobertaForTokenClassification <br> ElectraForTokenClassification <br> XLNetForTokenClassification | [命名实体标注](../examples/named_entity_recognition/)|[见上表](#Transformer预训练模型汇总)|
|序列标注<br>TokenClassification|BertForTokenClassification <br> ErnieForTokenClassification <br> RobertaForTokenClassification <br> ElectraForTokenClassification <br> XLNetForTokenClassification | [命名实体标注](../examples/information_extraction/msra_ner/)|[见上表](#Transformer预训练模型汇总)|
|问答任务<br>QuestionAnswering|BertForQuestionAnswering <br> ErnieForQuestionAnswering <br> RobertaForQuestionAnswering| [阅读理解](../examples/machine_reading_comprehension/SQuAD/)|[见上表](#Transformer预训练模型汇总)|
|文本生成<br>TextGeneration | ErnieForGeneration <br> GPT2ForGreedyGeneration |[文本生成](../examples/text_generation/ernie-gen)|[见上表](#Transformer预训练模型汇总)|
|机器翻译<br>MachineTranslation| TransformerModel | [机器翻译](../examples/machine_translation/transformer/)|[见上表](#Transformer预训练模型汇总)|

用户可以切换表格中的不同模型,来处理相同类型的任务。如对于[预训练模型使用方法](#预训练模型使用方法)中的文本分类任务,用户可以将`BertForSequenceClassification`换成`ErnieForSequenceClassification`, 来寻找更适合的预训练模型。

## 参考资料:
- 部分中文预训练模型来自:https://github.com/ymcui/Chinese-BERT-wwm
## Reference

- [ymcui/Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm)
- [ymcui/Chinese-XLNet](https://github.com/ymcui/Chinese-XLNet)
- Sun, Yu, et al. "Ernie: Enhanced representation through knowledge integration." arXiv preprint arXiv:1904.09223 (2019).
- Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
- Cui, Yiming, et al. "Pre-training with whole word masking for chinese bert." arXiv preprint arXiv:1906.08101 (2019).
Expand Down
2 changes: 1 addition & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# PaddleNLP 应用示例

[**PaddleNLP**](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP) 是基于 PaddlePaddle 深度学习框架开发的自然语言处理 (NLP) 工具,算法,模型和数据的开源项目。百度在 NLP 领域十几年的深厚积淀为 PaddleNLP 提供了强大的核心动力。PaddleNLP 提供较为丰富的模型库,基本涵盖了主流的NLP任务,因为模型库中使用了PaddleNLP提供的基础NLP工具,例如数据集处理,高层API,使得模型库的算法简洁易懂。
[**PaddleNLP**](https://github.com/PaddlePaddle/PaddleNLP) 是基于 PaddlePaddle 深度学习框架开发的自然语言处理 (NLP) 工具,算法,模型和数据的开源项目。百度在 NLP 领域十几年的深厚积淀为 PaddleNLP 提供了强大的核心动力。PaddleNLP 提供较为丰富的模型库,基本涵盖了主流的NLP任务,因为模型库中使用了PaddleNLP提供的基础NLP工具,例如数据集处理,高层API,使得模型库的算法简洁易懂。

下面是 PaddleNLP 支持任务的具体信息,涵盖了 [**NLP基础技术**](#nlp基础技术)[**NLP核心技术**](#nlp核心技术)[**NLP系统应用**](#nlp系统应用)三大领域。同时随着NLP序列建模技术的成熟,我们还提供了更多的基于NLP序列建模技术的应用场景如[蛋白质二级结构预测](#蛋白质二级结构预测-protein-secondary-structure-prediction)以及进阶的[模型压缩](#模型压缩-model-compression)应用示例。

Expand Down
13 changes: 5 additions & 8 deletions examples/information_extraction/msra_ner/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ PaddleNLP集成的数据集MSRA-NER数据集对文件格式做了调整:每一

- Python >= 3.6
- paddlepaddle >= 2.0.0,安装方式请参考 [快速安装](https://www.paddlepaddle.org.cn/install/quick)
- paddlenlp >= 2.0.0rc4, 安装方式:`pip install paddlenlp\>=2.0.0rc4`
- paddlenlp >= 2.0.0rc10, 安装方式:`pip install paddlenlp\>=2.0.0rc10`

### 2.2 模型训练

Expand All @@ -37,7 +37,7 @@ python -u ./train.py \
```

其中参数释义如下:
- `model_name_or_path`: 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer,支持[PaadleNLP transformer类预训练模型](https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/docs/transformers.md)中除ernie-gen以外的所有模型。若使用非BERT系列模型,需修改脚本导入相应的Task和Tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。
- `model_name_or_path`: 指示了某种特定配置的模型,对应有其预训练模型和预训练时使用的 tokenizer,支持[PaddleNLP Transformer API](../../../docs/transformers.md)中除ernie-gen以外的所有模型。若使用非BERT系列模型,需修改脚本导入相应的Task和Tokenizer。若模型相关内容保存在本地,这里也可以提供相应目录地址。
- `max_seq_length`: 表示最大句子长度,超过该长度将被截断。
- `batch_size`: 表示每次迭代**每张卡**上的样本数目。
- `learning_rate`: 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。
Expand Down Expand Up @@ -100,11 +100,8 @@ python -u ./predict.py \

## 5. 使用其它预训练模型

本项目支持[PaadleNLP transformer类预训练模型](../../docs/transformers.md)中除ernie-gen以外的所有模型。若使用非BERT系列模型,需修改脚本导入相应的Task和Tokenizer。例如使用ERNIE系列模型,经查[PaadleNLP transformer类预训练模型](../../docs/transformers.md),需要加入以下代码:
```python
from paddlenlp.transformers import ErnieForTokenClassification, ErnieTokenizer
```
请参考[Transformer API文档](../../../docs/transformers.md)了解更多PaddleNLP支持的预训练模型信息,并更换`--model_name_or_path`参数即可对比其他预训练模型的效果。

## 参考
## Reference

[The third international Chinese language processing bakeoff: Word segmentation and named entity recognition](https://faculty.washington.edu/levow/papers/sighan06.pdf)
- [The third international Chinese language processing bakeoff: Word segmentation and named entity recognition](https://faculty.washington.edu/levow/papers/sighan06.pdf)
5 changes: 5 additions & 0 deletions examples/language_model/elmo/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# ELMo

## 模型简介

ELMo(Embeddings from Language Models)是重要的通用语义表示模型之一,以双向LSTM为网络基本组件,以Language Model为训练目标,通过预训练得到通用的语义表示,ELMo能够学习到复杂的特征,比如语法、语义,并且能够学习在不同上下文情况下的词汇多义性。将ELMo得到的语义表示作为Feature迁移到下游NLP任务中,会显著提升下游任务的模型性能,比如问答、文本蕴含和情感分析等。ELMo模型的细节可以[参阅论文](https://arxiv.org/abs/1802.05365)

本项目是ELMo在Paddle上的开源实现, 基于1 Billion Word Language Model Benchmark进行预训练,并接入了简单的下游任务作为示例程序。
Expand Down Expand Up @@ -117,3 +118,7 @@ python example.py --init_from_ckpt='./checkpoints/10000'
```

**NOTE:** 可以通过构建模型时的trainable参数设置ELMo参与或不参与下游任务的训练。另外,预训练的ELMo也可以作为文本词向量编码器单独使用,即输入文本内容,输出每个词对应的词向量。ELMo接入下游任务的具体用法请参考`example.py`中示例`example_of_using_ELMo_as_finetune()``example_of_using_ELMo_as_embedder()`

## Reference

- [Deep contextualized word representations](https://arxiv.org/abs/1802.05365)
9 changes: 7 additions & 2 deletions examples/language_model/xlnet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## 模型简介

[XLNet](https://arxiv.org/abs/1906.08237) (XLNet: Generalized Autoregressive Pretraining for Language Understanding) 是一款无监督的自回归预训练语言模型。 有别于传统的单向自回归模型,XLNet通过最大化输入序列所有排列的期望来进行语言建模,这使得它可以同时关注到上下文的信息。 另外,XLNet在预训练阶段集成了 [Transformer-XL](https://arxiv.org/abs/1901.02860) 模型,Transformer-XL中的片段循环机制(Segment Recurrent Mechanism)和 相对位置编码(Relative Positional Encoding)机制能够支持XLNet接受更长的输入序列,这使得XLNet在长文本序列的语言任务上有着优秀的表现。
[XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) 是一款无监督的自回归预训练语言模型。 有别于传统的单向自回归模型,XLNet通过最大化输入序列所有排列的期望来进行语言建模,这使得它可以同时关注到上下文的信息。 另外,XLNet在预训练阶段集成了 [Transformer-XL](https://arxiv.org/abs/1901.02860) 模型,Transformer-XL中的片段循环机制(Segment Recurrent Mechanism)和 相对位置编码(Relative Positional Encoding)机制能够支持XLNet接受更长的输入序列,这使得XLNet在长文本序列的语言任务上有着优秀的表现。

本项目是XLNet在 Paddle 2.0上的开源实现,包含了在 [GLUE评测任务](https://gluebenchmark.com/tasks) 上的微调代码。

Expand Down Expand Up @@ -72,4 +72,9 @@ python -m paddle.distributed.launch ./run_glue.py \
| STS-B | Person/Spearman corr | 86.243/85.973 |
| QQP | Accuracy/F1 | 90.838/87.644 |
| MNLI | Matched acc/MisMatched acc | 87.468/86.859 |
| RTE | Accuracy | 70.036 |
| RTE | Accuracy | 70.036 |

## Reference

- [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237)
- [zihangdai/xlnet](https://github.com/zihangdai/xlnet)
2 changes: 1 addition & 1 deletion examples/model_compression/ofa/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ python -u ./run_glue.py \
--output_dir ./tmp/$TASK_NAME/ \
--n_gpu 1 \
```
参数详细含义参考[README.md](../../glue)
参数详细含义参考[README.md](../../benchmark/glue/README.md)
Fine-tuning 在dev上的结果如压缩结果表格中Result那一列所示。

### 环境配置
Expand Down
Loading

0 comments on commit e14231a

Please sign in to comment.