Skip to content

Latest commit

 

History

History
226 lines (155 loc) · 9.89 KB

README_en.md

File metadata and controls

226 lines (155 loc) · 9.89 KB

English | 简体中文


PyPI - PaddleNLP Version PyPI - Python Version PyPI Status python version support os GitHub

News

  • [2021-10-12] PaddleNLP 2.1 has been officially relealsed! 🎉 For more information please refer to Release Note.

Introduction

PaddleNLP is a powerful NLP library with Awesome pre-trained Transformer models and easy-to-use interface, supporting wide-range of NLP tasks from research to industrial applications.

  • Easy-to-Use API

    • The API is fully integrated with PaddlePaddle 2.0 high-level API system. It minimizes the number of user actions required for common use cases like data loading, text pre-processing, awesome transfomer models, and fast inference, which enables developer to deal with text problems more productively.
  • Wide-range NLP Task Support

    • PaddleNLP support NLP task from research to industrial applications, including Lexical Analysis, Text Classification, Text Matching, Text Generation, Information Extraction, Machine Translation, General Dialogue and Question Answering etc.
  • High Performance Distributed Training

    • We provide an industrial level training pipeline for super large-scale Transformer model based on Auto Mixed Precision and Fleet distributed training API by PaddlePaddle, which can support customized model pre-training efficiently.

Installation

Prerequisites

  • python >= 3.6
  • paddlepaddle >= 2.1

More information about PaddlePaddle installation please refer to PaddlePaddle's Website.

Python pip Installation

pip install --upgrade paddlenlp

Easy-to-use API

Taskflow:Off-the-shelf Industial NLP Pre-built Task

Taskflow aims to provide off-the-shelf NLP pre-built task covering NLU and NLG scenario, in the meanwhile with extreamly fast infernece satisfying industrial applications.

from paddlenlp import Taskflow

# Chinese Word Segmentation
seg = Taskflow("word_segmentation")
seg("第十四届全运会在西安举办")
>>> ['第十四届', '全运会', '在', '西安', '举办']

# POS Tagging
tag = Taskflow("pos_tagging")
tag("第十四届全运会在西安举办")
>>> [('第十四届', 'm'), ('全运会', 'nz'), ('在', 'p'), ('西安', 'LOC'), ('举办', 'v')]

# Named Entity Recognition
ner = Taskflow("ner")
ner("《孤女》是2010年九州出版社出版的小说,作者是余兼羽")
>>> [('《', 'w'), ('孤女', '作品类_实体'), ('》', 'w'), ('是', '肯定词'), ('2010年', '时间类'), ('九州出版社', '组织机构类'), ('出版', '场景事件'), ('的', '助词'), ('小说', '作品类_概念'), (',', 'w'), ('作者', '人物类_概念'), ('是', '肯定词'), ('余兼羽', '人物类_实体')]

# Dependency Parsing
ddp = Taskflow("dependency_parsing")
ddp("百度是一家高科技公司")
>>> [{'word': ['百度', '是', '一家', '高科技', '公司'], 'head': ['2', '0', '5', '5', '2'], 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'VOB']}]

# Sentiment Analysis
senta = Taskflow("sentiment_analysis")
senta("怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片")
>>> [{'text': '怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片', 'label': 'negative', 'score': 0.6691398620605469}]

For more usage please refer to Taskflow Docs

Transformer API: Awesome Pre-trained Model Ecosystem

We provide 22 network architectures and over 90 pretrained models. Not only includes all the SOTA model like ERNIE, PLATO and SKEP released by Baidu, but also integrates most of the high quality Chinese pretrained model developed by other organizations. We welcome all developers to contribute your Transformer models to PaddleNLP! 🤗

from paddlenlp.transformers import *

ernie = ErnieModel.from_pretrained('ernie-1.0')
ernie_gram = ErnieGramModel.from_pretrained('ernie-gram-zh')
bert = BertModel.from_pretrained('bert-wwm-chinese')
albert = AlbertModel.from_pretrained('albert-chinese-tiny')
roberta = RobertaModel.from_pretrained('roberta-wwm-ext')
electra = ElectraModel.from_pretrained('chinese-electra-small')
gpt = GPTForPretraining.from_pretrained('gpt-cpm-large-cn')

PaddleNLP also provides unified API experience for NLP task like semantic representation, text classification, sentence matching, sequence labeling, question answering, etc.

import paddle
from paddlenlp.transformers import ErnieTokenizer, ErnieModel

tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
text = tokenizer('natural language understanding')

# Semantic Representation
model = ErnieModel.from_pretrained('ernie-1.0')
sequence_output, pooled_output = model(input_ids=paddle.to_tensor([text['input_ids']]))
# Text Classificaiton and Matching
model = ErnieForSequenceClassification.from_pretrained('ernie-1.0')
# Sequence Labeling
model = ErnieForTokenClassification.from_pretrained('ernie-1.0')
# Question Answering
model = ErnieForQuestionAnswering.from_pretrained('ernie-1.0')

For more pretrained model usage, please refer to Transformer API

Dataset API: Abundant Dataset Integration and Quick Loading

from paddlenlp.datasets import load_dataset

train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])

For more dataset API usage please refer to Dataset API.

Embedding API: Quick Loading for Word Embedding

from paddlenlp.embeddings import TokenEmbedding

wordemb = TokenEmbedding("fasttext.wiki-news.target.word-word.dim300.en")
wordemb.cosine_sim("king", "queen")
>>> 0.77053076
wordemb.cosine_sim("apple", "rail")
>>> 0.29207364

For more TokenEmbedding usage, please refer to Embedding API

More API Usage

Please find more API Reference from our readthedocs.

Wide-range NLP Task Support

PaddleNLP provides rich application examples covering mainstream NLP task to help developers accelerate problem solving.

NLP Basic Technique

NLP System

NLP Extented Applications

Tutorials

Please refer to our official AI Studio account for more interactive tutorials: PaddleNLP on AI Studio

Community

Special Interest Group (SIG)

Welcome to join PaddleNLP SIG for contribution, eg. Dataset, Models and Toolkit.

Slack

To connect with other users and contributors, welcome to join our Slack channel.

QQ

Join our QQ Technical Group for technical exchange right now! ⬇️

ChangeLog

For more details about our release, please refer to ChangeLog

License

PaddleNLP is provided under the Apache-2.0 License.