Skip to content

Latest commit

 

History

History
171 lines (110 loc) · 6.51 KB

README_en.md

File metadata and controls

171 lines (110 loc) · 6.51 KB

English | 简体中文


PyPI - PaddleNLP Version PyPI - Python Version PyPI Status support os GitHub

Introduction

PaddleNLP aims to accelerate NLP applications through powerful model zoo, easy-to-use API and high performance distributed training. It's also the NLP best practice for PaddlePaddle 2.0 API system.

Features

  • Powerful Model Zoo for Rich Senario

    • Our Model Zoo covers mainstream NLP applications, including Lexical Analysis, Text Classification, Text Generation, Text Matching, Text Graph, Information Extraction, Machine Translation, General Dialogue and Question Answering etc.
  • Easy-to-Use and End-to-End API

    • The API is fully integrated with PaddlePaddle 2.0 high-level API system. It minimizes the number of user actions required for common use cases like data loading, text pre-processing, training and evaluation, which enables you to deal with text problems more productively.
  • High Performance and Distributed Training

  • We provide a highly optimized ditributed training implementation for BERT with Fleet API, and mixed precision training strategy based on PaddlePaddle 2.0, it can fully utilize GPU clusters for large-scale model pre-training.

Installation

Prerequisites

  • python >= 3.6
  • paddlepaddle >= 2.0.1

More information about PaddlePaddle installation please refer to PaddlePaddle Install

PIP Installation

pip install --upgrade paddlenlp -i https://pypi.org/simple

Install from Source

pip install --upgrade git+https://github.com/PaddlePaddle/PaddleNLP.git

pip install --upgrade git+https://gitee.com/PaddlePaddle/PaddleNLP.git

Quick Start

Quick Dataset Loading

from paddlenlp.datasets import load_dataset

train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])

For more dataset API usage please refer to Dataset API.

Pre-trained Text Embedding Loading

from paddlenlp.embeddings import TokenEmbedding

wordemb = TokenEmbedding("fasttext.wiki-news.target.word-word.dim300.en")
wordemb.cosine_sim("king", "queen")
>>> 0.77053076
wordemb.cosine_sim("apple", "rail")
>>> 0.29207364

For more TokenEmbedding usage, please refer to Embedding API

Rich Chinese Pre-trained Models

from paddlenlp.transformers import ErnieModel, BertModel, RobertaModel, ElectraModel, GPTForPretraining

ernie = ErnieModel.from_pretrained('ernie-1.0')
bert = BertModel.from_pretrained('bert-wwm-chinese')
roberta = RobertaModel.from_pretrained('roberta-wwm-ext')
electra = ElectraModel.from_pretrained('chinese-electra-small')
gpt = GPTForPretraining.from_pretrained('gpt-cpm-large-cn')

For more pretrained model selection, please refer to Transformer API

Extract Feature Through Pre-trained Model

import paddle
from paddlenlp.transformers import ErnieTokenizer, ErnieModel

tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
model = ErnieModel.from_pretrained('ernie-1.0')

text = tokenizer('自然语言处理')
pooled_output, sequence_output = model.forward(input_ids=paddle.to_tensor([text['input_ids']]))

Model Zoo and Applications

For model zoo introduction please refer toPaddleNLP Model Zoo. As for applicaiton senario please refer to PaddleNLP Examples

Advanced Application

API Usage

Tutorials

Please refer to our official AI Studio account for more interactive tutorials: PaddleNLP on AI Studio

Community

Special Interest Group(SIG)

Welcome to join PaddleNLP SIG for contribution, eg. Dataset, Models and Toolkit.

Slack

To connect with other users and contributors, welcome to join our Slack channel.

QQ

Join our QQ Technical Group for technical exchange right now! ⬇️

License

PaddleNLP is provided under the Apache-2.0 License.