Name	Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE	LICENSE
README.md	README.md

PCPM

Pretrained Collection of Pretrained Models. Links to pretrained models in NLP and voice with training script.

With rapid progress in NLP it is becoming easier to bootstrap a machine learning project involving text. Instead of starting with a base code, one can now start with a base pretrained model and within a few iterations get SOTA performance. This repository is made with the view that pretrained models minimizes collective human effort and cost of resources, thus accelerating development in the field.

Models listed are curated for either pytorch or tensorflow because of their wide usage.

Text ML models
Speech to text models
Datasets
Hall of Shame
Non english models
Other Collections

Text ML

Language Models

Name	Link	Trained On	Training script
XLnet	https://github.com/zihangdai/xlnet/#released-models	`booksCorpus`+`English Wikipedia`+`Giga5`+`ClueWeb 2012-B`+`Common Crawl`	https://github.com/zihangdai/xlnet/
Transformer-xl	https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models	`enwik8`, `lm1b`, `wt103`, `text8`	https://github.com/kimiyoung/transformer-xl
GPT-2	https://github.com/openai/gpt-2/blob/master/download_model.py	`webtext`	https://github.com/nshepperd/gpt-2/

BERT

Name	Link	Trained On	Training script
BERT	https://github.com/google-research/bert/	booksCorpus+English Wikipedia	https://github.com/google-research/bert/ (tf) https://github.com/huggingface/pytorch-pretrained-BERT (pytorch)
MT-DNN	https://mrc.blob.core.windows.net/mt-dnn-model/mt_dnn_base.pt (https://github.com/namisan/mt-dnn/blob/master/download.sh)	glue	https://github.com/namisan/mt-dnn

Sentiment

Name	Link	Trained On	Training script
MT-DNN Sentiment	https://drive.google.com/open?id=1-ld8_WpdQVDjPeYhb3AK8XYLGlZEbs-l	SST	https://github.com/namisan/mt-dnn

Reading Comprehension

SQUAD 1.1

Rank	Name	Link	Training script
49	BiDaf	https://s3-us-west-2.amazonaws.com/allennlp/models/bidaf-model-2017.09.15-charpad.tar.gz	https://github.com/allenai/allennlp

Speech to Text

Name	Link	Trained On	Training script
Espnet	https://github.com/espnet/espnet#asr-results	librispeech,Aishell,HKUST,TEDLIUM2	https://github.com/espnet/espnet
Deepspeech2 pytorch	SeanNaren/deepspeech.pytorch#299 (comment)	librispeech	https://github.com/SeanNaren/deepspeech.pytorch
Deepspeech	https://github.com/mozilla/DeepSpeech#getting-the-pre-trained-model	mozilla-common-voice, librispeech, fisher, switchboard	https://github.com/mozilla/DeepSpeech
speech-to-text-wavenet	https://github.com/buriburisuri/speech-to-text-wavenet#pre-trained-models	vctk	https://github.com/buriburisuri/speech-to-text-wavenet

Datasets

Datasets referenced in this document

Language Model data

Common crawl

http://commoncrawl.org/

enwik8

Wikipedia data dump (Large text compression benchmark) http://mattmahoney.net/dc/textdata.html

text8

Wikipedia cleaned text (Large text compression benchmark) http://mattmahoney.net/dc/textdata.html

lm1b

1 Billion Word Language Model Benchmark https://www.statmt.org/lm-benchmark/

wt103

Wikitext 103 https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/

webtext

Original dataset not released by the authors. An open source collection is available at https://skylion007.github.io/OpenWebTextCorpus/

English wikipedia

https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia

BooksCorpus

https://yknzhu.wixsite.com/mbweb https://github.com/soskek/bookcorpus

Sentiment

SST

Stanford sentiment tree bank https://nlp.stanford.edu/sentiment/index.html. One of the Glue tasks.

Glue

Glue is a collection of resources for benchmarking natural language systems. https://gluebenchmark.com/ Contains datasets on natural language inference, sentiment classification, paraphrase detection, similarity matching and lingusitc acceptability.

Speech to text data

Hall of Shame

High quality research which doesn't include pretrained models for public use.

wav2letter https://github.com/facebookresearch/wav2letter (flashlight/wav2letter#130)

Non English

Other Collections

Allen NLP

Built on pytorch, allen nlp has produced SOTA models and open sourced them. https://github.com/allenai/allennlp/blob/master/MODELS.md

They have neat interactive demo on various tasks at https://demo.allennlp.org/

GluonNLP

Based on MXNet this library has extensive list of pretrained models on various tasks in NLP. http://gluon-nlp.mxnet.io/master/index.html#model-zoo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PCPM

Contents

Text ML

Language Models

BERT

Sentiment

Reading Comprehension

SQUAD 1.1

Speech to Text

Datasets

Language Model data

Common crawl

enwik8

text8

lm1b

wt103

webtext

English wikipedia

BooksCorpus

Sentiment

SST

Glue

Speech to text data

fisher

librispeech

switchboard

Mozilla common voice

vctk

Hall of Shame

Non English

Other Collections

Allen NLP

GluonNLP

About

Releases

Packages

License

zhouzyhfut/PCPM

Folders and files

Latest commit

History

Repository files navigation

PCPM

Contents

Text ML

Language Models

BERT

Sentiment

Reading Comprehension

SQUAD 1.1

Speech to Text

Datasets

Language Model data

Common crawl

enwik8

text8

lm1b

wt103

webtext

English wikipedia

BooksCorpus

Sentiment

SST

Glue

Speech to text data

fisher

librispeech

switchboard

Mozilla common voice

vctk

Hall of Shame

Non English

Other Collections

Allen NLP

GluonNLP

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages