Skip to content
forked from rusiaaman/PCPM

Presenting Collection of Pretrained Models. Links to pretrained models in NLP and voice.

License

Notifications You must be signed in to change notification settings

zhouzyhfut/PCPM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 

Repository files navigation

PCPM

Pretrained Collection of Pretrained Models. Links to pretrained models in NLP and voice with training script.

With rapid progress in NLP it is becoming easier to bootstrap a machine learning project involving text. Instead of starting with a base code, one can now start with a base pretrained model and within a few iterations get SOTA performance. This repository is made with the view that pretrained models minimizes collective human effort and cost of resources, thus accelerating development in the field.

Models listed are curated for either pytorch or tensorflow because of their wide usage.

Contents

Text ML

Language Models

Name Link Trained On Training script
XLnet https://github.com/zihangdai/xlnet/#released-models booksCorpus+English Wikipedia+Giga5+ClueWeb 2012-B+Common Crawl https://github.com/zihangdai/xlnet/
Transformer-xl https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models enwik8, lm1b, wt103, text8 https://github.com/kimiyoung/transformer-xl
GPT-2 https://github.com/openai/gpt-2/blob/master/download_model.py webtext https://github.com/nshepperd/gpt-2/

BERT

Name Link Trained On Training script
BERT https://github.com/google-research/bert/ booksCorpus+English Wikipedia https://github.com/google-research/bert/ (tf) https://github.com/huggingface/pytorch-pretrained-BERT (pytorch)
MT-DNN https://mrc.blob.core.windows.net/mt-dnn-model/mt_dnn_base.pt (https://github.com/namisan/mt-dnn/blob/master/download.sh) glue https://github.com/namisan/mt-dnn

Sentiment

Name Link Trained On Training script
MT-DNN Sentiment https://drive.google.com/open?id=1-ld8_WpdQVDjPeYhb3AK8XYLGlZEbs-l SST https://github.com/namisan/mt-dnn

Reading Comprehension

SQUAD 1.1

Rank Name Link Training script
49 BiDaf https://s3-us-west-2.amazonaws.com/allennlp/models/bidaf-model-2017.09.15-charpad.tar.gz https://github.com/allenai/allennlp

Speech to Text

Name Link Trained On Training script
Espnet https://github.com/espnet/espnet#asr-results librispeech,Aishell,HKUST,TEDLIUM2 https://github.com/espnet/espnet
Deepspeech2 pytorch SeanNaren/deepspeech.pytorch#299 (comment) librispeech https://github.com/SeanNaren/deepspeech.pytorch
Deepspeech https://github.com/mozilla/DeepSpeech#getting-the-pre-trained-model mozilla-common-voice, librispeech, fisher, switchboard https://github.com/mozilla/DeepSpeech
speech-to-text-wavenet https://github.com/buriburisuri/speech-to-text-wavenet#pre-trained-models vctk https://github.com/buriburisuri/speech-to-text-wavenet

Datasets

Datasets referenced in this document

Language Model data

Common crawl

http://commoncrawl.org/

enwik8

Wikipedia data dump (Large text compression benchmark) http://mattmahoney.net/dc/textdata.html

text8

Wikipedia cleaned text (Large text compression benchmark) http://mattmahoney.net/dc/textdata.html

lm1b

1 Billion Word Language Model Benchmark https://www.statmt.org/lm-benchmark/

wt103

Wikitext 103 https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/

webtext

Original dataset not released by the authors. An open source collection is available at https://skylion007.github.io/OpenWebTextCorpus/

English wikipedia

https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia

BooksCorpus

https://yknzhu.wixsite.com/mbweb https://github.com/soskek/bookcorpus

Sentiment

SST

Stanford sentiment tree bank https://nlp.stanford.edu/sentiment/index.html. One of the Glue tasks.

Glue

Glue is a collection of resources for benchmarking natural language systems. https://gluebenchmark.com/ Contains datasets on natural language inference, sentiment classification, paraphrase detection, similarity matching and lingusitc acceptability.

Speech to text data

fisher

https://pdfs.semanticscholar.org/a723/97679079439b075de815553c7b687ccfa886.pdf

librispeech

www.danielpovey.com/files/2015_icassp_librispeech.pdf

switchboard

https://ieeexplore.ieee.org/document/225858/

Mozilla common voice

https://github.com/mozilla/voice-web

vctk

https://datashare.is.ed.ac.uk/handle/10283/2651

Hall of Shame

High quality research which doesn't include pretrained models for public use.

Non English

Other Collections

Allen NLP

Built on pytorch, allen nlp has produced SOTA models and open sourced them. https://github.com/allenai/allennlp/blob/master/MODELS.md

They have neat interactive demo on various tasks at https://demo.allennlp.org/

GluonNLP

Based on MXNet this library has extensive list of pretrained models on various tasks in NLP. http://gluon-nlp.mxnet.io/master/index.html#model-zoo

About

Presenting Collection of Pretrained Models. Links to pretrained models in NLP and voice.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published