Skip to content

Latest commit

 

History

History
60 lines (41 loc) · 2.48 KB

README.md

File metadata and controls

60 lines (41 loc) · 2.48 KB

gpt2-ml

GPT2 for Multiple Languages

Open In Colab GitHub GitHub All Releases contributions welcome GitHub stars Twitter Follow

中文说明 | English

  • Simplifed GPT2 train scripts(based on Grover, supporting TPUs)
  • Ported bert tokenizer,multilingual corpus compatible
  • 1.5B GPT2 pretrained Chinese model ( ~15G corpus, 10w steps )
  • Batteries-included Colab demo #
  • 1.5B GPT2 pretrained Chinese model ( ~50G corpus, 100w steps )

Pretrained Model

1.5B GPT2 pretrained Chinese model [Google Drive]

Corpus from THUCNews and nlp_chinese_corpus

Using Cloud TPU Pod v3-256 to train 10w steps

loss

Google Colab

With just 3 clicks (not including Colab auth process), the 1.5B pretrained Chinese model demo is ready to go:

[Colab Notebook]

Train

Disclaimer

The contents in this repository are for academic research purpose, and we do not provide any conclusive remarks.

Citing

@misc{GPT2-ML,
  author = {Zhibo Zhang},
  title = {GPT2-ML: GPT-2 for Multiple Languages},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/imcaspar/gpt2-ml}},
}

Reference

https://github.com/google-research/bert

https://github.com/rowanz/grover

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)