GPT2 for Multiple Languages

Simplifed GPT2 train scripts（based on Grover, supporting TPUs）
Ported bert tokenizer, multilingual corpus compatible
1.5B GPT2 pretrained Chinese model ( ~15G corpus, 10w steps )
Batteries-included Colab demo #
1.5B GPT2 pretrained Chinese model ( ~30G corpus, 22w steps )

Pretrained Model

Size	Language	Corpus	Vocab	Link	SHA256
1.5B parameters	Chinese	~30G	CLUE ( 8021 tokens )	Google Drive	e698cc97a7f5f706f84f58bb469d614e 51d3c0ce5f9ab9bf77e01e3fcb41d482
1.5B parameters	Chinese	~15G	Bert ( 21128 tokens )	Google Drive	4a6e5124df8db7ac2bdd902e6191b807 a6983a7f5d09fb10ce011f9a073b183e

Corpus from THUCNews and nlp_chinese_corpus

Using Cloud TPU Pod v3-256 to train 22w steps

Google Colab

With just 2 clicks (not including Colab auth process), the 1.5B pretrained Chinese model demo is ready to go:

[Colab Notebook]

Train

The following commands will generate big costs on Google Cloud!

Initial setup

Set the zone to europe-west4-a (the only one with big TPUs)

gcloud config set compute/zone europe-west4-a

Create a Storage Bucket

gsutil mb -p <PROJECT-ID> -c standard gs://<BUCKET-NAME>

Upload the dataset to the Bucket. Run the following command on your local computer

gsutil cp -r ./dataset gs://<BUCKET-NAME>/

Create a TPU and VM machine. Here we are setting the small v2-8 TPU as a safeguard, you will need bigger TPUs to train the models.

ctpu up --tf-version=2.1 --project=<PROJECT-ID> --tpu-size=v2-8 --preemptible

When logged in on the new VM machine type the install Tensorflow 1.15.2

pip3 install tensorflow==1.15.2

Clone the gpt2-ml from Deep-ESP

git clone https://github.com/DeepESP/gpt2-ml.git

Edit the training script and setup the OUTPUT_DIR (gs://<BUCKET-NAME>/output) and input_file (gs://<BUCKET-NAME>/dataset)

cd gpt2-ml/train/
nano train_tpu_adafactor.sh

Training

Run the training script

bash train_tpu_adafactor.sh

Monitoring Training

From your local computer

tensorboard --logdir gs://<BUCKET-NAME>/output --reload_interval 60

Note to self: loss, accuracy, perplexity. Accuracy shoulg get near 40%.

Trick for faster fine tuning: Keep the same vector for tokens in common between Spanish and English.

Get checkpoint

To download the checkpoints run the following command on your local computer

gsutil cp -r gs://<BUCKET-NAME>/output ./

This will download all checkpoints, you might want to download only the last one.

Disclaimer

The contents in this repository are for academic research purpose, and we do not provide any conclusive remarks.

Citation

@misc{GPT2-ML,
  author = {Zhibo Zhang},
  title = {GPT2-ML: GPT-2 for Multiple Languages},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/DeepESP/gpt2-ml}},
}

Reference

https://github.com/google-research/bert

https://github.com/rowanz/grover

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
.github		.github
3rd/gdown.pl		3rd/gdown.pl
configs		configs
dataset		dataset
dockerfiles		dockerfiles
scripts		scripts
tokenization		tokenization
train		train
vocabs		vocabs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pretrained_model_demo.ipynb		pretrained_model_demo.ipynb
requirements-gpu.txt		requirements-gpu.txt
requirements-tpu.txt		requirements-tpu.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT2 for Multiple Languages

Pretrained Model

Google Colab

Train

Initial setup

Training

Monitoring Training

Get checkpoint

Disclaimer

Citation

Reference

About

Releases

Packages

Languages

License

DeepESP/gpt2-ml

Folders and files

Latest commit

History

Repository files navigation

GPT2 for Multiple Languages

Pretrained Model

Google Colab

Train

Initial setup

Training

Monitoring Training

Get checkpoint

Disclaimer

Citation

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages