Name		Name	Last commit message	Last commit date
parent directory ..
config		config
gpt2		gpt2
grover		grover
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
create_pretraining_data.py		create_pretraining_data.py
requirements.txt		requirements.txt
train_bpe_tokenizer.py		train_bpe_tokenizer.py

README.md

AraGPT2

You can find more information in our paper AraGPT2

The code in this repository was used to train all GPT2 variants. The code support training and fine-tuning GPT2 on GPUs and TPUs via the TPUEstimator API.

GPT2-base and medium uses the code from the gpt2 folder and can trains models from the minimaxir/gpt-2-simple repository. These models were trained using the lamb optimizer and follow the same architecture as gpt2 and are fully compatible with the transformers library.

GPT2-large and GPT2-mega were trained using the imcaspar/gpt2-ml library, and follow the grover architecture. You can use the pytorch classes found in grover/modeling_gpt2.py as a direct replacement for classes in the transformers library (it should support version v4.x from transformers). Both models are trained using the adafactor optimizer, since the adam and lamb optimizer use too much memory causing the model to not even fit 1 batch on a TPU core.

AraGPT2 is trained on the same large Arabic Dataset as AraBERTv2.

Usage

Testing the model using `transformers`:

from transformers import GPT2TokenizerFast, pipeline
#for base and medium
from transformers import GPT2LMHeadModel
#for large and mega
from arabert.aragpt2.grover.modeling_gpt2 import GPT2LMHeadModel

from arabert.preprocess import ArabertPreprocessor

MODEL_NAME='aragpt2-base'
arabert_prep = ArabertPreprocessor(model_name=MODEL_NAME, keep_emojis=True)

model = GPT2LMHeadModel.from_pretrained(MODEL_NAME)
tokenizer = GPT2TokenizerFast.from_pretrained(MODEL_NAME)
generation_pipeline = pipeline("text-generation",model=model,tokenizer=tokenizer)

text=""
text_clean = arabert_prep.preprocess(text)

#feel free to try different decoding settings
generation_pipeline(
    text_clean,
    pad_token_id=tokenizer.eos_token_id,
    num_beams=10,
    max_length=200,
    top_p=0.9,
    repetition_penalty = 3.0,
    no_repeat_ngram_size = 3)[0]['generated_text']

Finetunning using `transformers`:

Follow the guide linked here

Finetuning using our code with TF 1.15.4:

Create the Training TFRecords:

python create_pretraining_data.py
 --input_file=<RAW TEXT FILE with documents/article sperated by an empty line>
 --output_file=<OUTPUT TFRecord>
 --tokenizer_dir=<Directory with the GPT2 Tokenizer files>

Finetuning:

python3 run_pretraining.py \
--input_file="gs://<GS_BUCKET>/pretraining_data/*" \
--output_dir="gs://<GS_BUCKET>/pretraining_model/" \
--config_file="config/small_hparams.json" \
--batch_size=128 \
--eval_batch_size=8 \
--num_train_steps= \
--num_warmup_steps= \
--learning_rate= \
--save_checkpoints_steps= \
--max_seq_length=1024 \
--max_eval_steps= \
--optimizer="lamb" \
--iterations_per_loop=5000 \
--keep_checkpoint_max=10 \
--use_tpu=True \
--tpu_name=<TPU NAME> \
--do_train=True \
--do_eval=False

Model Sizes

Model	Optimizer	Context size	Embedding Size	Num of heads	Num of layers	Model Size / Num of Params
AraGPT2-base	`lamb`	1024	768	12	12	527MB / 135M
AraGPT2-medium	`lamb`	1024	1024	16	24	1.4GB / 369M
AraGPT2-large	`adafactor`	1024	1280	20	36	2.98GB/792M
AraGPT2-mega	`adafactor`	1024	1536	24	48	5.5GB/1.46B

Compute

Model	Hardware	num of examples (seq len = 1024)	Batch Size	Num of Steps	Time (in days)
AraGPT2-base	TPUv3-128	9.7M	1792	125K	1.5
AraGPT2-medium	TPUv3-8	9.7M	80	1M	15
AraGPT2-large	TPUv3-128	9.7M	256	220k	3
AraGPT2-mega	TPUv3-128	9.7M	256	780K	9

Results

The results show in the table below are the perplexity values on wikipedia articles that are not in the training data.

Model	PPL
AraGPT2-base	55.8
AraGPT2-medium	45.7
AraGPT2-large	36.6
AraGPT2-mega	29.8

Disclaimer

The text generated by AraGPT2 Arabic is automatically generated by a neural network model trained on a large amount of texts, which does not represent the authors' or their institutes' official attitudes and preferences. The text generated by AraGPT2 should only be used for research and scientific purposes. If it infringes on your rights and interests or violates social morality, please do not propagate it.

If you used this model please cite us as :

@misc{antoun2020aragpt2,
      title={AraGPT2: Pre-Trained Transformer for Arabic Language Generation},
      author={Wissam Antoun and Fady Baly and Hazem Hajj},
      year={2020},
      eprint={2012.15520},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgments

Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the AUB MIND Lab Members for the continous support. Also thanks to Yakshof and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.

Contacts

Wissam Antoun: Linkedin | Twitter | Github | [email protected] | [email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aragpt2

aragpt2

README.md

AraGPT2

Usage

Testing the model using `transformers`:

Finetunning using `transformers`:

Finetuning using our code with TF 1.15.4:

Model Sizes

Compute

Results

Disclaimer

If you used this model please cite us as :

Acknowledgments

Contacts

Files

aragpt2

Directory actions

More options

Directory actions

More options

Latest commit

History

aragpt2

Folders and files

parent directory

README.md

AraGPT2

Usage

Testing the model using transformers:

Finetunning using transformers:

Finetuning using our code with TF 1.15.4:

Model Sizes

Compute

Results

Disclaimer

If you used this model please cite us as :

Acknowledgments

Contacts

Testing the model using `transformers`:

Finetunning using `transformers`: