Skip to content

Latest commit

 

History

History

data

Data Directory

Datasets for test purposes.

Files for model train/test are made with two steps. Tokenizing raw texts into words at the first step, this calls pre-tokenization. Doing subword segmentation on pre-tokenized texts then.

  • pretok contains pre-tokenized files.
    • *.en tokenizes with Moses tokenizer.
    • *.ja tokenizes with Mecab tokenizer.
  • *.en and *.ja in root directory are files after doing subword segmentation by using Sentencepiece.
  • detok contains raw texts.

Sentencepiece is trained separately on en/ja. *.model and *.vocab are model and vocabulary files generated by Sentencepiece.

data-bin contains dictionary and binarized files from *.en and *.ja by using Fairseq.

checkpoints contains pre-trained tiny Transformer model.

For more details, see here