Datasets for test purposes.
Files for model train/test are made with two steps. Tokenizing raw texts into words at the first step, this calls pre-tokenization. Doing subword segmentation on pre-tokenized texts then.
pretok
contains pre-tokenized files.*.en
tokenizes with Moses tokenizer.*.ja
tokenizes with Mecab tokenizer.
*.en
and*.ja
in root directory are files after doing subword segmentation by using Sentencepiece.detok
contains raw texts.
Sentencepiece is trained separately on en/ja.
*.model
and *.vocab
are model and vocabulary files generated by Sentencepiece.
data-bin
contains dictionary and binarized files from *.en
and *.ja
by using Fairseq.
checkpoints
contains pre-trained tiny Transformer model.
For more details, see here