Name		Name	Last commit message	Last commit date
parent directory ..
checkpoints		checkpoints
data-bin		data-bin
detok		detok
pretok		pretok
README.md		README.md
en.model		en.model
en.vocab		en.vocab
fixtures.py		fixtures.py
ja.model		ja.model
ja.vocab		ja.vocab
test.en		test.en
test.ja		test.ja
train.en		train.en
train.ja		train.ja
valid.en		valid.en
valid.ja		valid.ja

README.md

Data Directory

Datasets for test purposes.

Files for model train/test are made with two steps. Tokenizing raw texts into words at the first step, this calls pre-tokenization. Doing subword segmentation on pre-tokenized texts then.

pretok contains pre-tokenized files.
- *.en tokenizes with Moses tokenizer.
- *.ja tokenizes with Mecab tokenizer.
*.en and *.ja in root directory are files after doing subword segmentation by using Sentencepiece.
detok contains raw texts.

Sentencepiece is trained separately on en/ja. *.model and *.vocab are model and vocabulary files generated by Sentencepiece.

data-bin contains dictionary and binarized files from *.en and *.ja by using Fairseq.

checkpoints contains pre-trained tiny Transformer model.

For more details, see here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

Data Directory

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Data Directory