mt-opus

English-Thai Machine Translation with OPUS data

Data

We used 9 datasets from OPUS to train and validate our models within and across domains (total 5.4M sentence pairs; 68.8M English tokens and 53.1M Thai tokens).

datasets	nb_sent	en_tok	th_tok	description	reference
OpenSubtitles v2018	3.5M	28.4M	7.8M	crowdsourced subtitles	[1]
JW300 v1 en th	0.8M	14.9M	34.6M	Jehovah's Witness site	[2], [3]
GNOME v1	0.5M	2.3M	3.5M	GNOME documentation	[2]
QED v2.0a	0.3M	4.7M	1.2M	crowdsourced educational subtitles	[2]
bible-uedin v1	0.1M	3.6M	2.1M	the Bible	[2], [4]
Tanzil v1	93.5k	2.8M	3.4M	the Quran	[2]
KDE4 v2	92.0k	0.5M	0.2M	KDE4 documentation	[2]
Ubuntu v14.10	46.6k	0.4M	0.2M	Ubuntu documentation	[2]
Tatoeba v20190709	1.1k	6k	1.7k	crowdsourced translations	[2]

Models

Results

References

[1] P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
[2] J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
[3] Željko Agić, Ivan Vulić: "JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages", In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019. Acknowledge also OPUS by citing the following article: J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
[4] A massively parallel corpus: the Bible in 100 languages, Christos Christodoulopoulos and Mark Steedman, Language Resources and Evaluation, 49

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
pythainlp		pythainlp
run_fairseq.sh		run_fairseq.sh
sandbox.ipynb		sandbox.ipynb
sandbox_bpe.ipynb		sandbox_bpe.ipynb
script_fairseq_eval_for_n_epochs.sh		script_fairseq_eval_for_n_epochs.sh
tokenize.py		tokenize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mt-opus

Data

Models

Results

References

About

Releases

Packages

Languages

vistec-AI/mt-opus

Folders and files

Latest commit

History

Repository files navigation

mt-opus

Data

Models

Results

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages