English-Thai Machine Translation with OPUS data
We used 9 datasets from OPUS to train and validate our models within and across domains (total 5.4M sentence pairs; 68.8M English tokens and 53.1M Thai tokens).
datasets | nb_sent | en_tok | th_tok | description | reference |
---|---|---|---|---|---|
OpenSubtitles v2018 | 3.5M | 28.4M | 7.8M | crowdsourced subtitles | [1] |
JW300 v1 en th | 0.8M | 14.9M | 34.6M | Jehovah's Witness site | [2], [3] |
GNOME v1 | 0.5M | 2.3M | 3.5M | GNOME documentation | [2] |
QED v2.0a | 0.3M | 4.7M | 1.2M | crowdsourced educational subtitles | [2] |
bible-uedin v1 | 0.1M | 3.6M | 2.1M | the Bible | [2], [4] |
Tanzil v1 | 93.5k | 2.8M | 3.4M | the Quran | [2] |
KDE4 v2 | 92.0k | 0.5M | 0.2M | KDE4 documentation | [2] |
Ubuntu v14.10 | 46.6k | 0.4M | 0.2M | Ubuntu documentation | [2] |
Tatoeba v20190709 | 1.1k | 6k | 1.7k | crowdsourced translations | [2] |
- [1] P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
- [2] J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
- [3] Željko Agić, Ivan Vulić: "JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages", In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019. Acknowledge also OPUS by citing the following article: J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
- [4] A massively parallel corpus: the Bible in 100 languages, Christos Christodoulopoulos and Mark Steedman, Language Resources and Evaluation, 49