We are publishing pre-trained word vectors for 90 languages, trained on Wikipedia using fastText. These vectors in dimension 300 were obtained using the skip-gram model described in 1 with default parameters.
The word vectors come in both the binary and text default formats of fastText. In the text format, each line contain a word followed by its embedding. Each value is space separated. Words are ordered by their frequency in a descending order.
The models can be downloaded from:
- Afrikaans
- Albanian
- Arabic
- Armenian
- Asturian
- Azerbaijani
- Bashkir
- Basque
- Belarusian
- Bengali
- Bosnian
- Breton
- Bulgarian
- Burmese
- Catalan
- Cebuano
- Chechen
- Chinese
- Chuvash
- Croatian
- Czech
- Danish
- Dutch
- English
- Esperanto
- Estonian
- Farsi
- Finnish
- French
- Galician
- Georgian
- German
- Greek
- Gujarati
- Hebrew
- Hindi
- Hungarian
- Icelandic
- Indonesian
- Italian
- Japanese
- Kannada
- Kazakh
- Khmer
- Korean
- Kyrgyz
- Latin
- Latvian
- Lithuanian
- Luxembourgish
- Macedonian
- Malagasy
- Malayalam
- Malay
- Marathi
- Minangkabau
- Mongolian
- Nepali
- Newar
- Norwegian
- Occitan
- Polish
- Portuguese
- Punjabi
- Romanian
- Russian
- Sanskrit
- Scots
- Serbian
- Serbo-Croatian
- Sinhalese
- Slovak
- Slovene
- Spanish
- Swedish
- Tagalog
- Tajik
- Tamil
- Tatar
- Telugu
- Thai
- Turkish
- Ukrainian
- Urdu
- Uzbek
- Vietnamese
- Volapük
- Waray
- Welsh
- Western Frisian
If you use these word embeddings, please cite the following paper:
[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
@article{bojanowski2016enriching,
title={Enriching Word Vectors with Subword Information},
author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.04606},
year={2016}
}