Skip to content

Win32 port (minor changes) of Google's Word2Vec machine learning code

License

Notifications You must be signed in to change notification settings

bgrainger/word2vec

 
 

Repository files navigation

Win32 port of word2vec

Minor changes have been made to make the code compilable with Microsoft Visual Studio 2013.

  • Use OMP instead of pthreads
  • Custom posix_memalign()
  • different header files

All changes wrapped in _WIN32 #if blocks to keep backward compatibility.

Tools for computing distributed representtion of words

We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG), as well as several demo scripts.

Given a text corpus, the word2vec tool learns a vector for every word in the vocabulary using the Continuous Bag-of-Words or the Skip-Gram neural network architectures. The user should to specify the following:

  • desired vector dimensionality
  • the size of the context window for either the Skip-Gram or the Continuous Bag-of-Words model
  • training algorithm: hierarchical softmax and / or negative sampling
  • threshold for downsampling the frequent words
  • number of threads to use
  • the format of the output word vector file (text or binary)

Usually, the other hyper-parameters such as the learning rate do not need to be tuned for different training sets.

The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training is finished, the user can interactively explore the similarity of the words.

More information about the scripts is provided at https://code.google.com/p/word2vec/

About

Win32 port (minor changes) of Google's Word2Vec machine learning code

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 84.6%
  • Shell 14.2%
  • Makefile 1.2%