Merge pull request tensorflow#4 from tensorflow/swivel

added swivel model
MingStar · Mar 8, 2016 · 1ecaf09 · 1ecaf09
2 parents 7c41e65 + f3a2f63
commit 1ecaf09
Show file tree

Hide file tree

Showing 12 changed files with 2,431 additions and 0 deletions.
diff --git a/swivel/.gitignore b/swivel/.gitignore
@@ -0,0 +1,12 @@
+*.an.tab
+*.pyc
+*.ws.tab
+MEN.tar.gz
+Mtruk.csv
+SimLex-999.zip
+analogy
+fastprep
+myz_naacl13_test_set.tgz
+questions-words.txt
+rw.zip
+ws353simrel.tar.gz
diff --git a/swivel/README.md b/swivel/README.md
@@ -0,0 +1,182 @@
+# Swivel in Tensorflow
+
+This is a [TensorFlow](http://www.tensorflow.org/) implementation of the
+[Swivel algorithm](http://arxiv.org/abs/1602.02215) for generating word
+embeddings.
+
+Swivel works as follows:
+
+1. Compute the co-occurrence statistics from a corpus; that is, determine how
+   often a word *c* appears the context (e.g., "within ten words") of a focus
+   word *f*.  This results in a sparse *co-occurrence matrix* whose rows
+   represent the focus words, and whose columns represent the context
+   words. Each cell value is the number of times the focus and context words
+   were observed together.
+2. Re-organize the co-occurrence matrix and chop it into smaller pieces.
+3. Assign a random *embedding vector* of fixed dimension (say, 300) to each
+   focus word and to each context word.
+4. Iteratively attempt to approximate the
+   [pointwise mutual information](https://en.wikipedia.org/wiki/Pointwise_mutual_information)
+   (PMI) between words with the dot product of the corresponding embedding
+   vectors.
+
+Note that the resulting co-occurrence matrix is very sparse (i.e., contains many
+zeros) since most words won't have been observed in the context of other words.
+In the case of very rare words, it seems reasonable to assume that you just
+haven't sampled enough data to spot their co-occurrence yet.  On the other hand,
+if we've failed to observed to common words co-occuring, it seems likely that
+they are *anti-correlated*.
+
+Swivel attempts to capture this intuition by using both the observed and the
+un-observed co-occurrences to inform the way it iteratively adjusts vectors.
+Empirically, this seems to lead to better embeddings, especially for rare words.
+
+# Contents
+
+This release includes the following programs.
+
+* `prep.py` is a program that takes a text corpus and pre-processes it for
+  training. Specifically, it computes a vocabulary and token co-occurrence
+  statistics for the corpus.  It then outputs the information into a format that
+  can be digested by the TensorFlow trainer.
+* `swivel.py` is a TensorFlow program that generates embeddings from the
+  co-occurrence statistics.  It uses the files created by `prep.py` as input,
+  and generates two text files as output: the row and column embeddings.
+* `text2bin.py` combines the row and column vectors generated by Swivel into a
+  flat binary file that can be quickly loaded into memory to perform vector
+  arithmetic.  This can also be used to convert embeddings from
+  [Glove](http://nlp.stanford.edu/projects/glove/) and
+  [word2vec](https://code.google.com/archive/p/word2vec/) into a form that can
+  be used by the following tools.
+* `nearest.py` is a program that you can use to manually inspect binary
+  embeddings.
+* `eval.mk` is a GNU makefile that fill retrieve and normalize several common
+  word similarity and analogy evaluation data sets.
+* `wordsim.py` performs word similarity evaluation of the resulting vectors.
+* `analogy` performs analogy evaluation of the resulting vectors.
+* `fastprep` is a C++ program that works much more quickly that `prep.py`, but
+  also has some additional dependencies to build.
+
+# Building Embeddings with Swivel
+
+To build your own word embeddings with Swivel, you'll need the following:
+
+* A large corpus of text; for example, the
+  [dump of English Wikipedia](https://dumps.wikimedia.org/enwiki/).
+* A working [TensorFlow](http://www.tensorflow.org/) implementation.
+* A machine with plenty of disk space and, ideally, a beefy GPU card.  (We've
+  experimented with the
+  [Nvidia Titan X](http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan-x),
+  for example.)
+
+You'll then run `prep.py` (or `fastprep`) to prepare the data for Swivel and run
+`swivel.py` to create the embeddings. The resulting embeddings will be output
+into two large text files: one for the row vectors and one for the column
+vectors.  You can use those "as is", or convert them into a binary file using
+`text2bin.py` and then use the tools here to experiment with the resulting
+vectors.
+
+## Preparing the data for training
+
+Once you've downloaded the corpus (e.g., to `/tmp/wiki.txt`), run `prep.py` to
+prepare the data for training:
+
+    ./prep.py --output_dir /tmp/swivel_data --input /tmp/wiki.txt
+
+By default, `prep.py` will make one pass through the text file to compute a
+"vocabulary" of the most frequent words, and then a second pass to compute the
+co-occurrence statistics.  The following options allow you to control this
+behavior:
+
+|:--- |:--- |
+| `--min_count <n>` | Only include words in the generated vocabulary that appear at least *n* times. |
+| `--max_vocab <n>` | Admit at most *n* words into the vocabulary. |
+| `--vocab <filename>` | Use the specified filename as the vocabulary instead of computing it from the corpus.  The file should contain one word per line. |
+
+The `prep.py` program is pretty simple.  Notably, it does almost no text
+processing: it does no case translation and simply breaks text into tokens by
+splitting on spaces. Feel free to experiment with the `words` function if you'd
+like to do something more sophisticated.
+
+Unfortunately, `prep.py` is pretty slow.  Also included is `fastprep`, a C++
+equivalent that works much more quickly.  Building `fastprep.cc` is a bit more
+involved: it requires you to pull and build the Tensorflow source code in order
+to provide the libraries and headers that it needs.  See `fastprep.mk` for more
+details.
+
+## Training the embeddings
+
+When `prep.py` completes, it will have produced a directory containing the data
+that the Swivel trainer needs to run.  Train embeddings as follows:
+
+    ./swivel.py --input_base_path /tmp/swivel_data \
+       --output_base_path /tmp/swivel_data
+
+There are a variety of parameters that you can fiddle with to customize the
+embeddings; some that you may want to experiment with include:
+
+|:--- |:--- |
+| `--embedding_size <dim>` | The dimensionality of the embeddings that are created.  By default, 300 dimensional embeddings are created. |
+| `--num_epochs <n>` | The number of iterations through the data that are performed.  By default, 40 epochs are trained. |
+
+As mentioned above, access to beefy GPU will dramatically reduce the amount of
+time it takes Swivel to train embeddings.
+
+When complete, you should find `row_embeddings.tsv` and `col_embedding.tsv` in
+the directory specified by `--ouput_base_path`.  These files are tab-delimited
+files that contain one embedding per line.  Each line contains the token
+followed by *dim* floating point numbers.
+
+## Exploring and evaluating the embeddings
+
+There are also some simple tools you can to explore the embeddings.  These tools
+work with a simple binary vector format that can be `mmap`-ed into memory along
+with a separate vocabulary file.  Use `text2bin.py` to generate these files:
+
+    ./text2bin.py -o vecs.bin -v vocab.txt /tmp/swivel_data/*_embedding.tsv
+
+You can do some simple exploration using `nearest.py`:
+
+    ./nearest.py -v vocab.txt -e vecs.bin
+    query> dog
+    dog
+    dogs
+    cat
+    ...
+    query> man woman king
+    king
+    queen
+    princess
+    ...
+
+To evaluate the embeddings using common word similarity and analogy datasets,
+use `eval.mk` to retrieve the data sets and build the tools:
+
+    make -f eval.mk
+    ./wordsim.py -v vocab.txt -e vecs.bin *.ws.tab
+    ./analogy --vocab vocab.txt --embeddings vecs.bin *.an.tab
+
+The word similarity evaluation compares the embeddings' estimate of "similarity"
+with human judgement using
+[Spearman's rho](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient)
+as the measure of correlation.  (Bigger numbers are better.)
+
+The analogy evaluation tests how well the embeddings can predict analogies like
+"man is to woman as king is to queen".
+
+Note that `eval.mk` forces all evaluation data into lower case.  From there,
+both the word similarity and analogy evaluations assume that the eval data and
+the embeddings use consistent capitalization: if you train embeddings using
+mixed case and evaluate them using lower case, things won't work well.
+
+# Contact
+
+If you have any questions about Swivel, feel free to post to
+[[email protected]](https://groups.google.com/forum/#!forum/swivel-embeddings)
+or contact us directly:
+
+* Noam Shazeer (`[email protected]`)
+* Ryan Doherty (`[email protected]`)
+* Colin Evans (`[email protected]`)
+* Chris Waterson (`[email protected]`)
+