Skip to content

Commit

Permalink
Found the links to download word2vec binary model trained on google n…
Browse files Browse the repository at this point in the history
…ews corpus.
  • Loading branch information
cgsdfc committed Mar 13, 2019
1 parent f7a1227 commit 9a2c9a9
Show file tree
Hide file tree
Showing 3 changed files with 14 additions and 2 deletions.
3 changes: 3 additions & 0 deletions GoogleNewsCorpusEmbLink.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Links to the data
- https://doc-04-bc-docs.googleusercontent.com/docs/securesc/hjocr289sqh1r4455mj0jihan5v2ingr/pe4tmd0a5bd3ue6nc1lqmbb8iamd2ics/1552471200000/06848720943842814915/04145494130524406310/0B7XkCwpI5KDYNlNUTTlSS21pQmM?e=download&nonce=dn92bkknfn7l6&user=04145494130524406310&hash=s68fjotcmst190dg9vsf54v9bplaqo0j
- https://deeplearning4jblob.blob.core.windows.net/resources/wordvectors/GoogleNews-vectors-negative300.bin.gz
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,15 @@ You can install these deps with conda:
The script assumes one example per line (e.g. one dialogue or one sentence per line),
where line n in `'path_to_ground_truth.txt'` matches that of line n in `'path_to_predictions.txt'`.

# Recommended Word Embedding
The word embedding you are recommended to use is the *Word2Vec* vectors trained on the *Google News Corpus*.
This is also recommended by the original repository. To download this pre-trained embedding easily, here are some useful links:
- [word2vec Google News model](https://github.com/mmihaltz/word2vec-GoogleNews-vectors.git) is a mirror of the google archive on Github,
you need *git lfs* to be able to clone it.
- [Google Code Archive](https://code.google.com/archive/p/word2vec/)
- [Google Drive](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit)


# Where did the code come from?
The main script `embedding_metrices.py` is adapted from [hed-dlg-truncated](https://github.com/julianser/hed-dlg-truncated).
Thanks for their great script!
Expand Down
4 changes: 2 additions & 2 deletions embedding_metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
__authors__ = ("Chia-Wei Liu", "Iulian Vlad Serban")

from random import randint
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import numpy as np
import argparse

Expand Down Expand Up @@ -186,7 +186,7 @@ def average(fileone, filetwo, w2v):
args = parser.parse_args()

print("loading embeddings file...")
w2v = Word2Vec.load_word2vec_format(args.embeddings, binary=True)
w2v = KeyedVectors.load_word2vec_format(args.embeddings, binary=True)

r = average(args.ground_truth, args.predicted, w2v)
print("Embedding Average Score: %f +/- %f ( %f )" % (r[0], r[1], r[2]))
Expand Down

0 comments on commit 9a2c9a9

Please sign in to comment.