Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/spacy-io/sense2vec
Browse files Browse the repository at this point in the history
  • Loading branch information
honnibal committed Feb 15, 2016
2 parents 5de6ee7 + 2b39a44 commit a7fc529
Show file tree
Hide file tree
Showing 4 changed files with 41 additions and 15 deletions.
35 changes: 35 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# sense2vec

Use spaCy to go beyond vanilla word2vec

Read about sense2vec here:

https://spacy.io/blog/sense2vec-with-spacy

You can use an online demo of the technology here:

https://sense2vec.spacy.io/

We're currently refining the API, to make this technology easy to use. Once we've completed that, you'll be able
to download the package on PyPi. For now, the code is available to clarify the blog post.

There are three relevant files in this repository:

# bin/merge_text.py

This script pre-processes text using spaCy, so that the sense2vec model can be trained using Gensim.

# bin/train_word2vec.py

This script reads a directory of text files, and then trains a word2vec model using Gensim. The script includes its own
vocabulary counting code, because Gensim's vocabulary count is a bit slow for our large, sparse vocabulary.

# sense2vec/vectors.pyx

To serve the similarity queries, we wrote a small vector-store class in Cython. This made it easier to add an efficient
cache in front of the service. It also less memory than Gensim's Word2Vec class, as it doesn't hold the keys as Python
unicode strings.

Similarity queries could be faster, if we had made all vectors contiguous in memory, instead of holding them
as an array of pointers. However, we wanted to allow a `.borrow()` method, so that vectors can be added to the store
by reference, without copying the data.
2 changes: 1 addition & 1 deletion sense2vec/about.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# https://python-packaging-user-guide.readthedocs.org/en/latest/single_source_version/
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py

__name__ = 'sense2vec'
__title__ = 'sense2vec'
__version__ = '0.1.0'
__summary__ = 'Fancy word2vec'
__uri__ = 'https://spacy.io'
Expand Down
8 changes: 4 additions & 4 deletions sense2vec/download.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,19 @@
)
def main(force=False):
if force:
sputnik.purge(about.__name__, about.__version__)
sputnik.purge(about.__title__, about.__version__)

try:
sputnik.package(about.__name__, about.__version__, about.__default_model__)
sputnik.package(about.__title__, about.__version__, about.__default_model__)
print("Model already installed. Please run '%s --force to reinstall." % sys.argv[0], file=sys.stderr)
sys.exit(1)
except (PackageNotFoundException, CompatiblePackageNotFoundException):
pass

package = sputnik.install(about.__name__, about.__version__, about.__default_model__)
package = sputnik.install(about.__title__, about.__version__, about.__default_model__)

try:
sputnik.package(about.__name__, about.__version__, about.__default_model__)
sputnik.package(about.__title__, about.__version__, about.__default_model__)
except (PackageNotFoundException, CompatiblePackageNotFoundException):
print("Model failed to install. Please run '%s --force." % sys.argv[0], file=sys.stderr)
sys.exit(1)
Expand Down
11 changes: 1 addition & 10 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,6 @@
MOD_NAMES = ['sense2vec.vectors']


if sys.version_info[:2] < (2, 7) or (3, 0) <= sys.version_info[0:2] < (3, 4):
raise RuntimeError('Python version 2.7 or >= 3.4 required.')


# By subclassing build_extensions we have the actual compiler that will be used which is really known only after finalize_options
# http://stackoverflow.com/questions/724664/python-distutils-how-to-get-a-compiler-that-is-going-to-be-used
compile_options = {'msvc' : ['/Ox', '/EHsc'],
Expand All @@ -36,11 +32,6 @@
'-L/usr/lib64/atlas', # needed for redhat
'-lcblas']}

if sys.platform.startswith('darwin'):
compile_options['other'].append('-mmacosx-version-min=10.8')
compile_options['other'].append('-stdlib=libc++')
link_options['other'].append('-lc++')


class build_ext_options:
def build_options(self):
Expand Down Expand Up @@ -152,7 +143,7 @@ def setup_package():
prepare_includes(root)

setup(
name=about['__name__'],
name=about['__title__'],
zip_safe=False,
packages=PACKAGES,
package_data={'': ['*.pyx', '*.pxd']},
Expand Down

0 comments on commit a7fc529

Please sign in to comment.