Rename -hashOnly to -mode floret (#3)

* Rename -hashOnly to -mode floret In parallel to `spacy init vectors --mode floret`, rename the boolean option `-hashOnly` to `-mode floret` with the default `-mode fasttext`. * Set version to v0.10.0.dev1
explosion · Oct 4, 2021 · d4b0de0 · d4b0de0
1 parent c4e1613
commit d4b0de0
Show file tree

Hide file tree

Showing 10 changed files with 81 additions and 44 deletions.
diff --git a/README.md b/README.md
@@ -45,12 +45,12 @@ See the [python docs](python/README.md).
 `floret` adds two additional command line options to `fasttext`:
 
 ```
-  -hashOnly           both word and char ngrams hashed only in buckets [false]
-  -hashCount          with hashOnly: number of hashes (1-4) per word / subword [1]
+  -mode               fasttext (default) or floret (word and char ngrams hashed in buckets) [fasttext]
+  -hashCount          floret mode only: number of hashes (1-4) per word/subword [1]
 ```
 
-With `-hashOnly`, the word entries are stored in the same table as the subword
-embeddings (buckets), reducing the size of the saved vector data.
+With `-mode floret`, the word entries are stored in the same table as the
+subword embeddings (buckets), reducing the size of the saved vector data.
 
 With `-hashCount 2`, each entry is stored as the sum of 2 rows in the internal
 subwords hash table. `floret` supports 1-4 hashes per entry in the embeddings
@@ -64,14 +64,14 @@ hashes per entry, and a compact table of 50K entries rather than the default of
 2M entries.
 
 ```bash
-floret cbow -dim 300 -minn 4 -maxn 5 -hashOnly -hashCount 2 -bucket 50000 \
+floret cbow -dim 300 -minn 4 -maxn 5 -mode floret -hashCount 2 -bucket 50000 \
 -input input.txt -output vectors
 ```
 
-With the `-hashOnly` option, floret will save an additional vector table with
-the file ending `.floret`. The format is very similar to `.vec` with a header
-line followed by one line per vector. The word tokens are replaced with the
-index of the row and the header is extended to contain all the relevant
+With the `-mode floret` option, floret will save an additional vector table
+with the file ending `.floret`. The format is very similar to `.vec` with a
+header line followed by one line per vector. The word tokens are replaced with
+the index of the row and the header is extended to contain all the relevant
 training settings needed to load this table in spaCy.
 
 To import this vector table in [spaCy](https://spacy.io) v3.2+:
@@ -107,9 +107,9 @@ the table. By representing each entry as the sum of multiple rows, where it's
 unlikely that two entries will collide on multiple hashes, most entries will
 end up with a distinct representation.
 
-With the settings `-minn 4 -maxn 5 -hashOnly -hashCount 2`, the embedding for
-the word `apple` is stored internally as the sum of 2 hashed rows for each of
-the word, 4-grams and 5-ngrams. The word is padded with the BOW and EOW
+With the settings `-minn 4 -maxn 5 -mode floret -hashCount 2`, the embedding
+for the word `apple` is stored internally as the sum of 2 hashed rows for each
+of the word, 4-grams and 5-ngrams. The word is padded with the BOW and EOW
 characters `<` and `>`, creating the following word and subword entries:
 
 ```
@@ -128,7 +128,7 @@ For compatibility with spaCy,
 char ngram strings. The final embedding for `apple` is then the sum of two rows
 (`-hashCount 2`) per word and char ngram above.
 
-With `-hashOnly`, `floret` will save an additional vector table with the
+With `-mode floret`, `floret` will save an additional vector table with the
 ending `.floret` alongside the usual `.bin` and `.vec` files. The format is
 very similar to `.vec` with a header line followed by one line per entry in the
 vector table with the row index rather than a word token at the beginning of

diff --git a/python/README.md b/python/README.md
@@ -19,8 +19,8 @@ pip install floret
 
 Train floret vectors using the options:
 
-- `hashOnly`: if `True`, train floret vectors, storing both words and subwords
-  in the same compact hash table
+- `mode`: `"floret"`, storing both words and subwords in the same compact hash
+  table
 - `hashCount`: store each entry in 1-4 rows in the hash table (recommended:
   `2`)
 - `bucket`: in combination with `hashCount>1`, the size of the hash table can
@@ -36,7 +36,7 @@ import floret
 model = floret.train_unsupervised(
     "data.txt",
     model="cbow",
-    hashOnly=True,
+    mode="floret",
     hashCount=2,
     bucket=50000,
     minn=3,
@@ -56,7 +56,7 @@ model.save_vectors("vectors.vec")
 model.save_hash_only_vectors("vectors.floret")
 ```
 
-**Note:** with the default setting `hashOnly=False`, `floret` trains original
+**Note:** with the default setting `mode="fasttext"`, `floret` trains original
 fastText vectors.
 
 ## Use floret vectors in spaCy

diff --git a/python/floret_module/floret/floret.py b/python/floret_module/floret/floret.py
@@ -17,6 +17,7 @@
 
 loss_name = floret.loss_name
 model_name = floret.model_name
+mode_name = floret.mode_name
 EOS = "</s>"
 BOW = "<"
 EOW = ">"
@@ -102,7 +103,7 @@ def set_args(self, args=None):
                          'minCountLabel', 'minn', 'maxn', 'neg', 'wordNgrams',
                          'loss', 'bucket', 'thread', 'lrUpdateRate', 't',
                          'label', 'verbose', 'pretrainedVectors',
-                         'hashOnly', 'hashCount']
+                         'mode', 'hashCount']
             for arg_name in arg_names:
                 setattr(self, arg_name, getattr(args, arg_name))
 
@@ -416,9 +417,19 @@ def _parse_loss_string(string):
         raise ValueError("Unrecognized loss name")
 
 
+def _parse_mode_string(string):
+    if string == "fasttext":
+        return mode_name.fasttext
+    elif string == "floret":
+        return mode_name.floret
+    else:
+        raise ValueError("Unrecognized mode name")
+
+
 def _build_args(args, manually_set_args):
     args["model"] = _parse_model_string(args["model"])
     args["loss"] = _parse_loss_string(args["loss"])
+    args["mode"] = _parse_mode_string(args["mode"])
     if type(args["autotuneModelSize"]) == int:
         args["autotuneModelSize"] = str(args["autotuneModelSize"])
 
@@ -429,8 +440,10 @@ def _build_args(args, manually_set_args):
             a.setManual(k)
     a.output = ""  # User should use save_model
     a.saveOutput = 0  # Never use this
-    if a.wordNgrams <= 1 and a.maxn == 0:
+    if a.wordNgrams <= 1 and a.maxn == 0 and len(a.autotuneValidationFile) == 0 and a.mode != mode_name.floret:
         a.bucket = 0
+    if a.mode != "floret":
+        a.hashCount = 1
     return a
 
 
@@ -455,7 +468,7 @@ def load_model(path):
     'minCountLabel': 0,
     'minn': 3,
     'maxn': 6,
-    'hashOnly': False,
+    'mode': "fasttext_mode",
     'hashCount': 1,
     'neg': 5,
     'wordNgrams': 1,
@@ -557,7 +570,7 @@ def train_unsupervised(*kargs, **kwargs):
     """
     arg_names = ['input', 'model', 'lr', 'dim', 'ws', 'epoch', 'minCount',
                  'minCountLabel', 'minn', 'maxn', 'neg', 'wordNgrams', 'loss',
-                 'bucket', 'hashCount', 'hashOnly', 'thread', 'lrUpdateRate',
+                 'bucket', 'hashCount', 'mode', 'thread', 'lrUpdateRate',
                  't', 'label', 'verbose', 'pretrainedVectors']
     args, manually_set_args = read_args(kargs, kwargs, arg_names,
                                         unsupervised_default)

diff --git a/python/floret_module/floret/pybind/floret_pybind.cc b/python/floret_module/floret/pybind/floret_pybind.cc
@@ -104,7 +104,7 @@ PYBIND11_MODULE(floret_pybind, m) {
       .def_readwrite("bucket", &fasttext::Args::bucket)
       .def_readwrite("minn", &fasttext::Args::minn)
       .def_readwrite("maxn", &fasttext::Args::maxn)
-      .def_readwrite("hashOnly", &fasttext::Args::hashOnly)
+      .def_readwrite("mode", &fasttext::Args::mode)
       .def_readwrite("hashCount", &fasttext::Args::hashCount)
       .def_readwrite("thread", &fasttext::Args::thread)
       .def_readwrite("t", &fasttext::Args::t)
@@ -157,6 +157,11 @@ PYBIND11_MODULE(floret_pybind, m) {
           fasttext::metric_name::recallAtPrecisionLabel)
       .export_values();
 
+  py::enum_<fasttext::mode_name>(m, "mode_name")
+      .value("fasttext", fasttext::mode_name::fasttext)
+      .value("floret", fasttext::mode_name::floret);
+      // not exported into the parent scope because the names clash
+
   m.def(
       "train",
       [](fasttext::FastText& ft, fasttext::Args& a) {

diff --git a/setup.py b/setup.py
@@ -22,7 +22,7 @@
 import io
 import pybind11
 
-__version__ = '0.10.0.dev0'
+__version__ = '0.10.0.dev1'
 FASTTEXT_SRC = "src"
 
 # Based on https://github.com/pybind/python_example

diff --git a/src/args.cc b/src/args.cc
@@ -31,7 +31,7 @@ Args::Args() {
   bucket = 2000000;
   minn = 3;
   maxn = 6;
-  hashOnly = false;
+  mode = mode_name::fasttext;
   hashCount = 1;
   thread = 12;
   lrUpdateRate = 100;
@@ -107,6 +107,16 @@ std::string Args::metricToString(metric_name mn) const {
   return "Unknown metric name!"; // should never happen
 }
 
+std::string Args::modeToString(mode_name mn) const {
+  switch (mn) {
+    case mode_name::fasttext:
+      return "fasttext";
+    case mode_name::floret:
+      return "floret";
+  }
+  return "Unknown mode name!"; // should never happen
+}
+
 void Args::parseArgs(const std::vector<std::string>& args) {
   std::string command(args[1]);
   if (command == "supervised") {
@@ -175,9 +185,16 @@ void Args::parseArgs(const std::vector<std::string>& args) {
         minn = std::stoi(args.at(ai + 1));
       } else if (args[ai] == "-maxn") {
         maxn = std::stoi(args.at(ai + 1));
-      } else if (args[ai] == "-hashOnly") {
-	hashOnly = true;
-	ai--;
+      } else if (args[ai] == "-mode") {
+        if (std::string(args.at(ai + 1)) == "fasttext") {
+          mode = mode_name::fasttext;
+        } else if (std::string(args.at(ai + 1)) == "floret"){
+          mode = mode_name::floret;
+        } else {
+          std::cerr << "Unknown mode: " << args.at(ai + 1) << std::endl;
+          printHelp();
+          exit(EXIT_FAILURE);
+        }
       } else if (args[ai] == "-hashCount") {
 	hashCount = std::stoi(args.at(ai + 1));
         if (hashCount < 1 || hashCount >= 5) {
@@ -241,10 +258,10 @@ void Args::parseArgs(const std::vector<std::string>& args) {
     printHelp();
     exit(EXIT_FAILURE);
   }
-  if (wordNgrams <= 1 && maxn == 0 && !hasAutotune() && !hashOnly) {
+  if (wordNgrams <= 1 && maxn == 0 && !hasAutotune() && mode != mode_name::floret) {
     bucket = 0;
   }
-  if (!hashOnly) {
+  if (mode != mode_name::floret) {
     hashCount = 1;
   }
 }
@@ -278,9 +295,9 @@ void Args::printDictionaryHelp() {
             << "]\n"
             << "  -maxn               max length of char ngram [" << maxn
             << "]\n"
-            << "  -hashOnly           both word and char ngrams hashed only in buckets ["
-	    << boolToString(hashOnly) << "]\n"
-            << "  -hashCount          with hashOnly: number of hashes (1-4) per word / subword ["
+            << "  -mode               fasttext (default) or floret (word and char ngrams hashed in buckets) ["
+	    << "fasttext" << "]\n"
+            << "  -hashCount          floret mode only: number of hashes (1-4) per word/subword ["
 	    << hashCount << "]\n"
             << "  -t                  sampling threshold [" << t << "]\n"
             << "  -label              labels prefix [" << label << "]\n";
@@ -353,7 +370,7 @@ void Args::save(std::ostream& out) {
   out.write((char*)&(bucket), sizeof(int));
   out.write((char*)&(minn), sizeof(int));
   out.write((char*)&(maxn), sizeof(int));
-  out.write((char*)&(hashOnly), sizeof(bool));
+  out.write((char*)&(mode), sizeof(mode_name));
   out.write((char*)&(hashCount), sizeof(int));
   out.write((char*)&(lrUpdateRate), sizeof(int));
   out.write((char*)&(t), sizeof(double));
@@ -371,7 +388,7 @@ void Args::load(std::istream& in) {
   in.read((char*)&(bucket), sizeof(int));
   in.read((char*)&(minn), sizeof(int));
   in.read((char*)&(maxn), sizeof(int));
-  in.read((char*)&(hashOnly), sizeof(bool));
+  in.read((char*)&(mode), sizeof(mode_name));
   in.read((char*)&(hashCount), sizeof(int));
   in.read((char*)&(lrUpdateRate), sizeof(int));
   in.read((char*)&(t), sizeof(double));
@@ -400,8 +417,8 @@ void Args::dump(std::ostream& out) const {
       << " " << minn << std::endl;
   out << "maxn"
       << " " << maxn << std::endl;
-  out << "hashOnly"
-      << " " << hashOnly << std::endl;
+  out << "mode"
+      << " " << modeToString(mode) << std::endl;
   out << "hashCount"
       << " " << hashCount << std::endl;
   out << "lrUpdateRate"

diff --git a/src/args.h b/src/args.h
@@ -26,12 +26,14 @@ enum class metric_name : int {
   recallAtPrecision,
   recallAtPrecisionLabel
 };
+enum class mode_name : int { fasttext = 1, floret };
 
 class Args {
  protected:
   std::string boolToString(bool) const;
   std::string modelToString(model_name) const;
   std::string metricToString(metric_name) const;
+  std::string modeToString(mode_name) const;
   std::unordered_set<std::string> manualArgs_;
 
  public:
@@ -52,7 +54,7 @@ class Args {
   int bucket;
   int minn;
   int maxn;
-  bool hashOnly;
+  mode_name mode;
   int hashCount;
   int thread;
   double t;

diff --git a/src/dictionary.cc b/src/dictionary.cc
@@ -109,7 +109,7 @@ void Dictionary::getSubwords(
   int32_t i = getId(word);
   ngrams.clear();
   substrings.clear();
-  if (!args_->hashOnly && i >= 0) {
+  if (args_->mode != mode_name::floret && i >= 0) {
     ngrams.push_back(i);
     substrings.push_back(words_[i].word);
   }
@@ -183,7 +183,7 @@ void Dictionary::computeSubwords(
     const std::string& word,
     std::vector<int32_t>& ngrams,
     std::vector<std::string>* substrings) const {
-  if (args_->hashOnly) {
+  if ((args_->mode == mode_name::floret)) {
     std::vector<uint32_t> hashes;
     murmurhash(word, &hashes);
     for (uint32_t hash : hashes) {
@@ -205,7 +205,7 @@ void Dictionary::computeSubwords(
         ngram.push_back(word[j++]);
       }
       if (n >= args_->minn && !(n == 1 && (i == 0 || j == word.size()))) {
-        if (args_->hashOnly) {
+        if ((args_->mode == mode_name::floret)) {
           std::vector<uint32_t> hashes;
           murmurhash(ngram, &hashes);
           for (size_t i = 0; i < hashes.size(); i++) {
@@ -233,7 +233,7 @@ void Dictionary::initNgrams() {
       computeSubwords(word, words_[i].subwords);
     }
     // remove word-index subword for all words except 0 (</s>)
-    if (args_->hashOnly && i > 0) {
+    if ((args_->mode == mode_name::floret) && i > 0) {
       words_[i].subwords.erase(words_[i].subwords.begin());
     }
   }

diff --git a/src/fasttext.cc b/src/fasttext.cc
@@ -96,7 +96,7 @@ int32_t FastText::getWordId(const std::string& word) const {
 }
 
 int32_t FastText::getSubwordId(const std::string& subword) const {
-  if (args_->hashOnly) {
+  if (args_->mode == mode_name::floret) {
     return -1;
   } else {
     int32_t h = dict_->hash(subword) % args_->bucket;
@@ -125,7 +125,7 @@ void FastText::getWordVector(Vector& vec, const std::string& word) const {
 
 void FastText::getSubwordVector(Vector& vec, const std::string& subword) const {
   vec.zero();
-  if (args_->hashOnly) {
+  if (args_->mode == mode_name::floret) {
     std::vector<uint32_t> hashes;
     dict_->murmurhash(subword, &hashes);
     for (size_t i = 0; i < hashes.size(); i++) {

diff --git a/src/main.cc b/src/main.cc
@@ -381,7 +381,7 @@ void train(const std::vector<std::string> args) {
   }
   fasttext->saveModel(outputFileName);
   fasttext->saveVectors(a.output + ".vec");
-  if (a.hashOnly) {
+  if (a.mode == mode_name::floret) {
     fasttext->saveHashOnlyVectors(a.output + ".floret");
   }
   if (a.saveOutput) {