Skip to content

Commit

Permalink
Rename -hashOnly to -mode floret (#3)
Browse files Browse the repository at this point in the history
* Rename -hashOnly to -mode floret

In parallel to `spacy init vectors --mode floret`, rename the boolean
option `-hashOnly` to `-mode floret` with the default `-mode fasttext`.

* Set version to v0.10.0.dev1
  • Loading branch information
adrianeboyd authored Oct 4, 2021
1 parent c4e1613 commit d4b0de0
Show file tree
Hide file tree
Showing 10 changed files with 81 additions and 44 deletions.
26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,12 +45,12 @@ See the [python docs](python/README.md).
`floret` adds two additional command line options to `fasttext`:

```
-hashOnly both word and char ngrams hashed only in buckets [false]
-hashCount with hashOnly: number of hashes (1-4) per word / subword [1]
-mode fasttext (default) or floret (word and char ngrams hashed in buckets) [fasttext]
-hashCount floret mode only: number of hashes (1-4) per word/subword [1]
```

With `-hashOnly`, the word entries are stored in the same table as the subword
embeddings (buckets), reducing the size of the saved vector data.
With `-mode floret`, the word entries are stored in the same table as the
subword embeddings (buckets), reducing the size of the saved vector data.

With `-hashCount 2`, each entry is stored as the sum of 2 rows in the internal
subwords hash table. `floret` supports 1-4 hashes per entry in the embeddings
Expand All @@ -64,14 +64,14 @@ hashes per entry, and a compact table of 50K entries rather than the default of
2M entries.

```bash
floret cbow -dim 300 -minn 4 -maxn 5 -hashOnly -hashCount 2 -bucket 50000 \
floret cbow -dim 300 -minn 4 -maxn 5 -mode floret -hashCount 2 -bucket 50000 \
-input input.txt -output vectors
```

With the `-hashOnly` option, floret will save an additional vector table with
the file ending `.floret`. The format is very similar to `.vec` with a header
line followed by one line per vector. The word tokens are replaced with the
index of the row and the header is extended to contain all the relevant
With the `-mode floret` option, floret will save an additional vector table
with the file ending `.floret`. The format is very similar to `.vec` with a
header line followed by one line per vector. The word tokens are replaced with
the index of the row and the header is extended to contain all the relevant
training settings needed to load this table in spaCy.

To import this vector table in [spaCy](https://spacy.io) v3.2+:
Expand Down Expand Up @@ -107,9 +107,9 @@ the table. By representing each entry as the sum of multiple rows, where it's
unlikely that two entries will collide on multiple hashes, most entries will
end up with a distinct representation.

With the settings `-minn 4 -maxn 5 -hashOnly -hashCount 2`, the embedding for
the word `apple` is stored internally as the sum of 2 hashed rows for each of
the word, 4-grams and 5-ngrams. The word is padded with the BOW and EOW
With the settings `-minn 4 -maxn 5 -mode floret -hashCount 2`, the embedding
for the word `apple` is stored internally as the sum of 2 hashed rows for each
of the word, 4-grams and 5-ngrams. The word is padded with the BOW and EOW
characters `<` and `>`, creating the following word and subword entries:

```
Expand All @@ -128,7 +128,7 @@ For compatibility with spaCy,
char ngram strings. The final embedding for `apple` is then the sum of two rows
(`-hashCount 2`) per word and char ngram above.

With `-hashOnly`, `floret` will save an additional vector table with the
With `-mode floret`, `floret` will save an additional vector table with the
ending `.floret` alongside the usual `.bin` and `.vec` files. The format is
very similar to `.vec` with a header line followed by one line per entry in the
vector table with the row index rather than a word token at the beginning of
Expand Down
8 changes: 4 additions & 4 deletions python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ pip install floret

Train floret vectors using the options:

- `hashOnly`: if `True`, train floret vectors, storing both words and subwords
in the same compact hash table
- `mode`: `"floret"`, storing both words and subwords in the same compact hash
table
- `hashCount`: store each entry in 1-4 rows in the hash table (recommended:
`2`)
- `bucket`: in combination with `hashCount>1`, the size of the hash table can
Expand All @@ -36,7 +36,7 @@ import floret
model = floret.train_unsupervised(
"data.txt",
model="cbow",
hashOnly=True,
mode="floret",
hashCount=2,
bucket=50000,
minn=3,
Expand All @@ -56,7 +56,7 @@ model.save_vectors("vectors.vec")
model.save_hash_only_vectors("vectors.floret")
```

**Note:** with the default setting `hashOnly=False`, `floret` trains original
**Note:** with the default setting `mode="fasttext"`, `floret` trains original
fastText vectors.

## Use floret vectors in spaCy
Expand Down
21 changes: 17 additions & 4 deletions python/floret_module/floret/floret.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

loss_name = floret.loss_name
model_name = floret.model_name
mode_name = floret.mode_name
EOS = "</s>"
BOW = "<"
EOW = ">"
Expand Down Expand Up @@ -102,7 +103,7 @@ def set_args(self, args=None):
'minCountLabel', 'minn', 'maxn', 'neg', 'wordNgrams',
'loss', 'bucket', 'thread', 'lrUpdateRate', 't',
'label', 'verbose', 'pretrainedVectors',
'hashOnly', 'hashCount']
'mode', 'hashCount']
for arg_name in arg_names:
setattr(self, arg_name, getattr(args, arg_name))

Expand Down Expand Up @@ -416,9 +417,19 @@ def _parse_loss_string(string):
raise ValueError("Unrecognized loss name")


def _parse_mode_string(string):
if string == "fasttext":
return mode_name.fasttext
elif string == "floret":
return mode_name.floret
else:
raise ValueError("Unrecognized mode name")


def _build_args(args, manually_set_args):
args["model"] = _parse_model_string(args["model"])
args["loss"] = _parse_loss_string(args["loss"])
args["mode"] = _parse_mode_string(args["mode"])
if type(args["autotuneModelSize"]) == int:
args["autotuneModelSize"] = str(args["autotuneModelSize"])

Expand All @@ -429,8 +440,10 @@ def _build_args(args, manually_set_args):
a.setManual(k)
a.output = "" # User should use save_model
a.saveOutput = 0 # Never use this
if a.wordNgrams <= 1 and a.maxn == 0:
if a.wordNgrams <= 1 and a.maxn == 0 and len(a.autotuneValidationFile) == 0 and a.mode != mode_name.floret:
a.bucket = 0
if a.mode != "floret":
a.hashCount = 1
return a


Expand All @@ -455,7 +468,7 @@ def load_model(path):
'minCountLabel': 0,
'minn': 3,
'maxn': 6,
'hashOnly': False,
'mode': "fasttext_mode",
'hashCount': 1,
'neg': 5,
'wordNgrams': 1,
Expand Down Expand Up @@ -557,7 +570,7 @@ def train_unsupervised(*kargs, **kwargs):
"""
arg_names = ['input', 'model', 'lr', 'dim', 'ws', 'epoch', 'minCount',
'minCountLabel', 'minn', 'maxn', 'neg', 'wordNgrams', 'loss',
'bucket', 'hashCount', 'hashOnly', 'thread', 'lrUpdateRate',
'bucket', 'hashCount', 'mode', 'thread', 'lrUpdateRate',
't', 'label', 'verbose', 'pretrainedVectors']
args, manually_set_args = read_args(kargs, kwargs, arg_names,
unsupervised_default)
Expand Down
7 changes: 6 additions & 1 deletion python/floret_module/floret/pybind/floret_pybind.cc
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ PYBIND11_MODULE(floret_pybind, m) {
.def_readwrite("bucket", &fasttext::Args::bucket)
.def_readwrite("minn", &fasttext::Args::minn)
.def_readwrite("maxn", &fasttext::Args::maxn)
.def_readwrite("hashOnly", &fasttext::Args::hashOnly)
.def_readwrite("mode", &fasttext::Args::mode)
.def_readwrite("hashCount", &fasttext::Args::hashCount)
.def_readwrite("thread", &fasttext::Args::thread)
.def_readwrite("t", &fasttext::Args::t)
Expand Down Expand Up @@ -157,6 +157,11 @@ PYBIND11_MODULE(floret_pybind, m) {
fasttext::metric_name::recallAtPrecisionLabel)
.export_values();

py::enum_<fasttext::mode_name>(m, "mode_name")
.value("fasttext", fasttext::mode_name::fasttext)
.value("floret", fasttext::mode_name::floret);
// not exported into the parent scope because the names clash

m.def(
"train",
[](fasttext::FastText& ft, fasttext::Args& a) {
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
import io
import pybind11

__version__ = '0.10.0.dev0'
__version__ = '0.10.0.dev1'
FASTTEXT_SRC = "src"

# Based on https://github.com/pybind/python_example
Expand Down
43 changes: 30 additions & 13 deletions src/args.cc
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Args::Args() {
bucket = 2000000;
minn = 3;
maxn = 6;
hashOnly = false;
mode = mode_name::fasttext;
hashCount = 1;
thread = 12;
lrUpdateRate = 100;
Expand Down Expand Up @@ -107,6 +107,16 @@ std::string Args::metricToString(metric_name mn) const {
return "Unknown metric name!"; // should never happen
}

std::string Args::modeToString(mode_name mn) const {
switch (mn) {
case mode_name::fasttext:
return "fasttext";
case mode_name::floret:
return "floret";
}
return "Unknown mode name!"; // should never happen
}

void Args::parseArgs(const std::vector<std::string>& args) {
std::string command(args[1]);
if (command == "supervised") {
Expand Down Expand Up @@ -175,9 +185,16 @@ void Args::parseArgs(const std::vector<std::string>& args) {
minn = std::stoi(args.at(ai + 1));
} else if (args[ai] == "-maxn") {
maxn = std::stoi(args.at(ai + 1));
} else if (args[ai] == "-hashOnly") {
hashOnly = true;
ai--;
} else if (args[ai] == "-mode") {
if (std::string(args.at(ai + 1)) == "fasttext") {
mode = mode_name::fasttext;
} else if (std::string(args.at(ai + 1)) == "floret"){
mode = mode_name::floret;
} else {
std::cerr << "Unknown mode: " << args.at(ai + 1) << std::endl;
printHelp();
exit(EXIT_FAILURE);
}
} else if (args[ai] == "-hashCount") {
hashCount = std::stoi(args.at(ai + 1));
if (hashCount < 1 || hashCount >= 5) {
Expand Down Expand Up @@ -241,10 +258,10 @@ void Args::parseArgs(const std::vector<std::string>& args) {
printHelp();
exit(EXIT_FAILURE);
}
if (wordNgrams <= 1 && maxn == 0 && !hasAutotune() && !hashOnly) {
if (wordNgrams <= 1 && maxn == 0 && !hasAutotune() && mode != mode_name::floret) {
bucket = 0;
}
if (!hashOnly) {
if (mode != mode_name::floret) {
hashCount = 1;
}
}
Expand Down Expand Up @@ -278,9 +295,9 @@ void Args::printDictionaryHelp() {
<< "]\n"
<< " -maxn max length of char ngram [" << maxn
<< "]\n"
<< " -hashOnly both word and char ngrams hashed only in buckets ["
<< boolToString(hashOnly) << "]\n"
<< " -hashCount with hashOnly: number of hashes (1-4) per word / subword ["
<< " -mode fasttext (default) or floret (word and char ngrams hashed in buckets) ["
<< "fasttext" << "]\n"
<< " -hashCount floret mode only: number of hashes (1-4) per word/subword ["
<< hashCount << "]\n"
<< " -t sampling threshold [" << t << "]\n"
<< " -label labels prefix [" << label << "]\n";
Expand Down Expand Up @@ -353,7 +370,7 @@ void Args::save(std::ostream& out) {
out.write((char*)&(bucket), sizeof(int));
out.write((char*)&(minn), sizeof(int));
out.write((char*)&(maxn), sizeof(int));
out.write((char*)&(hashOnly), sizeof(bool));
out.write((char*)&(mode), sizeof(mode_name));
out.write((char*)&(hashCount), sizeof(int));
out.write((char*)&(lrUpdateRate), sizeof(int));
out.write((char*)&(t), sizeof(double));
Expand All @@ -371,7 +388,7 @@ void Args::load(std::istream& in) {
in.read((char*)&(bucket), sizeof(int));
in.read((char*)&(minn), sizeof(int));
in.read((char*)&(maxn), sizeof(int));
in.read((char*)&(hashOnly), sizeof(bool));
in.read((char*)&(mode), sizeof(mode_name));
in.read((char*)&(hashCount), sizeof(int));
in.read((char*)&(lrUpdateRate), sizeof(int));
in.read((char*)&(t), sizeof(double));
Expand Down Expand Up @@ -400,8 +417,8 @@ void Args::dump(std::ostream& out) const {
<< " " << minn << std::endl;
out << "maxn"
<< " " << maxn << std::endl;
out << "hashOnly"
<< " " << hashOnly << std::endl;
out << "mode"
<< " " << modeToString(mode) << std::endl;
out << "hashCount"
<< " " << hashCount << std::endl;
out << "lrUpdateRate"
Expand Down
4 changes: 3 additions & 1 deletion src/args.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,14 @@ enum class metric_name : int {
recallAtPrecision,
recallAtPrecisionLabel
};
enum class mode_name : int { fasttext = 1, floret };

class Args {
protected:
std::string boolToString(bool) const;
std::string modelToString(model_name) const;
std::string metricToString(metric_name) const;
std::string modeToString(mode_name) const;
std::unordered_set<std::string> manualArgs_;

public:
Expand All @@ -52,7 +54,7 @@ class Args {
int bucket;
int minn;
int maxn;
bool hashOnly;
mode_name mode;
int hashCount;
int thread;
double t;
Expand Down
8 changes: 4 additions & 4 deletions src/dictionary.cc
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ void Dictionary::getSubwords(
int32_t i = getId(word);
ngrams.clear();
substrings.clear();
if (!args_->hashOnly && i >= 0) {
if (args_->mode != mode_name::floret && i >= 0) {
ngrams.push_back(i);
substrings.push_back(words_[i].word);
}
Expand Down Expand Up @@ -183,7 +183,7 @@ void Dictionary::computeSubwords(
const std::string& word,
std::vector<int32_t>& ngrams,
std::vector<std::string>* substrings) const {
if (args_->hashOnly) {
if ((args_->mode == mode_name::floret)) {
std::vector<uint32_t> hashes;
murmurhash(word, &hashes);
for (uint32_t hash : hashes) {
Expand All @@ -205,7 +205,7 @@ void Dictionary::computeSubwords(
ngram.push_back(word[j++]);
}
if (n >= args_->minn && !(n == 1 && (i == 0 || j == word.size()))) {
if (args_->hashOnly) {
if ((args_->mode == mode_name::floret)) {
std::vector<uint32_t> hashes;
murmurhash(ngram, &hashes);
for (size_t i = 0; i < hashes.size(); i++) {
Expand Down Expand Up @@ -233,7 +233,7 @@ void Dictionary::initNgrams() {
computeSubwords(word, words_[i].subwords);
}
// remove word-index subword for all words except 0 (</s>)
if (args_->hashOnly && i > 0) {
if ((args_->mode == mode_name::floret) && i > 0) {
words_[i].subwords.erase(words_[i].subwords.begin());
}
}
Expand Down
4 changes: 2 additions & 2 deletions src/fasttext.cc
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ int32_t FastText::getWordId(const std::string& word) const {
}

int32_t FastText::getSubwordId(const std::string& subword) const {
if (args_->hashOnly) {
if (args_->mode == mode_name::floret) {
return -1;
} else {
int32_t h = dict_->hash(subword) % args_->bucket;
Expand Down Expand Up @@ -125,7 +125,7 @@ void FastText::getWordVector(Vector& vec, const std::string& word) const {

void FastText::getSubwordVector(Vector& vec, const std::string& subword) const {
vec.zero();
if (args_->hashOnly) {
if (args_->mode == mode_name::floret) {
std::vector<uint32_t> hashes;
dict_->murmurhash(subword, &hashes);
for (size_t i = 0; i < hashes.size(); i++) {
Expand Down
2 changes: 1 addition & 1 deletion src/main.cc
Original file line number Diff line number Diff line change
Expand Up @@ -381,7 +381,7 @@ void train(const std::vector<std::string> args) {
}
fasttext->saveModel(outputFileName);
fasttext->saveVectors(a.output + ".vec");
if (a.hashOnly) {
if (a.mode == mode_name::floret) {
fasttext->saveHashOnlyVectors(a.output + ".floret");
}
if (a.saveOutput) {
Expand Down

0 comments on commit d4b0de0

Please sign in to comment.