Skip to content

Commit

Permalink
swapped stopwords? for remove-stopwords? for more clarity
Browse files Browse the repository at this point in the history
added predicates to check user input into core api to avoid null-pointer exceptions on invalid input
convert features to strings before minhashing
swapped remaining print statements for log statements.
  • Loading branch information
andrewmcloud committed Jan 4, 2018
1 parent 5230e40 commit b8e35ea
Show file tree
Hide file tree
Showing 8 changed files with 136 additions and 67 deletions.
27 changes: 14 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@

consimilo is a library that utilizes locality sensitive hashing (implemented as lsh-forest) and minhashing, to support
*top-k* similar item queries. Finding similar items across expansive data-sets is a common problem that presents itself
in many real world applications (e.g. finding articles from the same source, plagiarism detection, collaborative filtering,
context filtering, document similarity, etc...). Searching a corpus for *top-k* similary items quickly grows to
an unwieldy complexity at relatively small corpus sizes *(n choose 2)*. LSH reduces the search space by "hashing" items
in such a way that collisions occur as a result of similarity. Once the items are hashed and indexed the lsh-forest
supports a *top-k* most similar items query of ~*O(log n)*. There is an accuracy trade-off that comes with the enormous
increase in query speed. More information can be found in chapter 3 of
in many real world applications (e.g. finding articles from the same source, plagiarism detection, collaborative
filtering, context filtering, document similarity, etc...). Searching a corpus for *top-k* similary items quickly grows
to an unwieldy complexity at relatively small corpus sizes *(n choose 2)*. LSH reduces the search space by "hashing"
items in such a way that collisions occur as a result of similarity. Once the items are hashed and indexed the
lsh-forest supports a *top-k* most similar items query of ~*O(log n)*. There is an accuracy trade-off that comes with
the enormous increase in query speed. More information can be found in chapter 3 of
[Mining Massive Datasets](http://infolab.stanford.edu/~ullman/mmds/ch3.pdf).

## Getting Started
Expand Down Expand Up @@ -50,9 +50,10 @@ offline and replace the production forest.

#### Adding strings and files to a forest (helper functions)

consimilo provides helper functions for constructing feature vectors from strings and files. By default, a new forest is
created and stopwords are removed. You may add to an existing forest and/or include stopwords via optional parameters
`:forest` `:stopwords`. The optional parameters are defaulted to `:forest (new-forest)` `:stopwords? true`.
consimilo provides helper functions for constructing feature vectors from strings and files. By default, a new forest
is created and stopwords are removed. You may add to an existing forest and/or include stopwords via optional
parameters `:forest` `:stopwords`. The optional parameters are defaulted to `:forest (new-forest)` `:remove-stopwords?
true`.

##### Adding documents/strings to a forest

Expand All @@ -70,7 +71,7 @@ To add a collection of strings to an **existing** forest and **do not remove** s
(consimilo/add-strings-to-forest [{:id id1 :features "my sample string 1"}
{:id id2 :features "my sample string 2"}]
:forest my-forest ;;updates my-forest in place
:stopwords? false))
:remove-stopwords? false))
```

##### Adding files to a forest
Expand Down Expand Up @@ -100,7 +101,7 @@ Once you have your forest `my-forest` built, you can query for `k` most similar
#### Querying a forest with strings and files (helper functions)

consimilo provides helper functions for querying the forest with strings and files. The helper functions `query-string`
and `query-file` have an optional parameter `:stopwords?` which is defaulted `true`, removing stopwords. Queries
and `query-file` have an optional parameter `:remove-stopwords?` which is defaulted `true`, removing stopwords. Queries
against strings and files should be made using the same tokenization scheme used to input items in the forest
(stopwords present or removed).

Expand All @@ -127,8 +128,8 @@ against strings and files should be made using the same tokenization scheme used
consimilo provides functions for calculating approximate distance / similarity between the query and *top-k* results.
The function `similar-k` accepts optional parameters to specify which distance / similarity function should be used.
For calculating Jaccard similarity, use: `:sim-fn :jaccard`, for calculating Hamming distance, use: `:sim-fn :hamming`,
and for calculating cosine distance, use: `:sim-fn :cosine`. `similar-k` returns a hash-map, `keys` are the *top-k* ids and
`vals` are the similarity / distance. As with the other query functions, queries against strings and files
and for calculating cosine distance, use: `:sim-fn :cosine`. `similar-k` returns a hash-map, `keys` are the *top-k* ids
and `vals` are the similarity / distance. As with the other query functions, queries against strings and files
should be made using the same tokenization scheme used to input the items in the forest (stopwords present or removed).

```clojure
Expand Down
77 changes: 45 additions & 32 deletions src/consimilo/core.clj
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@
(:require [consimilo.lsh-forest :refer [new-forest
add-lsh!
index!]]
[consimilo.lsh-util :refer [valid-input-add-strings?
valid-input-add-files?
valid-input?]]

[consimilo.minhash :refer [build-minhash]]
[consimilo.minhash-util :refer [jaccard-similarity
Expand All @@ -11,7 +14,8 @@
[consimilo.lsh-query :refer [query]]
[consimilo.text-processing :refer [tokenize-text
extract-text]]
[taoensso.nippy :as nippy]))
[taoensso.nippy :as nippy]
[clojure.tools.logging :as log]))

(defn add-all-to-forest
"Adds each vector in `feature-coll` to an lsh forest and returns the forest.
Expand All @@ -27,9 +31,13 @@
([feature-coll]
(add-all-to-forest (new-forest) feature-coll))
([forest feature-coll]
(dorun (pmap #(add-lsh! forest (:id %) (build-minhash (:features %))) feature-coll))
(index! forest)
forest))
(if (valid-input? feature-coll)
(do
(dorun (pmap #(add-lsh! forest (:id %) (build-minhash (:features %))) feature-coll))
(index! forest)
forest)
(log/warn "invalid input, feature-coll must be a collection of maps, each having keys :id and :features;
:features must be a collection"))))

(defn add-strings-to-forest
"Convenience method for processing documents. Each item of feature-coll should be a map with
Expand All @@ -38,17 +46,20 @@
parameters. The feature vector will be minhashed and inserted into the lsh-forest.
Optional Keyword Arguments: :forest - add to an existing forest; default: create new forest
:stopwords? - if true: remove stopwords; default: true
:remove-stopwords? - if true: remove stopwords; default: true
Note: items should be loaded into the forest as few times as possible in large chunks. An expensive
sort called after items are added to the forest to enable ~log(n) queries."

[feature-coll & {:keys [forest stopwords?]
:or {forest (new-forest) stopwords? true}}]
(add-all-to-forest forest
(map #(assoc % :features
(tokenize-text (:features %)))
feature-coll)))
[feature-coll & {:keys [forest remove-stopwords?]
:or {forest (new-forest) remove-stopwords? true}}]
(if (valid-input-add-strings? feature-coll)
(add-all-to-forest forest
(map #(assoc % :features
(tokenize-text (:features %)))
feature-coll))
(log/warn "invalid input, feature-coll must be a collection of maps, each having keys :id and :features;
:features must be a string")))

(defn add-files-to-forest
"Convenience method for processing files. Files should be a collection of File objects.
Expand All @@ -57,16 +68,18 @@
parameters. The feature vector is minhashed and inserted into the lsh-forest.
Optional Keyword Arguments: :forest - add to an existing forest; default: create new forest
:stopwords? - if true: remove stopwords; default: true
:remove-stopwords? - if true: remove stopwords; default: true
Note: items should be loaded into the forest as few times as possible in large chunks. An expensive
sort called after items are added to the forest to enable ~log(n) queries."
[files & {:keys [forest stopwords?]
:or {forest (new-forest) stopwords? true}}]
(add-strings-to-forest (map (fn [f] {:id (.getName f)
:features (extract-text f)})
files)
:forest forest))
[files & {:keys [forest remove-stopwords?]
:or {forest (new-forest) remove-stopwords? true}}]
(if (valid-input-add-files? files)
(add-strings-to-forest (map (fn [f] {:id (.getName f)
:features (extract-text f)})
files)
:forest forest)
(log/warn "invalid input, files must be a collection of file objects")))

(defn query-forest
"Finds the closest `k` vectors to vector `v` stored in the `forest`."
Expand All @@ -80,11 +93,11 @@
parameters. The feature vector is minhashed and used to query the forest. K is the number of results
(top-k most similar items).
Optional Keyword Arguments: :stopwords? - if true: remove stopwords; default: true
Optional Keyword Arguments: :remove-stopwords? - if true: remove stopwords; default: true
Note: for best results query the forest utilizing the same tokenization scheme used to create it"
[forest k string & {:keys [stopwords?]
:or {stopwords? true}}]
[forest k string & {:keys [remove-stopwords?]
:or {remove-stopwords? true}}]
(query-forest forest
k
(tokenize-text string)))
Expand All @@ -95,11 +108,11 @@
per the optional arguments. The feature vector is minhashed and used to query the forest. k is the number
of results (top-k most similar items).
Optional Keyword Arguments: :stopwords? - if true: remove stopwords; default: true
Optional Keyword Arguments: :remove-stopwords? - if true: remove stopwords; default: true
Note: for best results query the forest utilizing the same tokenization scheme used to create it"
[forest k file & {:keys [stopwords?]
:or {stopwords? true}}]
[forest k file & {:keys [remove-stopwords?]
:or {remove-stopwords? true}}]
(query-string forest
k
(extract-text file)))
Expand All @@ -109,27 +122,27 @@
similarity functions are Jaccard similarity, cosine distance, and Hamming distance. sim-fn is defaulted to :jaccard,
but can be overridden by passing the optional :sim-fn key and :jaccard, :cosine, or :hamming. similarity-k Dispatches
based on input: string, file, or feature-vector."
(fn [forest k input & {:keys [sim-fn stopwords?]
:or {sim-fn :jaccard stopwords? true}}]
(fn [forest k input & {:keys [sim-fn remove-stopwords?]
:or {sim-fn :jaccard remove-stopwords? true}}]
(cond
(coll? input) :feature-vec
(string? input) :string
:else :file)))

(defmethod similarity-k :string
[forest k string & {:keys [sim-fn stopwords?]
:or {sim-fn :jaccard stopwords? true}}]
(let [return (query-string forest k string :stopwords? stopwords?)
[forest k string & {:keys [sim-fn remove-stopwords?]
:or {sim-fn :jaccard remove-stopwords? true}}]
(let [return (query-string forest k string :remove-stopwords? remove-stopwords?)
f (condp = sim-fn
:jaccard jaccard-similarity
:cosine cosine-distance
:hamming hamming-distance)]
(zip-similarity forest return f)))

(defmethod similarity-k :file
[forest k file & {:keys [sim-fn stopwords?]
:or {sim-fn :jaccard stopwords? true}}]
(let [return (query-file forest k file :stopwords? stopwords?)
[forest k file & {:keys [sim-fn remove-stopwords?]
:or {sim-fn :jaccard remove-stopwords? true}}]
(let [return (query-file forest k file :remove-stopwords? remove-stopwords?)
f (condp = sim-fn
:jaccard jaccard-similarity
:cosine cosine-distance
Expand Down
9 changes: 5 additions & 4 deletions src/consimilo/lsh_forest.clj
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@
sort-tree
trees]]
[consimilo.lsh-query :refer [query]]
[config.core :refer [env]]))
[config.core :refer [env]]
[clojure.tools.logging :as log]))

(defn new-forest
"Create new empty initialized forest structure."
Expand All @@ -23,8 +24,8 @@
"add minhash to lsh-forest. key must be a string, will be converted to keyword"
[forest key minhash]
(cond
(get-in @forest [:keys (keywordize key)]) (print "key already added to hash")
(< (count minhash) hashrange) (print "minhash is not correct permutation size")
(get-in @forest [:keys (keywordize key)]) (log/warn "key already added to hash")
(< (count minhash) hashrange) (log/warn "minhash is not correct permutation size")
:else (plant-trees! forest key (slice-minhash minhash hashranges))))

(defn index!
Expand All @@ -39,4 +40,4 @@
"search lsh-forest for top k most similar items, utilizes binary search.
index! must be called prior to build the sorted hashes."
[forest minhash k-items]
(query forest minhash k-items))
(query forest minhash k-items))
30 changes: 29 additions & 1 deletion src/consimilo/lsh_util.clj
Original file line number Diff line number Diff line change
Expand Up @@ -58,4 +58,32 @@
the first is the start of the bucket range and the second
is the end of that bucket."
[minhash hashranges]
(mapv #(slice (first %) (last %) minhash) hashranges))
(mapv #(slice (first %) (last %) minhash) hashranges))

(defn valid-input?
"validates the input of add-*-to-forest functions"
[feature-coll]
(and (->> feature-coll
(map #(and (contains? % :id) (contains? % :features)))
(every? true?))
(->> feature-coll
(map #(coll? (:features %)))
(every? true?))))

(defn valid-input-add-strings?
"validates the input of add-*-to-forest functions"
[feature-coll]
(and (->> feature-coll
(map #(and (contains? % :id) (contains? % :features)))
(every? true?))
(->> feature-coll
(map #(string? (:features %)))
(every? true?))))

(defn valid-input-add-files?
"validates the input of add-*-to-forest functions"
[files]
(and (coll? files)
(->> files
(map #(instance? java.io.File %))
(every? true?))))
2 changes: 1 addition & 1 deletion src/consimilo/minhash.clj
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
of documents with varying feature sizes. One minhash should be created for
each document"
[hashvalues feature]
(let [hv (get-hash-bigint feature)
(let [hv (get-hash-bigint (str feature))
a (:a permutations)
b (:b permutations)]
(-> (scalar-mul a hv)
Expand Down
12 changes: 6 additions & 6 deletions src/consimilo/text_processing.clj
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,18 @@
(def ^:private stopwords (set (split-lines (slurp (io/resource "stopwords.txt")))))

(defn- remove-stopwords
"If stopwords?: returns tokenized-text with stopwords removed, else: returns tokenized-text unaltered"
[stopwords? tokenized-text]
(if stopwords?
"If remove-stopwords?: returns tokenized-text with stopwords removed, else: returns tokenized-text unaltered"
[remove-stopwords? tokenized-text]
(if remove-stopwords?
(remove stopwords tokenized-text)
tokenized-text))

(defn tokenize-text
"Tokenizes a string of text. If stopwords?: removes stopwords from token collection"
[text & {:keys [stopwords?] :or {stopwords? true}}]
"Tokenizes a string of text. If remove-stopwords?: removes stopwords from token collection"
[text & {:keys [remove-stopwords?] :or {remove-stopwords? true}}]
(->> (lower-case text)
tokenize
(remove-stopwords stopwords?)))
(remove-stopwords remove-stopwords?)))

;;Not currently used
(defn shingle
Expand Down
28 changes: 27 additions & 1 deletion test/consimilo/lsh_util_test.clj
Original file line number Diff line number Diff line change
Expand Up @@ -77,4 +77,30 @@
sorted-vec [[0 1 2] [1 2 3] [2 3 4] [3 4 5] [4 5 6] [5 6 7] [6 7 8] [7 8 9] [8 9 0]]]
(testing "search for min"
(is (= 2
(private-pred-search #(>= (compare (get sorted-vec %) [2 3 4]) 0) (count sorted-vec)))))))
(private-pred-search #(>= (compare (get sorted-vec %) [2 3 4]) 0) (count sorted-vec)))))))

(deftest valid-input?-test
(testing "valid input, correct keys and :features is a collection"
(is (= true (valid-input? [{:id 1 :features [1]} {:id 2 :features [2]}]))))
(testing "invalid input, incorrect keys and :features is a collection"
(is (= false (valid-input? [{:id 1 :feat [1]} {:id 2 :features [2]}]))))
(testing "invalid input, correct keys but :features is not a collection"
(is (= false (valid-input? [{:id 1 :features [1]} {:id 2 :features 2}])))))

(deftest valid-input-add-strings?-test
(testing "valid input, correct keys and :features is a collection"
(is (= true (valid-input-add-strings? [{:id 1 :features "my name is andrew"} {:id 2 :features "i like clojure"}]))))
(testing "invalid input, incorrect keys and :features is a collection"
(is (= false (valid-input-add-strings? [{:id 1 :feat "my name is andrew"} {:id 2 :features "i like clojure"}]))))
(testing "invalid input, correct keys but :features is a collection instead of string"
(is (= false (valid-input-add-strings? [{:id 1 :features "my name is andrew"} {:id 2 :features [2]}])))))

(deftest valid-input-add-files?-test
(testing "valid input, multiple files in collection"
(is (= true (valid-input-add-files? [(clojure.java.io/as-file "t1")
(clojure.java.io/as-file "t1")
(clojure.java.io/as-file "t2")]))))
(testing "valid input, single file in collection"
(is (= true (valid-input-add-files? [(clojure.java.io/file "t1")]))))
(testing "invalid input, no files in collection"
(is (= false (valid-input-add-files? [{:id 1 :features [1]} {:id 2 :features 2}])))))
Loading

0 comments on commit b8e35ea

Please sign in to comment.