-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MinHash query doesn't return jaccard similarity metric #159
Comments
Hey Since the API doesn't seem to support this, we ended up doing the lookup ourselves, you can do In our case we use a base64 represetantion of the minhash as a minhash key in the LSH, so we can deserialize the result and get the jaccard similarity. I don't think there's a |
+1 @surkova Just compute the Jaccard for the ones you get. You may also want to filter out false-positive results because the LSH doesn't guarantee correctness. |
The false positive results here are the ones that return Jaccard similarity lower than the LSH threshold? |
Yes.
|
@surkova |
After some experimentation we have found that most convenient way for us is to work with MinHash serialized into base64 encoded string, as you can write those to any db, JSON or what have you. Suppose you have a LeanMinHash buffer = bytearray(lean_minhash.bytesize())
lean_minhash.serialize(buffer)
buf_str_64 = base64.b64encode(buffer)
result = buf_str_64.decode("utf-8") Then the other way around, from base64 encoded string to LeanMinHash: minhash_bytea = bytearray(base64.b64decode(minhash_base64))
result = LeanMinHash.deserialize(minhash_bytea) |
@surkova |
The size of the base64 string depends on the size of the original data. And the size of the data depends of the number of permutations you use. In my case we chose to go with 128 permutations which then serializes into a string size of 700 bytes. Serialize or not really depends on the architecture of your project. For some (many!) serialization is not necessary at all. |
@surkova |
Hi,
When we query from a MinhashLSH, the keys above a specified threshold are returns, but it doesn't return the jaccard similarity metric for each key.
Is there a way so that MinHash LSH query returns the keys with their similarity metric? It'll be good to know by what magnitude they are similar.
Thanks
The text was updated successfully, but these errors were encountered: