Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MinHash query doesn't return jaccard similarity metric #159

Closed
harshit-2115 opened this issue May 17, 2021 · 9 comments
Closed

MinHash query doesn't return jaccard similarity metric #159

harshit-2115 opened this issue May 17, 2021 · 9 comments
Labels

Comments

@harshit-2115
Copy link

harshit-2115 commented May 17, 2021

Hi,

When we query from a MinhashLSH, the keys above a specified threshold are returns, but it doesn't return the jaccard similarity metric for each key.

Is there a way so that MinHash LSH query returns the keys with their similarity metric? It'll be good to know by what magnitude they are similar.

Thanks

@surkova
Copy link

surkova commented May 19, 2021

Hey

Since the API doesn't seem to support this, we ended up doing the lookup ourselves, you can do m1.jaccard(m2).

In our case we use a base64 represetantion of the minhash as a minhash key in the LSH, so we can deserialize the result and get the jaccard similarity. I don't think there's a get_by_key in the LSH.

@ekzhu
Copy link
Owner

ekzhu commented Jun 2, 2021

+1 @surkova

Just compute the Jaccard for the ones you get. You may also want to filter out false-positive results because the LSH doesn't guarantee correctness.

@surkova
Copy link

surkova commented Jun 2, 2021

The false positive results here are the ones that return Jaccard similarity lower than the LSH threshold?

@ekzhu
Copy link
Owner

ekzhu commented Jun 2, 2021

Yes.

The false positive results here are the ones that return Jaccard similarity lower than the LSH threshold?

@hamedmirzaei
Copy link

@surkova
Would you please share some code on how you serialize and deserialize the minhash object?

@surkova
Copy link

surkova commented Sep 20, 2021

@hamedmirzaei

After some experimentation we have found that most convenient way for us is to work with MinHash serialized into base64 encoded string, as you can write those to any db, JSON or what have you.

Suppose you have a LeanMinHash lean_minhash:

buffer = bytearray(lean_minhash.bytesize())
lean_minhash.serialize(buffer)
buf_str_64 = base64.b64encode(buffer)
result = buf_str_64.decode("utf-8")

Then the other way around, from base64 encoded string to LeanMinHash:

minhash_bytea = bytearray(base64.b64decode(minhash_base64))
result = LeanMinHash.deserialize(minhash_bytea)

@hamedmirzaei
Copy link

hamedmirzaei commented Sep 23, 2021

@surkova
Thank you so much. I didn't knew about LeanMinHash at all.
One important point is that how big these serialized keys would be in term of memory usage? I mean if we compare it with the naive approach of just keeping a map of integer to minhash objects in RAM, does it give us any benefits?

@surkova
Copy link

surkova commented Sep 24, 2021

The size of the base64 string depends on the size of the original data. And the size of the data depends of the number of permutations you use. In my case we chose to go with 128 permutations which then serializes into a string size of 700 bytes.

Serialize or not really depends on the architecture of your project. For some (many!) serialization is not necessary at all.

@hamedmirzaei
Copy link

@surkova
Thanks, I've got my answers. Appreciate your help. Good luck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants