Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about chroma vector database implementation. #11391

Closed
5 tasks done
zzw1123 opened this issue Dec 5, 2024 · 5 comments
Closed
5 tasks done

Question about chroma vector database implementation. #11391

zzw1123 opened this issue Dec 5, 2024 · 5 comments
Labels
🙋‍♂️ question This issue does not contain proper reproduce steps or it only has limited words without details.

Comments

@zzw1123
Copy link

zzw1123 commented Dec 5, 2024

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

0.9.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

I see the search process of chroma is as follows:

def search_by_vector(self, query_vector: list[float], **kwargs: Any) -> list[Document]:
        collection = self._client.get_or_create_collection(self._collection_name)
        results: QueryResult = collection.query(query_embeddings=query_vector, n_results=kwargs.get("top_k", 4))
        score_threshold = float(kwargs.get("score_threshold") or 0.0)
        ids: list[str] = results["ids"][0]
        documents: list[str] = results["documents"][0]
        metadatas: dict[str, Any] = results["metadatas"][0]
        distances: list[float] = results["distances"][0]
        docs = []
        for index in range(len(ids)):
            distance = distances[index]
            metadata = metadatas[index]
            if distance >= score_threshold:
                metadata["score"] = distance
                doc = Document(
                    page_content=documents[index],
                    metadata=metadata,
                )
                docs.append(doc)
        # Sort the documents by score in descending order
        docs = sorted(docs, key=lambda x: x.metadata["score"], reverse=True)
        return docs
  1. The default distance function of chroma is L2 distance, which means the smaller, the more similar. But why sorting in descending order?
  2. Moreover, the L2 distance of two normalized vectors are in [0,2], why the threshold on the setting page is in [0,1]?
    image

✔️ Expected Behavior

  1. Dont sorted.
  2. Maybe change the value range of threshold?

❌ Actual Behavior

See above.

@crazywoola
Copy link
Member

@bowenliang123

@dosubot dosubot bot added the 🙋‍♂️ question This issue does not contain proper reproduce steps or it only has limited words without details. label Dec 5, 2024
Copy link

dosubot bot commented Dec 5, 2024

Hey @zzw1123! I'm here to help you with any bugs, questions, or contributions. Let's tackle this together!

  1. The documents are sorted in descending order by score in the search_by_vector function to ensure that the most relevant documents, based on their similarity scores, appear first. This is a common practice in search algorithms to present the most relevant results at the top [1].

  2. The expected range for the 'score_threshold' parameter is between 0 and 1. This range is specified in the documentation, but there is no explanation provided for why it is set between 0 and 1, especially considering that the L2 distance for normalized vectors can range from 0 to 2 [2]. This discrepancy might be due to a design decision or a specific implementation detail that isn't documented.

To continue talking to Dosu, mention @dosu.

@bowenliang123
Copy link
Contributor

Hi, feel free to raise a PR for your suggested changes in ordering and threshold configs.

@zzw1123
Copy link
Author

zzw1123 commented Dec 5, 2024

Sure, much thanks.
By the way, how to change the vector db to chroma? I change the settings in .env from VECTOR_STORE=weaviate to VECTOR_STORE=chroma, but error occurs:
image

Copy link

dosubot bot commented Jan 5, 2025

Hi, @zzw1123. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary

  • You inquired about the sorting order of search results using L2 distance in the chroma vector database.
  • I clarified that results are sorted in descending order to show the most relevant results first.
  • There is a noted lack of documentation regarding the threshold setting range.
  • Bowenliang123 invited you to propose changes via a pull request.
  • You also requested guidance on switching the vector database to chroma due to an encountered error.

Next Steps

  • Please confirm if this issue is still relevant to the latest version of the Dify repository by commenting here.
  • If there is no further activity, this issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 5, 2025
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 20, 2025
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🙋‍♂️ question This issue does not contain proper reproduce steps or it only has limited words without details.
Projects
None yet
Development

No branches or pull requests

3 participants