-
-
Notifications
You must be signed in to change notification settings - Fork 44
[Store][Postgres] allow store initialization with utilized distance #197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Store][Postgres] allow store initialization with utilized distance #197
Conversation
Thanks for this PR. I'm using So in order to test, I changed in the constructor, and double check it's well configurated (current operator : diff --git a/src/store/src/Bridge/Postgres/Store.php b/src/store/src/Bridge/Postgres/Store.php
index 07bff22..28d4ba5 100644
--- a/src/store/src/Bridge/Postgres/Store.php
+++ b/src/store/src/Bridge/Postgres/Store.php
@@ -34,7 +34,7 @@ final readonly class Store implements VectorStoreInterface, InitializableStoreIn
private \PDO $connection,
private string $tableName,
private string $vectorFieldName = 'embedding',
- private Distance $distance = Distance::L2,
+ private Distance $distance = Distance::Cosine,
) {
} I have crawled https://jolicode.com and https://www.premieroctet.com, indexed all their content, and run the following code: $rows = $connection->executeQuery("select * from {$_SERVER['PLATFORM']}")->fetchAllAssociative();
foreach ($rows as $row) {
$metadata = json_decode($row['metadata'], true, 512, \JSON_THROW_ON_ERROR);
$vector = new Vector(json_decode($row['embedding'], true));
$documents = $store->query($vector, [], 0.000001); //Hack to not get "current row"
if (!$documents) {
continue;
}
echo "Current document: {$metadata['url']}\n";
echo "Found " . count($documents) . " similar documents:\n";
foreach ($documents as $i => $document) {
echo "- {$document->metadata['url']} (score: {$document->score})\n";
// break;
}
die;
echo "\n";
} So:
|
e966024
to
b4e30f5
Compare
$uuid = Uuid::v4(); | ||
$vectorData = [0.1, 0.2, 0.3]; | ||
$minScore = 0.8; | ||
$pdo = $this->createMock(\PDO::class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know the testing strategy on this project yet. But this kind of test tests nothing. All the important things are mocked.
IMHO, it would be much better to use a real instance of pgvector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Yeah ... the tests just testing the query is correctly build. But nethertheless i have to remove the tests because there was a merge of tests. But still unit and not functional tests. I would keep with them for now to have just a look that the correct comparison method is utilized.
Thanks, @lyrixx ! I've already added it to the named constructors. Your results generally is looking totally fine to me. The cosine distance, which is being used, returns a value between In the query, the filtering could be problematic. The term The score problem in filtering seems also be valid for the L2 distance as this is a value from 0 to infinite, where 0 is the most fitting match. It seems the |
b4e30f5
to
8c7735a
Compare
Thanks you very much for the explanation. Very clear. And I agree with you for the minScore. May be the name could be "treashhold". It kinda generic haha |
I think if there's something, that is explicitly part of the method signature, we should streamline the mechanism/behavior across the different implementations as part of the store abstraction. And if it's part of the Meaning, if we want to keep the |
Please rebase on main and adopt the changed test style, switching from |
8c7735a
to
4ecc7a1
Compare
From my perspective the argument should then be removed. But this is surely another issue. Instead of changing the score results just for the purpose to have them changed this is surely the better solution, so that the filtering in all storages must be given by the options. I have not checked the code in the other stores but i would assume that no storage really changed the number when distance calculations are utilized. And, as mentioned, at least Cosine and L2 are build with lower = better. But surely this can be discussed in an independent issue and not within this PR 😄 |
@chr-hertel Do you have any idea why the pipeline still fails? PHPStan fails with setting up the ai-bundle and the bot thingy has problems with void return types that are everywhere but fails with the class i have changed 🤔 |
Can you please remove |
Haha. Ok. 1 Minute ago. Nothing i could have known 😀 Sure ... i'll do it ... style changes over and over again. Hard to follow, sorry 🙈 |
Yes sorry, lets wait the discussion in |
4a346a6
to
789357f
Compare
I think rebasing should be safe again - most CS part is done :) |
789357f
to
e5bc8a4
Compare
e5bc8a4
to
a8b4fba
Compare
Thank you @DZunke. |
…ce::query` (DZunke) This PR was merged into the main branch. Discussion ---------- [Store] remove `minScore` parameter from `VectorStoreInterface::query` | Q | A | ------------- | --- | Bug fix? | yes | New feature? | no | Docs? | no | Issues | Fix #195 | License | MIT As discussed within #197 and #195 i have removed the `minScore` argument from the `VectorStoreInterface` as this is semantically not correct for any store. Most stores has never implemented this filter mechanism. It originated from the MongoDB store where it is correct with the internal scoring system but for stores like MariaDB and Postgres the distance calculations are more correct with a `maxScore` option as their results are sorted from `0` as exact match to larger float values for the distance. I have changed those two stores and left the MongoDB store with the minScore implementation. This is a break in the implementation as the interface has changed. Commits ------- 2700074 [Store] remove minScore argument from VectorStoreInterface::query
According to the pgvector documentation there are multiple distance calculations allowed. The current implementation in the store is only the L2 distance with the usage of
<->
. Allowing to utilize the other distance calculation variants would be useful here as mostly the discussion seem to go around the cosine algorithm.