Skip to content

[Store][Postgres] allow store initialization with utilized distance #197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 30, 2025

Conversation

DZunke
Copy link
Contributor

@DZunke DZunke commented Jul 24, 2025

Q A
Bug fix? no
New feature? yes
Docs? no
Issues #195
License MIT

According to the pgvector documentation there are multiple distance calculations allowed. The current implementation in the store is only the L2 distance with the usage of <->. Allowing to utilize the other distance calculation variants would be useful here as mostly the discussion seem to go around the cosine algorithm.

@lyrixx
Copy link
Member

lyrixx commented Jul 24, 2025

Thanks for this PR.

I'm using Store::FromDbal(), so the comparaison should be added there too. And also in fromPdo()

So in order to test, I changed in the constructor, and double check it's well configurated (current operator : <=>).

diff --git a/src/store/src/Bridge/Postgres/Store.php b/src/store/src/Bridge/Postgres/Store.php
index 07bff22..28d4ba5 100644
--- a/src/store/src/Bridge/Postgres/Store.php
+++ b/src/store/src/Bridge/Postgres/Store.php
@@ -34,7 +34,7 @@ final readonly class Store implements VectorStoreInterface, InitializableStoreIn
         private \PDO $connection,
         private string $tableName,
         private string $vectorFieldName = 'embedding',
-        private Distance $distance = Distance::L2,
+        private Distance $distance = Distance::Cosine,
     ) {
     }

I have crawled https://jolicode.com and https://www.premieroctet.com, indexed all their content, and run the following code:

$rows = $connection->executeQuery("select * from {$_SERVER['PLATFORM']}")->fetchAllAssociative();
foreach ($rows as $row) {
    $metadata = json_decode($row['metadata'], true, 512, \JSON_THROW_ON_ERROR);
    $vector = new Vector(json_decode($row['embedding'], true));
    $documents = $store->query($vector, [], 0.000001); //Hack to not get "current row"

    if (!$documents) {
        continue;
    }

    echo "Current document: {$metadata['url']}\n";
    echo "Found " . count($documents) . " similar documents:\n";
    foreach ($documents as $i => $document) {
        echo "- {$document->metadata['url']} (score: {$document->score})\n";
        // break;
    }
    die;
    echo "\n";
}

So:

  1. We still sort by score ASC
  2. But lowest seems to be best again
Current document: https://jolicode.com/blog/tag/zellij
Found 5 similar documents:
- https://jolicode.com/blog/tag/zellij (score: 0)
- https://jolicode.com/blog/tag/tmux (score: 0.068143753338624)
- https://jolicode.com/blog/tag/agence (score: 0.12764826831817)
- https://jolicode.com/blog/tag/js (score: 0.13115907744191)
- https://jolicode.com/blog/tag/sysadmin (score: 0.13248805321064)

Current document: https://jolicode.com/blog/tag/encodage
Found 5 similar documents:
- https://jolicode.com/blog/tag/encodage (score: 0)
- https://jolicode.com/blog/tag/utf8 (score: 0.045148642848127)
- https://jolicode.com/qui-sommes-nous/equipe/marion-hurteau (score: 0.07487888379814)
- https://jolicode.com/blog/ce-que-vous-devez-savoir-sur-les-chaa-r-nes-de-caracta-res (score: 0.07784386727315)
- https://jolicode.com/blog/tag/qualite (score: 0.083560306575813)

@chr-hertel chr-hertel added Store Issues & PRs about the AI Store component Status: Needs Work labels Jul 24, 2025
@DZunke DZunke force-pushed the configurable-postgres-distance branch from e966024 to b4e30f5 Compare July 25, 2025 09:53
$uuid = Uuid::v4();
$vectorData = [0.1, 0.2, 0.3];
$minScore = 0.8;
$pdo = $this->createMock(\PDO::class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know the testing strategy on this project yet. But this kind of test tests nothing. All the important things are mocked.

IMHO, it would be much better to use a real instance of pgvector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yeah ... the tests just testing the query is correctly build. But nethertheless i have to remove the tests because there was a merge of tests. But still unit and not functional tests. I would keep with them for now to have just a look that the correct comparison method is utilized.

@DZunke
Copy link
Contributor Author

DZunke commented Jul 25, 2025

Thanks for this PR.

I'm using Store::FromDbal(), so the comparaison should be added there too. And also in fromPdo()

So in order to test, I changed in the constructor, and double check it's well configurated (current operator : <=>).
...

Thanks, @lyrixx ! I've already added it to the named constructors.

Your results generally is looking totally fine to me. The cosine distance, which is being used, returns a value between 0 and 2, where 0 indicates that the elements are identical, and 2 means they are very different. So, sorting by ASC means that the most similar document comes first. Sorting by DESC would result in the most dissimilar document appearing first.

In the query, the filtering could be problematic. The term minScore is not the best wording here, especially in combination with the >= comparator. What’s currently labeled as minScore should actually be a maxScore, at least for cosine search 🙈

The score problem in filtering seems also be valid for the L2 distance as this is a value from 0 to infinite, where 0 is the most fitting match. It seems the minScore wording is coming from the MongoDB implementation - at least this seems to be where it started and where it is correct.

@DZunke DZunke force-pushed the configurable-postgres-distance branch from b4e30f5 to 8c7735a Compare July 25, 2025 11:12
@lyrixx
Copy link
Member

lyrixx commented Jul 25, 2025

Thanks you very much for the explanation. Very clear.

And I agree with you for the minScore. May be the name could be "treashhold". It kinda generic haha

@chr-hertel
Copy link
Member

I think if there's something, that is explicitly part of the method signature, we should streamline the mechanism/behavior across the different implementations as part of the store abstraction. And if it's part of the $options array we would expect it too be leaky and fair to be vendor or option specific.

Meaning, if we want to keep the minScore as explicit argument, we should streamline the behavior no matter which store and option, e.g. score is always between 0.0 and 1.0 and the higher the number the more similar.

@chr-hertel
Copy link
Member

Please rebase on main and adopt the changed test style, switching from #[Test] attribute to test method prefix, see #214.

@DZunke DZunke force-pushed the configurable-postgres-distance branch from 8c7735a to 4ecc7a1 Compare July 29, 2025 10:17
@DZunke
Copy link
Contributor Author

DZunke commented Jul 29, 2025

I think if there's something, that is explicitly part of the method signature, we should streamline the mechanism/behavior across the different implementations as part of the store abstraction. And if it's part of the $options array we would expect it too be leaky and fair to be vendor or option specific.

Meaning, if we want to keep the minScore as explicit argument, we should streamline the behavior no matter which store and option, e.g. score is always between 0.0 and 1.0 and the higher the number the more similar.

From my perspective the argument should then be removed. But this is surely another issue. Instead of changing the score results just for the purpose to have them changed this is surely the better solution, so that the filtering in all storages must be given by the options. I have not checked the code in the other stores but i would assume that no storage really changed the number when distance calculations are utilized. And, as mentioned, at least Cosine and L2 are build with lower = better.

But surely this can be discussed in an independent issue and not within this PR 😄

@DZunke
Copy link
Contributor Author

DZunke commented Jul 29, 2025

Please rebase on main and adopt the changed test style, switching from #[Test] attribute to test method prefix, see #214.

@chr-hertel Do you have any idea why the pipeline still fails? PHPStan fails with setting up the ai-bundle and the bot thingy has problems with void return types that are everywhere but fails with the class i have changed 🤔

@OskarStark
Copy link
Contributor

Can you please remove void, like I did?

#223

@DZunke
Copy link
Contributor Author

DZunke commented Jul 29, 2025

Haha. Ok. 1 Minute ago. Nothing i could have known 😀 Sure ... i'll do it ... style changes over and over again. Hard to follow, sorry 🙈

@OskarStark
Copy link
Contributor

Yes sorry, lets wait the discussion in

@DZunke DZunke force-pushed the configurable-postgres-distance branch 3 times, most recently from 4a346a6 to 789357f Compare July 29, 2025 10:44
@chr-hertel
Copy link
Member

I think rebasing should be safe again - most CS part is done :)

@DZunke DZunke force-pushed the configurable-postgres-distance branch from 789357f to e5bc8a4 Compare July 30, 2025 16:24
@DZunke DZunke force-pushed the configurable-postgres-distance branch from e5bc8a4 to a8b4fba Compare July 30, 2025 16:25
@chr-hertel
Copy link
Member

Thank you @DZunke.

@chr-hertel chr-hertel merged commit 59f1422 into symfony:main Jul 30, 2025
12 checks passed
chr-hertel added a commit that referenced this pull request Aug 1, 2025
…ce::query` (DZunke)

This PR was merged into the main branch.

Discussion
----------

[Store] remove `minScore` parameter from `VectorStoreInterface::query`

| Q             | A
| ------------- | ---
| Bug fix?      | yes
| New feature?  | no
| Docs?         | no
| Issues        | Fix #195
| License       | MIT

As discussed within #197 and #195 i have removed the `minScore` argument from the `VectorStoreInterface` as this is semantically not correct for any store. Most stores has never implemented this filter mechanism. It originated from the MongoDB store where it is correct with the internal scoring system but for stores like MariaDB and Postgres the distance calculations are more correct with a `maxScore` option as their results are sorted from `0` as exact match to larger float values for the distance.

I have changed those two stores and left the MongoDB store with the minScore implementation.

This is a break in the implementation as the interface has changed.

Commits
-------

2700074 [Store] remove minScore argument from VectorStoreInterface::query
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Needs Work Store Issues & PRs about the AI Store component
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants