Skip to content

Releases: prrao87/db-hub-fastapi

0.7.0

29 Apr 00:46
2314010
Compare
Choose a tag to compare

What's Changed

  • Aggregation endpoints for Elastic added in #24, fixing #22
  • Significant performance improvements for all databases through asyncio and multiprocessing shown in #26

0.6.0

25 Apr 22:13
Compare
Choose a tag to compare

What's Changed

  • Added bulk-indexing code and API for Weaviate: An ML-first vector database for similarity/hybrid search by @prrao87 in #21

0.5.0

24 Apr 19:31
34d988a
Compare
Choose a tag to compare

What's Changed

Added code for Qdrant, a vector database built in Rust

Includes:

Key features

Bulk index both the data and associated vectors (sentence embeddings) using sentence-transformers into Qdrant so that we can perform similarity search on phrases.

  • Unlike keyword based search, similarity search requires vectors that come from an NLP (typically transformer) model
    • We use a pretrained model from sentence-transformers
    • multi-qa-distilbert-cos-v1 is the model used: As per the docs, "This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs."
  • Unlike other cases, generating sentence embeddings on a large batch of text is quite slow on a CPU, so some code is provided to generate ONNX-optimized and quantized models so that we both generate and index the vectors into db more rapidly without a GPU

Notes on ONNX performance

It looks like ONNX does utilize all available CPU cores when processing the text and generating the embeddings (the image below was generated from an AWS EC2 T2 ubuntu instance with a single 4-core CPU).

image

On average, the entire wine reviews dataset of 129,971 reviews is vectorized and ingested into Qdrant in 34 minutes via the quantized ONNX model, as opposed to more than 1 hour for the regular sbert model downloaded from the sentence-transformers repo. The quantized ONNX model is also ~33% smaller in size from the original model.

  • sbert model: Processes roughly 51 items/sec
  • Quantized onnxruntime model: Processes roughly 92 items/sec

This amounts to a roughly 1.8x reduction in indexing time, with a ~26% smaller (quantized) model that loads and processes results faster. To verify that the embeddings from the quantized models are of similar quality, some example cosine similarities are shown below.

Example results:

The following results are for the sentence-transformers/multi-qa-MiniLM-L6-cos-v1 model that was built for semantic similarity tasks.

Vanilla model

---
Loading vanilla sentence transformer model
---
Similarity between 'I'm very happy' and 'I am so glad': [0.74601071]
Similarity between 'I'm very happy' and 'I'm so sad': [0.6456476]
Similarity between 'I'm very happy' and 'My dog is missing': [0.09541589]
Similarity between 'I'm very happy' and 'The universe is so vast!': [0.27607652]

Quantized ONNX model

---
Loading quantized ONNX model
---
The ONNX file model_optimized_quantized.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.
Similarity between 'I'm very happy' and 'I am so glad': [0.74153285]
Similarity between 'I'm very happy' and 'I'm so sad': [0.65299551]
Similarity between 'I'm very happy' and 'My dog is missing': [0.09312761]
Similarity between 'I'm very happy' and 'The universe is so vast!': [0.26112114]

As can be seen, the similarity scores are very close to the vanilla model, but the model is ~26% smaller and we are able to process the sentences much faster on the same CPU.

0.4.3

21 Apr 16:55
def1585
Compare
Choose a tag to compare

In this release

  • srsly is a fast and lightweight JSON serialization library from Explosion.

    • It eliminates a lot of boilerplate for util functions that read/write compressed JSONL files (in gzip format)
    • Using this library each bulk indexing script is very simple, doesn't add much overhead to the pip install time, and reduces the number of lines of code quite significantly
    • The code base for Elasticsearch, Meilisearch and Neo4j have all been updated to use srsly to read gzipped JSONL
    • For future DBs, the same approach will be used to also keep things clean and readable
  • For Meilisearch, the settings specification is moved over to a settings.json to keep things clean and easy to find all in one place

0.4.2

20 Apr 12:31
23529c7
Compare
Choose a tag to compare

Enhancements

This release contains updates and enhancements from #15 and #16.

#15 results in a ~4x reduction in indexing time for Meilisearch. The key changes are as follows:

  • It's possible to process files concurrently (using process pool executor from concurrent.futures), avoiding sequential execution
  • The process pool is then attached to the running event loop, so that we allow non-blocking execution of each executor that's performing tasks like reading JSON data and validating them in Pydantic
  • aiofiles was also tried to process files in async fashion, but the bottleneck seems to be with the validation in pydantic, not with file I/O.
    • It will be interesting to see how pydantic 2 compares with this approach in the future!

0.4.1

18 Apr 15:32
1b23cd6
Compare
Choose a tag to compare

Improvements to Meilisearch section

  • #11 resolves an issue where files not being found causes the script to fail
  • #12 improves indexing performance by gathering async tasks first (and not processing them in a blocking manner)
  • #13 cleans up the comments and docs and fixes a problem with the docker container not firing up when minor version is missing

0.4.0

17 Apr 14:01
Compare
Choose a tag to compare

What's in this release

#8 adds Meilisearch, a fast and responsive search engine database written in Rust. Like the other databases in this repo, the async Python client is used to bulk-index the dataset into the db and async queries are used in FastAPI. The following tasks are implemented:

  • Set up Meilisearch DB instance via Docker compose and include .env.example
  • Add async bulk indexing script
    • Include schema checks
    • Add methods to set searchable, sortable and filterable fields
  • Add API code for querying db
  • Add docs describing Meilisearch and some of its limitations compared to other dbs

0.3.0

15 Apr 16:16
013ebba
Compare
Choose a tag to compare

What's in this update

Includes updates from #5 and #6.

Elasticsearch

This release introduces Elasticsearch indexing and API code to the repo.

  • Include docker files to set up a basic license (free) Elasticsearch database
  • Create a wines alias and its associated index in Elastic
  • Bulk-index the wines dataset into Elastic
  • Test queries in Kibana
  • Build FastAPI application to query result from Elastic via JSON queries sent to the backend
  • Test out sample queries via OpenAPI browser

Neo4j

  • Minor fixes to docs: typos and clarity
  • Fix type hints in schema and API
  • Set the docker container tag as the API version for simplicity (every time the FastAPI container tag changes, the API version in the docs follow suit with the same number)
  • Fix issues with type hints in API routers
    • Neo4j queries return vanilla dicts, and for some reason, FastAPI + Pydantic don't parse these prior to sending them as a response (this isn't an issue in Elastic)
  • Update README example cURL request and docs
  • Fix linting issues

0.2.1

14 Apr 19:37
6806908
Compare
Choose a tag to compare

Refactor Neo4j data loader schema and queries

This release is for #4.

  • There's no need to complicate things by converting the existing data to a nested dict -- keeping the original dict from the raw data makes sense from a build query perspective
  • Running each portion (nodes first and then edges after) is also unnecessary -- a single build query with WITH and MERGE statements does the job

0.2.0

14 Apr 17:53
395eafc
Compare
Choose a tag to compare

Updates

  • Add uvloop to speed up async event loop (The AsyncGraphDatabase driver already uses this
  • Slim down docker image for faster builds
  • Add new endpoints
  • Update docs for more clarity