diff --git a/guides/using-trieve-vector-inference.mdx b/guides/using-trieve-vector-inference.mdx new file mode 100644 index 0000000..85a3acc --- /dev/null +++ b/guides/using-trieve-vector-inference.mdx @@ -0,0 +1,105 @@ +--- +title: 'Install Trieve vector inference' +description: 'Install Trieve Vector Inference' +icon: 'files' +--- + +## Installation Requirements + +- `eksctl` >= 0.171 ([eksctl installation guide](https://eksctl.io/installation)) +- `aws` >= 2.15 ([aws installation guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)) +- `kubectl` >= 1.28 ([kubectl installation guide](https://kubernetes.io/docs/tasks/tools/#kubectl)) +- `helm` >= 3.14 ([helm installation guide](https://helm.sh/docs/intro/install/#helm)) + +You'll also need a license to run Trieve Vector Inference + +### Getting your license + +(contact us here) + +## Check AWS quota + +Ensure you have quotas for +1) At least **4 vCPUs** for On-Demand G and VT instances in the region of choice. + +Check quota for *us-east-2* [here](https://us-west-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-3819A6DF) + +- At least **1 load-balancer per each model you want. + +Check quota for *us-east-2* [here](https://us-west-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-3819A6DF) + +## Deploying the Cluster + +### Setting up environment variables + +Create eks cluster and install needed plugins + +Your AWS Account ID: +```sh +export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query "Account" --output text)" +``` + +Your AWS REGION: +```sh +export AWS_REGION=us-east-2 +``` + +Your Kubernetes cluster name: + +```sh +export CLUSTER_NAME=trieve-gpu +``` + +Your machine types, we recommend `g4dn.xlarge`, as it is the cheapest on AWS. A single small node is needed for extra utility. + +```sh +export CPU_INSTANCE_TYPE=t3.small +export GPU_INSTANCE_TYPE=g4dn.xlarge +export GPU_COUNT=1 +``` + +### Create your cluster + +```sh +curl ./create_cluster.sh | sh +``` + +This will take around 25 minutes to complete + +## Install Trieve Vector Inference + +### Specify your embedding models + +Modify `embedding_models.yaml` for the models that you want to use + +### Install the helm chart + +```sh +helm upgrade -i vector-inference oci://registry-1.docker.io/trieve/embeddings-helm -f embedding_models.yaml +``` + +### Get your model endpoints + +```sh +kubectl get ingress +``` + +![](./assets/ingress.png) + +## Using Trieve Vector Inference + +```sh +curl -X POST -H "Content-Type: application/json" -d '{"inputs": "cancer" ,"model": "en"} +``` + +## Optional: Delete the cluster + +```sh +cluster_name=trieve-gpu +region=us-east-2 + +helm uninstall vector-release +helm uninstall nvdp -n kube-system +helm uninstall aws-load-balancer-controller -n kube-system +eksctl delete cluster --region=${REGION} --name=${CLUSTER_NAME} +``` diff --git a/mint.json b/mint.json index 740ad85..b8404d5 100644 --- a/mint.json +++ b/mint.json @@ -35,6 +35,10 @@ { "name": "API Reference", "url": "api-reference" + }, + { + "name": "Vector Inference", + "url": "vector-inference" } ], "anchors": [ @@ -66,7 +70,9 @@ "getting-started/introduction", "getting-started/quickstart", "getting-started/trieve-primitives", - "getting-started/screenshots" + "getting-started/screenshots", + "vector-inference/introduction", + "vector-inference/pricing" ] }, { @@ -75,7 +81,9 @@ "self-hosting/docker-compose", "self-hosting/local-kube", "self-hosting/aws", - "self-hosting/gcp" + "self-hosting/gcp", + "vector-inference/aws-installation", + "vector-inference/troubleshooting" ] }, { @@ -85,7 +93,21 @@ "guides/uploading-files", "guides/searching-with-trieve", "guides/recommending-with-trieve", - "guides/RAG-with-trieve" + "guides/RAG-with-trieve", + "vector-inference/rerank", + "vector-inference/splade", + "vector-inference/dense", + "vector-inference/openai" + ] + }, + { + "group": "API Reference", + "pages": [ + "vector-inference/embed", + "vector-inference/embed_all", + "vector-inference/embed_sparse", + "vector-inference/reranker", + "vector-inference/openai_compat" ] }, { diff --git a/vector-inference/aws-installation.mdx b/vector-inference/aws-installation.mdx new file mode 100644 index 0000000..ed82233 --- /dev/null +++ b/vector-inference/aws-installation.mdx @@ -0,0 +1,191 @@ +--- +title: 'AWS Installation' +description: 'Install Trieve Vector Inference in your own aws account' +icon: 'aws' +--- + +## Installation Requirements + +- `eksctl` >= 0.171 ([eksctl installation guide](https://eksctl.io/installation)) +- `aws` >= 2.15 ([aws installation guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)) +- `kubectl` >= 1.28 ([kubectl installation guide](https://kubernetes.io/docs/tasks/tools/#kubectl)) +- `helm` >= 3.14 ([helm installation guide](https://helm.sh/docs/intro/install/#helm)) + +You'll also need a license to run Trieve Vector Inference + +### Getting your license + +Contact us: +- Email us at humans@trieve.ai +- [book a meeting](https://cal.com/nick.k/meet) +- Call us @ 628-222-4090 + +Our pricing is [here](/vector-inference/pricing) + +## Check AWS quota + + + Ensure you have quotas for Both GPU's and Load Balancers. + + +1) At least **4 vCPUs** for On-Demand G and VT instances in the region of choice. + +Check quota for [here](https://us-west-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-3819A6DF) + +2) You will need **1 load-balancer** per each model you want. + +Check quotas for [here](https://us-west-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-3819A6DF) + +## Deploying the Cluster + +### Setting up environment variables + +Create eks cluster and install needed plugins + +Your AWS Account ID: +```sh +export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query "Account" --output text)" +``` + +Your AWS REGION: +```sh +export AWS_REGION=us-east-2 +``` + +Your Kubernetes cluster name: + +```sh +export CLUSTER_NAME=trieve-gpu +``` + +Your machine types, we recommend `g4dn.xlarge`, as it is the cheapest on AWS. A single small node is needed for extra utility. + +```sh +export CPU_INSTANCE_TYPE=t3.small +export GPU_INSTANCE_TYPE=g4dn.xlarge +export GPU_COUNT=1 +``` + +Disable AWS CLI pagination (optional): + +```sh +export AWS_PAGER="" +``` + +**To use our recommended defaults** + +```sh +export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query "Account" --output text)" +export AWS_REGION=us-east-2 +export CLUSTER_NAME=trieve-gpu +export CPU_INSTANCE_TYPE=t3.small +export GPU_INSTANCE_TYPE=g4dn.xlarge +export GPU_COUNT=1 +export AWS_PAGER="" +``` + +### Create your cluster + +Download the `bootstrap-eks.sh` script +```sh +wget cdn.trieve.ai/bootstrap-eks.sh +``` + +Run `bootstrap-eks.sh` with bash + +```sh +bash bootstrap-eks.sh +``` + +This will take around 25 minutes to complete + +## Install Trieve Vector Inference + +### Configure `embedding_models.yaml` + +First download the example configuration file + +```sh +wget https://cdn.trieve.ai/embedding_models.yaml +``` + +Now you can modify your `embedding_models.yaml`, this defines all the models that you will want to use + +```yaml embedding_models.yaml +accessKey: "" + +models: + bgeM3: + replicas: 2 + revision: main + modelName: BAAI/bge-m3 # The end of the URL https://huggingface.co/BAAI/bge-m3 + hfToken: "" # If you have a private hugging face repo + spladeDoc: + replicas: 2 + modelName: naver/efficient-splade-VI-BT-large-doc # The end of the URL https://huggingface.co/naver/efficient-splade-VI-BT-large-doc + isSplade: true + spladeQuery: + replicas: 2 + modelName: naver/efficient-splade-VI-BT-large-doc # The end of the URL https://huggingface.co/naver/efficient-splade-VI-BT-large-doc + isSplade: true + bge-reranker: + replicas: 2 + modelName: BAAI/bge-reranker-large + isSplade: false +``` + +### Install the helm chart + +```sh +helm upgrade -i vector-inference \ + oci://registry-1.docker.io/trieve/embeddings-helm \ + -f embedding_models.yaml +``` + +### Get your model endpoints + +```sh +kubectl get ingress +``` + +The output looks something like this: + +``` +NAME CLASS HOSTS ADDRESS PORTS AGE +vector-inference-embedding-bge-reranker-ingress alb * k8s-default-vectorin-18b7ade77a-2040086997.us-east-2.elb.amazonaws.com 80 73s +vector-inference-embedding-bgem3-ingress alb * k8s-default-vectorin-25e84e25f0-1362792264.us-east-2.elb.amazonaws.com 80 73s +vector-inference-embedding-spladedoc-ingress alb * k8s-default-vectorin-8af81ad2bd-192706382.us-east-2.elb.amazonaws.com 80 72s +vector-inference-embedding-spladequery-ingress alb * k8s-default-vectorin-10404abaee-1617952667.us-east-2.elb.amazonaws.com 80 3m20s +``` + +## Using Trieve Vector Inference + +Each `ingress` point will be using their own Application Load Balancer within AWS. The `Address` provided is the model's endpoint that you can make [dense embeddings](/vector-inference/embed), [sparse embeddings](/vector-inference/embed_sparse), or [reranker calls](/vector-inference/reranker) based on the models you chose + +Check out the guides for more information on configuration + + + + How to setup a dedicated instance for the sparse splade embedding model + + + How to use private or gated hugging face models. Or any models that you want + + + Trieve Vector Inference has openai compatible routes + + + +## Optional: Delete the cluster + +```sh +CLUSTER_NAME=trieve-gpu +REGION=us-east-2 + +aws eks update-kubeconfig --region ${REGION} --name ${CLUSTER_NAME} + +helm uninstall vector-release +helm uninstall nvdp -n kube-system +helm uninstall aws-load-balancer-controller -n kube-system +eksctl delete cluster --region=${REGION} --name=${CLUSTER_NAME} +``` diff --git a/vector-inference/dense.mdx b/vector-inference/dense.mdx new file mode 100644 index 0000000..00cce87 --- /dev/null +++ b/vector-inference/dense.mdx @@ -0,0 +1,48 @@ +--- +title: 'Using Custom Models' +icon: brackets-curly +description: How to use gated or private models hosted on huggingface +mode: wide +--- + +## Custom or fine tuned models in Trieve Vector Inference + +The [open source text models](https://huggingface.co/spaces/mteb/leaderboard) on hugging face may not be what you always want, + + + + +To use a private or custom model with Trieve Vector Inference, you will need to update your `embedding_models.yaml` file. + +If the model is a private or gated hugging face model, you will need to include your huggingface api token + +```yaml embedding_models.yaml +... +models: + ... + my-custom-model: + replicas: 1 + revision: main + modelName: trieve/private-model-example + hfToken: "hf_**********************************" +... +``` + + + +Update TVI to include your models + +```bash +helm upgrade -i vector-inference \ + oci://registry-1.docker.io/trieve/embeddings-helm \ + -f embedding_models.yaml +``` + + + +```sh +kubectl get ing +``` + + + diff --git a/vector-inference/embed.mdx b/vector-inference/embed.mdx new file mode 100644 index 0000000..5cb4153 --- /dev/null +++ b/vector-inference/embed.mdx @@ -0,0 +1,113 @@ +--- +title: 'Create Embedding' +sidebarTitle: 'POST /embed' +description: 'Get Embeddings. Returns a 424 status code if the model is not an embedding model' +--- + +Generating an embedding from a dense embedding model + + + +```json RAW Json +{ + "inputs": "The model input", + "prompt_name": null, + "normalize": true, + "truncate": false, + "truncation_direction": "right" +} +``` + +```sh curl +curl -X POST \ + -H "Content-Type: application/json"\ + -d '{"inputs": "test input"}' \ + --url "http://$ENDPOINT/embed" +``` + +```py python +import requests + +endpoint = "" + +requests.post(f"{endpoint}/embed", json={ + "inputs": ["test input", "test input 2"] +}); + +## or + +requests.post(f"{endpoint}/embed", json={ + "inputs": "test single input" +}); +``` + + + + +```json 200 Embeddings +[ + [ + 0.038483415, + -0.00076982786, + -0.020039458 + ... + ], + [ + 0.04496114, + -0.039057795, + -0.022400795, + ... + ] +] +``` + +```json 413 +{ + "error": "Batch size error", + "error_type": "validation" +} +``` + +```json 422 +{ + "error": "Tokenization error", + "error_type": "validation" +} +``` + +```json 424 +{ + "error": "Inference failed", + "error_type": "backend" +} +``` + +```json 429 +{ + "error": "Model is overloaded", + "error_type": "overloaded" +} +``` + + + + Inputs that need to be embedded + + + + + + +The name of the prompt that should be used by for encoding. If not set, no prompt will be applied. + +Must be a key in the `sentence-transformers` configuration prompts dictionary. + +For example if `prompt_name` is **"doc"** then the sentence **"How to get fast inference?"** will be encoded as **"doc: How to get fast inference?"** because the prompt text will be prepended before any text to encode. + + + +Automatically truncate inputs that are longer than the maximum supported size + + + + diff --git a/vector-inference/embed_all.mdx b/vector-inference/embed_all.mdx new file mode 100644 index 0000000..2d46cc0 --- /dev/null +++ b/vector-inference/embed_all.mdx @@ -0,0 +1,114 @@ +--- +title: 'Create Embedding' +sidebarTitle: 'POST /embed_all' +description: 'Get Embeddings. Returns a 424 status code if the model is not an embedding model' +--- + +Generating an embedding from a dense embedding model + + + +```json RAW Json +{ + "inputs": "The model input", + "prompt_name": null, + "truncate": false, + "truncation_direction": "right" +} +``` + +```sh curl +curl -X POST \ + -H "Content-Type: application/json"\ + -d '{"inputs": "test input"}' \ + --url http://$ENDPOINT/embed_all +``` + +```py python +import requests + +endpoint = "" + +requests.post(f"{endpoint}/embed_all", json={ + "inputs": ["test input", "test input 2"] +}); + +## or + +requests.post(f"{endpoint}/embed_all", json={ + "inputs": "test single input" +}); + + +``` + + + + +```json 200 Embeddings +[ + [ + 0.038483415, + -0.00076982786, + -0.020039458 + ... + ], + [ + 0.04496114, + -0.039057795, + -0.022400795, + ... + ] +] +``` + +```json 413 +{ + "error": "Batch size error", + "error_type": "validation" +} +``` + +```json 422 +{ + "error": "Tokenization error", + "error_type": "validation" +} +``` + +```json 424 +{ + "error": "Inference failed", + "error_type": "backend" +} +``` + +```json 429 +{ + "error": "Model is overloaded", + "error_type": "overloaded" +} +``` + + + + + + Inputs that need to be embedded + + + +The name of the prompt that should be used by for encoding. If not set, no prompt will be applied. + +Must be a key in the `sentence-transformers` configuration prompts dictionary. + +For example if `prompt_name` is **"doc"** then the sentence **"How to get fast inference?"** will be encoded as **"doc: How to get fast inference?"** because the prompt text will be prepended before any text to encode. + + + +Automatically truncate inputs that are longer than the maximum supported size + + + + + diff --git a/vector-inference/embed_sparse.mdx b/vector-inference/embed_sparse.mdx new file mode 100644 index 0000000..59d18b9 --- /dev/null +++ b/vector-inference/embed_sparse.mdx @@ -0,0 +1,127 @@ +--- +title: 'Create Sparse Embedding' +sidebarTitle: 'POST /embed_sparse' +description: 'Get Sparse Embeddings. Returns a 424 status code if the model is not a Splade embedding model' +--- + +Generating an embedding from a sparse embedding model. +The main one that we support right now are the Splade models + + + +```json RAW Json +{ + "inputs": "The model input", + "prompt_name": null, + "truncate": false, + "truncation_direction": "right" +} +``` + +```sh curl +curl -X POST \ + -H "Content-Type: application/json"\ + -d '{"inputs": "test input"}' \ + --url http://$ENDPOINT/embed_sparse +``` + +```py python +import requests + +endpoint = "" + +requests.post(f"{endpoint}/embed_sparse", json={ + "inputs": ["test input", "test input 2"] +}); + +## or + +requests.post(f"{endpoint}/embed_sparse", json={ + "inputs": "test single input" +}); + + +``` + + + + +```json 200 Embeddings +[ + // Embedding 1 + [ + { + "index": 1012, + "value": 0.9970703 + }, + { + "index": 4456, + "value": 2.7832031 + } + ], + // Embedding 2 + [ + { + "index": 990, + "value": 2.783203 + }, + { + "index": 3021, + "value": 10.9970703 + }, + ... + ], + ... +] +``` + +```json 413 +{ + "error": "Batch size error", + "error_type": "validation" +} +``` + +```json 422 +{ + "error": "Tokenization error", + "error_type": "validation" +} +``` + +```json 424 +{ + "error": "Inference failed", + "error_type": "backend" +} +``` + +```json 429 +{ + "error": "Model is overloaded", + "error_type": "overloaded" +} +``` + + + + + + Inputs that need to be embedded + + + +The name of the prompt that should be used by for encoding. If not set, no prompt will be applied. + +Must be a key in the `sentence-transformers` configuration prompts dictionary. + +For example if `prompt_name` is **"doc"** then the sentence **"How to get fast inference?"** will be encoded as **"doc: How to get fast inference?"** because the prompt text will be prepended before any text to encode. + + + +Automatically truncate inputs that are longer than the maximum supported size + + + + + diff --git a/vector-inference/introduction.mdx b/vector-inference/introduction.mdx new file mode 100644 index 0000000..b0ddc66 --- /dev/null +++ b/vector-inference/introduction.mdx @@ -0,0 +1,77 @@ +--- +title: Introduction +description: Trieve Vector Inference is an on-prem solution for fast vector inference +icon: rocket +--- + +## Inspiration + +SaSS offerings for text embeddings have 2 major issues: +1) They have higher latency, due to batch processing. +2) They have heavy rate limits. + +Trieve Vector Inference was created so you could Host Dedicated embedding servers in your own cloud. + +## Performance Difference + +Benchmarks ran using [wrk2](https://github.com/giltene/wrk2) over 30 seconds on 12 threads and 40 active connections. + +Machine used to test was on `m5.large` in `us-west-1`. + + + + +| | OPENAI Cloud | JINA AI Cloud* | JINA (SageMaker)** | TVI Jina | TVI BGE-M3 | TVI Nomic | +|-------------|---------------|----------------|---------------------|-------------|----------|----------| +| P50 Latency | 193.15 ms | 179.33 ms | 185.21 ms | 19.06 ms | 14.69 ms | 21.36 ms | +| P90 Latency | 261.25 ms | 271.87 ms | 296.19 ms | 23.09 ms | 16.90 ms | 29.81 ms | +| P99 Latency | 621.05 ms | 402.43 ms | 306.94 ms | 24.27 ms | 18.80 ms | 30.29 ms | +| Requests Made | 324 | 324 | 324 | 324 | 324 | 324 | +| Requests Failed | 0 | 0 | 3 | 0 | 0 | 0 | + + + +| | OPENAI Cloud | JINA AI Cloud* | JINA (SageMaker)** | TVI Jina | TVI BGE-M3 | TVI Nomic | +|-------------|---------------|----------------|---------------------|-------------|----------|----------| +| P50 Latency | 180.74 ms | 182.62 ms | 515.84 ms | 16.48 ms | 14.35 ms | 23.22 ms | +| P90 Latency | 222.34 ms | 262.65 ms | 654.85 ms | 20.70 ms | 16.15 ms | 29.71 ms | +| P99 Latency | 1.11 sec | 363.01 ms | 724.48 ms | 22.82 ms | 19.82 ms | 31.07 ms | +| Requests Made | 2,991 | 2,991 | 2963 | 3,015 | 3,024 | 3,024 | +| Requests Failed | 0 | 2,986 | 0 | 0 | 0 | 0 | + + + +| | OPENAI Cloud | JINA AI Cloud* | JINA (SageMaker)** | TVI Jina | TVI BGE-M3 | TVI Nomic | +|-------------|---------------|----------------|---------------------|-------------|-----------|----------| +| P50 Latency | 15.70 sec | 15.82 sec | 17.97 sec | 24.40 ms | 14.86 ms | 23.74 ms | +| P90 Latency | 22.01 sec | 21.91 sec | 25.30 sec | 25.14 ms | 17.81 ms | 31.74 ms | +| P99 Latency | 23.59 sec | 23.12 sec | 27.03 sec | 27.61 ms | 19.52 ms | 34.11 ms | +| Requests Made | 6,234 | 6,771 | 2963 | 30,002 | 30,002 | 30,001 | +| Requests Failed | 0 | 6,711 | 0 | 0 | 0 | 0 | + + + + +\* Failed requests was when rate limiting hit in (Jina AI rate limit is 60 RPM or 300 RPM for premium plan) + +\** `jina-embeddings-v2-base-en` on Sagemaker with `ml.g4dn.xlarge` + +## See more + + + + Adding Trieve Vector Inference into your AWS account + + + + Using the `/embed` route + + + + Check out the API Reference to see all of the available endpoints for Trieve Vector Inference + + + + Check out the API Reference to see all of the available endpoints for Trieve Vector Inference + + diff --git a/vector-inference/openai.mdx b/vector-inference/openai.mdx new file mode 100644 index 0000000..778df13 --- /dev/null +++ b/vector-inference/openai.mdx @@ -0,0 +1,45 @@ +--- +title: "Using OpenAI SDK" +icon: microchip-ai +description: How to integrate TVI with existing openai compatible endpoints +--- + +Trieve Vector Inference is compatible with the OpenAI api. This means you're able to just replace the endpoint, without changing any pre-existing code. +Here's an example with the `openai` python sdk + + + + ```sh + pip install openai requests python-dotenv + ``` + + + + Replace `base_url` with your embeddding endpoint. + + ```python openai_compatibility.py + import openai + import time + import requests + import os + from dotenv import load_dotenv + + load_dotenv() + + endpoint = "http://" + + openai.base_url = endpoint + + client = openai.OpenAI( + # This is the default and can be omitted + api_key=os.environ.get("OPENAI_API_KEY"), + base_url=endpoint + ) + + embedding = client.embeddings.create( + input="This is some example input", + model="BAAI/bge-m3" + ) + ``` + + diff --git a/vector-inference/openai_compat.mdx b/vector-inference/openai_compat.mdx new file mode 100644 index 0000000..76bc121 --- /dev/null +++ b/vector-inference/openai_compat.mdx @@ -0,0 +1,134 @@ +--- +title: 'OpenAI compatible embeddings route' +sidebarTitle: 'POST /v1/embeddings' +description: 'OpenAI compatible route. Returns a 424 status code if the model is not an embedding model' +--- + +Generating an embedding from a dense embedding model + + + +```json Raw JSON +{ + "encoding_format": "float", + "input": "string", + "model": "null", + "user": "null" +} +``` + +```sh curl +curl -X POST \ + -H "Content-Type: application/json"\ + -d '{"input": "test input"}' \ + --url http://$ENDPOINT/v1/embeddings +``` + +```py python +import requests + +endpoint = "" + +requests.post(f"{endpoint}/v1/embeddings", json={ + "input": ["test input", "test input 2"] +}); + +## or + +requests.post(f"{endpoint}/v1/embeddings", json={ + "input": "test single input" +}); + + +``` + + + + +```json 200 Embeddings +{ + "data": [ + { + "embedding": [ + 0.038483415, + -0.00076982786, + -0.020039458 + ... + ], + "index": 0, + "object": "embedding" + }, + { + "embedding": [ + 0.038483415, + -0.00076982786, + -0.020039458 + ... + ], + "index": 1, + "object": "embedding" + }, + ... + ], + "model": "thenlper/gte-base", + "object": "list", + "usage": { + "prompt_tokens": 512, + "total_tokens": 512 + } +} +``` + +```json 413 +{ + "error": "Batch size error", + "error_type": "validation" +} +``` + +```json 422 +{ + "error": "Tokenization error", + "error_type": "validation" +} +``` + +```json 424 +{ + "error": "Inference failed", + "error_type": "backend" +} +``` + +```json 429 +{ + "error": "Model is overloaded", + "error_type": "overloaded" +} +``` + + + + + + Inputs that need to be embedded + + + + + + +The name of the prompt that should be used by for encoding. If not set, no prompt will be applied. + +Must be a key in the `sentence-transformers` configuration prompts dictionary. + +For example if `prompt_name` is **"doc"** then the sentence **"How to get fast inference?"** will be encoded as **"doc: How to get fast inference?"** because the prompt text will be prepended before any text to encode. + + + +Automatically truncate inputs that are longer than the maximum supported size + + + + + diff --git a/vector-inference/pricing.mdx b/vector-inference/pricing.mdx new file mode 100644 index 0000000..8966037 --- /dev/null +++ b/vector-inference/pricing.mdx @@ -0,0 +1,64 @@ +--- +title: Pricing +description: The pricing design Trieve Vector Inference +mode: wide +icon: money-bill +--- + +Trieve Vector Inference is meant to be an on-prem solution a license is needed for use. + +To obtain a license for Trieve Vector Inference contact us: + +- Email us at humans@trieve.ai +- [book a meeting](https://cal.com/nick.k/meet) +- Call us @ 628-222-4090 + + + + +
+
+

$0*

+ per month +
+ +
+
Hosting License
+
Unlimited Clusters
+
+
+
+ + +
+
+

$500

+ per month +
+ +
+
Hosting License
+
Unlimited Clusters
+
Dedicated Slack Support
+
+ +
+
+ + +
+
+

$1000+

+ per month +
+ +
+
Hosting License
+
Unlimited Clusters
+
Dedicated Slack Support
+
99.9% SLA
+
Managed and hosted by Trieve
+
+
+ +\* Free for < 10 employees or Pre-seed diff --git a/vector-inference/rerank.mdx b/vector-inference/rerank.mdx new file mode 100644 index 0000000..fed542d --- /dev/null +++ b/vector-inference/rerank.mdx @@ -0,0 +1,53 @@ +--- +title: "Working with Reranker" +mode: wide +icon: arrow-up-arrow-down +--- + +## What is a Reranker / CrossEncoder? + +A `Reranker` model provides a powerful semantic boost to the search quality of any keyword or vector search system without requiring any overhaul or replacement. + +## Using Rerankers with Trieve Vector Inference + + + +To use a reranker model with Trieve Vector Inference, you will need to update your embedding_models.yaml file + +```yaml embedding_models.yaml +... +models: + ... + my-reranker-model: + replicas: 1 + revision: main + modelName: BAAI/bge-reranker-large +... +``` + + + +Update TVF to include your models + +```bash +helm upgrade -i vector-inference \ + oci://registry-1.docker.io/trieve/embeddings-helm \ + -f embedding_models.yaml +``` + + + +```sh +kubectl get ing +``` + +``` +NAME CLASS HOSTS ADDRESS PORTS AGE +vector-inference-embedding-bge-reranker-ingress alb * k8s-default-vectorin-b09efe8cf6-890425945.us-west-1.elb.amazonaws.com 80 77m +``` + +The output looks like this + + + + diff --git a/vector-inference/reranker.mdx b/vector-inference/reranker.mdx new file mode 100644 index 0000000..023f723 --- /dev/null +++ b/vector-inference/reranker.mdx @@ -0,0 +1,140 @@ +--- +title: 'Get ranks' +sidebarTitle: 'POST /rerank' +description: 'Runs Reranker. Returns a 424 status code if the model is not a Reranker model' +--- + + + +```json Raw Json +{ + "query": "What are some good electric cars", + "texts": [ + "Here’s the information about the Mercedes CLR GTR: The Mercedes CLR GTR is a remarkable racing car ...", + "The Tesla Cybertruck is an all-electric, battery-powered light-duty truck unveiled by Tesla, Inc. ..." + ], + "raw_scores": false, + "return_text": false, + "truncate": false, + "truncation_direction": "right" +} +``` + +```sh curl +curl -X POST \ + -H "Content-Type: application/json" \ + -d '{ + "query": "What are some good electric cars", + "texts": [ + "Here’s the information about the Mercedes CLR GTR: The Mercedes CLR GTR is a remarkable racing car ...", + "The Tesla Cybertruck is an all-electric, battery-powered light-duty truck unveiled by Tesla, Inc. ..." + ], + "raw_scores": false, + "return_text": false, + "truncate": false, + "truncation_direction": "right" + }' \ + --url http://$ENDPOINT/rerank +``` + +```py python +import requests + +endpoint = "" + +requests.post(f"{endpoint}/rerank", json={ + "query": "What are some good electric cars", + "texts": [ + "Here’s the information about the Mercedes CLR GTR: The Mercedes CLR GTR is a remarkable racing car ...", + "The Tesla Cybertruck is an all-electric, battery-powered light-duty truck unveiled by Tesla, Inc. ..." + ], + "raw_scores": False, + "return_text": False, + "truncate": False, + "truncation_direction": "right" +}); + +## or + +requests.post(f"{endpoint}/rerank", json={ + "inputs": "test single input" +}); + + +``` + + + + +```json 200 Ranks +[ + { + "index":1, + "score":0.15253653, + // if return_text = true + "text": "The Tesla Cybertruck is an all-electric, battery-powered light-duty truck unveiled by Tesla, Inc. ..." + }, + { + "index":0, + "score":0.00498227 + // if return_text = true + "text": "Here’s the information about the Mercedes CLR GTR: The Mercedes CLR GTR is a remarkable racing car ..." + } +] +``` + +```json 413 +{ + "error": "Batch size error", + "error_type": "validation" +} +``` + +```json 422 +{ + "error": "Tokenization error", + "error_type": "validation" +} +``` + +```json 424 +{ + "error": "Inference failed", + "error_type": "backend" +} +``` + +```json 429 +{ + "error": "Model is overloaded", + "error_type": "overloaded" +} +``` + + + + + + Inputs that need to be embedded + + + + Inputs that need to be embedded + + + + Output the raw reranker score or the normalized score between 0-1. + When `false`, score is between 0 and 1, otherwise range is indeterminate + + + + Return the text with along with each rank + + + +Automatically truncate inputs that are longer than the maximum supported size + + + + + diff --git a/vector-inference/splade.mdx b/vector-inference/splade.mdx new file mode 100644 index 0000000..a28855d --- /dev/null +++ b/vector-inference/splade.mdx @@ -0,0 +1,63 @@ +--- +title: "Working with Splade v2" +icon: magnifying-glass +description: Learn how to use splade with TVI. +mode: wide +--- + +## What is splade? + +`Splade` is similar to other inverted index approaches like `bm25`. `Splade` includes neural term expansion, meaning that it is able to match on synonym's much better than traditional bm25 + +## Using Splade with Trieve Vector Inference + + + +To use splade with Trieve Vector Inference, you will need to adapt both the `doc` and `query` models + +The splade `document` model is the model you use to encode files, where the `query` model is the one to encode the query that you will be searching with + +```yaml embedding_models.yaml +models: + # ... + spladeDoc: + replicas: 1 + modelName: naver/efficient-splade-VI-BT-large-doc + isSplade: true + spladeQuery: + replicas: 1 + modelName: naver/efficient-splade-VI-BT-large-query + isSplade: true + # ... +``` + + + +Update TVF to include your models + +```bash +helm upgrade -i vector-inference \ + oci://registry-1.docker.io/trieve/embeddings-helm \ + -f embedding_models.yaml +``` + + + +```sh +kubectl get ing +``` + + + + ```sh + ENDPOINT="k8s-default-vectorin...elb.amazonaws.com" + + curl -X POST \ + -H "Content-Type: application/json"\ + -d '{"inputs": "test input"}' \ + --url http://$ENDPOINT/embed_sparse + ``` + + For more information checkout the [API reference](/vector-inference/embed_sparse) for sparse vectors + + diff --git a/vector-inference/troubleshooting.mdx b/vector-inference/troubleshooting.mdx new file mode 100644 index 0000000..355b91b --- /dev/null +++ b/vector-inference/troubleshooting.mdx @@ -0,0 +1,51 @@ +--- +title: Troubleshooting +icon: 'triangle-exclamation' +description: 'Common issues with self hosting' +--- + +There are a lot of moving parts in `eksctl`. Here’s a list of common issues we’ve seen customers run into: + + + + + This error happens when deleting the cluster and some pods in `kube-system` refuse to stop. + To fix this run the following command and the deletion process should be able to proceed. + + ```sh + kubectl get pods -n kube-system -o NAME | xargs kubectl -n kube-system delete + ``` + + + + This happens when the cluster doesn't properly delete load balancers, to fix this + + + + Run this to get the available load balancers + ```sh + kubectl get ingress + ``` + + + The output should look like this + ``` + NAME CLASS HOSTS ADDRESS PORTS AGE + vector-inference-embedding-bgem3-ingress alb * k8s-default-vectorin-25e84e25f0-1362792264.us-east-2.elb.amazonaws.com 80 3d19h + vector-inference-embedding-nomic-ingress alb * k8s-default-vectorin-eb664ce6e9-238019709.us-east-2.elb.amazonaws.com 80 2d20h + vector-inference-embedding-spladedoc-ingress alb * k8s-default-vectorin-8af81ad2bd-192706382.us-east-2.elb.amazonaws.com 80 3d19h + ``` + + + + + Go to EC2 > LoadBalancers ([link](https://us-west-1.console.aws.amazon.com/ec2/home?region=us-west-1#LoadBalancers:v=3;$case=tags:false%5C,client:false;$regex=tags:false%5C,client:false)) and delete the alb's that have the ingress point names + + + + + The delete script should be able to resume + + + +