diff --git a/guides/using-trieve-vector-inference.mdx b/guides/using-trieve-vector-inference.mdx
new file mode 100644
index 0000000..85a3acc
--- /dev/null
+++ b/guides/using-trieve-vector-inference.mdx
@@ -0,0 +1,105 @@
+---
+title: 'Install Trieve vector inference'
+description: 'Install Trieve Vector Inference'
+icon: 'files'
+---
+
+## Installation Requirements
+
+- `eksctl` >= 0.171 ([eksctl installation guide](https://eksctl.io/installation))
+- `aws` >= 2.15 ([aws installation guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html))
+- `kubectl` >= 1.28 ([kubectl installation guide](https://kubernetes.io/docs/tasks/tools/#kubectl))
+- `helm` >= 3.14 ([helm installation guide](https://helm.sh/docs/intro/install/#helm))
+
+You'll also need a license to run Trieve Vector Inference
+
+### Getting your license
+
+(contact us here)
+
+## Check AWS quota
+
+Ensure you have quotas for
+1) At least **4 vCPUs** for On-Demand G and VT instances in the region of choice.
+
+Check quota for *us-east-2* [here](https://us-west-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-3819A6DF)
+
+- At least **1 load-balancer per each model you want.
+
+Check quota for *us-east-2* [here](https://us-west-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-3819A6DF)
+
+## Deploying the Cluster
+
+### Setting up environment variables
+
+Create eks cluster and install needed plugins
+
+Your AWS Account ID:
+```sh
+export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query "Account" --output text)"
+```
+
+Your AWS REGION:
+```sh
+export AWS_REGION=us-east-2
+```
+
+Your Kubernetes cluster name:
+
+```sh
+export CLUSTER_NAME=trieve-gpu
+```
+
+Your machine types, we recommend `g4dn.xlarge`, as it is the cheapest on AWS. A single small node is needed for extra utility.
+
+```sh
+export CPU_INSTANCE_TYPE=t3.small
+export GPU_INSTANCE_TYPE=g4dn.xlarge
+export GPU_COUNT=1
+```
+
+### Create your cluster
+
+```sh
+curl ./create_cluster.sh | sh
+```
+
+This will take around 25 minutes to complete
+
+## Install Trieve Vector Inference
+
+### Specify your embedding models
+
+Modify `embedding_models.yaml` for the models that you want to use
+
+### Install the helm chart
+
+```sh
+helm upgrade -i vector-inference oci://registry-1.docker.io/trieve/embeddings-helm -f embedding_models.yaml
+```
+
+### Get your model endpoints
+
+```sh
+kubectl get ingress
+```
+
+
+
+## Using Trieve Vector Inference
+
+```sh
+curl -X POST -H "Content-Type: application/json" -d '{"inputs": "cancer" ,"model": "en"}
+```
+
+## Optional: Delete the cluster
+
+```sh
+cluster_name=trieve-gpu
+region=us-east-2
+
+helm uninstall vector-release
+helm uninstall nvdp -n kube-system
+helm uninstall aws-load-balancer-controller -n kube-system
+eksctl delete cluster --region=${REGION} --name=${CLUSTER_NAME}
+```
diff --git a/mint.json b/mint.json
index 740ad85..b8404d5 100644
--- a/mint.json
+++ b/mint.json
@@ -35,6 +35,10 @@
{
"name": "API Reference",
"url": "api-reference"
+ },
+ {
+ "name": "Vector Inference",
+ "url": "vector-inference"
}
],
"anchors": [
@@ -66,7 +70,9 @@
"getting-started/introduction",
"getting-started/quickstart",
"getting-started/trieve-primitives",
- "getting-started/screenshots"
+ "getting-started/screenshots",
+ "vector-inference/introduction",
+ "vector-inference/pricing"
]
},
{
@@ -75,7 +81,9 @@
"self-hosting/docker-compose",
"self-hosting/local-kube",
"self-hosting/aws",
- "self-hosting/gcp"
+ "self-hosting/gcp",
+ "vector-inference/aws-installation",
+ "vector-inference/troubleshooting"
]
},
{
@@ -85,7 +93,21 @@
"guides/uploading-files",
"guides/searching-with-trieve",
"guides/recommending-with-trieve",
- "guides/RAG-with-trieve"
+ "guides/RAG-with-trieve",
+ "vector-inference/rerank",
+ "vector-inference/splade",
+ "vector-inference/dense",
+ "vector-inference/openai"
+ ]
+ },
+ {
+ "group": "API Reference",
+ "pages": [
+ "vector-inference/embed",
+ "vector-inference/embed_all",
+ "vector-inference/embed_sparse",
+ "vector-inference/reranker",
+ "vector-inference/openai_compat"
]
},
{
diff --git a/vector-inference/aws-installation.mdx b/vector-inference/aws-installation.mdx
new file mode 100644
index 0000000..ed82233
--- /dev/null
+++ b/vector-inference/aws-installation.mdx
@@ -0,0 +1,191 @@
+---
+title: 'AWS Installation'
+description: 'Install Trieve Vector Inference in your own aws account'
+icon: 'aws'
+---
+
+## Installation Requirements
+
+- `eksctl` >= 0.171 ([eksctl installation guide](https://eksctl.io/installation))
+- `aws` >= 2.15 ([aws installation guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html))
+- `kubectl` >= 1.28 ([kubectl installation guide](https://kubernetes.io/docs/tasks/tools/#kubectl))
+- `helm` >= 3.14 ([helm installation guide](https://helm.sh/docs/intro/install/#helm))
+
+You'll also need a license to run Trieve Vector Inference
+
+### Getting your license
+
+Contact us:
+- Email us at humans@trieve.ai
+- [book a meeting](https://cal.com/nick.k/meet)
+- Call us @ 628-222-4090
+
+Our pricing is [here](/vector-inference/pricing)
+
+## Check AWS quota
+
+
+ Ensure you have quotas for Both GPU's and Load Balancers.
+
+
+1) At least **4 vCPUs** for On-Demand G and VT instances in the region of choice.
+
+Check quota for [here](https://us-west-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-3819A6DF)
+
+2) You will need **1 load-balancer** per each model you want.
+
+Check quotas for [here](https://us-west-1.console.aws.amazon.com/servicequotas/home/services/ec2/quotas/L-3819A6DF)
+
+## Deploying the Cluster
+
+### Setting up environment variables
+
+Create eks cluster and install needed plugins
+
+Your AWS Account ID:
+```sh
+export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query "Account" --output text)"
+```
+
+Your AWS REGION:
+```sh
+export AWS_REGION=us-east-2
+```
+
+Your Kubernetes cluster name:
+
+```sh
+export CLUSTER_NAME=trieve-gpu
+```
+
+Your machine types, we recommend `g4dn.xlarge`, as it is the cheapest on AWS. A single small node is needed for extra utility.
+
+```sh
+export CPU_INSTANCE_TYPE=t3.small
+export GPU_INSTANCE_TYPE=g4dn.xlarge
+export GPU_COUNT=1
+```
+
+Disable AWS CLI pagination (optional):
+
+```sh
+export AWS_PAGER=""
+```
+
+**To use our recommended defaults**
+
+```sh
+export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query "Account" --output text)"
+export AWS_REGION=us-east-2
+export CLUSTER_NAME=trieve-gpu
+export CPU_INSTANCE_TYPE=t3.small
+export GPU_INSTANCE_TYPE=g4dn.xlarge
+export GPU_COUNT=1
+export AWS_PAGER=""
+```
+
+### Create your cluster
+
+Download the `bootstrap-eks.sh` script
+```sh
+wget cdn.trieve.ai/bootstrap-eks.sh
+```
+
+Run `bootstrap-eks.sh` with bash
+
+```sh
+bash bootstrap-eks.sh
+```
+
+This will take around 25 minutes to complete
+
+## Install Trieve Vector Inference
+
+### Configure `embedding_models.yaml`
+
+First download the example configuration file
+
+```sh
+wget https://cdn.trieve.ai/embedding_models.yaml
+```
+
+Now you can modify your `embedding_models.yaml`, this defines all the models that you will want to use
+
+```yaml embedding_models.yaml
+accessKey: ""
+
+models:
+ bgeM3:
+ replicas: 2
+ revision: main
+ modelName: BAAI/bge-m3 # The end of the URL https://huggingface.co/BAAI/bge-m3
+ hfToken: "" # If you have a private hugging face repo
+ spladeDoc:
+ replicas: 2
+ modelName: naver/efficient-splade-VI-BT-large-doc # The end of the URL https://huggingface.co/naver/efficient-splade-VI-BT-large-doc
+ isSplade: true
+ spladeQuery:
+ replicas: 2
+ modelName: naver/efficient-splade-VI-BT-large-doc # The end of the URL https://huggingface.co/naver/efficient-splade-VI-BT-large-doc
+ isSplade: true
+ bge-reranker:
+ replicas: 2
+ modelName: BAAI/bge-reranker-large
+ isSplade: false
+```
+
+### Install the helm chart
+
+```sh
+helm upgrade -i vector-inference \
+ oci://registry-1.docker.io/trieve/embeddings-helm \
+ -f embedding_models.yaml
+```
+
+### Get your model endpoints
+
+```sh
+kubectl get ingress
+```
+
+The output looks something like this:
+
+```
+NAME CLASS HOSTS ADDRESS PORTS AGE
+vector-inference-embedding-bge-reranker-ingress alb * k8s-default-vectorin-18b7ade77a-2040086997.us-east-2.elb.amazonaws.com 80 73s
+vector-inference-embedding-bgem3-ingress alb * k8s-default-vectorin-25e84e25f0-1362792264.us-east-2.elb.amazonaws.com 80 73s
+vector-inference-embedding-spladedoc-ingress alb * k8s-default-vectorin-8af81ad2bd-192706382.us-east-2.elb.amazonaws.com 80 72s
+vector-inference-embedding-spladequery-ingress alb * k8s-default-vectorin-10404abaee-1617952667.us-east-2.elb.amazonaws.com 80 3m20s
+```
+
+## Using Trieve Vector Inference
+
+Each `ingress` point will be using their own Application Load Balancer within AWS. The `Address` provided is the model's endpoint that you can make [dense embeddings](/vector-inference/embed), [sparse embeddings](/vector-inference/embed_sparse), or [reranker calls](/vector-inference/reranker) based on the models you chose
+
+Check out the guides for more information on configuration
+
+
+
+ How to setup a dedicated instance for the sparse splade embedding model
+
+
+ How to use private or gated hugging face models. Or any models that you want
+
+
+ Trieve Vector Inference has openai compatible routes
+
+
+
+## Optional: Delete the cluster
+
+```sh
+CLUSTER_NAME=trieve-gpu
+REGION=us-east-2
+
+aws eks update-kubeconfig --region ${REGION} --name ${CLUSTER_NAME}
+
+helm uninstall vector-release
+helm uninstall nvdp -n kube-system
+helm uninstall aws-load-balancer-controller -n kube-system
+eksctl delete cluster --region=${REGION} --name=${CLUSTER_NAME}
+```
diff --git a/vector-inference/dense.mdx b/vector-inference/dense.mdx
new file mode 100644
index 0000000..00cce87
--- /dev/null
+++ b/vector-inference/dense.mdx
@@ -0,0 +1,48 @@
+---
+title: 'Using Custom Models'
+icon: brackets-curly
+description: How to use gated or private models hosted on huggingface
+mode: wide
+---
+
+## Custom or fine tuned models in Trieve Vector Inference
+
+The [open source text models](https://huggingface.co/spaces/mteb/leaderboard) on hugging face may not be what you always want,
+
+
+
+
+To use a private or custom model with Trieve Vector Inference, you will need to update your `embedding_models.yaml` file.
+
+If the model is a private or gated hugging face model, you will need to include your huggingface api token
+
+```yaml embedding_models.yaml
+...
+models:
+ ...
+ my-custom-model:
+ replicas: 1
+ revision: main
+ modelName: trieve/private-model-example
+ hfToken: "hf_**********************************"
+...
+```
+
+
+
+Update TVI to include your models
+
+```bash
+helm upgrade -i vector-inference \
+ oci://registry-1.docker.io/trieve/embeddings-helm \
+ -f embedding_models.yaml
+```
+
+
+
+```sh
+kubectl get ing
+```
+
+
+
diff --git a/vector-inference/embed.mdx b/vector-inference/embed.mdx
new file mode 100644
index 0000000..5cb4153
--- /dev/null
+++ b/vector-inference/embed.mdx
@@ -0,0 +1,113 @@
+---
+title: 'Create Embedding'
+sidebarTitle: 'POST /embed'
+description: 'Get Embeddings. Returns a 424 status code if the model is not an embedding model'
+---
+
+Generating an embedding from a dense embedding model
+
+
+
+```json RAW Json
+{
+ "inputs": "The model input",
+ "prompt_name": null,
+ "normalize": true,
+ "truncate": false,
+ "truncation_direction": "right"
+}
+```
+
+```sh curl
+curl -X POST \
+ -H "Content-Type: application/json"\
+ -d '{"inputs": "test input"}' \
+ --url "http://$ENDPOINT/embed"
+```
+
+```py python
+import requests
+
+endpoint = ""
+
+requests.post(f"{endpoint}/embed", json={
+ "inputs": ["test input", "test input 2"]
+});
+
+## or
+
+requests.post(f"{endpoint}/embed", json={
+ "inputs": "test single input"
+});
+```
+
+
+
+
+```json 200 Embeddings
+[
+ [
+ 0.038483415,
+ -0.00076982786,
+ -0.020039458
+ ...
+ ],
+ [
+ 0.04496114,
+ -0.039057795,
+ -0.022400795,
+ ...
+ ]
+]
+```
+
+```json 413
+{
+ "error": "Batch size error",
+ "error_type": "validation"
+}
+```
+
+```json 422
+{
+ "error": "Tokenization error",
+ "error_type": "validation"
+}
+```
+
+```json 424
+{
+ "error": "Inference failed",
+ "error_type": "backend"
+}
+```
+
+```json 429
+{
+ "error": "Model is overloaded",
+ "error_type": "overloaded"
+}
+```
+
+
+
+ Inputs that need to be embedded
+
+
+
+
+
+
+The name of the prompt that should be used by for encoding. If not set, no prompt will be applied.
+
+Must be a key in the `sentence-transformers` configuration prompts dictionary.
+
+For example if `prompt_name` is **"doc"** then the sentence **"How to get fast inference?"** will be encoded as **"doc: How to get fast inference?"** because the prompt text will be prepended before any text to encode.
+
+
+
+Automatically truncate inputs that are longer than the maximum supported size
+
+
+
+
diff --git a/vector-inference/embed_all.mdx b/vector-inference/embed_all.mdx
new file mode 100644
index 0000000..2d46cc0
--- /dev/null
+++ b/vector-inference/embed_all.mdx
@@ -0,0 +1,114 @@
+---
+title: 'Create Embedding'
+sidebarTitle: 'POST /embed_all'
+description: 'Get Embeddings. Returns a 424 status code if the model is not an embedding model'
+---
+
+Generating an embedding from a dense embedding model
+
+
+
+```json RAW Json
+{
+ "inputs": "The model input",
+ "prompt_name": null,
+ "truncate": false,
+ "truncation_direction": "right"
+}
+```
+
+```sh curl
+curl -X POST \
+ -H "Content-Type: application/json"\
+ -d '{"inputs": "test input"}' \
+ --url http://$ENDPOINT/embed_all
+```
+
+```py python
+import requests
+
+endpoint = ""
+
+requests.post(f"{endpoint}/embed_all", json={
+ "inputs": ["test input", "test input 2"]
+});
+
+## or
+
+requests.post(f"{endpoint}/embed_all", json={
+ "inputs": "test single input"
+});
+
+
+```
+
+
+
+
+```json 200 Embeddings
+[
+ [
+ 0.038483415,
+ -0.00076982786,
+ -0.020039458
+ ...
+ ],
+ [
+ 0.04496114,
+ -0.039057795,
+ -0.022400795,
+ ...
+ ]
+]
+```
+
+```json 413
+{
+ "error": "Batch size error",
+ "error_type": "validation"
+}
+```
+
+```json 422
+{
+ "error": "Tokenization error",
+ "error_type": "validation"
+}
+```
+
+```json 424
+{
+ "error": "Inference failed",
+ "error_type": "backend"
+}
+```
+
+```json 429
+{
+ "error": "Model is overloaded",
+ "error_type": "overloaded"
+}
+```
+
+
+
+
+
+ Inputs that need to be embedded
+
+
+
+The name of the prompt that should be used by for encoding. If not set, no prompt will be applied.
+
+Must be a key in the `sentence-transformers` configuration prompts dictionary.
+
+For example if `prompt_name` is **"doc"** then the sentence **"How to get fast inference?"** will be encoded as **"doc: How to get fast inference?"** because the prompt text will be prepended before any text to encode.
+
+
+
+Automatically truncate inputs that are longer than the maximum supported size
+
+
+
+
+
diff --git a/vector-inference/embed_sparse.mdx b/vector-inference/embed_sparse.mdx
new file mode 100644
index 0000000..59d18b9
--- /dev/null
+++ b/vector-inference/embed_sparse.mdx
@@ -0,0 +1,127 @@
+---
+title: 'Create Sparse Embedding'
+sidebarTitle: 'POST /embed_sparse'
+description: 'Get Sparse Embeddings. Returns a 424 status code if the model is not a Splade embedding model'
+---
+
+Generating an embedding from a sparse embedding model.
+The main one that we support right now are the Splade models
+
+
+
+```json RAW Json
+{
+ "inputs": "The model input",
+ "prompt_name": null,
+ "truncate": false,
+ "truncation_direction": "right"
+}
+```
+
+```sh curl
+curl -X POST \
+ -H "Content-Type: application/json"\
+ -d '{"inputs": "test input"}' \
+ --url http://$ENDPOINT/embed_sparse
+```
+
+```py python
+import requests
+
+endpoint = ""
+
+requests.post(f"{endpoint}/embed_sparse", json={
+ "inputs": ["test input", "test input 2"]
+});
+
+## or
+
+requests.post(f"{endpoint}/embed_sparse", json={
+ "inputs": "test single input"
+});
+
+
+```
+
+
+
+
+```json 200 Embeddings
+[
+ // Embedding 1
+ [
+ {
+ "index": 1012,
+ "value": 0.9970703
+ },
+ {
+ "index": 4456,
+ "value": 2.7832031
+ }
+ ],
+ // Embedding 2
+ [
+ {
+ "index": 990,
+ "value": 2.783203
+ },
+ {
+ "index": 3021,
+ "value": 10.9970703
+ },
+ ...
+ ],
+ ...
+]
+```
+
+```json 413
+{
+ "error": "Batch size error",
+ "error_type": "validation"
+}
+```
+
+```json 422
+{
+ "error": "Tokenization error",
+ "error_type": "validation"
+}
+```
+
+```json 424
+{
+ "error": "Inference failed",
+ "error_type": "backend"
+}
+```
+
+```json 429
+{
+ "error": "Model is overloaded",
+ "error_type": "overloaded"
+}
+```
+
+
+
+
+
+ Inputs that need to be embedded
+
+
+
+The name of the prompt that should be used by for encoding. If not set, no prompt will be applied.
+
+Must be a key in the `sentence-transformers` configuration prompts dictionary.
+
+For example if `prompt_name` is **"doc"** then the sentence **"How to get fast inference?"** will be encoded as **"doc: How to get fast inference?"** because the prompt text will be prepended before any text to encode.
+
+
+
+Automatically truncate inputs that are longer than the maximum supported size
+
+
+
+
+
diff --git a/vector-inference/introduction.mdx b/vector-inference/introduction.mdx
new file mode 100644
index 0000000..b0ddc66
--- /dev/null
+++ b/vector-inference/introduction.mdx
@@ -0,0 +1,77 @@
+---
+title: Introduction
+description: Trieve Vector Inference is an on-prem solution for fast vector inference
+icon: rocket
+---
+
+## Inspiration
+
+SaSS offerings for text embeddings have 2 major issues:
+1) They have higher latency, due to batch processing.
+2) They have heavy rate limits.
+
+Trieve Vector Inference was created so you could Host Dedicated embedding servers in your own cloud.
+
+## Performance Difference
+
+Benchmarks ran using [wrk2](https://github.com/giltene/wrk2) over 30 seconds on 12 threads and 40 active connections.
+
+Machine used to test was on `m5.large` in `us-west-1`.
+
+
+
+
+| | OPENAI Cloud | JINA AI Cloud* | JINA (SageMaker)** | TVI Jina | TVI BGE-M3 | TVI Nomic |
+|-------------|---------------|----------------|---------------------|-------------|----------|----------|
+| P50 Latency | 193.15 ms | 179.33 ms | 185.21 ms | 19.06 ms | 14.69 ms | 21.36 ms |
+| P90 Latency | 261.25 ms | 271.87 ms | 296.19 ms | 23.09 ms | 16.90 ms | 29.81 ms |
+| P99 Latency | 621.05 ms | 402.43 ms | 306.94 ms | 24.27 ms | 18.80 ms | 30.29 ms |
+| Requests Made | 324 | 324 | 324 | 324 | 324 | 324 |
+| Requests Failed | 0 | 0 | 3 | 0 | 0 | 0 |
+
+
+
+| | OPENAI Cloud | JINA AI Cloud* | JINA (SageMaker)** | TVI Jina | TVI BGE-M3 | TVI Nomic |
+|-------------|---------------|----------------|---------------------|-------------|----------|----------|
+| P50 Latency | 180.74 ms | 182.62 ms | 515.84 ms | 16.48 ms | 14.35 ms | 23.22 ms |
+| P90 Latency | 222.34 ms | 262.65 ms | 654.85 ms | 20.70 ms | 16.15 ms | 29.71 ms |
+| P99 Latency | 1.11 sec | 363.01 ms | 724.48 ms | 22.82 ms | 19.82 ms | 31.07 ms |
+| Requests Made | 2,991 | 2,991 | 2963 | 3,015 | 3,024 | 3,024 |
+| Requests Failed | 0 | 2,986 | 0 | 0 | 0 | 0 |
+
+
+
+| | OPENAI Cloud | JINA AI Cloud* | JINA (SageMaker)** | TVI Jina | TVI BGE-M3 | TVI Nomic |
+|-------------|---------------|----------------|---------------------|-------------|-----------|----------|
+| P50 Latency | 15.70 sec | 15.82 sec | 17.97 sec | 24.40 ms | 14.86 ms | 23.74 ms |
+| P90 Latency | 22.01 sec | 21.91 sec | 25.30 sec | 25.14 ms | 17.81 ms | 31.74 ms |
+| P99 Latency | 23.59 sec | 23.12 sec | 27.03 sec | 27.61 ms | 19.52 ms | 34.11 ms |
+| Requests Made | 6,234 | 6,771 | 2963 | 30,002 | 30,002 | 30,001 |
+| Requests Failed | 0 | 6,711 | 0 | 0 | 0 | 0 |
+
+
+
+
+\* Failed requests was when rate limiting hit in (Jina AI rate limit is 60 RPM or 300 RPM for premium plan)
+
+\** `jina-embeddings-v2-base-en` on Sagemaker with `ml.g4dn.xlarge`
+
+## See more
+
+
+
+ Adding Trieve Vector Inference into your AWS account
+
+
+
+ Using the `/embed` route
+
+
+
+ Check out the API Reference to see all of the available endpoints for Trieve Vector Inference
+
+
+
+ Check out the API Reference to see all of the available endpoints for Trieve Vector Inference
+
+
diff --git a/vector-inference/openai.mdx b/vector-inference/openai.mdx
new file mode 100644
index 0000000..778df13
--- /dev/null
+++ b/vector-inference/openai.mdx
@@ -0,0 +1,45 @@
+---
+title: "Using OpenAI SDK"
+icon: microchip-ai
+description: How to integrate TVI with existing openai compatible endpoints
+---
+
+Trieve Vector Inference is compatible with the OpenAI api. This means you're able to just replace the endpoint, without changing any pre-existing code.
+Here's an example with the `openai` python sdk
+
+
+
+ ```sh
+ pip install openai requests python-dotenv
+ ```
+
+
+
+ Replace `base_url` with your embeddding endpoint.
+
+ ```python openai_compatibility.py
+ import openai
+ import time
+ import requests
+ import os
+ from dotenv import load_dotenv
+
+ load_dotenv()
+
+ endpoint = "http://"
+
+ openai.base_url = endpoint
+
+ client = openai.OpenAI(
+ # This is the default and can be omitted
+ api_key=os.environ.get("OPENAI_API_KEY"),
+ base_url=endpoint
+ )
+
+ embedding = client.embeddings.create(
+ input="This is some example input",
+ model="BAAI/bge-m3"
+ )
+ ```
+
+
diff --git a/vector-inference/openai_compat.mdx b/vector-inference/openai_compat.mdx
new file mode 100644
index 0000000..76bc121
--- /dev/null
+++ b/vector-inference/openai_compat.mdx
@@ -0,0 +1,134 @@
+---
+title: 'OpenAI compatible embeddings route'
+sidebarTitle: 'POST /v1/embeddings'
+description: 'OpenAI compatible route. Returns a 424 status code if the model is not an embedding model'
+---
+
+Generating an embedding from a dense embedding model
+
+
+
+```json Raw JSON
+{
+ "encoding_format": "float",
+ "input": "string",
+ "model": "null",
+ "user": "null"
+}
+```
+
+```sh curl
+curl -X POST \
+ -H "Content-Type: application/json"\
+ -d '{"input": "test input"}' \
+ --url http://$ENDPOINT/v1/embeddings
+```
+
+```py python
+import requests
+
+endpoint = ""
+
+requests.post(f"{endpoint}/v1/embeddings", json={
+ "input": ["test input", "test input 2"]
+});
+
+## or
+
+requests.post(f"{endpoint}/v1/embeddings", json={
+ "input": "test single input"
+});
+
+
+```
+
+
+
+
+```json 200 Embeddings
+{
+ "data": [
+ {
+ "embedding": [
+ 0.038483415,
+ -0.00076982786,
+ -0.020039458
+ ...
+ ],
+ "index": 0,
+ "object": "embedding"
+ },
+ {
+ "embedding": [
+ 0.038483415,
+ -0.00076982786,
+ -0.020039458
+ ...
+ ],
+ "index": 1,
+ "object": "embedding"
+ },
+ ...
+ ],
+ "model": "thenlper/gte-base",
+ "object": "list",
+ "usage": {
+ "prompt_tokens": 512,
+ "total_tokens": 512
+ }
+}
+```
+
+```json 413
+{
+ "error": "Batch size error",
+ "error_type": "validation"
+}
+```
+
+```json 422
+{
+ "error": "Tokenization error",
+ "error_type": "validation"
+}
+```
+
+```json 424
+{
+ "error": "Inference failed",
+ "error_type": "backend"
+}
+```
+
+```json 429
+{
+ "error": "Model is overloaded",
+ "error_type": "overloaded"
+}
+```
+
+
+
+
+
+ Inputs that need to be embedded
+
+
+
+
+
+
+The name of the prompt that should be used by for encoding. If not set, no prompt will be applied.
+
+Must be a key in the `sentence-transformers` configuration prompts dictionary.
+
+For example if `prompt_name` is **"doc"** then the sentence **"How to get fast inference?"** will be encoded as **"doc: How to get fast inference?"** because the prompt text will be prepended before any text to encode.
+
+
+
+Automatically truncate inputs that are longer than the maximum supported size
+
+
+
+
+
diff --git a/vector-inference/pricing.mdx b/vector-inference/pricing.mdx
new file mode 100644
index 0000000..8966037
--- /dev/null
+++ b/vector-inference/pricing.mdx
@@ -0,0 +1,64 @@
+---
+title: Pricing
+description: The pricing design Trieve Vector Inference
+mode: wide
+icon: money-bill
+---
+
+Trieve Vector Inference is meant to be an on-prem solution a license is needed for use.
+
+To obtain a license for Trieve Vector Inference contact us:
+
+- Email us at humans@trieve.ai
+- [book a meeting](https://cal.com/nick.k/meet)
+- Call us @ 628-222-4090
+
+
+
+
+
+
+
$0*
+ per month
+
+
+
+
Hosting License
+
Unlimited Clusters
+
+
+
+
+
+
+
+
$500
+ per month
+
+
+
+
Hosting License
+
Unlimited Clusters
+
Dedicated Slack Support
+
+
+
+
+
+
+
+
+
$1000+
+ per month
+
+
+
+
Hosting License
+
Unlimited Clusters
+
Dedicated Slack Support
+
99.9% SLA
+
Managed and hosted by Trieve
+
+
+
+\* Free for < 10 employees or Pre-seed
diff --git a/vector-inference/rerank.mdx b/vector-inference/rerank.mdx
new file mode 100644
index 0000000..fed542d
--- /dev/null
+++ b/vector-inference/rerank.mdx
@@ -0,0 +1,53 @@
+---
+title: "Working with Reranker"
+mode: wide
+icon: arrow-up-arrow-down
+---
+
+## What is a Reranker / CrossEncoder?
+
+A `Reranker` model provides a powerful semantic boost to the search quality of any keyword or vector search system without requiring any overhaul or replacement.
+
+## Using Rerankers with Trieve Vector Inference
+
+
+
+To use a reranker model with Trieve Vector Inference, you will need to update your embedding_models.yaml file
+
+```yaml embedding_models.yaml
+...
+models:
+ ...
+ my-reranker-model:
+ replicas: 1
+ revision: main
+ modelName: BAAI/bge-reranker-large
+...
+```
+
+
+
+Update TVF to include your models
+
+```bash
+helm upgrade -i vector-inference \
+ oci://registry-1.docker.io/trieve/embeddings-helm \
+ -f embedding_models.yaml
+```
+
+
+
+```sh
+kubectl get ing
+```
+
+```
+NAME CLASS HOSTS ADDRESS PORTS AGE
+vector-inference-embedding-bge-reranker-ingress alb * k8s-default-vectorin-b09efe8cf6-890425945.us-west-1.elb.amazonaws.com 80 77m
+```
+
+The output looks like this
+
+
+
+
diff --git a/vector-inference/reranker.mdx b/vector-inference/reranker.mdx
new file mode 100644
index 0000000..023f723
--- /dev/null
+++ b/vector-inference/reranker.mdx
@@ -0,0 +1,140 @@
+---
+title: 'Get ranks'
+sidebarTitle: 'POST /rerank'
+description: 'Runs Reranker. Returns a 424 status code if the model is not a Reranker model'
+---
+
+
+
+```json Raw Json
+{
+ "query": "What are some good electric cars",
+ "texts": [
+ "Here’s the information about the Mercedes CLR GTR: The Mercedes CLR GTR is a remarkable racing car ...",
+ "The Tesla Cybertruck is an all-electric, battery-powered light-duty truck unveiled by Tesla, Inc. ..."
+ ],
+ "raw_scores": false,
+ "return_text": false,
+ "truncate": false,
+ "truncation_direction": "right"
+}
+```
+
+```sh curl
+curl -X POST \
+ -H "Content-Type: application/json" \
+ -d '{
+ "query": "What are some good electric cars",
+ "texts": [
+ "Here’s the information about the Mercedes CLR GTR: The Mercedes CLR GTR is a remarkable racing car ...",
+ "The Tesla Cybertruck is an all-electric, battery-powered light-duty truck unveiled by Tesla, Inc. ..."
+ ],
+ "raw_scores": false,
+ "return_text": false,
+ "truncate": false,
+ "truncation_direction": "right"
+ }' \
+ --url http://$ENDPOINT/rerank
+```
+
+```py python
+import requests
+
+endpoint = ""
+
+requests.post(f"{endpoint}/rerank", json={
+ "query": "What are some good electric cars",
+ "texts": [
+ "Here’s the information about the Mercedes CLR GTR: The Mercedes CLR GTR is a remarkable racing car ...",
+ "The Tesla Cybertruck is an all-electric, battery-powered light-duty truck unveiled by Tesla, Inc. ..."
+ ],
+ "raw_scores": False,
+ "return_text": False,
+ "truncate": False,
+ "truncation_direction": "right"
+});
+
+## or
+
+requests.post(f"{endpoint}/rerank", json={
+ "inputs": "test single input"
+});
+
+
+```
+
+
+
+
+```json 200 Ranks
+[
+ {
+ "index":1,
+ "score":0.15253653,
+ // if return_text = true
+ "text": "The Tesla Cybertruck is an all-electric, battery-powered light-duty truck unveiled by Tesla, Inc. ..."
+ },
+ {
+ "index":0,
+ "score":0.00498227
+ // if return_text = true
+ "text": "Here’s the information about the Mercedes CLR GTR: The Mercedes CLR GTR is a remarkable racing car ..."
+ }
+]
+```
+
+```json 413
+{
+ "error": "Batch size error",
+ "error_type": "validation"
+}
+```
+
+```json 422
+{
+ "error": "Tokenization error",
+ "error_type": "validation"
+}
+```
+
+```json 424
+{
+ "error": "Inference failed",
+ "error_type": "backend"
+}
+```
+
+```json 429
+{
+ "error": "Model is overloaded",
+ "error_type": "overloaded"
+}
+```
+
+
+
+
+
+ Inputs that need to be embedded
+
+
+
+ Inputs that need to be embedded
+
+
+
+ Output the raw reranker score or the normalized score between 0-1.
+ When `false`, score is between 0 and 1, otherwise range is indeterminate
+
+
+
+ Return the text with along with each rank
+
+
+
+Automatically truncate inputs that are longer than the maximum supported size
+
+
+
+
+
diff --git a/vector-inference/splade.mdx b/vector-inference/splade.mdx
new file mode 100644
index 0000000..a28855d
--- /dev/null
+++ b/vector-inference/splade.mdx
@@ -0,0 +1,63 @@
+---
+title: "Working with Splade v2"
+icon: magnifying-glass
+description: Learn how to use splade with TVI.
+mode: wide
+---
+
+## What is splade?
+
+`Splade` is similar to other inverted index approaches like `bm25`. `Splade` includes neural term expansion, meaning that it is able to match on synonym's much better than traditional bm25
+
+## Using Splade with Trieve Vector Inference
+
+
+
+To use splade with Trieve Vector Inference, you will need to adapt both the `doc` and `query` models
+
+The splade `document` model is the model you use to encode files, where the `query` model is the one to encode the query that you will be searching with
+
+```yaml embedding_models.yaml
+models:
+ # ...
+ spladeDoc:
+ replicas: 1
+ modelName: naver/efficient-splade-VI-BT-large-doc
+ isSplade: true
+ spladeQuery:
+ replicas: 1
+ modelName: naver/efficient-splade-VI-BT-large-query
+ isSplade: true
+ # ...
+```
+
+
+
+Update TVF to include your models
+
+```bash
+helm upgrade -i vector-inference \
+ oci://registry-1.docker.io/trieve/embeddings-helm \
+ -f embedding_models.yaml
+```
+
+
+
+```sh
+kubectl get ing
+```
+
+
+
+ ```sh
+ ENDPOINT="k8s-default-vectorin...elb.amazonaws.com"
+
+ curl -X POST \
+ -H "Content-Type: application/json"\
+ -d '{"inputs": "test input"}' \
+ --url http://$ENDPOINT/embed_sparse
+ ```
+
+ For more information checkout the [API reference](/vector-inference/embed_sparse) for sparse vectors
+
+
diff --git a/vector-inference/troubleshooting.mdx b/vector-inference/troubleshooting.mdx
new file mode 100644
index 0000000..355b91b
--- /dev/null
+++ b/vector-inference/troubleshooting.mdx
@@ -0,0 +1,51 @@
+---
+title: Troubleshooting
+icon: 'triangle-exclamation'
+description: 'Common issues with self hosting'
+---
+
+There are a lot of moving parts in `eksctl`. Here’s a list of common issues we’ve seen customers run into:
+
+
+
+
+ This error happens when deleting the cluster and some pods in `kube-system` refuse to stop.
+ To fix this run the following command and the deletion process should be able to proceed.
+
+ ```sh
+ kubectl get pods -n kube-system -o NAME | xargs kubectl -n kube-system delete
+ ```
+
+
+
+ This happens when the cluster doesn't properly delete load balancers, to fix this
+
+
+
+ Run this to get the available load balancers
+ ```sh
+ kubectl get ingress
+ ```
+
+
+ The output should look like this
+ ```
+ NAME CLASS HOSTS ADDRESS PORTS AGE
+ vector-inference-embedding-bgem3-ingress alb * k8s-default-vectorin-25e84e25f0-1362792264.us-east-2.elb.amazonaws.com 80 3d19h
+ vector-inference-embedding-nomic-ingress alb * k8s-default-vectorin-eb664ce6e9-238019709.us-east-2.elb.amazonaws.com 80 2d20h
+ vector-inference-embedding-spladedoc-ingress alb * k8s-default-vectorin-8af81ad2bd-192706382.us-east-2.elb.amazonaws.com 80 3d19h
+ ```
+
+
+
+
+ Go to EC2 > LoadBalancers ([link](https://us-west-1.console.aws.amazon.com/ec2/home?region=us-west-1#LoadBalancers:v=3;$case=tags:false%5C,client:false;$regex=tags:false%5C,client:false)) and delete the alb's that have the ingress point names
+
+
+
+
+ The delete script should be able to resume
+
+
+
+