Nexa SDK is a comprehensive toolkit for supporting ONNX and GGML models. It supports text generation, image generation, vision-language models (VLM), and text-to-speech (TTS) capabilities. Additionally, it offers an OpenAI-compatible API server with JSON schema mode for function calling and streaming support, and a user-friendly Streamlit UI.
-
Model Support:
- ONNX & GGML models
- Conversion Engine
- Inference Engine:
- Text Generation
- Image Generation
- Vision-Language Models (VLM)
- Text-to-Speech (TTS)
Detailed API documentation is available here.
- Server:
- OpenAI-compatible API
- JSON schema mode for function calling
- Streaming support
- Streamlit UI for interactive model deployment and testing
Below is our differentiation from other similar tools:
Feature | Nexa SDK | ollama | Optimum | LM Studio |
---|---|---|---|---|
GGML Support | ✅ | ✅ | ❌ | ✅ |
ONNX Support | ✅ | ❌ | ✅ | ❌ |
Text Generation | âś… | âś… | âś… | âś… |
Image Generation | ✅ | ❌ | ❌ | ❌ |
Vision-Language Models | âś… | âś… | âś… | âś… |
Text-to-Speech | ✅ | ❌ | ✅ | ❌ |
Server Capability | âś… | âś… | âś… | âś… |
User Interface | ✅ | ❌ | ❌ | ✅ |
We have released pre-built wheels for various Python versions, platforms, and backends for convenient installation on our index page.
pip install nexaai --index-url https://nexaai.github.io/nexa-sdk/whl/cpu --extra-index-url https://pypi.org/simple
For the GPU version supporting Metal (macOS):
CMAKE_ARGS="-DGGML_METAL=ON -DSD_METAL=ON" pip install nexaai --index-url https://nexaai.github.io/nexa-sdk/whl/metal --extra-index-url https://pypi.org/simple
For the GPU version supporting CUDA (Linux/Windows):
CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON" pip install nexaai --index-url https://nexaai.github.io/nexa-sdk/whl/cu124 --extra-index-url https://pypi.org/simple
Note
The CUDA wheels are built with CUDA 12.4, but should be compatible with all CUDA 12.X
FAQ: Building Issues for llava
If you encounter the following issue while building:
try the following command:
CMAKE_ARGS="-DCMAKE_CXX_FLAGS=-fopenmp" pip install nexaai
Note: Docker doesn't support GPU acceleration
docker pull nexa4ai/sdk:latest
replace following placeholder with your path and command
docker run -v <your_model_dir>:/model -it nexa4ai/sdk:latest [nexa_command] [your_model_relative_path]
Example:
docker run -v /home/ubuntu/.cache/nexa/hub/official:/model -it nexa4ai/sdk:latest nexa gen-text /model/Phi-3-mini-128k-instruct/q4_0.gguf
will create an interactive session with text generation
Model | Type | Format | Command |
---|---|---|---|
octopus-v2 | NLP | GGUF | nexa run octopus-v2 |
octopus-v4 | NLP | GGUF | nexa run octopus-v4 |
tinyllama | NLP | GGUF | nexa run tinyllama |
llama2 | NLP | GGUF/ONNX | nexa run llama2 |
llama3 | NLP | GGUF/ONNX | nexa run llama3 |
llama3.1 | NLP | GGUF/ONNX | nexa run llama3.1 |
gemma | NLP | GGUF/ONNX | nexa run gemma |
gemma2 | NLP | GGUF | nexa run gemma2 |
qwen1.5 | NLP | GGUF | nexa run qwen1.5 |
qwen2 | NLP | GGUF/ONNX | nexa run qwen2 |
mistral | NLP | GGUF/ONNX | nexa run mistral |
codegemma | NLP | GGUF | nexa run codegemma |
codellama | NLP | GGUF | nexa run codellama |
codeqwen | NLP | GGUF | nexa run codeqwen |
deepseek-coder | NLP | GGUF | nexa run deepseek-coder |
dolphin-mistral | NLP | GGUF | nexa run dolphin-mistral |
phi2 | NLP | GGUF | nexa run phi2 |
phi3 | NLP | GGUF/ONNX | nexa run phi3 |
llama2-uncensored | NLP | GGUF | nexa run llama2-uncensored |
llama3-uncensored | NLP | GGUF | nexa run llama3-uncensored |
llama2-function-calling | NLP | GGUF | nexa run llama2-function-calling |
nanollava | Multimodal | GGUF | nexa run nanollava |
llava-phi3 | Multimodal | GGUF | nexa run llava-phi3 |
llava-llama3 | Multimodal | GGUF | nexa run llava-llama3 |
llava1.6-mistral | Multimodal | GGUF | nexa run llava1.6-mistral |
llava1.6-vicuna | Multimodal | GGUF | nexa run llava1.6-vicuna |
stable-diffusion-v1-4 | Computer Vision | GGUF | nexa run sd1-4 |
stable-diffusion-v1-5 | Computer Vision | GGUF/ONNX | nexa run sd1-5 |
lcm-dreamshaper | Computer Vision | GGUF/ONNX | nexa run lcm-dreamshaper |
hassaku-lcm | Computer Vision | GGUF | nexa run hassaku-lcm |
anything-lcm | Computer Vision | GGUF | nexa run anything-lcm |
faster-whisper-tiny | Audio | BIN | nexa run faster-whisper-tiny |
faster-whisper-small | Audio | BIN | nexa run faster-whisper-small |
faster-whisper-medium | Audio | BIN | nexa run faster-whisper-medium |
faster-whisper-base | Audio | BIN | nexa run faster-whisper-base |
faster-whisper-large | Audio | BIN | nexa run faster-whisper-large |
usage: nexa [-h] [-V] {run,onnx,server,pull,remove,clean,list,login,whoami,logout} ...
Nexa CLI tool for handling various model operations.
positional arguments:
{run,onnx,server,pull,remove,clean,list,login,whoami,logout}
sub-command help
run Run inference for various tasks using GGUF models.
onnx Run inference for various tasks using ONNX models.
server Run the Nexa AI Text Generation Service
pull Pull a model from official or hub.
remove Remove a model from local machine.
clean Clean up all model files.
list List all models in the local machine.
login Login to Nexa API.
whoami Show current user information.
logout Logout from Nexa API.
options:
-h, --help show this help message and exit
-V, --version Show the version of the Nexa SDK.
List all models on your local computer.
nexa list
Download a model file to your local computer from Nexa Model Hub.
nexa pull MODEL_PATH
usage: nexa pull [-h] model_path
positional arguments:
model_path Path or identifier for the model in Nexa Model Hub
options:
-h, --help show this help message and exit
nexa pull llama2
Remove a model from your local computer.
nexa remove MODEL_PATH
usage: nexa remove [-h] model_path
positional arguments:
model_path Path or identifier for the model in Nexa Model Hub
options:
-h, --help show this help message and exit
nexa remove llama2
Remove all downloaded models on your local computer.
nexa clean
Run a model on your local computer. If the model file is not yet downloaded, it will be automatically fetched first.
By default, nexa
will run gguf models. To run onnx models, use nexa onnx MODEL_PATH
nexa run MODEL_PATH
usage: nexa run [-h] [-t TEMPERATURE] [-m MAX_NEW_TOKENS] [-k TOP_K] [-p TOP_P] [-sw [STOP_WORDS ...]] [-pf] [-st] model_path
positional arguments:
model_path Path or identifier for the model in Nexa Model Hub
options:
-h, --help show this help message and exit
-pf, --profiling Enable profiling logs for the inference process
-st, --streamlit Run the inference in Streamlit UI
Text generation options:
-t, --temperature TEMPERATURE
Temperature for sampling
-m, --max_new_tokens MAX_NEW_TOKENS
Maximum number of new tokens to generate
-k, --top_k TOP_K Top-k sampling parameter
-p, --top_p TOP_P Top-p sampling parameter
-sw, --stop_words [STOP_WORDS ...]
List of stop words for early stopping
nexa run llama2
nexa run MODEL_PATH
usage: nexa run [-h] [-i2i] [-ns NUM_INFERENCE_STEPS] [-np NUM_IMAGES_PER_PROMPT] [-H HEIGHT] [-W WIDTH] [-g GUIDANCE_SCALE] [-o OUTPUT] [-s RANDOM_SEED] [-st] model_path
positional arguments:
model_path Path or identifier for the model in Nexa Model Hub
options:
-h, --help show this help message and exit
-st, --streamlit Run the inference in Streamlit UI
Image generation options:
-i2i, --img2img Whether to run image-to-image generation
-ns, --num_inference_steps NUM_INFERENCE_STEPS
Number of inference steps
-np, --num_images_per_prompt NUM_IMAGES_PER_PROMPT
Number of images to generate per prompt
-H, --height HEIGHT Height of the output image
-W, --width WIDTH Width of the output image
-g, --guidance_scale GUIDANCE_SCALE
Guidance scale for diffusion
-o, --output OUTPUT Output path for the generated image
-s, --random_seed RANDOM_SEED
Random seed for image generation
--lora_dir LORA_DIR Path to directory containing LoRA files
--wtype WTYPE Weight type (f32, f16, q4_0, q4_1, q5_0, q5_1, q8_0)
--control_net_path CONTROL_NET_PATH
Path to control net model
--control_image_path CONTROL_IMAGE_PATH
Path to image condition for Control Net
--control_strength CONTROL_STRENGTH
Strength to apply Control Net
nexa run sd1-4
nexa run MODEL_PATH
usage: nexa run [-h] [-t TEMPERATURE] [-m MAX_NEW_TOKENS] [-k TOP_K] [-p TOP_P] [-sw [STOP_WORDS ...]] [-pf] [-st] model_path
positional arguments:
model_path Path or identifier for the model in Nexa Model Hub
options:
-h, --help show this help message and exit
-pf, --profiling Enable profiling logs for the inference process
-st, --streamlit Run the inference in Streamlit UI
VLM generation options:
-t, --temperature TEMPERATURE
Temperature for sampling
-m, --max_new_tokens MAX_NEW_TOKENS
Maximum number of new tokens to generate
-k, --top_k TOP_K Top-k sampling parameter
-p, --top_p TOP_P Top-p sampling parameter
-sw, --stop_words [STOP_WORDS ...]
List of stop words for early stopping
nexa run nanollava
nexa run MODEL_PATH
usage: nexa run [-h] [-o OUTPUT_DIR] [-b BEAM_SIZE] [-l LANGUAGE] [--task TASK] [-t TEMPERATURE] [-c COMPUTE_TYPE] [-st] model_path
positional arguments:
model_path Path or identifier for the model in Nexa Model Hub
options:
-h, --help show this help message and exit
-st, --streamlit Run the inference in Streamlit UI
Automatic Speech Recognition options:
-b, --beam_size BEAM_SIZE
Beam size to use for transcription
-l, --language LANGUAGE
The language spoken in the audio. It should be a language code such as 'en' or 'fr'.
--task TASK Task to execute (transcribe or translate)
-c, --compute_type COMPUTE_TYPE
Type to use for computation (e.g., float16, int8, int8_float16)
nexa run faster-whisper-tiny
Start a local server using models on your local computer.
nexa server MODEL_PATH
usage: nexa server [-h] [--host HOST] [--port PORT] [--reload] model_path
positional arguments:
model_path Path or identifier for the model in S3
options:
-h, --help show this help message and exit
--host HOST Host to bind the server to
--port PORT Port to bind the server to
--reload Enable automatic reloading on code changes
nexa server llama2
For model_path
in nexa commands, it's better to follow the standard format to ensure correct model loading and execution. The standard format for model_path
is:
[user_name]/[repo_name]:[tag_name]
(user's model)[repo_name]:[tag_name]
(official model)
gemma-2b:q4_0
Meta-Llama-3-8B-Instruct:onnx-cpu-int8
alanzhuly/Qwen2-1B-Instruct:q4_0
You can start a local server using models on your local computer with the nexa server
command. Here's the usage syntax:
usage: nexa server [-h] [--host HOST] [--port PORT] [--reload] model_path
--host
: Host to bind the server to--port
: Port to bind the server to--reload
: Enable automatic reloading on code changes
nexa server gemma
nexa server llama2-function-calling
nexa server sd1-5
nexa server faster-whipser-large
By default, nexa server
will run gguf models. To run onnx models, simply add onnx
after nexa server
.
1. Text Generation: /v1/completions
Generates text based on a single prompt.
{
"prompt": "Tell me a story",
"temperature": 1,
"max_new_tokens": 128,
"top_k": 50,
"top_p": 1,
"stop_words": ["string"]
}
{
"result": "Once upon a time, in a small village nestled among rolling hills..."
}
2. Chat Completions: /v1/chat/completions
Handles chat completions with support for conversation history.
{
"messages": [
{
"role": "user",
"content": "Tell me a story"
}
],
"max_tokens": 128,
"temperature": 0.1,
"stream": false,
"stop_words": []
}
{
"id": "f83502df-7f5a-4825-a922-f5cece4081de",
"object": "chat.completion",
"created": 1723441724.914671,
"choices": [
{
"message": {
"role": "assistant",
"content": "In the heart of a mystical forest..."
}
}
]
}
3. Function Calling: /v1/function-calling
Call the most appropriate function based on user's prompt.
{
"messages": [
{
"role": "user",
"content": "Extract Jason is 25 years old"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "UserDetail",
"parameters": {
"properties": {
"name": {
"description": "The user's name",
"type": "string"
},
"age": {
"description": "The user's age",
"type": "integer"
}
},
"required": ["name", "age"],
"type": "object"
}
}
}
],
"tool_choice": "auto"
}
{
"type": "function",
"function": {
"name": "function_name",
"description": "function_description",
"parameters": {
"type": "object",
"properties": {
"property_name": {
"type": "string | number | boolean | object | array",
"description": "string"
}
},
"required": ["array_of_required_property_names"]
}
}
}
{
"id": "chatcmpl-7a9b0dfb-878f-4f75-8dc7-24177081c1d0",
"object": "chat.completion",
"created": 1724186442,
"model": "/home/ubuntu/.cache/nexa/hub/official/Llama2-7b-function-calling/q3_K_M.gguf",
"choices": [
{
"finish_reason": "tool_calls",
"index": 0,
"logprobs": null,
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call__0_UserDetail_cmpl-8d5cf645-7f35-4af2-a554-2ccea1a67bdd",
"type": "function",
"function": {
"name": "UserDetail",
"arguments": "{ \"name\": \"Jason\", \"age\": 25 }"
}
}
],
"function_call": {
"name": "",
"arguments": "{ \"name\": \"Jason\", \"age\": 25 }"
}
}
}
],
"usage": {
"completion_tokens": 15,
"prompt_tokens": 316,
"total_tokens": 331
}
}
4. Text-to-Image: /v1/txt2img
Generates images based on a single prompt.
{
"prompt": "A girl, standing in a field of flowers, vivid",
"image_path": "",
"cfg_scale": 7,
"width": 256,
"height": 256,
"sample_steps": 20,
"seed": 0,
"negative_prompt": ""
}
{
"created": 1724186615.5426757,
"data": [
{
"base64": "base64_of_generated_image",
"url": "path/to/generated_image"
}
]
}
5. Image-to-Image: /v1/img2img
Modifies existing images based on a single prompt.
{
"prompt": "A girl, standing in a field of flowers, vivid",
"image_path": "path/to/image",
"cfg_scale": 7,
"width": 256,
"height": 256,
"sample_steps": 20,
"seed": 0,
"negative_prompt": ""
}
{
"created": 1724186615.5426757,
"data": [
{
"base64": "base64_of_generated_image",
"url": "path/to/generated_image"
}
]
}
6. Audio Transcriptions: /v1/audio/transcriptions
Transcribes audio files to text.
beam_size
(integer): Beam size for transcription (default: 5)language
(string): Language code (e.g., 'en', 'fr')temperature
(number): Temperature for sampling (default: 0)
{
"file" (form-data): The audio file to transcribe (required)
}
{
"text": " And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country."
}
7. Audio Translations: /v1/audio/translations
Translates audio files to text in English.
beam_size
(integer): Beam size for transcription (default: 5)temperature
(number): Temperature for sampling (default: 0)
{
"file" (form-data): The audio file to transcribe (required)
}
{
"text": " Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday"
}
We would like to thank the following projects: