Nexa SDK - Local On-Device Inference Framework

nexa-sdk-demo.mp4

Nexa SDK - Local On-Device Inference Framework

On-Device Model Hub | Documentation | Discord | Blogs | X (Twitter)

Nexa SDK is a local on-device inference framework for ONNX and GGML models, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.

Features

Device Support: CPU, GPU (CUDA, Metal, ROCm), iOS
Server: OpenAI-compatible API, JSON schema for function calling and streaming support
Local UI: Streamlit for interactive model deployment and testing

Latest News 🔥

Support SYCL backend for Intel GPU on Windows
Support GPU acceleration for FLUX and other Computer Vision models
Optimize the benchmark system for GGUF benchmark evaluation, now at least 50x faster than lm-eval-harness in GGUF benchmark with 8 workers: nexa eval <model_path> --tasks gpqa --num_workers 8
Support Nexa AI's own vision language model (0.9B parameters): nexa run omniVLM and audio language model (2.9B parameters): nexa run omniaudio
Support audio language model: nexa run qwen2audio, we are the first open-source toolkit to support audio language model with GGML tensor library.
Support Android Kotlin binding for local inference on Android devices.
Support iOS Swift binding for local inference on iOS mobile devices.
Support embedding model: nexa embed <model_path> <prompt>
Support pull and run supported Computer Vision models in GGUF format from HuggingFace or ModelScope: nexa run -hf <hf_model_id> -mt COMPUTER_VISION or nexa run -ms <ms_model_id> -mt COMPUTER_VISION
Support pull and run NLP models in GGUF format from HuggingFace or ModelScope: nexa run -hf <hf_model_id> -mt NLP or nexa run -ms <ms_model_id> -mt NLP

Welcome to submit your requests through issues, we ship weekly.

Install Option 1: Executable Installer

macOS Installer

Windows Installer

Linux Installer

curl -fsSL https://public-storage.nexa4ai.com/install.sh | sh

FAQ: cannot use executable with nexaai python package already installed

Try using nexa-exe instead:

nexa-exe <command>

Install Option 2: Python Package

We have released pre-built wheels for various Python versions, platforms, and backends for convenient installation on our index page.

CPU

pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/cpu --extra-index-url https://pypi.org/simple --no-cache-dir

Apple GPU (Metal)

For the GPU version supporting Metal (macOS):

CMAKE_ARGS="-DGGML_METAL=ON" pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/metal --extra-index-url https://pypi.org/simple --no-cache-dir

FAQ: cannot use Metal/GPU on M1

Try the following command:

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh
conda create -n nexasdk python=3.10
conda activate nexasdk
CMAKE_ARGS="-DGGML_METAL=ON" pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/metal --extra-index-url https://pypi.org/simple --no-cache-dir

Nvidia GPU (CUDA)

To install with CUDA support, make sure you have CUDA Toolkit 12.0 or later installed.

For Linux:

CMAKE_ARGS="-DGGML_CUDA=ON" pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir

For Windows PowerShell:

$env:CMAKE_ARGS="-DGGML_CUDA=ON"; pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir

For Windows Command Prompt:

set CMAKE_ARGS="-DGGML_CUDA=ON" & pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir

For Windows Git Bash:

CMAKE_ARGS="-DGGML_CUDA=ON" pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir

FAQ: Building Issues for llava

If you encounter the following issue while building:

try the following command:

CMAKE_ARGS="-DCMAKE_CXX_FLAGS=-fopenmp" pip install nexaai

Intel GPU (SYCL)

For Windows:

Make sure you have the following installed:

Latest Intel GPU driver
Microsoft Visual Studio
Intel oneAPI
Ninja (SYCL on Windows only support Ninja build.)
Then install Nexa SDK:

.\scripts\windows-build-sycl.bat

AMD GPU (ROCm)

To install with ROCm support, make sure you have ROCm 6.2.1 or later installed.

For Linux:

CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/rocm621 --extra-index-url https://pypi.org/simple --no-cache-dir

GPU (Vulkan)

To install with Vulkan support, make sure you have Vulkan SDK 1.3.261.1 or later installed.

For Windows PowerShell:

$env:CMAKE_ARGS="-DGGML_VULKAN=on"; pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/vulkan --extra-index-url https://pypi.org/simple --no-cache-dir

For Windows Command Prompt:

set CMAKE_ARGS="-DGGML_VULKAN=on" & pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/vulkan --extra-index-url https://pypi.org/simple --no-cache-dir

For Windows Git Bash:

CMAKE_ARGS="-DGGML_VULKAN=on" pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/vulkan --extra-index-url https://pypi.org/simple --no-cache-dir

Local Build

How to clone this repo

git clone --recursive https://github.com/NexaAI/nexa-sdk

If you forget to use --recursive, you can use below command to add submodule

git submodule update --init --recursive

Then you can build and install the package

pip install -e .

Differentiation

Below is our differentiation from other similar tools:

Feature	Nexa SDK	ollama	Optimum	LM Studio
GGML Support	✅	✅	❌	✅
ONNX Support	✅	❌	✅	❌
Text Generation	✅	✅	✅	✅
Image Generation	✅	❌	❌	❌
Vision-Language Models	✅	✅	✅	✅
Audio-Language Models	✅	❌	❌	❌
Text-to-Speech	✅	❌	✅	❌
Server Capability	✅	✅	✅	✅
User Interface	✅	❌	❌	✅
Executable Installation	✅	✅	❌	✅

Supported Models & Model Hub

Our on-device model hub offers all types of quantized models (text, image, audio, multimodal) with filters for RAM, file size, Tasks, etc. to help you easily explore models with UI. Explore on-device models at On-device Model Hub

Supported model examples (full list at Model Hub):

Model	Type	Format	Command
omniaudio	AudioLM	GGUF	`nexa run omniaudio`
qwen2audio	AudioLM	GGUF	`nexa run qwen2audio`
octopus-v2	Function Call	GGUF	`nexa run octopus-v2`
octo-net	Text	GGUF	`nexa run octo-net`
omniVLM	Multimodal	GGUF	`nexa run omniVLM`
nanollava	Multimodal	GGUF	`nexa run nanollava`
llava-phi3	Multimodal	GGUF	`nexa run llava-phi3`
llava-llama3	Multimodal	GGUF	`nexa run llava-llama3`
llava1.6-mistral	Multimodal	GGUF	`nexa run llava1.6-mistral`
llava1.6-vicuna	Multimodal	GGUF	`nexa run llava1.6-vicuna`
llama3.2	Text	GGUF	`nexa run llama3.2`
llama3-uncensored	Text	GGUF	`nexa run llama3-uncensored`
gemma2	Text	GGUF	`nexa run gemma2`
qwen2.5	Text	GGUF	`nexa run qwen2.5`
mathqwen	Text	GGUF	`nexa run mathqwen`
codeqwen	Text	GGUF	`nexa run codeqwen`
mistral	Text	GGUF/ONNX	`nexa run mistral`
deepseek-coder	Text	GGUF	`nexa run deepseek-coder`
DeepSeek-R1-Distill-Qwen-1.5B	Text	GGUF	`nexa run DeepSeek-R1-Distill-Qwen-1.5B:q4_K_M`
DeepSeek-R1-Distill-Llama-8B	Text	GGUF	`nexa run DeepSeek-R1-Distill-Llama-8B:q4_K_M`
phi3.5	Text	GGUF	`nexa run phi3.5`
openelm	Text	GGUF	`nexa run openelm`
stable-diffusion-v2-1	Image Generation	GGUF	`nexa run sd2-1`
stable-diffusion-3-medium	Image Generation	GGUF	`nexa run sd3`
FLUX.1-schnell	Image Generation	GGUF	`nexa run flux`
lcm-dreamshaper	Image Generation	GGUF/ONNX	`nexa run lcm-dreamshaper`
whisper-large-v3-turbo	Speech-to-Text	BIN	`nexa run faster-whisper-large-turbo`
whisper-tiny.en	Speech-to-Text	ONNX	`nexa run whisper-tiny.en`
mxbai-embed-large-v1	Embedding	GGUF	`nexa embed mxbai`
nomic-embed-text-v1.5	Embedding	GGUF	`nexa embed nomic`
all-MiniLM-L12-v2	Embedding	GGUF	`nexa embed all-MiniLM-L12-v2:fp16`
bark-small	Text-to-Speech	GGUF	`nexa run bark-small:fp16`
OuteTTS-0.1-350M	Text-to-Speech	GGUF	`nexa run OuteTTS-0.1-350M:q4_K_M`
OuteTTS-0.2-500M	Text-to-Speech	GGUF	`nexa run OuteTTS-0.2-500M:q4_K_M`

Run Models from 🤗 HuggingFace or 🤖 ModelScope

You can pull, convert (to .gguf), quantize and run llama.cpp supported text generation models from HF or MS with Nexa SDK.

Run .gguf File

Use nexa run -hf <hf-model-id> or nexa run -ms <ms-model-id> to run models with provided .gguf files:

nexa run -hf Qwen/Qwen2.5-Coder-7B-Instruct-GGUF

nexa run -ms Qwen/Qwen2.5-Coder-7B-Instruct-GGUF

Note: You will be prompted to select a single .gguf file. If your desired quantization version has multiple split files (like fp16-00001-of-00004), please use Nexa's conversion tool (see below) to convert and quantize the model locally.

Convert .safetensors Files

Install Nexa Python package, and install Nexa conversion tool with pip install "nexaai[convert]", then convert models from huggingface with nexa convert <hf-model-id>:

nexa convert HuggingFaceTB/SmolLM2-135M-Instruct

Or you can convert models from ModelScope with nexa convert -ms <ms-model-id>:

nexa convert -ms Qwen/Qwen2.5-7B-Instruct

Note: Check our leaderboard for performance benchmarks of different quantized versions of mainstream language models and HuggingFace docs to learn about quantization options.

📋 You can view downloaded and converted models with nexa list

Documentation

Note

If you want to use ONNX model, just replace pip install nexaai with pip install "nexaai[onnx]" in provided commands.
If you want to run benchmark evaluation, just replace pip install nexaai with pip install "nexaai[eval]" in provided commands.
If you want to convert and quantize huggingface models to GGUF models, just replace pip install nexaai with pip install "nexaai[convert]" in provided commands.
If you want to use TTS model, just replace pip install nexaai with pip install nexaai[tts] in provided commands.
For Chinese developers, we recommend you to use Tsinghua Open Source Mirror as extra index url, just replace --extra-index-url https://pypi.org/simple with --extra-index-url https://pypi.tuna.tsinghua.edu.cn/simple in provided commands.

CLI Reference

Here's a brief overview of the main CLI commands:

nexa run: Run inference for various tasks using GGUF models.
nexa onnx: Run inference for various tasks using ONNX models.
nexa convert: Convert and quantize huggingface models to GGUF models.
nexa server: Run the Nexa AI Text Generation Service.
nexa eval: Run the Nexa AI Evaluation Tasks.
nexa pull: Pull a model from official or hub.
nexa remove: Remove a model from local machine.
nexa clean: Clean up all model files.
nexa list: List all models in the local machine.
nexa login: Login to Nexa API.
nexa whoami: Show current user information.
nexa logout: Logout from Nexa API.

For detailed information on CLI commands and usage, please refer to the CLI Reference document.

Start Local Server

To start a local server using models on your local computer, you can use the nexa server command. For detailed information on server setup, API endpoints, and usage examples, please refer to the Server Reference document.

Benchmark

Install Nexa Python package, and install Nexa benchmark tool with pip install "nexaai[eval]", then evaluate the benchmark of a model with the following command:

nexa eval <model_path> --tasks <task> --num_workers <num_workers>

Swift Package

Swift SDK: Provides a Swifty API, allowing Swift developers to easily integrate and use llama.cpp models in their projects.

More Docs

Acknowledgements

We would like to thank the following projects:

Name		Name	Last commit message	Last commit date
Latest commit History 1,438 Commits
.github		.github
android		android
assets		assets
dependency		dependency
docs		docs
examples		examples
nexa		nexa
scripts		scripts
swift		swift
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
CLI.md		CLI.md
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Package.swift		Package.swift
README.md		README.md
SERVER.md		SERVER.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nexa SDK - Local On-Device Inference Framework

Features

Latest News 🔥

Install Option 1: Executable Installer

Install Option 2: Python Package

Differentiation

Supported Models & Model Hub

Run Models from 🤗 HuggingFace or 🤖 ModelScope

Run .gguf File

Convert .safetensors Files

Documentation

CLI Reference

Start Local Server

Benchmark

Swift Package

Acknowledgements

About

Releases 97

Packages

Contributors 29

Languages

License

NexaAI/nexa-sdk

Folders and files

Latest commit

History

Repository files navigation

Nexa SDK - Local On-Device Inference Framework

Features

Latest News 🔥

Install Option 1: Executable Installer

Install Option 2: Python Package

Differentiation

Supported Models & Model Hub

Run Models from 🤗 HuggingFace or 🤖 ModelScope

Run .gguf File

Convert .safetensors Files

Documentation

CLI Reference

Start Local Server

Benchmark

Swift Package

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 97

Packages 0

Contributors 29

Languages

Packages