ialacol (l-o-c-a-l-a-i)

Introduction

ialacol (pronounced "localai") is an open-source project that provides a boring, lightweight, self-hosted, private, and commercially usable LLM streaming service. It is built on top of ctransformers.

This project is inspired by other similar projects like LocalAI, privateGPT, local.ai, llama-cpp-python, closedai, and mlc-llm, with a specific focus on Kubernetes deployment.

Supported Models

See Receipts below for instructions of deployments.

And all LLMs supported by ctransformers.

Features

Compatibility with OpenAI APIs, allowing you to use any frameworks that are built on top of OpenAI APIs such as langchain.
Lightweight, easy deployment on Kubernetes clusters with a 1-click Helm installation.
Streaming first! For better UX.
Optional CUDA acceleration.

Quick Start

To quickly get started with ialacol, follow the steps below:

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama-2-7b-chat ialacol/ialacol

By defaults, it will deploy Meta's Llama 2 Chat model quantized by TheBloke.

Port-forward

kubectl port-forward svc/llama-2-7b-chat 8000:8000

Chat with the default model llama-2-7b-chat.ggmlv3.q4_0.bin using curl

curl -X POST \
     -H 'Content-Type: application/json' \
     -d '{ "messages": [{"role": "user", "content": "How are you?"}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false}' \
     http://localhost:8000/v1/chat/completions

Alternatively, using OpenAI's client library (see more examples in the examples/openai folder).

openai -k "sk-fake" -b http://localhost:8000/v1 -vvvvv api chat_completions.create -m llama-2-7b-chat.ggmlv3.q4_0.bin -g user "Hello world!"

GPU Acceleration

To enable GPU/CUDA acceleration, you need to use the container image built for GPU and add GPU_LAYERS environment variable. GPU_LAYERS is determine by the size of your GPU memory. See the PR/discussion in llama.cpp to find the best value.

CUDA 11

deployment.image = ghcr.io/chenhunghan/ialacol-cuda11:latest
deployment.env.GPU_LAYERS is the layer to off loading to GPU.

CUDA 12

deployment.image = ghcr.io/chenhunghan/ialacol-cuda11:latest
deployment.env.GPU_LAYERS is the layer to off loading to GPU.

For example

helm install llama2-7b-chat-cuda11 ialacol/ialacol -f examples/values/llama2-7b-chat-cuda11.yaml

Deploys llama2 7b model with 40 layers offloadind to GPU. The inference is accelerated by CUDA 11.

Tips

Creative v.s. Conservative

LLMs are known to be sensitive to parameters, the higher temperature leads to more "randomness" hence LLM becomes more "creative", top_p and top_k also contribute to the "randomness"

If you want to make LLM be creative.

curl -X POST \
     -H 'Content-Type: application/json' \
     -d '{ "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "2", "top_p": "1.0", "top_k": "0" }' \
     http://localhost:8000/v1/chat/completions

If you want to make LLM be more consistent and genereate the same result with the same input.

curl -X POST \
     -H 'Content-Type: application/json' \
     -d '{ "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "0.1", "top_p": "0.1", "top_k": "40" }' \
     http://localhost:8000/v1/chat/completions

Roadmap

Support starcoder model type via ctransformers, including:
- StarChat https://huggingface.co/TheBloke/starchat-beta-GGML
- StarCoder https://huggingface.co/TheBloke/starcoder-GGML
- StarCoderPlus https://huggingface.co/TheBloke/starcoderplus-GGML
Mimic restof OpenAI API, including GET /models and POST /completions
GPU acceleration (CUDA/METAL)
Support POST /embeddings backed by huggingface Apache-2.0 embedding models such as Sentence Transformers and hkunlp/instructor
Suuport Apache-2.0 fastchat-t5-3b
Support more Apache-2.0 models such as codet5p and others listed here

Star History

Receipts

Llama-2

Deploy Meta's Llama 2 Chat model quantized by TheBloke.

7B Chat

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-7b-chat ialacol/ialacol -f examples/values/llama2-7b-chat.yaml

13B Chat

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-13b-chat ialacol/ialacol -f examples/values/llama2-13b-chat.yaml

70B Chat

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-70b-chat ialacol/ialacol -f examples/values/llama2-70b-chat.yaml

OpenLM Research's OpenLLaMA Models

Deploy OpenLLaMA 7B model quantized by rustformers.

ℹ️ This is a base model, likely only useful for text completion.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install openllama-7b ialacol/ialacol -f examples/values/openllama-7b.yaml

VMWare's OpenLlama 13B Open Instruct

Deploy OpenLLaMA 13B Open Instruct model quantized by TheBloke.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install openllama-13b-instruct ialacol/ialacol -f examples/values/openllama-13b-instruct.yaml

Mosaic's MPT Models

Deploy MosaicML's MPT-7B model quantized by rustformers. ℹ️ This is a base model, likely only useful for text completion.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install mpt-7b ialacol/ialacol -f examples/values/mpt-7b.yaml

Deploy MosaicML's MPT-30B Chat model quantized by TheBloke.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install mpt-30b-chat ialacol/ialacol -f examples/values/mpt-30b-chat.yaml

Falcon Models

Deploy Uncensored Falcon 7B model quantized by TheBloke.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install falcon-7b ialacol/ialacol -f examples/values/falcon-7b.yaml

Deploy Uncensored Falcon 40B model quantized by TheBloke.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install falcon-40b ialacol/ialacol -f examples/values/falcon-40b.yaml

StarCoder Models (startcoder, startchat, starcoderplus, WizardCoder)

Deploy starchat-beta model quantized by TheBloke.

helm repo add starchat https://chenhunghan.github.io/ialacol
helm repo update
helm install starchat-beta ialacol/ialacol -f examples/values/starchat-beta.yaml

Deploy WizardCoder model quantized by TheBloke.

helm repo add starchat https://chenhunghan.github.io/ialacol
helm repo update
helm install wizard-coder-15b ialacol/ialacol -f examples/values/wizard-coder-15b.yaml

Pythia Models

Deploy light-weight pythia-70m model with only 70 millions paramters (~40MB) quantized by rustformers.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install pythia70m ialacol/ialacol -f examples/values/pythia-70m.yaml

RedPajama Models

Deploy RedPajama 3B model

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install redpajama-3b ialacol/ialacol -f examples/values/redpajama-3b.yaml

StableLM Models

Deploy stableLM 7B model

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install stablelm-7b ialacol/ialacol -f examples/values/stablelm-7b.yaml

Development

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
pip freeze > requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
.github/workflows		.github/workflows
.vscode		.vscode
charts/ialacol		charts/ialacol
examples		examples
.dockerignore		.dockerignore
.gitignore		.gitignore
.pylintrc		.pylintrc
Dockerfile		Dockerfile
Dockerfile.cuda11		Dockerfile.cuda11
Dockerfile.cuda12		Dockerfile.cuda12
LICENSE		LICENSE
README.md		README.md
devspace.yaml		devspace.yaml
get_config.py		get_config.py
get_default_thread.py		get_default_thread.py
get_env.py		get_env.py
get_llm.py		get_llm.py
main.py		main.py
model_generate.py		model_generate.py
request_body.py		request_body.py
requirements.txt		requirements.txt
response_body.py		response_body.py
streamers.py		streamers.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ialacol (l-o-c-a-l-a-i)

Introduction

Supported Models

Features

Quick Start

GPU Acceleration

CUDA 11

CUDA 12

Tips

Creative v.s. Conservative

Roadmap

Star History

Receipts

Llama-2

OpenLM Research's OpenLLaMA Models

VMWare's OpenLlama 13B Open Instruct

Mosaic's MPT Models

Falcon Models

StarCoder Models (startcoder, startchat, starcoderplus, WizardCoder)

Pythia Models

RedPajama Models

StableLM Models

Development

About

Releases 2

Packages

Languages

License

ktaletsk/ialacol

Folders and files

Latest commit

History

Repository files navigation

ialacol (l-o-c-a-l-a-i)

Introduction

Supported Models

Features

Quick Start

GPU Acceleration

CUDA 11

CUDA 12

Tips

Creative v.s. Conservative

Roadmap

Star History

Receipts

Llama-2

OpenLM Research's OpenLLaMA Models

VMWare's OpenLlama 13B Open Instruct

Mosaic's MPT Models

Falcon Models

StarCoder Models (startcoder, startchat, starcoderplus, WizardCoder)

Pythia Models

RedPajama Models

StableLM Models

Development

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages