ialacol (pronounced "localai") is an open-source project that provides a boring, lightweight, self-hosted, private, and commercially usable LLM streaming service. It is built on top of ctransformers.
This project is inspired by other similar projects like LocalAI, privateGPT, local.ai, llama-cpp-python, closedai, and mlc-llm, with a specific focus on Kubernetes deployment.
See Receipts below for instructions of deployments.
- LLaMa 2 variants
- OpenLLaMA variants
- StarCoder variants
- WizardCoder
- StarChat variants
- MPT-7B
- MPT-30B
- Falcon
And all LLMs supported by ctransformers.
- Compatibility with OpenAI APIs, allowing you to use any frameworks that are built on top of OpenAI APIs such as langchain.
- Lightweight, easy deployment on Kubernetes clusters with a 1-click Helm installation.
- Streaming first! For better UX.
- Optional CUDA acceleration.
To quickly get started with ialacol, follow the steps below:
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama-2-7b-chat ialacol/ialacol
By defaults, it will deploy Meta's Llama 2 Chat model quantized by TheBloke.
Port-forward
kubectl port-forward svc/llama-2-7b-chat 8000:8000
Chat with the default model llama-2-7b-chat.ggmlv3.q4_0.bin
using curl
curl -X POST \
-H 'Content-Type: application/json' \
-d '{ "messages": [{"role": "user", "content": "How are you?"}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false}' \
http://localhost:8000/v1/chat/completions
Alternatively, using OpenAI's client library (see more examples in the examples/openai
folder).
openai -k "sk-fake" -b http://localhost:8000/v1 -vvvvv api chat_completions.create -m llama-2-7b-chat.ggmlv3.q4_0.bin -g user "Hello world!"
To enable GPU/CUDA acceleration, you need to use the container image built for GPU and add GPU_LAYERS
environment variable. GPU_LAYERS
is determine by the size of your GPU memory. See the PR/discussion in llama.cpp to find the best value.
deployment.image
=ghcr.io/chenhunghan/ialacol-cuda11:latest
deployment.env.GPU_LAYERS
is the layer to off loading to GPU.
deployment.image
=ghcr.io/chenhunghan/ialacol-cuda11:latest
deployment.env.GPU_LAYERS
is the layer to off loading to GPU.
For example
helm install llama2-7b-chat-cuda11 ialacol/ialacol -f examples/values/llama2-7b-chat-cuda11.yaml
Deploys llama2 7b model with 40 layers offloadind to GPU. The inference is accelerated by CUDA 11.
LLMs are known to be sensitive to parameters, the higher temperature
leads to more "randomness" hence LLM becomes more "creative", top_p
and top_k
also contribute to the "randomness"
If you want to make LLM be creative.
curl -X POST \
-H 'Content-Type: application/json' \
-d '{ "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "2", "top_p": "1.0", "top_k": "0" }' \
http://localhost:8000/v1/chat/completions
If you want to make LLM be more consistent and genereate the same result with the same input.
curl -X POST \
-H 'Content-Type: application/json' \
-d '{ "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "0.1", "top_p": "0.1", "top_k": "40" }' \
http://localhost:8000/v1/chat/completions
- Support
starcoder
model type via ctransformers, including: - Mimic restof OpenAI API, including
GET /models
andPOST /completions
- GPU acceleration (CUDA/METAL)
- Support
POST /embeddings
backed by huggingface Apache-2.0 embedding models such as Sentence Transformers and hkunlp/instructor - Suuport Apache-2.0 fastchat-t5-3b
- Support more Apache-2.0 models such as codet5p and others listed here
Deploy Meta's Llama 2 Chat model quantized by TheBloke.
7B Chat
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-7b-chat ialacol/ialacol -f examples/values/llama2-7b-chat.yaml
13B Chat
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-13b-chat ialacol/ialacol -f examples/values/llama2-13b-chat.yaml
70B Chat
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-70b-chat ialacol/ialacol -f examples/values/llama2-70b-chat.yaml
Deploy OpenLLaMA 7B model quantized by rustformers.
βΉοΈ This is a base model, likely only useful for text completion.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install openllama-7b ialacol/ialacol -f examples/values/openllama-7b.yaml
Deploy OpenLLaMA 13B Open Instruct model quantized by TheBloke.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install openllama-13b-instruct ialacol/ialacol -f examples/values/openllama-13b-instruct.yaml
Deploy MosaicML's MPT-7B model quantized by rustformers. βΉοΈ This is a base model, likely only useful for text completion.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install mpt-7b ialacol/ialacol -f examples/values/mpt-7b.yaml
Deploy MosaicML's MPT-30B Chat model quantized by TheBloke.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install mpt-30b-chat ialacol/ialacol -f examples/values/mpt-30b-chat.yaml
Deploy Uncensored Falcon 7B model quantized by TheBloke.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install falcon-7b ialacol/ialacol -f examples/values/falcon-7b.yaml
Deploy Uncensored Falcon 40B model quantized by TheBloke.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install falcon-40b ialacol/ialacol -f examples/values/falcon-40b.yaml
Deploy starchat-beta
model quantized by TheBloke.
helm repo add starchat https://chenhunghan.github.io/ialacol
helm repo update
helm install starchat-beta ialacol/ialacol -f examples/values/starchat-beta.yaml
Deploy WizardCoder
model quantized by TheBloke.
helm repo add starchat https://chenhunghan.github.io/ialacol
helm repo update
helm install wizard-coder-15b ialacol/ialacol -f examples/values/wizard-coder-15b.yaml
Deploy light-weight pythia-70m
model with only 70 millions paramters (~40MB) quantized by rustformers.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install pythia70m ialacol/ialacol -f examples/values/pythia-70m.yaml
Deploy RedPajama
3B model
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install redpajama-3b ialacol/ialacol -f examples/values/redpajama-3b.yaml
Deploy stableLM
7B model
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install stablelm-7b ialacol/ialacol -f examples/values/stablelm-7b.yaml
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
pip freeze > requirements.txt