Inference of Meta's LLaMA model (and others) in pure C/C++.
The main goal of llama.cpp
is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware.
- Plain C/C++ implementation without any dependencies
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
- Vulkan and SYCL backend support
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
This fork was created using this instruction, based on gcc 8.5
and nvcc 10.2
. To use this, you will need the following software packages installed. The section "Install prerequisites" describes the process in detail. The installation of gcc 8.5
and cmake 3.27
of these might take several hours.
- Nvidia CUDA Compiler nvcc 10.2 -
nvcc --version
- GCC and CXX (g++) 8.5 -
gcc --version
- cmake >= 3.14 -
cmake --version
nano
,curl
,libcurl4-openssl-dev
,python3-pip
andjtop
We need to add a few extra flags to the recommended first instruction cmake -B build
, otherwise there are several error like Target "ggml-cuda" requires the language dialect "CUDA17" (with compiler extensions). that would stop the compilation. There will we a few warning: constexpr if statements are a C++17 feature after the second instruction, but we can ignore them. Let's start with the first one:
cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=14 -DCMAKE_CUDA_STANDARD_REQUIRED=true -DGGML_CPU_ARM_ARCH=armv8-a -DGGML_NATIVE=off
And 15 seconds later we're ready for the last step, the instruction that will take 85 minutes to have llama.cpp compiled:
cmake --build build --config Release
Now you can use binaries from build/bin
folder. If you want to make binaries globally available, add this to your ~/.bashrc
file:
export PATH="$PATH:$HOME/Llama.cpp/build/bin"
A lightweight, OpenAI API compatible, HTTP server for serving LLMs.
-
Start a local HTTP server with default configuration on port 8080
llama-server -m model.gguf --port 8080 # Basic web UI can be accessed via browser: http://localhost:8080 # Chat completion endpoint: http://localhost:8080/v1/chat/completions
-
Support multiple-users and parallel decoding
# up to 4 concurrent requests, each with 4096 max context llama-server -m model.gguf -c 16384 -np 4
-
Enable speculative decoding
# the draft.gguf model should be a small variant of the target model.gguf llama-server -m model.gguf -md draft.gguf
-
Serve an embedding model
# use the /embedding endpoint llama-server -m model.gguf --embedding --pooling cls -ub 8192
-
Serve a reranking model
# use the /reranking endpoint llama-server -m model.gguf --reranking
-
Constrain all outputs with a grammar
# custom grammar llama-server -m model.gguf --grammar-file grammar.gbnf # JSON llama-server -m model.gguf --grammar-file grammars/json.gbnf