Skip to content

LLM inference in C/C++. This fork supports Jetson Nano 4GB board (JetPack 4.6.1, CUDA 10.2).

License

Notifications You must be signed in to change notification settings

LLabsTech/Llama.cpp

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Llama.cpp

llama

Inference of Meta's LLaMA model (and others) in pure C/C++.

Description

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware.

  • Plain C/C++ implementation without any dependencies
  • 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
  • Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
  • Vulkan and SYCL backend support
  • CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

Building the fork

This fork was created using this instruction, based on gcc 8.5 and nvcc 10.2. To use this, you will need the following software packages installed. The section "Install prerequisites" describes the process in detail. The installation of gcc 8.5 and cmake 3.27 of these might take several hours.

  • Nvidia CUDA Compiler nvcc 10.2 - nvcc --version
  • GCC and CXX (g++) 8.5 - gcc --version
  • cmake >= 3.14 - cmake --version
  • nano, curl, libcurl4-openssl-dev, python3-pip and jtop

We need to add a few extra flags to the recommended first instruction cmake -B build, otherwise there are several error like Target "ggml-cuda" requires the language dialect "CUDA17" (with compiler extensions). that would stop the compilation. There will we a few warning: constexpr if statements are a C++17 feature after the second instruction, but we can ignore them. Let's start with the first one:

cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=14 -DCMAKE_CUDA_STANDARD_REQUIRED=true -DGGML_CPU_ARM_ARCH=armv8-a -DGGML_NATIVE=off

And 15 seconds later we're ready for the last step, the instruction that will take 85 minutes to have llama.cpp compiled:

cmake --build build --config Release

Now you can use binaries from build/bin folder. If you want to make binaries globally available, add this to your ~/.bashrc file:

export PATH="$PATH:$HOME/Llama.cpp/build/bin"

A lightweight, OpenAI API compatible, HTTP server for serving LLMs.

  • Start a local HTTP server with default configuration on port 8080
    llama-server -m model.gguf --port 8080
    
    # Basic web UI can be accessed via browser: http://localhost:8080
    # Chat completion endpoint: http://localhost:8080/v1/chat/completions
  • Support multiple-users and parallel decoding
    # up to 4 concurrent requests, each with 4096 max context
    llama-server -m model.gguf -c 16384 -np 4
  • Enable speculative decoding
    # the draft.gguf model should be a small variant of the target model.gguf
    llama-server -m model.gguf -md draft.gguf
  • Serve an embedding model
    # use the /embedding endpoint
    llama-server -m model.gguf --embedding --pooling cls -ub 8192
  • Serve a reranking model
    # use the /reranking endpoint
    llama-server -m model.gguf --reranking
  • Constrain all outputs with a grammar
    # custom grammar
    llama-server -m model.gguf --grammar-file grammar.gbnf
    
    # JSON
    llama-server -m model.gguf --grammar-file grammars/json.gbnf

About

LLM inference in C/C++. This fork supports Jetson Nano 4GB board (JetPack 4.6.1, CUDA 10.2).

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 57.8%
  • C 14.3%
  • Python 8.9%
  • Cuda 6.7%
  • Objective-C 2.4%
  • Metal 2.2%
  • Other 7.7%