Practical Llama 3, 3.1 and 3.2 inference implemented in a single Java file.
This project is the successor of llama2.java based on llama2.c by Andrej Karpathy and his excellent educational videos.
Besides the educational value, this project will be used to test and tune compiler optimizations and features on the JVM, particularly for the Graal compiler.
- Single file, no dependencies
- GGUF format parser
- Llama 3 tokenizer based on minbpe
- Llama 3 inference with Grouped-Query Attention
- Support Llama 3.1 (ad-hoc RoPE scaling) and 3.2 (tie word embeddings)
- Support for Q8_0 and Q4_0 quantizations
- Fast matrix-vector multiplication routines for quantized tensors using Java's Vector API
- Simple CLI with
--chat
and--instruct
modes.
Here's the interactive --chat
mode in action:
Download pure Q4_0
and (optionally) Q8_0
quantized .gguf files from:
- https://huggingface.co/mukel/Llama-3.2-1B-Instruct-GGUF
- https://huggingface.co/mukel/Llama-3.2-3B-Instruct-GGUF
- https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF
- https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF
The pure Q4_0
quantized models are recommended, except for the very small models (1B), please be gentle with huggingface.co servers:
# Llama 3.2 (3B)
curl -L -O https://huggingface.co/mukel/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_0.gguf
# Llama 3.2 (1B)
curl -L -O https://huggingface.co/mukel/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf
# Llama 3.1 (8B)
curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_0.gguf
# Llama 3 (8B)
curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf
# Optionally download the Q8_0 quantized models
# curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q8_0.gguf
# curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
In the wild, Q8_0
quantizations are fine, but Q4_0
quantizations are rarely pure e.g. the token_embd.weights
/output.weights
tensor are quantized with Q6_K
, instead of Q4_0
.
A pure Q4_0
quantization can be generated from a high precision (F32, F16, BFLOAT16) .gguf source
with the llama-quantize
utility from llama.cpp as follows:
./llama-quantize --pure ./Meta-Llama-3-8B-Instruct-F32.gguf ./Meta-Llama-3-8B-Instruct-Q4_0.gguf Q4_0
Java 21+ is required, in particular the MemorySegment
mmap-ing feature.
jbang
is a perfect fit for this use case, just:
jbang Llama3.java --help
Or execute directly, also via jbang
:
chmod +x Llama3.java
./Llama3.java --help
java --enable-preview --source 21 --add-modules jdk.incubator.vector LLama3.java -i --model Meta-Llama-3-8B-Instruct-Q4_0.gguf
A simple Makefile is provided, run make
to produce llama3.jar
or manually:
javac -g --enable-preview -source 21 --add-modules jdk.incubator.vector -d target/classes Llama3.java
jar -cvfe llama3.jar com.llama4j.Llama3 LICENSE -C target/classes .
Run the resulting llama3.jar
as follows:
java --enable-preview --add-modules jdk.incubator.vector -jar llama3.jar --help
Important Note
On GraalVM, please note that the Graal compiler doesn't support the Vector API yet, run with -Dllama.VectorAPI=false
, but expect sub-optimal performance.
Vanilla OpenJDK 21+ is recommended for now, which supports the Vector API.
Vanilla llama.cpp
built with make -j 20
.
./main --version
version: 2879 (4f026363)
built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
Executed as follows:
./main -m ../Meta-Llama-3-8B-Instruct-Q4_0.gguf \
-n 512 \
-s 42 \
-p "<|start_of_header_id|>user<|end_of_header_id|>Why is the sky blue?<|eot_id|><|start_of_header_id|>assistant<|end_of_header_id|>\n\n" \
--interactive-specials
Collected the "eval time" metric in tokens\s.
Running on OpenJDK 21.0.2.
jbang Llama3.java \
--model ./Meta-Llama-3-8B-Instruct-Q4_0.gguf \
--max-tokens 512 \
--seed 42 \
--stream false \
--prompt "Why is the sky blue?"
Model | tokens/s | Implementation |
---|---|---|
Llama-3-8B-Instruct-Q4_0.gguf | 7.53 | llama.cpp |
Llama-3-8B-Instruct-Q4_0.gguf | 6.95 | llama3.java |
Llama-3-8B-Instruct-Q8_0.gguf | 5.16 | llama.cpp |
Llama-3-8B-Instruct-Q8_0.gguf | 4.02 | llama3.java |
**Notes
Running on a single CCD e.g. taskset -c 0-15 jbang Llama3.java ...
since inference is constrained by memory bandwidth.
Model | tokens/s | Implementation |
---|---|---|
Llama-3-8B-Instruct-Q4_0.gguf | 9.26 | llama.cpp |
Llama-3-8B-Instruct-Q4_0.gguf | 8.03 | llama3.java |
Llama-3-8B-Instruct-Q8_0.gguf | 5.79 | llama.cpp |
Llama-3-8B-Instruct-Q8_0.gguf | 4.92 | llama3.java |
MIT