Open Source LLM Inference Engines

Overview of popular open source large language model inference engines. An inference engine is the program which loads a models weights and generates text responses based on given inputs.

View feature table
View benchmarks

Feel free to create a PR or issue if you want a new engine column, feature row, or update a status.

Compared Inference Engines

vLLM: Designed to provide SOTA throughput.
TensorRT-LLM: Nvidias design for a high performance extensible pytorch-like API for use with Nvidia Triton Inference Server.
llama.cpp: Pure C++ without any dependencies, with Apple Silicon prioritized.
TGI: HuggingFace' fast and flexible engine designed for high throughput.
LightLLM: Lightweight, fast and flexible framework targeting performance, written purely in Python / Triton.
DeepSpeed-MII / DeepSpeed-FastGen: Microsofts high performance implementation including SOTA Dynamic Splitfuse
ExLlamaV2: Efficiently run language models on modern consumer GPUs. Implements SOTA quantization method, EXL2.

Comparison Table

	vLLM	TensorRT-LLM	llama.cpp	TGI	LightLLM	Fastgen	ExLlamaV2
Optimizations
FlashAttention2	✅ ¹	✅ ²	🟠 ³	✅ ⁴	✅	✅	✅
PagedAttention	✅ ⁴	✅ ²	❌ ⁵	✅	🟠*** ⁶	✅	✅ ⁷
Speculative Decoding	🔨 ⁸	✅ ⁹	✅ ¹⁰	✅ ¹¹	❌	❌ ¹²	✅
Tensor Parallel	✅	✅ ¹³	🟠** ¹⁴	✅ ¹⁵	✅	✅ ¹⁶	❌
Pipeline Parallel	✅ ¹⁷	✅ ¹⁸	❌ ¹⁹	❓ ¹⁵	❌	❌ ²⁰	❌
Optim. / Scheduler
Dyn. SplitFuse (SOTA²¹)	🗓️ ²¹	🗓️ ²²	❌	❌	❌	✅ ²¹	❌
Efficient Rtr (better)	❌	❌	❌	❌	✅ ²³	❌	❌
Cont. Batching	✅ ²¹	✅ ²⁴	✅	✅	❌	✅ ¹⁶	❓ ²⁵
Optim. / Quant
EXL2 (SOTA²⁶)	🔨 ²⁷	❌	❌	✅ ²⁸	❌	❌	✅
AWQ	🌩️ ²⁹	✅	❌	✅	❌	❌	❌
Other Quants	(yes) ³⁰	GPTQ	GGUF ³¹	(yes) ³²	?	?	?
Features
OpenAI-Style API	✅	❌ ³³	✅ [^13]	✅ ³⁴	✅ ³⁵	❌	❌
Feat. / Sampling
Beam Search	✅	✅ ²	✅ ³⁶	🟠**** ³⁷	❌	❌ ³⁸	❌ ³⁹
JSON / Grammars via Outlines	✅	🗓️	✅	✅	?	?	✅
Models
Llama 2 / 3	✅	✅	✅	✅	✅	✅	✅
Mistral	✅	✅	✅	✅	✅ ⁴⁰	✅	✅
Mixtral	✅	✅	✅	✅	✅	✅	✅
Implementation
Core Language	Python	C++	C++	Py / Rust	Python	Python	Python
GPU API	CUDA*	CUDA*	Metal / CUDA	CUDA*	Triton / CUDA	CUDA*	CUDA
Repo
License	Apache 2	Apache 2	MIT	Apache 2 ⁴¹	Apache 2	Apache 2	MIT
Github Stars	17K	6K	54K	8K	2K	2K	3K

Benchmarks

BentoML (June 5th, 2024): Compares LMDeploy, MLC-LLM, TGI, TRT-LLM, vLLM

Notes

*Supports Triton for one-off such as FlashAttention (FusedAttention) / quantization, or allows Triton plugins, however the project doesn't use Triton otherwise.

**Sequentially processed tensor split

***"TokenAttention is the special case of PagedAttention when block size equals to 1, which we have tested before and find it under-utilizes GPU compute compared to larger block size. Unless LightLLM's Triton kernel implementation is surprisingly fast, this should not bring speedup."

****TGI maintainers suggest using best_of instead of beam search. (best_of creates n generations and selects the one with the lowest logprob). Anecdotally, beam search is much better at finding the best generation for "non-creative" tasks.

Footnotes

https://github.com/vllm-project/vllm/issues/485#issuecomment-1693009046 ↩
https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/gpt_attention.md ↩ ↩² ↩³
https://github.com/ggerganov/llama.cpp/pull/5021 FlashAttention, but not FlashAttention2 ↩
https://github.com/huggingface/text-generation-inference/issues/753#issuecomment-1663525606 ↩ ↩²
https://github.com/ggerganov/llama.cpp/issues/1955 ↩
https://github.com/ModelTC/lightllm/blob/main/docs/TokenAttention.md ↩
https://github.com/turboderp/exllamav2/commit/affc3508c1d18e4294a5062f794f44112a8b07c5 ↩
https://github.com/vllm-project/vllm/pull/1797 ↩
https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html ↩
https://github.com/ggerganov/llama.cpp/blob/fe680e3d1080a765e5d3150ffd7bab189742898d/examples/speculative/README.md ↩
https://github.com/huggingface/text-generation-inference/pull/1308 ↩
https://github.com/microsoft/DeepSpeed-MII/issues/254 ↩
https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/pybind/bindings.cpp#L184 ↩
https://github.com/ggerganov/llama.cpp/issues/4014#issuecomment-1804925896 ↩
https://github.com/huggingface/text-generation-inference/issues/1031#issuecomment-1727976990 ↩ ↩²
https://github.com/microsoft/DeepSpeed-MII ↩ ↩²
https://github.com/vllm-project/vllm/issues/387 ↩
https://github.com/NVIDIA/TensorRT-LLM/blob/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d/tensorrt_llm/auto_parallel/config.py#L35 ↩
"without specific architecture tricks, you will only be using one GPU at a time, and your performance will suffer compared to a single GPU due to communication and synchronization overhead." https://github.com/ggerganov/llama.cpp/issues/4238#issuecomment-1832768597 ↩
https://github.com/microsoft/DeepSpeed-MII/issues/329#issuecomment-1830317364 ↩
https://blog.vllm.ai/2023/11/14/notes-vllm-vs-deepspeed.html, https://github.com/vllm-project/vllm/issues/1562 ↩ ↩² ↩³ ↩⁴
https://github.com/NVIDIA/TensorRT-LLM/issues/317#issuecomment-1810841752 ↩
https://github.com/ModelTC/lightllm/blob/a9cf0152ad84beb663cddaf93a784092a47d1515/docs/LightLLM.md#efficient-router ↩
https://github.com/NVIDIA/TensorRT-LLM/blob/main/README.md ↩
https://github.com/turboderp/exllamav2/discussions/19#discussioncomment-6989460 ↩
https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/#pareto-frontiers ↩
https://github.com/vllm-project/vllm/issues/296 ↩
https://github.com/huggingface/text-generation-inference/pull/1211 ↩
https://github.com/vllm-project/vllm/blob/main/docs/source/quantization/auto_awq.rst ↩
https://github.com/vllm-project/vllm/blob/1f24755bf802a2061bd46f3dd1191b7898f13f45/vllm/model_executor/quantization_utils/squeezellm.py#L8 ↩
https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/README.md ↩
https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/cli.py#L15-L21 ↩
https://github.com/NVIDIA/TensorRT-LLM/issues/334 ↩
https://huggingface.co/docs/text-generation-inference/messages_api ↩
https://github.com/ModelTC/lightllm/blob/main/lightllm/server/api_models.py#L9 ↩
https://github.com/ggerganov/llama.cpp/tree/master/examples/beam-search ↩
https://github.com/huggingface/text-generation-inference/issues/722#issuecomment-1658823644 ↩
https://github.com/microsoft/DeepSpeed-MII/issues/286#issuecomment-1808510043 ↩
https://github.com/turboderp/exllamav2/issues/84 ↩
https://github.com/ModelTC/lightllm/issues/224#issuecomment-1827365514 ↩
https://raw.githubusercontent.com/huggingface/text-generation-inference/main/LICENSE, https://twitter.com/julien_c/status/1777328456709062848 ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Open Source LLM Inference Engines

Compared Inference Engines

Comparison Table

Benchmarks

Notes

Files

README.md

Latest commit

History

README.md

File metadata and controls

Open Source LLM Inference Engines

Compared Inference Engines

Comparison Table

Benchmarks

Notes

Footnotes