Overview of popular open source large language model inference engines. An inference engine is the program which loads a models weights and generates text responses based on given inputs.
Feel free to create a PR or issue if you want a new engine column, feature row, or update a status.
- vLLM: Designed to provide SOTA throughput.
- TensorRT-LLM: Nvidias design for a high performance extensible pytorch-like API for use with Nvidia Triton Inference Server.
- llama.cpp: Pure C++ without any dependencies, with Apple Silicon prioritized.
- TGI: HuggingFace' fast and flexible engine designed for high throughput.
- LightLLM: Lightweight, fast and flexible framework targeting performance, written purely in Python / Triton.
- DeepSpeed-MII / DeepSpeed-FastGen: Microsofts high performance implementation including SOTA Dynamic Splitfuse
- ExLlamaV2: Efficiently run language models on modern consumer GPUs. Implements SOTA quantization method, EXL2.
✅ Included | 🟠 Inferior Alternative | 🌩️ Exists but has Issues | 🔨 PR | 🗓️ Planned |❓ Unclear / Unofficial | ❌ Not Implemented
vLLM | TensorRT-LLM | llama.cpp | TGI | LightLLM | Fastgen | ExLlamaV2 | |
---|---|---|---|---|---|---|---|
Optimizations | |||||||
FlashAttention2 | ✅ 1 | ✅ 2 | 🟠 3 | ✅ 4 | ✅ | ✅ | ✅ |
PagedAttention | ✅ 4 | ✅ 2 | ❌ 5 | ✅ | 🟠*** 6 | ✅ | ✅ 7 |
Speculative Decoding | 🔨 8 | ✅ 9 | ✅ 10 | ✅ 11 | ❌ | ❌ 12 | ✅ |
Tensor Parallel | ✅ | ✅ 13 | 🟠** 14 | ✅ 15 | ✅ | ✅ 16 | ❌ |
Pipeline Parallel | ✅ 17 | ✅ 18 | ❌ 19 | ❓ 15 | ❌ | ❌ 20 | ❌ |
Optim. / Scheduler | |||||||
Dyn. SplitFuse (SOTA21) | 🗓️ 21 | 🗓️ 22 | ❌ | ❌ | ❌ | ✅ 21 | ❌ |
Efficient Rtr (better) | ❌ | ❌ | ❌ | ❌ | ✅ 23 | ❌ | ❌ |
Cont. Batching | ✅ 21 | ✅ 24 | ✅ | ✅ | ❌ | ✅ 16 | ❓ 25 |
Optim. / Quant | |||||||
EXL2 (SOTA26) | 🔨 27 | ❌ | ❌ | ✅ 28 | ❌ | ❌ | ✅ |
AWQ | 🌩️ 29 | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ |
Other Quants | (yes) 30 | GPTQ | GGUF 31 | (yes) 32 | ? | ? | ? |
Features | |||||||
OpenAI-Style API | ✅ | ❌ 33 | ✅ [^13] | ✅ 34 | ✅ 35 | ❌ | ❌ |
Feat. / Sampling | |||||||
Beam Search | ✅ | ✅ 2 | ✅ 36 | 🟠**** 37 | ❌ | ❌ 38 | ❌ 39 |
JSON / Grammars via Outlines | ✅ | 🗓️ | ✅ | ✅ | ? | ? | ✅ |
Models | |||||||
Llama 2 / 3 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Mistral | ✅ | ✅ | ✅ | ✅ | ✅ 40 | ✅ | ✅ |
Mixtral | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Implementation | |||||||
Core Language | Python | C++ | C++ | Py / Rust | Python | Python | Python |
GPU API | CUDA* | CUDA* | Metal / CUDA | CUDA* | Triton / CUDA | CUDA* | CUDA |
Repo | |||||||
License | Apache 2 | Apache 2 | MIT | Apache 2 41 | Apache 2 | Apache 2 | MIT |
Github Stars | 17K | 6K | 54K | 8K | 2K | 2K | 3K |
- BentoML (June 5th, 2024): Compares LMDeploy, MLC-LLM, TGI, TRT-LLM, vLLM
*Supports Triton for one-off such as FlashAttention (FusedAttention) / quantization, or allows Triton plugins, however the project doesn't use Triton otherwise.
**Sequentially processed tensor split
****TGI maintainers suggest using best_of
instead of beam search. (best_of
creates n
generations and selects the one with the lowest logprob). Anecdotally, beam search is much better at finding the best generation for "non-creative" tasks.
Footnotes
-
https://github.com/vllm-project/vllm/issues/485#issuecomment-1693009046 ↩
-
https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/gpt_attention.md ↩ ↩2 ↩3
-
https://github.com/ggerganov/llama.cpp/pull/5021 FlashAttention, but not FlashAttention2 ↩
-
https://github.com/huggingface/text-generation-inference/issues/753#issuecomment-1663525606 ↩ ↩2
-
https://github.com/ModelTC/lightllm/blob/main/docs/TokenAttention.md ↩
-
https://github.com/turboderp/exllamav2/commit/affc3508c1d18e4294a5062f794f44112a8b07c5 ↩
-
https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html ↩
-
https://github.com/ggerganov/llama.cpp/blob/fe680e3d1080a765e5d3150ffd7bab189742898d/examples/speculative/README.md ↩
-
https://github.com/huggingface/text-generation-inference/pull/1308 ↩
-
https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/pybind/bindings.cpp#L184 ↩
-
https://github.com/ggerganov/llama.cpp/issues/4014#issuecomment-1804925896 ↩
-
https://github.com/huggingface/text-generation-inference/issues/1031#issuecomment-1727976990 ↩ ↩2
-
https://github.com/NVIDIA/TensorRT-LLM/blob/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d/tensorrt_llm/auto_parallel/config.py#L35 ↩
-
"without specific architecture tricks, you will only be using one GPU at a time, and your performance will suffer compared to a single GPU due to communication and synchronization overhead." https://github.com/ggerganov/llama.cpp/issues/4238#issuecomment-1832768597 ↩
-
https://github.com/microsoft/DeepSpeed-MII/issues/329#issuecomment-1830317364 ↩
-
https://blog.vllm.ai/2023/11/14/notes-vllm-vs-deepspeed.html, https://github.com/vllm-project/vllm/issues/1562 ↩ ↩2 ↩3 ↩4
-
https://github.com/NVIDIA/TensorRT-LLM/issues/317#issuecomment-1810841752 ↩
-
https://github.com/ModelTC/lightllm/blob/a9cf0152ad84beb663cddaf93a784092a47d1515/docs/LightLLM.md#efficient-router ↩
-
https://github.com/NVIDIA/TensorRT-LLM/blob/main/README.md ↩
-
https://github.com/turboderp/exllamav2/discussions/19#discussioncomment-6989460 ↩
-
https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/#pareto-frontiers ↩
-
https://github.com/huggingface/text-generation-inference/pull/1211 ↩
-
https://github.com/vllm-project/vllm/blob/main/docs/source/quantization/auto_awq.rst ↩
-
https://github.com/vllm-project/vllm/blob/1f24755bf802a2061bd46f3dd1191b7898f13f45/vllm/model_executor/quantization_utils/squeezellm.py#L8 ↩
-
https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/README.md ↩
-
https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/cli.py#L15-L21 ↩
-
https://huggingface.co/docs/text-generation-inference/messages_api ↩
-
https://github.com/ModelTC/lightllm/blob/main/lightllm/server/api_models.py#L9 ↩
-
https://github.com/ggerganov/llama.cpp/tree/master/examples/beam-search ↩
-
https://github.com/huggingface/text-generation-inference/issues/722#issuecomment-1658823644 ↩
-
https://github.com/microsoft/DeepSpeed-MII/issues/286#issuecomment-1808510043 ↩
-
https://github.com/ModelTC/lightllm/issues/224#issuecomment-1827365514 ↩
-
https://raw.githubusercontent.com/huggingface/text-generation-inference/main/LICENSE, https://twitter.com/julien_c/status/1777328456709062848 ↩