Skip to content

Latest commit

 

History

History
115 lines (96 loc) · 10.6 KB

README.md

File metadata and controls

115 lines (96 loc) · 10.6 KB

Open Source LLM Inference Engines

Overview of popular open source large language model inference engines. An inference engine is the program which loads a models weights and generates text responses based on given inputs.

Feel free to create a PR or issue if you want a new engine column, feature row, or update a status.

Compared Inference Engines

  • vLLM: Designed to provide SOTA throughput.
  • TensorRT-LLM: Nvidias design for a high performance extensible pytorch-like API for use with Nvidia Triton Inference Server.
  • llama.cpp: Pure C++ without any dependencies, with Apple Silicon prioritized.
  • TGI: HuggingFace' fast and flexible engine designed for high throughput.
  • LightLLM: Lightweight, fast and flexible framework targeting performance, written purely in Python / Triton.
  • DeepSpeed-MII / DeepSpeed-FastGen: Microsofts high performance implementation including SOTA Dynamic Splitfuse
  • ExLlamaV2: Efficiently run language models on modern consumer GPUs. Implements SOTA quantization method, EXL2.

Comparison Table

✅ Included | 🟠 Inferior Alternative | 🌩️ Exists but has Issues | 🔨 PR | 🗓️ Planned |❓ Unclear / Unofficial | ❌ Not Implemented

vLLM TensorRT-LLM llama.cpp TGI LightLLM Fastgen ExLlamaV2
Optimizations
FlashAttention2 1 2 🟠 3 4
PagedAttention 4 2 5 🟠*** 6 7
Speculative Decoding 🔨 8 9 10 11 12
Tensor Parallel 13 🟠** 14 15 16
Pipeline Parallel 17 18 19 15 20
Optim. / Scheduler
Dyn. SplitFuse (SOTA21) 🗓️ 21 🗓️ 22 21
Efficient Rtr (better) 23
Cont. Batching 21 24 16 25
Optim. / Quant
EXL2 (SOTA26) 🔨 27 28
AWQ 🌩️ 29
Other Quants (yes) 30 GPTQ GGUF 31 (yes) 32 ? ? ?
Features
OpenAI-Style API 33 ✅ [^13] 34 35
Feat. / Sampling
Beam Search 2 36 🟠**** 37 38 39
JSON / Grammars via Outlines 🗓️ ? ?
Models
Llama 2 / 3
Mistral 40
Mixtral
Implementation
Core Language Python C++ C++ Py / Rust Python Python Python
GPU API CUDA* CUDA* Metal / CUDA CUDA* Triton / CUDA CUDA* CUDA
Repo
License Apache 2 Apache 2 MIT Apache 2 41 Apache 2 Apache 2 MIT
Github Stars 17K 6K 54K 8K 2K 2K 3K

Benchmarks

Notes

*Supports Triton for one-off such as FlashAttention (FusedAttention) / quantization, or allows Triton plugins, however the project doesn't use Triton otherwise.

**Sequentially processed tensor split

***"TokenAttention is the special case of PagedAttention when block size equals to 1, which we have tested before and find it under-utilizes GPU compute compared to larger block size. Unless LightLLM's Triton kernel implementation is surprisingly fast, this should not bring speedup."

****TGI maintainers suggest using best_of instead of beam search. (best_of creates n generations and selects the one with the lowest logprob). Anecdotally, beam search is much better at finding the best generation for "non-creative" tasks.

Footnotes

  1. https://github.com/vllm-project/vllm/issues/485#issuecomment-1693009046

  2. https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/gpt_attention.md 2 3

  3. https://github.com/ggerganov/llama.cpp/pull/5021 FlashAttention, but not FlashAttention2

  4. https://github.com/huggingface/text-generation-inference/issues/753#issuecomment-1663525606 2

  5. https://github.com/ggerganov/llama.cpp/issues/1955

  6. https://github.com/ModelTC/lightllm/blob/main/docs/TokenAttention.md

  7. https://github.com/turboderp/exllamav2/commit/affc3508c1d18e4294a5062f794f44112a8b07c5

  8. https://github.com/vllm-project/vllm/pull/1797

  9. https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html

  10. https://github.com/ggerganov/llama.cpp/blob/fe680e3d1080a765e5d3150ffd7bab189742898d/examples/speculative/README.md

  11. https://github.com/huggingface/text-generation-inference/pull/1308

  12. https://github.com/microsoft/DeepSpeed-MII/issues/254

  13. https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/pybind/bindings.cpp#L184

  14. https://github.com/ggerganov/llama.cpp/issues/4014#issuecomment-1804925896

  15. https://github.com/huggingface/text-generation-inference/issues/1031#issuecomment-1727976990 2

  16. https://github.com/microsoft/DeepSpeed-MII 2

  17. https://github.com/vllm-project/vllm/issues/387

  18. https://github.com/NVIDIA/TensorRT-LLM/blob/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d/tensorrt_llm/auto_parallel/config.py#L35

  19. "without specific architecture tricks, you will only be using one GPU at a time, and your performance will suffer compared to a single GPU due to communication and synchronization overhead." https://github.com/ggerganov/llama.cpp/issues/4238#issuecomment-1832768597

  20. https://github.com/microsoft/DeepSpeed-MII/issues/329#issuecomment-1830317364

  21. https://blog.vllm.ai/2023/11/14/notes-vllm-vs-deepspeed.html, https://github.com/vllm-project/vllm/issues/1562 2 3 4

  22. https://github.com/NVIDIA/TensorRT-LLM/issues/317#issuecomment-1810841752

  23. https://github.com/ModelTC/lightllm/blob/a9cf0152ad84beb663cddaf93a784092a47d1515/docs/LightLLM.md#efficient-router

  24. https://github.com/NVIDIA/TensorRT-LLM/blob/main/README.md

  25. https://github.com/turboderp/exllamav2/discussions/19#discussioncomment-6989460

  26. https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/#pareto-frontiers

  27. https://github.com/vllm-project/vllm/issues/296

  28. https://github.com/huggingface/text-generation-inference/pull/1211

  29. https://github.com/vllm-project/vllm/blob/main/docs/source/quantization/auto_awq.rst

  30. https://github.com/vllm-project/vllm/blob/1f24755bf802a2061bd46f3dd1191b7898f13f45/vllm/model_executor/quantization_utils/squeezellm.py#L8

  31. https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/README.md

  32. https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/cli.py#L15-L21

  33. https://github.com/NVIDIA/TensorRT-LLM/issues/334

  34. https://huggingface.co/docs/text-generation-inference/messages_api

  35. https://github.com/ModelTC/lightllm/blob/main/lightllm/server/api_models.py#L9

  36. https://github.com/ggerganov/llama.cpp/tree/master/examples/beam-search

  37. https://github.com/huggingface/text-generation-inference/issues/722#issuecomment-1658823644

  38. https://github.com/microsoft/DeepSpeed-MII/issues/286#issuecomment-1808510043

  39. https://github.com/turboderp/exllamav2/issues/84

  40. https://github.com/ModelTC/lightllm/issues/224#issuecomment-1827365514

  41. https://raw.githubusercontent.com/huggingface/text-generation-inference/main/LICENSE, https://twitter.com/julien_c/status/1777328456709062848