GitHub - jxrjlxc02/Awesome-LLM-Inference: 📖A small Collection for Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

📒Introduction

Awesome-LLM-Inference: A small collection for 📙Awesome LLM Inference Papers with Codes. ❤️Star🌟👆🏻this repo to support me if it does any helps to you~

©️Citations

@misc{Awesome-LLM-Inference@2023,
  title={Awesome-LLM-Inference: A small collection for Awesome LLM Inference Papers with codes},
  url={https://github.com/DefTruth/Awesome-LLM-Inference},
  note={Open-source software available at https://github.com/DefTruth/Awesome-LLM-Inference},
  author={Yanjun Qiu},
  year={2023}
}

🎉Download PDFs

Awesome-LLM-Inference-v0.3.pdf: 500 pages, contains ByteTransformer, FastServe, FlashAttention 1/2, FlexGen, FP8, LLM.int8(), Tensor Cores, PagedAttention, RoPE, SmoothQuant, SpecInfer, WINT8/4, Continuous Batching, ZeroQuant 1/2/FP, AWQ, FlashDecoding, FlashDecoding++, FP8-LM, LLM-FP4, StreamLLM etc.

📙Awesome LLM Inference Papers with Codes

📖LLM Algorithmic/Eval Survey

Date	Title	Paper	Code	Recommend
2023.10	[Evaluating] Evaluating Large Language Models: A Comprehensive Survey	[arxiv][pdf]	[GitHub][Awesome-LLMs-Evaluation]	⭐️⭐️⭐️
2023.11	🔥🔥[Runtime Performance] Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models	[arxiv][pdf]	⚠️	⭐️⭐️⭐️⭐️⭐️
2023.11	[ChatGPT Anniversary] ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?	[arxiv][pdf]	⚠️	⭐️⭐️⭐️
2023.12	[Algorithmic Survey] The Efficiency Spectrum of Large Language Models: An Algorithmic Survey	[arxiv][pdf]	⚠️	⭐️⭐️⭐️
2023.12	[Security and Privacy] A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly	[arxiv][pdf]	⚠️	⭐️⭐️⭐️
2023.12	🔥🔥[LLMCompass] A Hardware Evaluation Framework for Large Language Model Inference	[arxiv][pdf]	⚠️	⭐️⭐️⭐️⭐️⭐️
2023.12	🔥🔥[Efficient LLMs] Efficient Large Language Models: A Survey	[arxiv][pdf]	[GitHub][Efficient-LLMs-Survey]	⭐️⭐️⭐️⭐️⭐️

📖LLM Train/Inference Framework

Date	Title	Paper	Code	Recommend
2020.05	🔥🔥[Megatron-LM] Training Multi-Billion Parameter Language Models Using Model Parallelism	[arxiv][pdf]	[GitHub][Megatron-LM]	⭐️⭐️⭐️⭐️⭐️
2023.03	[FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU	[arxiv][pdf]	[GitHub][FlexGen]	⭐️⭐️⭐️
2023.05	[SpecInfer] Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification	[arxiv][pdf]	[GitHub][FlexFlow]	⭐️⭐️⭐️
2023.05	[FastServe] Fast Distributed Inference Serving for Large Language Models	[arxiv][pdf]	⚠️	⭐️⭐️⭐️
2023.09	🔥🔥[vLLM] Efficient Memory Management for Large Language Model Serving with PagedAttention	[arxiv][pdf]	[GitHub][vllm]	⭐️⭐️⭐️⭐️⭐️
2023.09	[StreamingLLM] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS	[arxiv][pdf]	[GitHub][streaming-llm]	⭐️⭐️⭐️
2023.09	[Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads	[blog]	[GitHub][Medusa]	⭐️⭐️⭐️
2023.10	🔥🔥[TensorRT-LLM] NVIDIA TensorRT LLM	[TensorRT-LLM’s Docs]	[GitHub][TensorRT-LLM]	⭐️⭐️⭐️⭐️⭐️
2023.11	🔥🔥[DeepSpeed-FastGen 2x vLLM?] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference	[github][blog]	[GitHub][deepspeed-fastgen]	⭐️⭐️⭐️⭐️⭐️

📖Continuous/In-flight Batching

Date	Title	Paper	Code	Recommend
2022.07	🔥🔥[Continuous Batching] Orca: A Distributed Serving System for Transformer-Based Generative Models	[osdi22-yu][pdf]	⚠️	⭐️⭐️⭐️⭐️⭐️
2023.10	🔥🔥[In-flight Batching] NVIDIA TensorRT LLM Batch Manager	[TensorRT-LLM’s Docs]	[GitHub][TensorRT-LLM]	⭐️⭐️⭐️⭐️⭐️
2023.11	🔥🔥[DeepSpeed-FastGen 2x vLLM?] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference	[github][blog]	[GitHub][deepspeed-fastgen]	⭐️⭐️⭐️⭐️⭐️
2023.11	[Splitwise] Splitwise: Efficient Generative LLM Inference Using Phase Splitting	[arxiv][pdf]	⚠️	⭐️⭐️⭐️
2023.12	[SpotServe] SpotServe: Serving Generative Large Language Models on Preemptible Instances	[arxiv][pdf]	[GitHub][SpotServe]	⭐️⭐️⭐️

📖Weight/Activation Quantize/Compress

Date	Title	Paper	Code	Recommend
2022.06	🔥🔥[ZeroQuant] Efficient and Affordable Post-Training Quantization for Large-Scale Transformers	[arxiv][pdf]	[GitHub][DeepSpeed]	⭐️⭐️⭐️⭐️⭐️
2022.08	[FP8-Quantization] FP8 Quantization: The Power of the Exponent	[arxiv][pdf]	⚠️	⭐️⭐️⭐️
2022.08	[LLM.int8()] 8-bit Matrix Multiplication for Transformers at Scale	[arxiv][pdf]	[GitHub][bitsandbytes]	⭐️⭐️⭐️
2022.10	🔥🔥[GPTQ] GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS	[arxiv][pdf]	[GitHub][gptq]	⭐️⭐️⭐️⭐️⭐️
2022.11	🔥🔥[WINT8/4] Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production	[arxiv][pdf]	[GitHub][FasterTransformer]	⭐️⭐️⭐️⭐️⭐️
2022.11	🔥🔥[SmoothQuant] Accurate and Efficient Post-Training Quantization for Large Language Models	[arxiv][pdf]	[GitHub][smoothquant]	⭐️⭐️⭐️⭐️⭐️
2023.03	[ZeroQuant-V2] Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation	[arxiv][pdf]	[GitHub][DeepSpeed]	⭐️⭐️⭐️
2023.06	🔥🔥[AWQ] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration	[arxiv][pdf]	[GitHub][llm-awq]	⭐️⭐️⭐️⭐️⭐️
2023.06	[SpQR] SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression	[arxiv][pdf]	[GitHub][SpQR]	⭐️⭐️⭐️
2023.06	[SqueezeLLM] SQUEEZELLM: DENSE-AND-SPARSE QUANTIZATION	[arxiv][pdf]	[GitHub][SqueezeLLM]	⭐️⭐️⭐️
2023.07	[ZeroQuant-FP] A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats	[arxiv][pdf]	[GitHub][DeepSpeed]	⭐️⭐️⭐️
2023.09	[KV Cache FP8 + WINT4] Exploration on LLM inference performance optimization	[ZhiHu Tech Blog]	⚠️	⭐️⭐️⭐️
2023.10	[FP8-LM] FP8-LM: Training FP8 Large Language Models	[arxiv][pdf]	[GitHub][MS-AMP]	⭐️⭐️⭐️
2023.10	[LLM-Shearing] SHEARED LLAMA: ACCELERATING LANGUAGE MODEL PRE-TRAINING VIA STRUCTURED PRUNING	[arxiv][pdf]	[GitHub][LLM-Shearing]	⭐️⭐️⭐️
2023.10	[LLM-FP4] LLM-FP4: 4-Bit Floating-Point Quantized Transformers	[arxiv][pdf]	[GitHub][LLM-FP4]	⭐️⭐️⭐️
2023.11	[2-bit LLM] Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization	[arxiv][pdf]	⚠️	⭐️⭐️⭐️
2023.12	[SmoothQuant+] SmoothQuant+: Accurate and Efficient 4-bit Post-Training Weight Quantization for LLM	[arxiv][pdf]	[GitHub][smoothquantplus]	⭐️⭐️⭐️
2023.11	[OdysseyLLM W4A8] A Speed Odyssey for Deployable Quantization of LLMs	[arxiv][pdf]	⚠️	⭐️⭐️⭐️

📖IO/FLOPs-Aware Attention Optimization

Date	Title	Paper	Code	Recommend
2018.05	[Online Softmax] Online normalizer calculation for softmax	[arxiv][pdf]	⚠️	⭐️⭐️⭐️
2019.11	🔥🔥[MQA] Fast Transformer Decoding: One Write-Head is All You Need	[arxiv][pdf]	⚠️	⭐️⭐️⭐️⭐️⭐️
2022.05	🔥🔥[FlashAttention] Fast and Memory-Efficient Exact Attention with IO-Awareness	[arxiv][pdf]	[GitHub][flash-attention]	⭐️⭐️⭐️⭐️⭐️
2022.10	[Online Softmax] SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY	[arxiv][pdf]	⚠️	⭐️⭐️⭐️
2023.05	[FlashAttention] From Online Softmax to FlashAttention	[cse599m][flashattn.pdf]	⚠️	⭐️⭐️⭐️⭐️⭐️
2023.05	[FLOP, I/O] Dissecting Batching Effects in GPT Inference	[blog en/cn]	⚠️	⭐️⭐️⭐️
2023.05	🔥🔥🔥[GQA] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints	[arxiv][pdf]	[GitHub][flaxformer]	⭐️⭐️⭐️⭐️⭐️
2023.06	[Sparse FlashAttention] Faster Causal Attention Over Large Sequences Through Sparse Flash Attention	[arxiv][pdf]	[GitHub][dynamic-sparse-flash-attention]	⭐️⭐️⭐️
2023.07	🔥🔥[FlashAttention-2] Faster Attention with Better Parallelism and Work Partitioning	[arxiv][pdf]	[GitHub][flash-attention]	⭐️⭐️⭐️⭐️⭐️
2023.10	🔥🔥[Flash-Decoding] Flash-Decoding for long-context inference	[tech report]	[GitHub][flash-attention]	⭐️⭐️⭐️⭐️⭐️
2023.11	[Flash-Decoding++] FLASHDECODING++: FASTER LARGE LANGUAGE MODEL INFERENCE ON GPUS	[arxiv][pdf]	⚠️	⭐️⭐️⭐️
2023.01	[SparseGPT] SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot	[arxiv][pdf]	[GitHub][sparsegpt]	⭐️⭐️⭐️
2023.11	🔥🔥[HyperAttention] HyperAttention: Long-context Attention in Near-Linear Time	[arxiv][pdf]	⚠️	⭐️⭐️⭐️⭐️⭐️
2023.11	[Streaming Attention Approximation] One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space	[arxiv][pdf]	⚠️	⭐️⭐️⭐️

📖KV Cache Scheduling/Quantize/Compress

Date	Title	Paper	Code	Recommend
2019.11	🔥🔥[MQA] Fast Transformer Decoding: One Write-Head is All You Need	[arxiv][pdf]	⚠️	⭐️⭐️⭐️⭐️⭐️
2023.05	🔥🔥🔥[GQA] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints	[arxiv][pdf]	[GitHub][flaxformer]	⭐️⭐️⭐️⭐️⭐️
2023.05	[KV Cache Compress] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time	[arxiv][pdf]	⚠️	⭐️⭐️⭐️⭐️⭐️
2023.06	[H2O] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models	[arxiv][pdf]	[GitHub][H2O]	⭐️⭐️⭐️
2023.09	🔥🔥🔥[PagedAttention] Efficient Memory Management for Large Language Model Serving with PagedAttention	[arxiv][pdf]	[GitHub][vllm]	⭐️⭐️⭐️⭐️⭐️
2023.09	[KV Cache FP8 + WINT4] Exploration on LLM inference performance optimization	[ZhiHu Tech Blog]	⚠️	⭐️⭐️⭐️
2023.10	🔥🔥[TensorRT-LLM KV Cache FP8] NVIDIA TensorRT LLM	[TensorRT-LLM’s Docs]	[GitHub][TensorRT-LLM]	⭐️⭐️⭐️⭐️⭐️
2023.10	🔥🔥[Adaptive KV Cache Compress] MODEL TELLS YOU WHAT TO DISCARD: ADAPTIVE KV CACHE COMPRESSION FOR LLMS	[arxiv][pdf]	⚠️	⭐️⭐️⭐️⭐️⭐️
2023.10	[CacheGen] CacheGen: Fast Context Loading for Language Model Applications	[arxiv][pdf]	⚠️	⭐️⭐️⭐️
2023.12	[KV-Cache Optimizations] Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO	[arxiv][pdf]	⚠️	⭐️⭐️⭐️

📖GEMM、Tensor Cores、WMMA

Date	Title	Paper	Code	Recommend
2018.03	[Tensor Core] NVIDIA Tensor Core Programmability, Performance & Precision	[arxiv][pdf]	⚠️	⭐️⭐️⭐️
2022.09	[FP8] FP8 FORMATS FOR DEEP LEARNING	[arxiv][pdf]	⚠️	⭐️⭐️⭐️
2023.08	[Tensor Cores] Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library	[arxiv][pdf]	[GitHub][wmma_extension]	⭐️⭐️⭐️

📖LLM CPU/Single GPU/Mobile Inference

Date	Title	Paper	Code	Recommend
2023.03	[FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU	[arxiv][pdf]	[GitHub][FlexGen]	⭐️⭐️⭐️
2023.11	[LLM CPU Inference] Efficient LLM Inference on CPUs	[arxiv][pdf]	[GitHub][intel-extension-for-transformers]	⭐️⭐️⭐️
2023.12	[LinguaLinked] LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices	[arxiv][pdf]	⚠️	⭐️⭐️⭐️
2023.12	[OpenVINO] Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO	[arxiv][pdf]	⚠️	⭐️⭐️⭐️

📖Non Transformer Architecture

Date	Title	Paper	Code	Recommend
2023.05	🔥🔥🔥[RWKV] RWKV: Reinventing RNNs for the Transformer Era	[arxiv][pdf]	[GitHub][RWKV-LM]	⭐️⭐️⭐️⭐️⭐️
2023.12	🔥🔥🔥[Mamba] Mamba: Linear-Time Sequence Modeling with Selective State Spaces	[arxiv][pdf]	[GitHub][mamba]	⭐️⭐️⭐️⭐️⭐️

📖Sampling、Position Embed、Others

Date	Title	Paper	Code	Recommend
2019.11	🔥🔥[MQA] Fast Transformer Decoding: One Write-Head is All You Need	[arxiv][pdf]	⚠️	⭐️⭐️⭐️⭐️⭐️
2023.05	🔥🔥🔥[GQA] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints	[arxiv][pdf]	[GitHub][flaxformer]	⭐️⭐️⭐️⭐️⭐️
2021.04	🔥🔥[RoPE] ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING	[arxiv][pdf]	[GitHub][transformers]	⭐️⭐️⭐️
2022.10	[ByteTransformer] A High-Performance Transformer Boosted for Variable-Length Inputs	[arxiv][pdf]	[GitHub][ByteTransformer]	⭐️⭐️⭐️
2023.09	🔥🔥[StreamingLLM] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS	[arxiv][pdf]	[GitHub][streaming-llm]	⭐️⭐️⭐️
2023.09	🔥🔥[Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads	[blog]	[GitHub][Medusa]	⭐️⭐️⭐️

©️License

GNU General Public License v3.0

🎉Contribute

Welcome to submit a PR to this repo!

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.gitignore		.gitignore
Awesome-LLM-Inference-v0.1.pdf		Awesome-LLM-Inference-v0.1.pdf
Awesome-LLM-Inference-v0.2.pdf		Awesome-LLM-Inference-v0.2.pdf
Awesome-LLM-Inference-v0.3.pdf		Awesome-LLM-Inference-v0.3.pdf
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📒Introduction

©️Citations

🎉Download PDFs

📙Awesome LLM Inference Papers with Codes

📖Contents

📖LLM Algorithmic/Eval Survey

📖LLM Train/Inference Framework

📖Continuous/In-flight Batching

📖Weight/Activation Quantize/Compress

📖IO/FLOPs-Aware Attention Optimization

📖KV Cache Scheduling/Quantize/Compress

📖GEMM、Tensor Cores、WMMA

📖LLM CPU/Single GPU/Mobile Inference

📖Non Transformer Architecture

📖Sampling、Position Embed、Others

©️License

🎉Contribute

About

Releases

Packages

License

jxrjlxc02/Awesome-LLM-Inference

Folders and files

Latest commit

History

Repository files navigation

📒Introduction

©️Citations

🎉Download PDFs

📙Awesome LLM Inference Papers with Codes

📖Contents

📖LLM Algorithmic/Eval Survey

📖LLM Train/Inference Framework

📖Continuous/In-flight Batching

📖Weight/Activation Quantize/Compress

📖IO/FLOPs-Aware Attention Optimization

📖KV Cache Scheduling/Quantize/Compress

📖GEMM、Tensor Cores、WMMA

📖LLM CPU/Single GPU/Mobile Inference

📖Non Transformer Architecture

📖Sampling、Position Embed、Others

©️License

🎉Contribute

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages