Awesome-LLM-Inference: A small collection for 📙Awesome LLM Inference Papers with Codes. ❤️Star🌟👆🏻this repo to support me if it does any helps to you~
@misc{Awesome-LLM-Inference@2023,
title={Awesome-LLM-Inference: A small collection for Awesome LLM Inference Papers with codes},
url={https://github.com/DefTruth/Awesome-LLM-Inference},
note={Open-source software available at https://github.com/DefTruth/Awesome-LLM-Inference},
author={Yanjun Qiu},
year={2023}
}
Awesome-LLM-Inference-v0.3.pdf: 500 pages, contains ByteTransformer, FastServe, FlashAttention 1/2, FlexGen, FP8, LLM.int8(), Tensor Cores, PagedAttention, RoPE, SmoothQuant, SpecInfer, WINT8/4, Continuous Batching, ZeroQuant 1/2/FP, AWQ, FlashDecoding, FlashDecoding++, FP8-LM, LLM-FP4, StreamLLM etc.
- LLM Algorithmic/Eval Survey
- LLM Train/Inference Framework
- Weight/Activation Quantize/Compress
- Continuous/In-flight Batching
- IO/FLOPs-Aware Attention Optimization
- KV Cache Scheduling/Quantize/Compress
- GEMM、Tensor Cores、WMMA
- LLM CPU/Single GPU/Mobile Inference
- Non Transformer Architecture
- Sampling、Position Embed、Others
Date | Title | Paper | Code | Recommend |
---|---|---|---|---|
2023.10 | [Evaluating] Evaluating Large Language Models: A Comprehensive Survey | [arxiv][pdf] | [GitHub][Awesome-LLMs-Evaluation] | ⭐️⭐️⭐️ |
2023.11 | 🔥🔥[Runtime Performance] Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models | [arxiv][pdf] | ⭐️⭐️⭐️⭐️⭐️ | |
2023.11 | [ChatGPT Anniversary] ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up? | [arxiv][pdf] | ⭐️⭐️⭐️ | |
2023.12 | [Algorithmic Survey] The Efficiency Spectrum of Large Language Models: An Algorithmic Survey | [arxiv][pdf] | ⭐️⭐️⭐️ | |
2023.12 | [Security and Privacy] A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly | [arxiv][pdf] | ⭐️⭐️⭐️ | |
2023.12 | 🔥🔥[LLMCompass] A Hardware Evaluation Framework for Large Language Model Inference | [arxiv][pdf] | ⭐️⭐️⭐️⭐️⭐️ | |
2023.12 | 🔥🔥[Efficient LLMs] Efficient Large Language Models: A Survey | [arxiv][pdf] | [GitHub][Efficient-LLMs-Survey] | ⭐️⭐️⭐️⭐️⭐️ |
Date | Title | Paper | Code | Recommend |
---|---|---|---|---|
2020.05 | 🔥🔥[Megatron-LM] Training Multi-Billion Parameter Language Models Using Model Parallelism | [arxiv][pdf] | [GitHub][Megatron-LM] | ⭐️⭐️⭐️⭐️⭐️ |
2023.03 | [FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU | [arxiv][pdf] | [GitHub][FlexGen] | ⭐️⭐️⭐️ |
2023.05 | [SpecInfer] Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification | [arxiv][pdf] | [GitHub][FlexFlow] | ⭐️⭐️⭐️ |
2023.05 | [FastServe] Fast Distributed Inference Serving for Large Language Models | [arxiv][pdf] | ⭐️⭐️⭐️ | |
2023.09 | 🔥🔥[vLLM] Efficient Memory Management for Large Language Model Serving with PagedAttention | [arxiv][pdf] | [GitHub][vllm] | ⭐️⭐️⭐️⭐️⭐️ |
2023.09 | [StreamingLLM] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS | [arxiv][pdf] | [GitHub][streaming-llm] | ⭐️⭐️⭐️ |
2023.09 | [Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads | [blog] | [GitHub][Medusa] | ⭐️⭐️⭐️ |
2023.10 | 🔥🔥[TensorRT-LLM] NVIDIA TensorRT LLM | [TensorRT-LLM’s Docs] | [GitHub][TensorRT-LLM] | ⭐️⭐️⭐️⭐️⭐️ |
2023.11 | 🔥🔥[DeepSpeed-FastGen 2x vLLM?] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference | [github][blog] | [GitHub][deepspeed-fastgen] | ⭐️⭐️⭐️⭐️⭐️ |
Date | Title | Paper | Code | Recommend |
---|---|---|---|---|
2022.07 | 🔥🔥[Continuous Batching] Orca: A Distributed Serving System for Transformer-Based Generative Models | [osdi22-yu][pdf] | ⭐️⭐️⭐️⭐️⭐️ | |
2023.10 | 🔥🔥[In-flight Batching] NVIDIA TensorRT LLM Batch Manager | [TensorRT-LLM’s Docs] | [GitHub][TensorRT-LLM] | ⭐️⭐️⭐️⭐️⭐️ |
2023.11 | 🔥🔥[DeepSpeed-FastGen 2x vLLM?] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference | [github][blog] | [GitHub][deepspeed-fastgen] | ⭐️⭐️⭐️⭐️⭐️ |
2023.11 | [Splitwise] Splitwise: Efficient Generative LLM Inference Using Phase Splitting | [arxiv][pdf] | ⭐️⭐️⭐️ | |
2023.12 | [SpotServe] SpotServe: Serving Generative Large Language Models on Preemptible Instances | [arxiv][pdf] | [GitHub][SpotServe] | ⭐️⭐️⭐️ |
Date | Title | Paper | Code | Recommend |
---|---|---|---|---|
2022.06 | 🔥🔥[ZeroQuant] Efficient and Affordable Post-Training Quantization for Large-Scale Transformers | [arxiv][pdf] | [GitHub][DeepSpeed] | ⭐️⭐️⭐️⭐️⭐️ |
2022.08 | [FP8-Quantization] FP8 Quantization: The Power of the Exponent | [arxiv][pdf] | ⭐️⭐️⭐️ | |
2022.08 | [LLM.int8()] 8-bit Matrix Multiplication for Transformers at Scale | [arxiv][pdf] | [GitHub][bitsandbytes] | ⭐️⭐️⭐️ |
2022.10 | 🔥🔥[GPTQ] GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS | [arxiv][pdf] | [GitHub][gptq] | ⭐️⭐️⭐️⭐️⭐️ |
2022.11 | 🔥🔥[WINT8/4] Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production | [arxiv][pdf] | [GitHub][FasterTransformer] | ⭐️⭐️⭐️⭐️⭐️ |
2022.11 | 🔥🔥[SmoothQuant] Accurate and Efficient Post-Training Quantization for Large Language Models | [arxiv][pdf] | [GitHub][smoothquant] | ⭐️⭐️⭐️⭐️⭐️ |
2023.03 | [ZeroQuant-V2] Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation | [arxiv][pdf] | [GitHub][DeepSpeed] | ⭐️⭐️⭐️ |
2023.06 | 🔥🔥[AWQ] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration | [arxiv][pdf] | [GitHub][llm-awq] | ⭐️⭐️⭐️⭐️⭐️ |
2023.06 | [SpQR] SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression | [arxiv][pdf] | [GitHub][SpQR] | ⭐️⭐️⭐️ |
2023.06 | [SqueezeLLM] SQUEEZELLM: DENSE-AND-SPARSE QUANTIZATION | [arxiv][pdf] | [GitHub][SqueezeLLM] | ⭐️⭐️⭐️ |
2023.07 | [ZeroQuant-FP] A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats | [arxiv][pdf] | [GitHub][DeepSpeed] | ⭐️⭐️⭐️ |
2023.09 | [KV Cache FP8 + WINT4] Exploration on LLM inference performance optimization | [ZhiHu Tech Blog] | ⭐️⭐️⭐️ | |
2023.10 | [FP8-LM] FP8-LM: Training FP8 Large Language Models | [arxiv][pdf] | [GitHub][MS-AMP] | ⭐️⭐️⭐️ |
2023.10 | [LLM-Shearing] SHEARED LLAMA: ACCELERATING LANGUAGE MODEL PRE-TRAINING VIA STRUCTURED PRUNING | [arxiv][pdf] | [GitHub][LLM-Shearing] | ⭐️⭐️⭐️ |
2023.10 | [LLM-FP4] LLM-FP4: 4-Bit Floating-Point Quantized Transformers | [arxiv][pdf] | [GitHub][LLM-FP4] | ⭐️⭐️⭐️ |
2023.11 | [2-bit LLM] Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization | [arxiv][pdf] | ⭐️⭐️⭐️ | |
2023.12 | [SmoothQuant+] SmoothQuant+: Accurate and Efficient 4-bit Post-Training Weight Quantization for LLM | [arxiv][pdf] | [GitHub][smoothquantplus] | ⭐️⭐️⭐️ |
2023.11 | [OdysseyLLM W4A8] A Speed Odyssey for Deployable Quantization of LLMs | [arxiv][pdf] | ⭐️⭐️⭐️ |
Date | Title | Paper | Code | Recommend |
---|---|---|---|---|
2018.05 | [Online Softmax] Online normalizer calculation for softmax | [arxiv][pdf] | ⭐️⭐️⭐️ | |
2019.11 | 🔥🔥[MQA] Fast Transformer Decoding: One Write-Head is All You Need | [arxiv][pdf] | ⭐️⭐️⭐️⭐️⭐️ | |
2022.05 | 🔥🔥[FlashAttention] Fast and Memory-Efficient Exact Attention with IO-Awareness | [arxiv][pdf] | [GitHub][flash-attention] | ⭐️⭐️⭐️⭐️⭐️ |
2022.10 | [Online Softmax] SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY | [arxiv][pdf] | ⭐️⭐️⭐️ | |
2023.05 | [FlashAttention] From Online Softmax to FlashAttention | [cse599m][flashattn.pdf] | ⭐️⭐️⭐️⭐️⭐️ | |
2023.05 | [FLOP, I/O] Dissecting Batching Effects in GPT Inference | [blog en/cn] | ⭐️⭐️⭐️ | |
2023.05 | 🔥🔥🔥[GQA] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | [arxiv][pdf] | [GitHub][flaxformer] | ⭐️⭐️⭐️⭐️⭐️ |
2023.06 | [Sparse FlashAttention] Faster Causal Attention Over Large Sequences Through Sparse Flash Attention | [arxiv][pdf] | [GitHub][dynamic-sparse-flash-attention] | ⭐️⭐️⭐️ |
2023.07 | 🔥🔥[FlashAttention-2] Faster Attention with Better Parallelism and Work Partitioning | [arxiv][pdf] | [GitHub][flash-attention] | ⭐️⭐️⭐️⭐️⭐️ |
2023.10 | 🔥🔥[Flash-Decoding] Flash-Decoding for long-context inference | [tech report] | [GitHub][flash-attention] | ⭐️⭐️⭐️⭐️⭐️ |
2023.11 | [Flash-Decoding++] FLASHDECODING++: FASTER LARGE LANGUAGE MODEL INFERENCE ON GPUS | [arxiv][pdf] | ⭐️⭐️⭐️ | |
2023.01 | [SparseGPT] SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot | [arxiv][pdf] | [GitHub][sparsegpt] | ⭐️⭐️⭐️ |
2023.11 | 🔥🔥[HyperAttention] HyperAttention: Long-context Attention in Near-Linear Time | [arxiv][pdf] | ⭐️⭐️⭐️⭐️⭐️ | |
2023.11 | [Streaming Attention Approximation] One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space | [arxiv][pdf] | ⭐️⭐️⭐️ |
Date | Title | Paper | Code | Recommend |
---|---|---|---|---|
2019.11 | 🔥🔥[MQA] Fast Transformer Decoding: One Write-Head is All You Need | [arxiv][pdf] | ⭐️⭐️⭐️⭐️⭐️ | |
2023.05 | 🔥🔥🔥[GQA] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | [arxiv][pdf] | [GitHub][flaxformer] | ⭐️⭐️⭐️⭐️⭐️ |
2023.05 | [KV Cache Compress] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time | [arxiv][pdf] | ⭐️⭐️⭐️⭐️⭐️ | |
2023.06 | [H2O] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models | [arxiv][pdf] | [GitHub][H2O] | ⭐️⭐️⭐️ |
2023.09 | 🔥🔥🔥[PagedAttention] Efficient Memory Management for Large Language Model Serving with PagedAttention | [arxiv][pdf] | [GitHub][vllm] | ⭐️⭐️⭐️⭐️⭐️ |
2023.09 | [KV Cache FP8 + WINT4] Exploration on LLM inference performance optimization | [ZhiHu Tech Blog] | ⭐️⭐️⭐️ | |
2023.10 | 🔥🔥[TensorRT-LLM KV Cache FP8] NVIDIA TensorRT LLM | [TensorRT-LLM’s Docs] | [GitHub][TensorRT-LLM] | ⭐️⭐️⭐️⭐️⭐️ |
2023.10 | 🔥🔥[Adaptive KV Cache Compress] MODEL TELLS YOU WHAT TO DISCARD: ADAPTIVE KV CACHE COMPRESSION FOR LLMS | [arxiv][pdf] | ⭐️⭐️⭐️⭐️⭐️ | |
2023.10 | [CacheGen] CacheGen: Fast Context Loading for Language Model Applications | [arxiv][pdf] | ⭐️⭐️⭐️ | |
2023.12 | [KV-Cache Optimizations] Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO | [arxiv][pdf] | ⭐️⭐️⭐️ |
Date | Title | Paper | Code | Recommend |
---|---|---|---|---|
2018.03 | [Tensor Core] NVIDIA Tensor Core Programmability, Performance & Precision | [arxiv][pdf] | ⭐️⭐️⭐️ | |
2022.09 | [FP8] FP8 FORMATS FOR DEEP LEARNING | [arxiv][pdf] | ⭐️⭐️⭐️ | |
2023.08 | [Tensor Cores] Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library | [arxiv][pdf] | [GitHub][wmma_extension] | ⭐️⭐️⭐️ |
Date | Title | Paper | Code | Recommend |
---|---|---|---|---|
2023.03 | [FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU | [arxiv][pdf] | [GitHub][FlexGen] | ⭐️⭐️⭐️ |
2023.11 | [LLM CPU Inference] Efficient LLM Inference on CPUs | [arxiv][pdf] | [GitHub][intel-extension-for-transformers] | ⭐️⭐️⭐️ |
2023.12 | [LinguaLinked] LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices | [arxiv][pdf] | ⭐️⭐️⭐️ | |
2023.12 | [OpenVINO] Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO | [arxiv][pdf] | ⭐️⭐️⭐️ |
Date | Title | Paper | Code | Recommend |
---|---|---|---|---|
2023.05 | 🔥🔥🔥[RWKV] RWKV: Reinventing RNNs for the Transformer Era | [arxiv][pdf] | [GitHub][RWKV-LM] | ⭐️⭐️⭐️⭐️⭐️ |
2023.12 | 🔥🔥🔥[Mamba] Mamba: Linear-Time Sequence Modeling with Selective State Spaces | [arxiv][pdf] | [GitHub][mamba] | ⭐️⭐️⭐️⭐️⭐️ |
Date | Title | Paper | Code | Recommend |
---|---|---|---|---|
2019.11 | 🔥🔥[MQA] Fast Transformer Decoding: One Write-Head is All You Need | [arxiv][pdf] | ⭐️⭐️⭐️⭐️⭐️ | |
2023.05 | 🔥🔥🔥[GQA] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | [arxiv][pdf] | [GitHub][flaxformer] | ⭐️⭐️⭐️⭐️⭐️ |
2021.04 | 🔥🔥[RoPE] ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING | [arxiv][pdf] | [GitHub][transformers] | ⭐️⭐️⭐️ |
2022.10 | [ByteTransformer] A High-Performance Transformer Boosted for Variable-Length Inputs | [arxiv][pdf] | [GitHub][ByteTransformer] | ⭐️⭐️⭐️ |
2023.09 | 🔥🔥[StreamingLLM] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS | [arxiv][pdf] | [GitHub][streaming-llm] | ⭐️⭐️⭐️ |
2023.09 | 🔥🔥[Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads | [blog] | [GitHub][Medusa] | ⭐️⭐️⭐️ |
GNU General Public License v3.0
Welcome to submit a PR to this repo!