Awesome LLM compression research papers and tools to accelerate the LLM training and inference.
-
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
NeurIPS 2022 [Paper] [Code] -
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
NeurIPS 2022 [Paper] [Code] -
LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models
Arxiv 2022 [Paper] -
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling
Arxiv 2023 [Paper] -
Quantized Distributed Training of Large Models with Convergence Guarantees
Arxiv 2023 [Paper] -
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
ICML 2023 [Paper] [Code] -
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
ICLR 2023 [Paper] [Code] -
RPTQ: Reorder-based Post-training Quantization for Large Language Models
Arxiv 2023 [Paper] [Code] -
ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
Arxiv 2023 [Paper] [Code] -
QLoRA: Efficient Finetuning of Quantized LLMs
Arxiv 2023 [Paper] [Code] -
Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models
Arxiv 2023 [Paper] -
The Quantization Model of Neural Scaling
Arxiv 2023 [Paper] -
Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization
Arxiv 2023 [Paper] -
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Arxiv 2023 [Paper] [Code] -
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
Arxiv 2023 [Paper] -
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
Arxiv 2023 [Paper] [Code] -
OWQ: Lessons learned from activation outliers for weight quantization in large language models
Arxiv 2023 [Paper]
-
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
ICLR 2023 [Paper] -
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Arxiv 2023 [Paper] [Code] -
LLM-Pruner: On the Structural Pruning of Large Language Models
Arxiv 2023 [Paper] [Code] -
Prune and Tune: Improving Efficient Pruning Techniques for Massive Language Models
ICLR 2023 TinyPapers [Paper] -
Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering
Arxiv 2023 [Paper] [Code] -
Learning to Compress Prompts with Gist Tokens
Arxiv 2023 [Paper] [Code] -
Efficient Prompting via Dynamic In-Context Learning
Arxiv 2023 [Paper]
-
Lifting the Curse of Capacity Gap in Distilling Language Models
ACL 2023 [Paper] [Code] -
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
ACL 2023 [Paper] -
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Arxiv 2023 [Paper] [Code] -
Large Language Model Distillation Doesn't Need a Teacher
Arxiv 2023 [Paper] [Code] -
The False Promise of Imitating Proprietary LLMs
Arxiv 2023 [Paper] -
GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo
Arxiv 2023 [Paper] [Code] -
PaD: Program-aided Distillation Specializes Large Models in Reasoning
Arxiv 2023 [Paper]
-
BMCook: Model Compression for Big Models [Code]
-
llama.cpp: Inference of LLaMA model in pure C/C++ [Code]
-
LangChain: Building applications with LLMs through composability [Code]
-
GPTQ-for-LLaMA: 4 bits quantization of LLaMA using GPTQ [Code]
-
Alpaca-CoT: An Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Language Models Interface [Code]