Skip to content

Mars2018/Awesome-LLM-Compression

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Awesome LLM Compression

Awesome LLM compression research papers and tools to accelerate LLM training and inference.

Contents

Papers

Survey

  • A Survey on Model Compression for Large Language Models
    Arxiv 2023 [Paper]

  • The Efficiency Spectrum of Large Language Models: An Algorithmic Survey
    Arxiv 2023 [Paper]

  • Efficient Large Language Models: A Survey
    Arxiv 2023 [Paper] [GitHub Page]

  • Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
    Arxiv 2023 [Paper]

  • Understanding LLMs: A Comprehensive Overview from Training to Inference
    Arxiv 2024 [Paper]

  • A Survey of Resource-efficient LLM and Multimodal Foundation Models
    Arxiv 2024 [Paper]

  • A Survey on Hardware Accelerators for Large Language Models
    Arxiv 2024 [Paper]

  • A Comprehensive Survey of Compression Algorithms for Language Models
    Arxiv 2024 [Paper]

  • Model Compression and Efficient Inference for Large Language Models: A Survey
    Arxiv 2024 [Paper]

Quantization

  • ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
    NeurIPS 2022 [Paper] [Code (DeepSpeed)]

  • LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
    NeurIPS 2022 [Paper] [Code]

  • Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models
    NeurIPS 2022 [Paper] [Code]

  • LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models
    Arxiv 2022 [Paper]

  • SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
    ICML 2023 [Paper] [Code]

  • FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization
    ICML 2023 [Paper] [Code (DeepSpeed)]

  • Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases
    ICML 2023 [Paper] [Code]

  • The case for 4-bit precision: k-bit Inference Scaling Laws
    ICML 2023 [Paper]

  • GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
    ICLR 2023 [Paper] [Code]

  • PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models
    ACL 2023 [Paper]

  • Boost Transformer-based Language Models with GPU-Friendly Sparsity and Quantization
    ACL 2023 [Paper]

  • QLoRA: Efficient Finetuning of Quantized LLMs
    NeurIPS 2023 [Paper] [Code]

  • The Quantization Model of Neural Scaling
    NeurIPS 2023 [Paper]

  • Quantized Distributed Training of Large Models with Convergence Guarantees
    Arxiv 2023 [Paper]

  • RPTQ: Reorder-based Post-training Quantization for Large Language Models
    Arxiv 2023 [Paper] [Code]

  • ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
    Arxiv 2023 [Paper] [Code]

  • Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models
    Arxiv 2023 [Paper]

  • Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization
    NeurIPS 2023 [Paper]

  • Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt
    Arxiv 2023 [Paper]

  • AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
    Arxiv 2023 [Paper] [Code]

  • LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
    Arxiv 2023 [Paper] [Code]

  • SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
    Arxiv 2023 [Paper] [Code]

  • OWQ: Lessons learned from activation outliers for weight quantization in large language models
    Arxiv 2023 [Paper]

  • SqueezeLLM: Dense-and-Sparse Quantization
    Arxiv 2023 [Paper] [Code]

  • INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation
    Arxiv 2023 [Paper]

  • INT-FP-QSim: Mixed Precision and Formats For Large Language Models and Vision Transformers
    Arxiv 2023 [Paper] [Code]

  • QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models
    Arxiv 2023 [Paper] [Code]

  • Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study
    Arxiv 2023 [Paper]

  • ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats
    Arxiv 2023 [Paper] [Code (DeepSpeed)]

  • OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization
    ISCA 2023 [Paper]

  • NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search
    Arxiv 2023 [Paper]

  • GPT-Zip: Deep Compression of Finetuned Large Language Models
    ICML 2023 Workshop ES-FoMO [Paper]

  • Generating Efficient Kernels for Quantized Inference on Large Language Models
    ICML 2023 Workshop ES-FoMO [Paper]

  • Gradient-Based Post-Training Quantization: Challenging the Status Quo
    Arxiv 2023 [Paper]

  • FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs
    Arxiv 2023 [Paper]

  • OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
    ICLR 2024 [Paper] [Code]

  • FPTQ: Fine-grained Post-Training Quantization for Large Language Models
    Arxiv 2023 [Paper]

  • eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models
    Arxiv 2023 [Paper]

  • QuantEase: Optimization-based Quantization for Language Models -- An Efficient and Intuitive Algorithm
    Arxiv 2023 [Paper]

  • Norm Tweaking: High-performance Low-bit Quantization of Large Language Models
    Arxiv 2023 [Paper]

  • Understanding the Impact of Post-Training Quantization on Large-scale Language Models
    Arxiv 2023 [Paper]

  • MEMORY-VQ: Compression for Tractable Internet-Scale Memory
    Arxiv 2023 [Paper]

  • Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
    Arxiv 2023 [Paper] [Code]

  • Efficient Post-training Quantization with FP8 Formats
    Arxiv 2023 [Paper] [Code (Intel® Neural Compressor)]

  • QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
    Arxiv 2023 [Paper] [Code]

  • Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models
    Arxiv 2023 [Paper]

  • ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers
    Arxiv 2023 [Paper]

  • PB-LLM: Partially Binarized Large Language Models
    Arxiv 2023 [Paper] [Code]

  • Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM
    Arxiv 2023 [Paper]

  • Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models
    Arxiv 2023 [Paper]

  • QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models
    Arxiv 2023 [Paper]

  • LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
    Arxiv 2023 [Paper]

  • QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources
    Arxiv 2023 [Paper]

  • TEQ: Trainable Equivalent Transformation for Quantization of LLMs
    Arxiv 2023 [Paper] [Code (Intel® Neural Compressor)]

  • BitNet: Scaling 1-bit Transformers for Large Language Models
    Arxiv 2023 [Paper]

  • FP8-LM: Training FP8 Large Language Models
    Arxiv 2023 [Paper] [Code]

  • QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models
    Arxiv 2023 [Paper] [Code]

  • AFPQ: Asymmetric Floating Point Quantization for LLMs
    Arxiv 2023 [Paper] [Code]

  • AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models
    Arxiv 2023 [Paper]

  • Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
    Arxiv 2023 [Paper]

  • QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
    Arxiv 2023 [Paper]

  • Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models
    Arxiv 2023 [Paper]

  • How Does Calibration Data Affect the Post-training Pruning and Quantization of Large Language Models?
    Arxiv 2023 [Paper]

  • A Speed Odyssey for Deployable Quantization of LLMs
    Arxiv 2023 [Paper]

  • Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization
    Arxiv 2023 [Paper]

  • Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
    NeurIPS 2023 [Paper] [Code]

  • Efficient LLM Inference on CPUs
    NeurIPS 2023 on Efficient Natural Language and Speech Processing [Paper] [Code]

  • The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models
    EMNLP Findings 2023 [Paper]

  • Zero-Shot Sharpness-Aware Quantization for Pre-trained Language Models
    EMNLP 2023 [Paper]

  • Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?
    EMNLP 2023 [Paper] [Code]

  • Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling
    EMNLP 2023 [Paper]

  • Watermarking LLMs with Weight Quantization
    EMNLP 2023 [Paper] [Code]

  • Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization
    EMNLP 2023 [Paper]

  • LLM-FP4: 4-Bit Floating-Point Quantized Transformers
    EMNLP 2023 [Paper] [Code]

  • Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
    AAAI 2024 [Paper]

  • SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM
    Arxiv 2023 [Paper]

  • CBQ: Cross-Block Quantization for Large Language Models
    Arxiv 2023 [Paper]

  • ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
    Arxiv 2023 [Paper]

  • QuIP: 2-Bit Quantization of Large Language Models With Guarantees
    NeurIPS 2023 [Paper] [Code]

  • A Performance Evaluation of a Quantized Large Language Model on Various Smartphones
    Arxiv 2023 [Paper]

  • FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGA
    Arxiv 2024 [Paper]

  • Extreme Compression of Large Language Models via Additive Quantization
    Arxiv 2024 [Paper]

  • Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models
    Arxiv 2024 [Paper]

  • Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models
    Arxiv 2024 [Paper]

  • FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
    Arxiv 2024 [Paper]

  • KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
    Arxiv 2024 [Paper]

  • Can Large Language Models Understand Context?
    Arxiv 2024 [Paper]

  • AffineQuant: Affine Transformation Quantization for Large Language Models
    EACL 2024 [Paper]

Pruning and Sparsity

  • The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
    ICLR 2023 [Paper]

  • Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
    ICML 2023 [Paper] [Code]

  • LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
    ICML 2023 [Paper] [Code]

  • LLM-Pruner: On the Structural Pruning of Large Language Models
    NeurIPS 2023 [Paper] [Code]

  • ZipLM: Inference-Aware Structured Pruning of Language Models
    NeurIPS 2023 [Paper] [Code]

  • H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
    NeurIPS 2023 [Paper] [Code]

  • Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time
    NeurIPS 2023 [Paper]

  • The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter
    NeurIPS 2023 [Paper] [Code]

  • Learning to Compress Prompts with Gist Tokens
    NeurIPS 2023 [Paper]

  • Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
    NeurIPS 2023 [Paper]

  • Prune and Tune: Improving Efficient Pruning Techniques for Massive Language Models
    ICLR 2023 TinyPapers [Paper]

  • SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
    Arxiv 2023 [Paper] [Code]

  • Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering
    Arxiv 2023 [Paper] [Code]

  • Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale
    ACL 2023 [Paper] [Code]

  • Structured Pruning for Efficient Generative Pre-trained Language Models
    ACL 2023 [Paper]

  • A Simple and Effective Pruning Approach for Large Language Models
    Arxiv 2023 [Paper] [Code]

  • Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning
    Arxiv 2023 [Paper]

  • Structural pruning of large language models via neural architecture search
    AutoML 2023 [Paper]

  • Pruning Large Language Models via Accuracy Predictor
    ICASSP 2024 [Paper]

  • Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
    VLDB 2024 [Paper] [Cde]

  • Compressing LLMs: The Truth is Rarely Pure and Never Simple
    Arxiv 2023 [Paper]

  • Junk DNA Hypothesis: A Task-Centric Angle of LLM Pre-trained Weights through Sparsity
    Arxiv 2023 [Paper] [Code]

  • Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
    Arxiv 2023 [Paper]

  • Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models
    Arxiv 2023 [Paper] [Code]

  • Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
    Arxiv 2023 [Paper] [Code]

  • Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
    Arxiv 2023 [Paper] [Code]

  • Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs
    Arxiv 2023 [Paper] [Code]

  • One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models
    ICASSP 2024 [Paper]

  • Survival of the Most Influential Prompts: Efficient Black-Box Prompt Search via Clustering and Pruning
    EMNLP 2023 Findings [Paper]

  • The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models
    EMNLP Findings 2023 [Paper]

  • Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization
    Arxiv 2023 [Paper]

  • LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery
    Arxiv 2023 [Paper]

  • ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models
    Arxiv 2023 [Paper]

  • E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity
    Arxiv 2023 [Paper]

  • Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models
    Arxiv 2023 [Paper] [Code]

  • How Does Calibration Data Affect the Post-training Pruning and Quantization of Large Language Models?
    Arxiv 2023 [Paper]

  • BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
    OpenReview [Paper] [Code]

  • PUSHING GRADIENT TOWARDS ZERO: A NOVEL PRUNING METHOD FOR LARGE LANGUAGE MODELS
    OpenReview 2023 [Paper]

  • An Efficient Plug-and-Play Post-Training Pruning Strategy in Large Language Models
    Preprints 2023 [Paper]

  • Lighter, yet More Faithful: Investigating Hallucinations in Pruned Large Language Models for Abstractive Summarization
    Arxiv 2023 [Paper] [Code]

  • LORAPRUNE: PRUNING MEETS LOW-RANK PARAMETER-EFFICIENT FINE-TUNING
    Arxiv 2023 [Paper]

  • Mini-GPTs: Efficient Large Language Models through Contextual Pruning
    Arxiv 2023 [Paper] [Code]

  • The LLM Surgeon
    Arxiv 2023 [Paper]

  • Fluctuation-based Adaptive Structured Pruning for Large Language Models
    AAAI 2024 [Paper]

  • How to Prune Your Language Model: Recovering Accuracy on the "Sparsity May Cry'' Benchmark
    CPAL 2024 [Paper]

  • PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs
    Arxiv 2023 [Paper]

  • Fast and Optimal Weight Update for Pruned Large Language Models
    Arxiv 2024 [Paper]

  • APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference
    Arxiv 2024 [Paper]

  • Scaling Sparse Fine-Tuning to Large Language Models
    Arxiv 2024 [Paper]

  • SliceGPT: Compress Large Language Models by Deleting Rows and Columns
    ICLR 2024 [Paper] [Code]

Distillation

  • Lifting the Curse of Capacity Gap in Distilling Language Models
    ACL 2023 [Paper] [Code]

  • Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step
    ACL 2023 [Paper]

  • Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
    ACL 2023 [Paper]

  • SCOTT: Self-Consistent Chain-of-Thought Distillation
    ACL 2023 [Paper]

  • DISCO: Distilling Counterfactuals with Large Language Models
    ACL 2023 [Paper] [Code]

  • LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
    Arxiv 2023 [Paper] [Code]

  • How To Train Your (Compressed) Large Language Model
    Arxiv 2023 [Paper]

  • The False Promise of Imitating Proprietary LLMs
    Arxiv 2023 [Paper]

  • GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo
    Arxiv 2023 [Paper] [Code]

  • PaD: Program-aided Distillation Specializes Large Models in Reasoning
    Arxiv 2023 [Paper]

  • Knowledge Distillation of Large Language Models
    Arxiv 2023 [Paper] [Code]

  • GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models
    Arxiv 2023 [Paper]

  • Chain-of-Thought Prompt Distillation for Multimodal Named Entity and Multimodal Relation Extraction
    Arxiv 2023 [Paper]

  • Task-agnostic Distillation of Encoder-Decoder Language Models
    Arxiv 2023 [Paper]

  • Sci-CoT: Leveraging Large Language Models for Enhanced Knowledge Distillation in Small Models for Scientific QA
    Arxiv 2023 [Paper]

  • Can a student Large Language Model perform as well as it's teacher?
    Arxiv 2023 [Paper]

  • Multistage Collaborative Knowledge Distillation from Large Language Models
    Arxiv 2023 [Paper]

  • Lion: Adversarial Distillation of Closed-Source Large Language Model
    EMNLP 2023 [Paper] [Code]

  • MCC-KD: Multi-CoT Consistent Knowledge Distillation
    EMNLP 2023 [Paper]

  • PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation
    EMNLP 2023 [Paper]

  • YODA: Teacher-Student Progressive Learning for Language Models
    Arxiv 2023 [Paper]

Efficient Prompting

  • Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning
    ACL 2023 [Paper] [Code]

  • Batch Prompting: Efficient Inference with Large Language Model APIs
    EMNLP 2023 [Paper] [Code]

  • Adapting Language Models to Compress Contexts
    EMNLP 2023 [Paper] [Code]

  • Compressing Context to Enhance Inference Efficiency of Large Language Models
    EMNLP 2023 [Paper] [Code]

  • LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
    EMNLP 2023 [Paper] [Code]

  • Vector-Quantized Prompt Learning for Paraphrase Generation
    EMNLP 2023 Findings [Paper]

  • Efficient Prompting via Dynamic In-Context Learning
    Arxiv 2023 [Paper]

  • Learning to Compress Prompts with Gist Tokens
    Arxiv 2023 [Paper] [Code]

  • In-context Autoencoder for Context Compression in a Large Language Model
    Arxiv 2023 [Paper]

  • Discrete Prompt Compression with Reinforcement Learning
    Arxiv 2023 [Paper]

  • BatchPrompt: Accomplish more with less
    Arxiv 2023 [Paper]

  • (Dynamic) Prompting might be all you need to repair Compressed LLMs
    Arxiv 2023 [Paper]

  • RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation
    Arxiv 2023 [Paper] [Code]

  • LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression
    Arxiv 2023 [Paper] [Code]

  • Extending Context Window of Large Language Models via Semantic Compression
    Arxiv 2023 [Paper]

  • Boosting LLM Reasoning: Push the Limits of Few-shot Learning with Reinforced In-Context Pruning
    Arxiv 2023 [Paper]

  • The Impact of Reasoning Step Length on Large Language Models
    Arxiv 2024 [Paper]

Other

  • TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition
    Arxiv 2023 [Paper]

  • Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
    Arxiv 2023 [Paper]

  • SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference
    Arxiv 2023 [Paper]

  • Scaling In-Context Demonstrations with Structured Attention
    Arxiv 2023 [Paper]

  • Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline
    Arxiv 2023 [Paper] [Code]

  • CPET: Effective Parameter-Efficient Tuning for Compressed Large Language Models
    Arxiv 2023 [Paper]

  • Ternary Singular Value Decomposition as a Better Parameterized Form in Linear Mapping
    Arxiv 2023 [Paper]

  • LLMCad: Fast and Scalable On-device Large Language Model Inference
    Arxiv 2023 [Paper]

  • LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
    Arxiv 2023 [Paper] [Code]

  • LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression
    Arxiv 2023 [Paper] [Code]

  • Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation
    Arxiv 2023 [Paper]

  • Efficient Streaming Language Models with Attention Sinks
    Arxiv 2023 [Paper] [Code]

  • Efficient Large Language Models Fine-Tuning On Graphs
    Arxiv 2023 [Paper]

  • SparQ Attention: Bandwidth-Efficient LLM Inference
    Arxiv 2023 [Paper]

  • Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models
    Arxiv 2023 [Paper]

  • PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
    Arxiv 2023 [Paper] [Code]

  • Dataset Quantization
    ICCV 2023 [Paper] [Code]

  • Text Alignment Is An Efficient Unified Model for Massive NLP Tasks
    NeurIPS 2023 [Paper] [Code]

  • Context Compression for Auto-regressive Transformers with Sentinel Tokens
    EMNLP 2023 [Paper] [Code]

  • TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction
    EMNLP 2023 Findings [Paper]

  • Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression
    EMNLP 2023 Findings [Paper]

  • FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference
    Arxiv 2024 [Paper]

  • LoMA: Lossless Compressed Memory Attention
    Arxiv 2024 [Paper]

  • Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
    Arxiv 2024 [Paper] [Code]

  • BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
    Arxiv 2024 [Paper] [Code]

  • CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks
    Arxiv 2024 [Paper]

Tools

  • BMCook: Model Compression for Big Models [Code]

  • llama.cpp: Inference of LLaMA model in pure C/C++ [Code]

  • LangChain: Building applications with LLMs through composability [Code]

  • GPTQ-for-LLaMA: 4 bits quantization of LLaMA using GPTQ [Code]

  • Alpaca-CoT: An Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Language Models Interface [Code]

  • vllm: A high-throughput and memory-efficient inference and serving engine for LLMs [Code]

  • LLaMA Efficient Tuning: Fine-tuning LLaMA with PEFT (PT+SFT+RLHF with QLoRA) [Code]

  • gpt-fast: Simple and efficient pytorch-native transformer text generation in <1000 LOC of python. [Code]

  • Efficient-Tuning-LLMs: (Efficient Finetuning of QLoRA LLMs). QLoRA, LLama, bloom, baichuan-7B, GLM [Code]

  • bitsandbytes: 8-bit CUDA functions for PyTorch [Code]

  • ExLlama: A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights. [Code]

  • lit-gpt: Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. [Code]

  • Lit-LLaMA: Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. [Code]

  • lama.onnx: LLaMa/RWKV onnx models, quantization and testcase [Code]

  • fastLLaMa: An experimental high-performance framework for running Decoder-only LLMs with 4-bit quantization in Python using a C/C++ backend. [Code]

  • Sparsebit: A model compression and acceleration toolbox based on pytorch. [Code]

  • llama2.c: Inference Llama 2 in one file of pure C [Code]

  • Megatron-LM: Ongoing research training transformer models at scale [Code]

  • ggml: Tensor library for machine learning [Code]

  • LLamaSharp: C#/.NET binding of llama.cpp, including LLaMa/GPT model inference and quantization, ASP.NET core integration and UI [Code]

  • rwkv.cpp: NT4/INT5/INT8 and FP16 inference on CPU for RWKV language model [Code]

  • Can my GPU run this LLM?: Calculate GPU memory requirement & breakdown for training/inference of LLM models. Supports ggml/bnb quantization [Code]

  • TinyChatEngine: On-Device LLM Inference Library [Code]

  • TensorRT-LLM: TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. [Code]

  • IntLLaMA: A fast and light quantization solution for LLaMA [Code]

  • EasyLLM: Built upon Megatron-Deepspeed and HuggingFace Trainer, EasyLLM has reorganized the code logic with a focus on usability. While enhancing usability, it also ensures training efficiency [Code]

  • GreenBit LLaMA: Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs [Code]

Contributing

This is an active repository and your contributions are always welcome! Before you add papers/tools into the awesome list, please make sure that:

  • The paper or tools is related to Large Language Models (LLMs). If the compression algorithms or tools are only evaluated on small-scale language models (e.g., BERT), they should not be included in the list.
  • The paper should be inserted in the correct position in chronological order (publication/arxiv release time).
  • The link to [Paper] should be the arxiv page, not the pdf page if this is a paper posted on arxiv.

Star History

Star History Chart

About

Awesome LLM compression research papers and tools.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published