This repo contains a comprehensive paper list of Model Quantization for efficient deep learning on AI conferences/journals/arXiv. As a highlight, we categorize the papers in terms of model structures and application scenarios, and label the quantization methods with keywords.
This repo is being actively updated, and contributions in any form to make this list more comprehensive are welcome. Special thanks to collaborator Zhikai Li, and all researchers who have contributed to this repo!
If you find this repo useful, please consider ★STARing and feel free to share it with others!
[Update: June, 2023] Add new arXiv papers uploaded in May 2023, especially the hot LLM quantization field.
[Update: June, 2023] Reborn this repo! New style, better experience!
Keywords: PTQ
: post-training quantization | Non-uniform
: non-uniform quantization | MP
: mixed-precision quantization | Extreme
: binary or ternary quantization
- "A Survey of Quantization Methods for Efficient Neural Network Inference", Book Chapter: Low-Power Computer Vision, 2021. [paper]
- "Full Stack Optimization of Transformer Inference: a Survey", arXiv, 2023. [paper]
- "A White Paper on Neural Network Quantization", arXiv, 2021. [paper]
- "Binary Neural Networks: A Survey", PR, 2020. [Paper] [
Extreme
]
- "One-Shot Model for Mixed-Precision Quantization", CVPR, 2023. [paper] [
MP
] - "Adaptive Data-Free Quantization", CVPR, 2023. [paper]
- "Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization", CVPR, 2023. [paper] [
PTQ
] - "Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective", CVPR, 2023. [paper] [code] [
PTQ
] - "GENIE: Show Me the Data for Quantization", CVPR, 2023. [paper] [code] [
PTQ
] - "Bayesian asymmetric quantized neural networks", PR, 2023. [paper]
- "Distribution-sensitive Information Retention for Accurate Binary Neural Network", IJCV, 2023. [paper] [
Extreme
] - "SDQ: Stochastic Differentiable Quantization with Mixed Precision", ICML, 2022. [paper] [
MP
] - "Finding the Task-Optimal Low-Bit Sub-Distribution in Deep Neural Networks", ICML, 2022. [paper] [code]
- "GACT: Activation Compressed Training for Generic Network Architectures", ICML, 2022. [paper] [code]
- "Overcoming Oscillations in Quantization-Aware Training", ICML, 2022. [paper] [code]
- "Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation", CVPR, 2022. [paper] [code] [
Non-uniform
] - "Learnable Lookup Table for Neural Network Quantization", CVPR, 2022. [paper] [code]
- "Mr.BiQ: Post-Training Non-Uniform Quantization based on Minimizing the Reconstruction Error", CVPR, 2022. [paper] [
PTQ
] [Non-uniform
] - "Data-Free Network Compression via Parametric Non-uniform Mixed Precision Quantization", CVPR, 2022. [paper] [
Non-uniform
] [MP
] - "IntraQ: Learning Synthetic Images With Intra-Class Heterogeneity for Zero-Shot Network Quantization", CVPR, 2022. [paper] [code]
- "Instance-Aware Dynamic Neural Network Quantization", CVPR, 2022. [paper]
- "Leveraging Inter-Layer Dependency for Post-Training Quantization", NeurIPS, 2022. [paper] [
PTQ
] - "Theoretically Better and Numerically Faster Distributed Optimization with Smoothness-Aware Quantization Techniques", NeurIPS, 2022. [paper]
- "Entropy-Driven Mixed-Precision Quantization for Deep Network Design", NeurIPS, 2022. [paper] [
MP
] - "Redistribution of Weights and Activations for AdderNet Quantization", NeurIPS, 2022. [paper]
- "FP8 Quantization: The Power of the Exponent", NeurIPS, 2022. [paper] [code]
- "Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning", NeurIPS, 2022. [paper] [code] [
PTQ
] - "ClimbQ: Class Imbalanced Quantization Enabling Robustness on Efficient Inferences", NeurIPS, 2022. [paper]
- "Non-Uniform Step Size Quantization for Accurate Post-Training Quantization", ECCV, 2022. [paper] [
PTQ
] [Non-uniform
] - "Towards Accurate Network Quantization with Equivalent Smooth Regularizer", ECCV, 2022. [paper]
- "BASQ: Branch-wise Activation-clipping Search Quantization for Sub-4-bit Neural Networks", ECCV, 2022. [paper] [code]
- "RDO-Q: Extremely Fine-Grained Channel-Wise Quantization via Rate-Distortion Optimization", ECCV, 2022. [paper]
- "Mixed-Precision Neural Network Quantization via Learned Layer-Wise Importance", ECCV, 2022. [paper] [Code] [code] [
MP
] - "Symmetry Regularization and Saturating Nonlinearity for Robust Quantization", ECCV, 2022. [paper]
- "RAPQ: Rescuing Accuracy for Power-of-Two Low-bit Post-training Quantization", IJCAI, 2022. [paper] [code] [
PTQ
] - "MultiQuant: Training Once for Multi-bit Quantization of Neural Networks", IJCAI, 2022. [paper]
- "F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization", ICLR, 2022. [paper]
- "8-bit Optimizers via Block-wise Quantization", ICLR, 2022. [paper] [code]
- "Information Bottleneck: Exact Analysis of (Quantized) Neural Networks", ICLR, 2022. [paper] [code]
- "QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization", ICLR, 2022. [paper] [code] [
PTQ
] - "SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation", ICLR, 2022. [paper] [code] [
PTQ
] - "FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization", FPGA, 2022. [paper] [
MP
] - "Accurate Post Training Quantization with Small Calibration Sets", ICML, 2021. [paper] [code] [
PTQ
] - "How Do Adam and Training Strategies Help BNNs Optimization?", ICML, 2021. [paper] [code]
- "ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training", ICML, 2021. [paper] [code]
- "HAWQ-V3: Dyadic Neural Network Quantization", ICML, 2021. [paper] [code] [
MP
] - "Differentiable Dynamic Quantization with Mixed Precision and Adaptive Resolution", ICML, 2021. [paper] [
MP
] - "Auto-NBA: Efficient and Effective Search Over the Joint Space of Networks, Bitwidths, and Accelerators", ICML, 2021. [paper] [code]
- "Qimera: Data-free Quantization with Synthetic Boundary Supporting Samples", NeurIPS, 2021. [paper] [code]
- "Post-Training Sparsity-Aware Quantization", NeurIPS, 2021. [paper] [code] [
PTQ
] - "Diversifying Sample Generation for Accurate Data-Free Quantization", CVPR, 2021. [paper] [
PTQ
] - "Permute, Quantize, and Fine-tune: Efficient Compression of Neural Networks.", CVPR, 2021. [paper] [code]
- "Learnable Companding Quantization for Accurate Low-bit Neural Networks", CVPR, 2021. [paper]
- "Zero-shot Adversarial Quantization", CVPR, 2021. [paper] [code]
- "Network Quantization with Element-wise Gradient Scaling", CVPR, 2021. [paper] [code]
- "High-Capacity Expert Binary Networks", ICLR, 2021. [paper] [code] [
Extreme
] - "Multi-Prize Lottery Ticket Hypothesis: Finding Accurate Binary Neural Networks by Pruning A Randomly Weighted Network", ICLR, 2021. [paper] [code] [
Extreme
] - "BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction", ICLR, 2021. [paper] [code] [
PTQ
] - "Neural gradients are near-lognormal: improved quantized and sparse training", ICLR, 2021. [paper]
- "Training with Quantization Noise for Extreme Model Compression", ICLR, 2021. [paper]
- "BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization", ICLR, 2021. [paper] [code] [
MP
] - "Simple Augmentation Goes a Long Way: ADRL for DNN Quantization", ICLR, 2021. [paper]
- "Distribution Adaptive INT8 Quantization for Training CNNs", AAAI, 2021. [paper]
- "Stochastic Precision Ensemble: Self‐Knowledge Distillation for Quantized Deep Neural Networks", AAAI, 2021. [paper]
- "Optimizing Information Theory Based Bitwise Bottlenecks for Efficient Mixed-Precision Activation Quantization", AAAI, 2021. [paper] [
MP
] - "OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization", AAAI, 2021. [paper]
- "Scalable Verification of Quantized Neural Networks", AAAI, 2021. [paper] [code]
- "Uncertainty Quantification in CNN through the Bootstrap of Convex Neural Networks", AAAI, 2021. [paper]
- "FracBits: Mixed Precision Quantization via Fractional Bit-Widths", AAAI, 2021. [paper] [
MP
] - "Post-training Quantization with Multiple Points: Mixed Precision without Mixed Precision", AAAI, 2021. [paper] [
PTQ
] [MP
] - "ZeroQ: A Novel Zero Shot Quantization Framework", CVPR, 2020. [paper] [code] [
PTQ
] - "LSQ+: Improving Low-bit Quantization Through Learnable Offsets and Better Initialization", CVPR, 2020. [paper]
- "HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks", NeurIPS, 2020. [paper] [
MP
] - "Learned step size quantization", ICLR, 2020. [paper]
- "HAWQ: Hessian AWare Quantization of Neural Networks With Mixed-Precision", ICCV, 2019. [paper] [
MP
] - "Data-Free Quantization Through Weight Equalization and Bias Correction", ICCV, 2019. [paper] [
PTQ
] - "HAQ: Hardware-Aware Automated Quantization with Mixed Precision", CVPR, 2019. [paper] [code] [
MP
] - "PACT: Parameterized Clipping Activation for Quantized Neural Networks", arXiv, 2018. [paper]
- "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference", CVPR, 2018. [paper]
- "Improving Post-Training Quantization on Object Detection with Task Loss-Guided Lp Metric", arXiv, 2023. [paper] [
PTQ
] - "AQD: Towards Accurate Quantized Object Detection", CVPR, 2021. [paper]
- "BiDet: An Efficient Binarized Object Detector", CVPR, 2020. [paper] [code] [
Extreme
] - "Fully Quantized Network for Object Detection", CVPR, 2019. [paper]
- "Toward Accurate Post-Training Quantization for Image Super Resolution", CVPR, 2023. [paper] [code] [
PTQ
] - "EBSR: Enhanced Binary Neural Network for Image Super-Resolution", arXiv, 2023. [paper] [
Extreme
] - "CADyQ: Content-Aware Dynamic Quantization for Image Super-Resolution ", ECCV, 2022. [paper] [code]
- "Dynamic Dual Trainable Bounds for Ultra-low Precision Super-Resolution Networks", ECCV, 2022. [paper] [code]
- "DAQ: Channel-Wise Distribution-Aware Quantization for Deep Image Super-Resolution Networks", WACV, 2022. [paper] [code]
- "Fully Quantized Image Super-Resolution Networks", ACM MM, 2021. [paper] [code]
- "PAMS: Quantized Super-Resolution via Parameterized Max Scale", ECCV, 2020. [paper] [code]
- "Training Binary Neural Network without Batch Normalization for Image Super-Resolution", AAAI, 2021. [paper] [
Extreme
]
- "Binarizing Sparse Convolutional Networks for Efficient Point Cloud Analysis", arXiv, 2023. [paper] [
Extreme
] - "BiPointNet: Binary Neural Network for Point Clouds", ICLR, 2021. [paper] [code] [
Extreme
]
- "NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers", CVPR, 2023. [paper] [
PTQ
] - "Boost Vision Transformer with GPU-Friendly Sparsity and Quantization", CVPR, 2023. [paper]
- "Q-DETR: An Efficient Low-Bit Quantized Detection Transformer", CVPR, 2023. [paper]
- "Q-diffusion: Quantizing diffusion models", arXiv, 2023. [paper] [
PTQ
] - "QD-BEV: Quantization-aware View-guided Distillation for Multi-view 3D Object Detection", 2023. [paper]
- "Output Sensitivity-Aware DETR Quantization", 2023. [paper]
- "RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers", arXiv, 2022. [paper] [
PTQ
] - "I-ViT: integer-only quantization for efficient vision transformer inference", arXiv, 2022. [paper]
- "PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers", arXiv, 2022. [paper]
- "Q-HyViT: Post-Training Quantization for Hybrid Vision Transformer with Bridge Block Reconstruction", arXiv, 2023. [paper] [
PTQ
] - "Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer", NeurIPS, 2022. [paper] [code]
- "Patch Similarity Aware Data-Free Quantization for Vision Transformers", ECCV, 2022. [paper] [code] [
PTQ
] - "PTQ4ViT: Post-Training Quantization for Vision Transformers with Twin Uniform Quantization", ECCV, 2022. [paper] [code] [
PTQ
] - "FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer", IJCAI, 2022. [paper] [code] [
PTQ
] - "Q-ViT: Fully Differentiable Quantization for Vision Transformer", arXiv, 2022. [paper]
- "Post-Training Quantization for Vision Transformer", NeurIPS, 2021. [paper] [
PTQ
]
- "SqueezeLLM: Dense-and-Sparse Quantization", arXiv, 2023. [paper] [
PTQ
] - "SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression", arXiv, 2023. [paper] [
PTQ
] - "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration", arXiv, 2023. [paper] [
PTQ
] - "LLM-QAT: Data-Free Quantization Aware Training for Large Language Models", arXiv, 2023. [paper]
- "QLoRA: Efficient Finetuning of Quantized LLMs", arXiv, 2023. [paper]
- "Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization", arXiv, 2023. [paper]
- "Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling", arXiv, 2023. [paper] [
PTQ
] - "RPTQ: Reorder-based Post-training Quantization for Large Language Models", arXiv, 2023. [paper] [code] [
PTQ
] - "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models", ICML, 2023. [paper] [code] [
PTQ
] - "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers", ICLR, 2023. [papar] [code] [
PTQ
] - "LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models", arXiv, 2022. [paper]
- "BiBERT: Accurate Fully Binarized BERT", ICLR, 2022. [paper] [code] [
Extreme
] - "BiT: Robustly Binarized Multi-distilled Transformer", NeurIPS, 2022. [paper] [code] [
Extreme
] - "Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models", NeurIPS, 2022. [paper] [code] [
PTQ
] - "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale", NeurIPS, 2022. [paper] [code]
- "Towards Efficient Post-training Quantization of Pre-trained Language Models", NeurIPS, 2022. [paper] [
PTQ
] - "ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers", NeurIPS, 2022. [paper] [code] [
PTQ
] - "Compression of Generative Pre-trained Language Models via Quantization", ACL, 2022. [paper]
- "I-BERT: Integer-only BERT Quantization", ICML, 2021. [paper] [code]
- "BinaryBERT: Pushing the Limit of BERT Quantization", ACL, 2021. [paper] [code] [
Extreme
] - "On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers", ACL, 2021. [paper]
- "Understanding and Overcoming the Challenges of Efficient Transformer Quantization", EMNLP, 2021. [paper] [code]
- "KDLSQ-BERT: A Quantized Bert Combining Knowledge Distillation with Learned Step Size Quantization", arXiv, 2021. [paper]
- "TernaryBERT: Distillation-aware Ultra-low Bit BERT", EMNLP, 2020. [paper] [code] [
Extreme
] - "Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation", EMNLP, 2020. [paper]
- "GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference", MICRO, 2020. [paper]
- "Towards Fully 8-bit Integer Inference for the Transformer Model", IJCAI, 2020. [paper]
- "Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT", AAAI, 2020. [paper]
- "Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model", ICML, 2019. [paper]
- "Q8BERT: Quantized 8Bit BERT", EMC2 Workshop, 2019. [paper]