Efficient transformers: A survey.
Y Tay, M Dehghani, D Bahri, D Metzler.
ACM Computing Surveys, 2022.
[Paper]
A survey on efficient training of transformers.
B Zhuang, J Liu, Z Pan, H He, Y Weng, C Shen.
arXiv:2302.01107, 2023.
[Paper]
Full stack optimization of transformer inference: a survey.
S Kim, C Hooper, T Wattanawong, M Kang, R Yan, H Genc, G Dinh, Q Huang, K Keutzer, et al.
arXiv:2302.14017, 2023.
[Paper]
Smoothquant: Accurate and efficient post-training quantization for large language models.
G Xiao, J Lin, M Seznec, H Wu, J Demouth, S Han.
International Conference on Machine Learning, 2023.
[Paper]
[Github]
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads.
G Xiao, J Tang, J Zuo, J Guo, S Yang, H Tang, Y Fu, S Han.
arXiv:2410.10819, 2024.
[Paper]
[Github]
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference.
W Luk, KFC Yiu, R Li, K Mishchenko, SI Venieris, H Fan.
arxiv:2405.18628, 2024.
[Paper]
[Github]
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving.
Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Ming-Yu Liu, Kai Li, Song Han.
ArXiv, 2024.
[Paper]
[Github]
Efficient streaming language models with attention sinks.
G Xiao, Y Tian, B Chen, S Han, M Lewis.
ICLR, 2024.
[Paper]
[Github]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han.
MLSys, 2024.
[Paper]
[Github]
Efficientvlm: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning.
T Wang, W Zhou, Y Zeng, X Zhang.
arxiv:2210.07795, 2022.
[Paper]
[Github]
MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices.
Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, et al.
arXiv, 2023.
[Paper]
[Github]
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model.
X Chu, L Qiao, X Zhang, S Xu, F Wei, Y Yang, et al.
ArXiv, 2024.
[Paper]
[Github]
Snapfusion: Text-to-image diffusion model on mobile devices within two seconds.
Y Li, H Wang, Q Jin, J Hu, P Chemerys, Y Fu, Y Wang, S Tulyakov, J Ren.
Advances in Neural Information Processing Systems, 2024.
[Paper]
[Github]
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models.
J Chen, H Cai, J Chen, E Xie, S Yang, H Tang, M Li, Y Lu, S Han.
arXiv:2410.10733, 2024.
[Paper]
[Github]
Deepcache: Accelerating diffusion models for free.
X Ma, G Fang, X Wang.
CVPR, 2024.
[Paper]
[Github]
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models.
Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Ming-Yu Liu, Kai Li, Song Han.
CVPR, 2024.
[Paper]
[Github]
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers.
E Xie, J Chen, J Chen, H Cai, Y Lin, Z Zhang, M Li, Y Lu, S Han.
arXiv:2410.10629, 2024.
[Paper]
[Github]
Tiny Machine Learning: Progress and Futures.
J Lin, L Zhu, WM Chen, WC Wang, et al.
IEEE Circuits and Systems Magazine 23 (3), 8-34, 2023.
[Paper]
Intelligence at the extreme edge: A survey on reformable tinyml.
V Rajapakse, I Karunanayake, N Ahmed.
ACM Computing Surveys, 2023.
[Paper]
PockEngine: Sparse and Efficient Fine-tuning in a Pocket.
L Zhu, L Hu, J Lin, WM Chen, WC Wang, C Gan, S Han.
MICRO, 2023.
[Paper]
On-device training under 256kb memory.
J Lin, L Zhu, WM Chen, WC Wang, et al.
NeurIPS, 2022.
[Paper]
Training Machine Learning models at the Edge: A Survey.
AR Khouas, MR Bouadjenek, H Hacid, et al.
ArXiv, 2023.
[Paper]
Mcunet: Tiny deep learning on iot devices.
J Lin, WM Chen, Y Lin, C Gan, S Han.
Advances in Neural Information Processing Systems, 2020.
[Paper]
[Github]
Mcunetv2: Memory-efficient patch-based inference for tiny deep learning.
J Lin, WM Chen, H Cai, C Gan, S Han.
arXiv:2110.15352, 2021.
[Paper]
[Github]
Deepcache: Accelerating diffusion models for free.
X Ma, G Fang, X Wang, et al.
CVPR, 2024.
[Paper]
[Github]