forked from mit-han-lab/llm-awq
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 460d6d4
Showing
27 changed files
with
2,245 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,170 @@ | ||
.DS_Store | ||
|
||
data/ | ||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Distribution / packaging | ||
.Python | ||
*.pyc | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
share/python-wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
MANIFEST | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.nox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
*.py,cover | ||
.hypothesis/ | ||
.pytest_cache/ | ||
cover/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
local_settings.py | ||
db.sqlite3 | ||
db.sqlite3-journal | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
.pybuilder/ | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# IPython | ||
profile_default/ | ||
ipython_config.py | ||
|
||
# pyenv | ||
# For a library or package, you might want to ignore these files since the code is | ||
# intended to run in multiple environments; otherwise, check them in: | ||
# .python-version | ||
|
||
# pipenv | ||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. | ||
# However, in case of collaboration, if having platform-specific dependencies or dependencies | ||
# having no cross-platform support, pipenv may install dependencies that don't work, or not | ||
# install all needed dependencies. | ||
#Pipfile.lock | ||
|
||
# poetry | ||
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. | ||
# This is especially recommended for binary packages to ensure reproducibility, and is more | ||
# commonly ignored for libraries. | ||
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control | ||
#poetry.lock | ||
|
||
# pdm | ||
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. | ||
#pdm.lock | ||
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it | ||
# in version control. | ||
# https://pdm.fming.dev/#use-with-ide | ||
.pdm.toml | ||
|
||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm | ||
__pypackages__/ | ||
|
||
# Celery stuff | ||
celerybeat-schedule | ||
celerybeat.pid | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# Environments | ||
.env | ||
.venv | ||
env/ | ||
venv/ | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
|
||
# Spyder project settings | ||
.spyderproject | ||
.spyproject | ||
|
||
# Rope project settings | ||
.ropeproject | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ | ||
.dmypy.json | ||
dmypy.json | ||
|
||
# Pyre type checker | ||
.pyre/ | ||
|
||
# pytype static type analyzer | ||
.pytype/ | ||
|
||
# Cython debug symbols | ||
cython_debug/ | ||
|
||
# PyCharm | ||
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can | ||
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore | ||
# and can be added to the global gitignore or merged into this file. For a more nuclear | ||
# option (not recommended) you can uncomment the following to ignore the entire idea folder. | ||
#.idea/ | ||
|
||
*.pt | ||
**/*.pt | ||
**/*.pyc | ||
*.json | ||
__pycache__ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,132 @@ | ||
# AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration | ||
|
||
**Efficient and accurate** low-bit weight quantization (INT3/4) for LLMs, supporting **instruction-tuned** models and **multi-modal** LMs. | ||
|
||
 | ||
|
||
The current release supports: | ||
|
||
- AWQ search for accurate quantization. | ||
- Pre-computed AWQ model zoo for LLMs (LLaMA, OPT, Vicuna, LLaVA; load to generate quantized weights). | ||
- Memory-efficient 4-bit Linear in PyTorch. | ||
- Efficient CUDA kernel implementation for fast inference (support context and decoding stage). | ||
- Examples on 4-bit inference of an instruction-tuned model (Vicuna) and mult-modal LM (LLaVA). | ||
|
||
## Contents | ||
|
||
- [Install](#install) | ||
- [AWQ Model Zoo](#awq-model-zoo) | ||
- [Examples](#examples) | ||
- [Usage](#usage) | ||
- [Reference](#reference) | ||
|
||
## Install | ||
|
||
1. Clone this repository and navigate to AWQ folder | ||
``` | ||
git clone https://github.com/mit-han-lab/llm-awq | ||
cd llm-awq | ||
``` | ||
|
||
2. Install Package | ||
``` | ||
conda create -n awq python=3.10 -y | ||
conda activate awq | ||
pip install --upgrade pip # enable PEP 660 support | ||
pip install -e . | ||
``` | ||
|
||
3. Install kernel implementation | ||
``` | ||
cd awq/kernels | ||
python setup.py install | ||
``` | ||
|
||
## AWQ Model Zoo | ||
|
||
We provide pre-computed AWQ search results for multiple model families, including LLaMA, OPT, Vicuna, and LLaVA. To get the pre-computed AWQ search results, run: | ||
|
||
```bash | ||
# git lfs install # install git lfs if not already | ||
git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache | ||
``` | ||
|
||
The detailed support list: | ||
|
||
| Models | Sizes | INT4-g128 | INT3-g128 | | ||
| ------ | --------------------------- | --------- | --------- | | ||
| LLaMA | 7B/13B/30B/65B | ✅ | ✅ | | ||
| OPT | 125m/1.3B/2.7B/6.7B/13B/30B | ✅ | ✅ | | ||
| Vicuna | 7B/13B | ✅ | | | ||
| LLaVA | 13B | ✅ | | | ||
|
||
## Examples | ||
|
||
AWQ can be easily applied to various LMs thanks to its good generalization, including instruction-tuned models and multi-modal LMs. It provides an easy-to-use tool to reduce the serving cost of LLMs. | ||
|
||
Here we provide two examples of AWQ application: Vicuna-7B (chatbot) and LLaVA-13B (visual reasoning) under `./examples` directory. AWQ can easily reduce the GPU memory of model serving and speed up token generation. It provides accurate quantization, providing reasoning outputs. You should be able to observe **memory savings** when running the models with 4-bit weights. | ||
|
||
Note that we perform AWQ using only textual calibration data, depsite we are running on multi-modal input. Please refer to `./examples` for details. | ||
|
||
 | ||
|
||
## Usage | ||
|
||
We provide several sample script to run AWQ (please refer to `./scripts`). We use OPT-6.7B as an example. | ||
|
||
1. Perform AWQ search and save search results (we already did it for you): | ||
```bash | ||
python -m awq.entry --model_path /PATH/TO/OPT/opt-6.7b \ | ||
--w_bit 4 --q_group_size 128 \ | ||
--run_awq --dump_awq awq_cache/opt-6.7b-w4-g128.pt | ||
``` | ||
|
||
2. Evaluate the AWQ quantize model on WikiText-2 (simulated pseudo quantization) | ||
```bash | ||
python -m awq.entry --model_path /PATH/TO/OPT/opt-6.7b \ | ||
--tasks wikitext \ | ||
--w_bit 4 --q_group_size 128 \ | ||
--load_awq awq_cache/opt-6.7b-w4-g128.pt \ | ||
--q_backend fake | ||
``` | ||
|
||
3. Generate real quantized weights (INT4) | ||
```bash | ||
mkdir quant_cache | ||
python -m awq.entry --model_path /PATH/TO/OPT/opt-6.7b \ | ||
--w_bit 4 --q_group_size 128 \ | ||
--load_awq awq_cache/opt-6.7b-w4-g128.pt \ | ||
--q_backend real --dump_quant quant_cache/opt-6.7b-w4-g128-awq.pt | ||
``` | ||
|
||
4. Load and evaluate the real quantized model (now you can see smaller gpu memory usage) | ||
```bash | ||
python -m awq.entry --model_path /PATH/TO/OPT/opt-6.7b \ | ||
--tasks wikitext \ | ||
--w_bit 4 --q_group_size 128 \ | ||
--load_quant quant_cache/opt-6.7b-w4-g128-awq.pt | ||
``` | ||
|
||
## Reference | ||
|
||
If you find AWQ useful or relevant to your research, please kindly cite our paper: | ||
|
||
``` | ||
@article{lin2023awq, | ||
title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration}, | ||
author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song}, | ||
journal={arXiv}, | ||
year={2023} | ||
} | ||
``` | ||
|
||
## Related Projects | ||
|
||
[SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](https://github.com/mit-han-lab/smoothquant) | ||
|
||
[GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers](https://arxiv.org/abs/2210.17323) | ||
|
||
[Vicuna and FastChat](https://github.com/lm-sys/FastChat#readme) | ||
|
||
[LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA) | ||
|
Oops, something went wrong.