GPTQ-for-LLaMa

4 bits quantization of LLaMa using GPTQ

GPTQ is SOTA one-shot weight quantization method

This code is based on GPTQ

Result

Model(LLaMa-7B)	Bits	group-size	Wikitext2	PTB	C4
FP16	16	-	5.67	8.79	7.05
RTN	4	-	6.28	9.68	7.70
GPTQ	4	-	6.79	10.67	8.28
GPTQ	4	64	6.16	9.66	7.52
RTN	3	-	25.66	61.25	28.19
GPTQ	3	-	20.86	37.54	22.19
GPTQ	3	64	12.24	16.77	9.55

Model(LLaMa-13B)	Bits	group-size	Wikitext2	PTB	C4
FP16	16	-	5.08	8.06	6.58
RTN	4	-	5.52	8.62	6.96
GPTQ	4	-	5.35	8.40	6.82
GPTQ	4	64	5.18	8.18	6.66
RTN	3	-	11.41	21.21	13.20
GPTQ	3	-	6.80	10.45	8.31
GPTQ	3	64	5.50	8.60	7.00

Quantizing the model requires a large amount of CPU memory. For example, quantizing a LLaMa-13b model requires 42gb, and LLaMa-33b requires more memory than 64gb.

Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(IST-DASLab/gptq#1)

According to GPTQ paper, As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.

Dependencies

torch: tested on v1.12.1+cu113
transformers: tested on v4.27.0.dev0(required)
datasets: tested on v2.10.1
(to run 4-bit kernels: setup for compiling PyTorch CUDA extensions, see also https://pytorch.org/tutorials/advanced/cpp_extension.html, tested on CUDA 11.3)

All experiments were run on a single NVIDIA RTX3090.

Language Generation

LLaMa

# Compute full precision (FP16) results
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4
# Run RTN baseline and compute results
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --nearest
# Run GPTQ and compute results
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --groupsize 64

To run other LLaMa models replace llama-7b-hf with one of: llama-13b-hf, llama-30b-hf, llama-65b-hf.

ZeroShot

See zeroShot/ folder.

CUDA Kernels

# Install kernels
python setup_cuda.py install

# Benchmark performance for FC2 layer of LLaMa-7B
CUDA_VISIBLE_DEVICES=0 python test_kernel.py

# Benchmark language generation with 4-bit LLaMa-7B:

# Save compressed model
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --save llama7b-4bit.pt
# Benchmark generating a 2048 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --load llama7b-4bit.pt --benchmark 2048 --check
# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python llama.py decapoda-research/llama-7b-hf c4 --benchmark 2048 --check

# model inference with the saved model
CUDA_VISIBLE_DEVICES=0 python llama_inference.py decapoda-research/llama-7b-hf --wbits 4 --load llama7b-4bit.pt --text "this is llama"

CUDA Kernels support 2,3,4,8 bits.

Basically, 4-bit quantization is recommended.

cuda kernel does not support group size.

Memory Usage

Model	Bits	memory(MiB)	benchmark(ppl)	Wikitext2	PTB	C4	checkpoint size(GB)
LLaMa-7B with FP16	16	13940	5.23	5.67	8.79	7.05	12.5
LLaMa-13B with FP16	16	OOM	-	5.08	8.06	6.58	24.2
LLaMa-7B with GPTQ	8	7748	5.39	5.67	8.81	7.08	6.5
LLaMa-13B with GPTQ	8	14570	5.00	5.09	8.06	6.61	12.4
LLaMa-7B with GPTQ	4	4740	6.23	6.79	10.67	8.28	3.5
LLaMa-13B with GPTQ	4	8410	5.14	5.35	8.40	6.82	6.5
LLaMa-7B with GPTQ	3	3852	11.43	17.94	31.44	19.65	2.75
LLaMa-13B with GPTQ	3	6870	5.58	6.77	10.29	8.34	5.06
LLaMa-7B with GPTQ	2	3076	4152	30749	45936	5045	2.0
LLaMa-13B with GPTQ	2	5275	6903	13203	1384	8.34	5.06
LLaMa-33B with GPTQ	4	19499	4.59	-	-	-	15.7

Acknowledgements

This code is based on GPTQ

Thanks to Meta AI for releasing LLaMa, a powerful LLM.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
zeroShot		zeroShot
README.md		README.md
bloom.py		bloom.py
datautils.py		datautils.py
gptq.py		gptq.py
llama.py		llama.py
llama_inference.py		llama_inference.py
modelutils.py		modelutils.py
opt.py		opt.py
quant.py		quant.py
quant_cuda.cpp		quant_cuda.cpp
quant_cuda_kernel.cu		quant_cuda_kernel.cu
setup_cuda.py		setup_cuda.py
test_kernel.py		test_kernel.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPTQ-for-LLaMa

Result

Dependencies

Language Generation

LLaMa

ZeroShot

CUDA Kernels

Memory Usage

Acknowledgements

About

Releases

Packages

Languages

ruilimit/GPTQ-for-LLaMa

Folders and files

Latest commit

History

Repository files navigation

GPTQ-for-LLaMa

Result

Dependencies

Language Generation

LLaMa

ZeroShot

CUDA Kernels

Memory Usage

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages