Skip to content

ruilimit/GPTQ-for-LLaMa

 
 

Repository files navigation

GPTQ-for-LLaMa

4 bits quantization of LLaMa using GPTQ

GPTQ is SOTA one-shot weight quantization method

This code is based on GPTQ

Result

Model(LLaMa-7B) Bits group-size Wikitext2 PTB C4
FP16 16 - 5.67 8.79 7.05
RTN 4 - 6.28 9.68 7.70
GPTQ 4 - 6.79 10.67 8.28
GPTQ 4 64 6.16 9.66 7.52
RTN 3 - 25.66 61.25 28.19
GPTQ 3 - 20.86 37.54 22.19
GPTQ 3 64 12.24 16.77 9.55
Model(LLaMa-13B) Bits group-size Wikitext2 PTB C4
FP16 16 - 5.08 8.06 6.58
RTN 4 - 5.52 8.62 6.96
GPTQ 4 - 5.35 8.40 6.82
GPTQ 4 64 5.18 8.18 6.66
RTN 3 - 11.41 21.21 13.20
GPTQ 3 - 6.80 10.45 8.31
GPTQ 3 64 5.50 8.60 7.00

Quantizing the model requires a large amount of CPU memory. For example, quantizing a LLaMa-13b model requires 42gb, and LLaMa-33b requires more memory than 64gb.

Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(IST-DASLab/gptq#1)

According to GPTQ paper, As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.

Dependencies

All experiments were run on a single NVIDIA RTX3090.

Language Generation

LLaMa

# Compute full precision (FP16) results
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4
# Run RTN baseline and compute results
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --nearest
# Run GPTQ and compute results
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --groupsize 64

To run other LLaMa models replace llama-7b-hf with one of: llama-13b-hf, llama-30b-hf, llama-65b-hf.

ZeroShot

See zeroShot/ folder.

CUDA Kernels

# Install kernels
python setup_cuda.py install

# Benchmark performance for FC2 layer of LLaMa-7B
CUDA_VISIBLE_DEVICES=0 python test_kernel.py

# Benchmark language generation with 4-bit LLaMa-7B:

# Save compressed model
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --save llama7b-4bit.pt
# Benchmark generating a 2048 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --load llama7b-4bit.pt --benchmark 2048 --check
# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python llama.py decapoda-research/llama-7b-hf c4 --benchmark 2048 --check

# model inference with the saved model
CUDA_VISIBLE_DEVICES=0 python llama_inference.py decapoda-research/llama-7b-hf --wbits 4 --load llama7b-4bit.pt --text "this is llama"

CUDA Kernels support 2,3,4,8 bits.

Basically, 4-bit quantization is recommended.

cuda kernel does not support group size.

Memory Usage

Model Bits memory(MiB) benchmark(ppl) Wikitext2 PTB C4 checkpoint size(GB)
LLaMa-7B with FP16 16 13940 5.23 5.67 8.79 7.05 12.5
LLaMa-13B with FP16 16 OOM - 5.08 8.06 6.58 24.2
LLaMa-7B with GPTQ 8 7748 5.39 5.67 8.81 7.08 6.5
LLaMa-13B with GPTQ 8 14570 5.00 5.09 8.06 6.61 12.4
LLaMa-7B with GPTQ 4 4740 6.23 6.79 10.67 8.28 3.5
LLaMa-13B with GPTQ 4 8410 5.14 5.35 8.40 6.82 6.5
LLaMa-7B with GPTQ 3 3852 11.43 17.94 31.44 19.65 2.75
LLaMa-13B with GPTQ 3 6870 5.58 6.77 10.29 8.34 5.06
LLaMa-7B with GPTQ 2 3076 4152 30749 45936 5045 2.0
LLaMa-13B with GPTQ 2 5275 6903 13203 1384 8.34 5.06
LLaMa-33B with GPTQ 4 19499 4.59 - - - 15.7

Acknowledgements

This code is based on GPTQ

Thanks to Meta AI for releasing LLaMa, a powerful LLM.

About

4 bits quantization of LLaMa using GPTQ

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 93.6%
  • Cuda 5.5%
  • C++ 0.9%