Skip to content
forked from ModelTC/llmc

This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

License

Notifications You must be signed in to change notification settings

zhiwei-dong/llmc

 
 

Repository files navigation

llmc: Towards Accurate and Efficient LLM Compression

llmc

License arXiv GitHub Stars visitors Discord Banner QQ Doc Doc

[ English | δΈ­ζ–‡ | ζ—₯本θͺž ]

llmc is an off-the-shell tool designed for compressing LLM, leveraging state-of-the-art compression algorithms to enhance efficiency and reduce model size without compromising performance.

English doc is here.

Chinese doc is here.

docker hub is here.

Community:

News

  • Sep 24, 2024: πŸ”₯ We have released the INT4 and INT8 models of Llama-3.1-405B quantized using LLMC. You can download the model parameters here.

  • Sep 23, 2024: πŸ”₯ We now support exporting real quantized models from LLMC to advanced inference backends such as SGLang, AutoAWQ, and MLC-LLM for quantized inference deployment, enabling reduced memory usage and faster inference speeds. For detailed usage, please refer to the SGLang documentation, AutoAWQ documentation, and MLC-LLM documentation.

  • Sep 9, 2024: πŸ”₯ We fix exporting quantized LLM to vLLM(see here). Moreover, we provide some configs of our best practice towards superior performance (see Best Practice here).

  • Sep 3, 2024: πŸš€ We support opencompass to eval llmc model. Follow this doc and have a try!

  • Aug 22, 2024: πŸ”₯We support lots of small language models, including current SOTA SmolLM(see Supported Model List). Additionally, we also support down stream task evaluation through our modified lm-evaluation-harness πŸ€—. Specifically, people can first employ save_trans mode(see save part in Configuration) to save a weight modified model. After obtaining the transformed model, they can directly evaluate the quantized model referring to run_lm_eval.sh. More details can be found in here.

  • Jul 23, 2024: 🍺🍺🍺 We release a brand new version benchmark paper:

    LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit.

    Ruihao Gong*, Yang Yong*, Shiqiao Gu*, Yushi Huang*, Chengtao Lv, Yunchen Zhang, Xianglong LiuπŸ“§, Dacheng Tao

    (* denotes equal contribution, πŸ“§ denotes corresponding author.)

    comp

    Instead of focusing on the best practice, We modularly and fairly benchmark LLM quantization considering calibration data, algorithms, and data formats. With detailed observation and analysis, we provide various types of novel points for performance and method improvements under different configurations. With the powerful toolkit LLMC and comprehensive insights, future LLM researchers can efficiently integrate suitable algorithms and low-bit formats for their applications, thereby democratizing the compression of large language models.

  • Jul 16, 2024: πŸ”₯We support Wanda/Naive(Magnitude) for llm sparsification and layer-wise mix bits quantization now!

  • Jul 14, 2024: πŸ”₯We support rotation based quantization QuaRot now!

  • Jul 4, 2024: πŸ“± We open our discussion channel. If you have any questions, please join our community:

  • May 17, 2024: πŸš€ We support some advanced large models, e.g., LLaVA, Mixtral, LLaMA V3 and Qwen V2 now. Have a try!

  • May 13, 2024: 🍺🍺🍺 We release our quantization benchmark paper:

    LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models.

    Ruihao Gong*, Yang Yong*, Shiqiao Gu*, Yushi Huang*, Yunchen Zhang, Xianglong LiuπŸ“§, Dacheng Tao

    (* denotes equal contribution, πŸ“§ denotes corresponding author.)

    comp

    We modularly and fairly benchmark the quantization techniques considering calibration cost, inference efficiency, and quantized accuracy. Near 600 experiments on diverse models and datasets provide three insightful takeaways on the calibration data, algorithm pipeline, and quantization configuration selection. Based on the takeaways, a best practice for the LLM PTQ pipeline is designed, to achieve the best accuracy and efficiency performance balance under various scenarios.

  • Mar 7, 2024: πŸš€ We release the quantization part of a powerful and efficient LLM compression tool. Notably, our benchmark paper is coming soon😊.

Highlight Feature

  • Quantize LLMs, e.g., Llama2-70B, OPT-175B, and evaluate their PPL on only one A100/H100/H800 GPUπŸ’₯.
  • SOTA compression algorithms align with the origin repos, for users to choose from, and users can sequentially employ multiple algorithms on one LLMπŸ’₯.
  • Transformed model (save_trans mode in quant part in Configuration) exported by our tool with a specifical compression algorithm can go through naive quantization by multiple backends, e.g., Lightllm, TensorRT-LLM to get a specifical-compression-algorithm-optimized model, which the corresponding backend can infer πŸ’₯.
  • Our compressed model (save_lightllm mode in quant part in Configuration) with a shallow memory footprint can be directly inferred by LightllmπŸ’₯.

Usage

  1. Clone this repository and install packages:

    # install packages
    cd llmc
    pip install -r requirements.txt
  2. Prepare models and data.

    # After downloading LLMs from huggingface, prepare calibration and evaluation data as follows:
    cd tools
    python download_calib_dataset.py --save_path [calib data path]
    python download_eval_dataset.py --save_path [eval data path]
  3. Choose an algorithm to quantize your model:

    # Here's an example about Awq:
    cd scripts
    # Modify the path of llmc, ``llmc_path``, in the bash file. You can also choose one config
    # placed in ``llmc/configs/quantization/Awq/`` to quantize your model, or your own
    # config referring to those we provide by changing the ``--config`` argument in run_awq_llama.sh.
    bash run_awq_llama.sh

Configuration

To help users design their configs, we now explain some universal configurations in all configs we provide under llmc/configs/:

  • model:

    model:
        # Replace by the name of the class in ``llmc/models/*.py``.
        type: Llama
        # Replace by the path of your model.
        path: model path
        torch_dtype: auto
  • calib:

    # Note: some algorithms do not need ``calib``, like naive... So, you can remove this part.
    calib:
        # Replace by the calibration data name, e.g., pileval, c4, wikitext2, or ptb, downloaded before.
        name: pileval
        download: False
        # Replace by the path of one of the calibration data, e.g., pileval, c4, wikitext2, or ptb,
        # downloaded before.
        path: calib data path
        n_samples: 128
        bs: -1
        seq_len: 512
        # Replace by the function name in ``llmc/data/dataset/specified_preproc.py``.
        preproc: general
        seed: *seed
  • eval:

    # If you want to evaluate PPL of your pretrained/transformed/fake_quant model.
    eval:
        # You can evaluate the pretrain, transformed, fake_quant model, and set the position
        # you want to evaluate.
        eval_pos: [pretrain, transformed, fake_quant]
        # Replace by the name of the eval data, e.g., c4, wikitext2, ptb or [c4, wikitext2],
        # downloaded before.
        name: wikitext2
        download: False
        path: eval data path
        # For 70B model eval, bs can be set to 20, and inference_per_block can be set to True.
        # For 7B / 13B model eval, bs can be set to 1, and inference_per_block can be set to False.
        bs: 1
        inference_per_block: False
        seq_len: 2048
  • save:

    save:
        # ``save_trans`` is True, which means you want to export the transformed model, e.g., parameter-modified
        # model, whose performance and structure are the same as the original model, and users can
        # utilize naive quantization to the transformed model to obtain the same performance as
        # the specifical-algorithm-quantized model.
        save_trans: False
        # ``save_lightllm`` or ``save_trtllm`` is True, which means you want to export a real quant model, e.g.,
        # low-bit weights with weight and activation quantization parameters.
        save_lightllm: False
        # ``save_fake`` is True means you want to export fake_quant model, e.g.,
        # dequantized weight with activation quantization parameters.
        save_fake: False
        save_path: ./save
  • quant:

    quant:
        # Replace by the class name in ``llmc/compression/quantization/*.py``
        method: OmniQuant
        # weight-only quantization does not have ``act`` part.
        weight:
            bit: 8
            symmetric: True
            # Quantization granularity: per_channel, per_tensor, per_head (not recommended).
            granularity: per_channel
            group_size: -1
            # Calibration algorithms: learnble, mse, and minmax (default).
            calib_algo: learnable
            # Utilize Stright-Through Estimation, which is necessary for learnable
            # calibration algorithms.
            ste: True
        act:
            bit: 8
            symmetric: True
            # Quantization granularity: per_token, per_tensor
            granularity: per_token
            ste: True
            # Static quantization (quantization during calibration)or dynamic
            # quantization (quantization during inference).
            static: True
        # This part is designed for specific algorithms, users can refer to
        # those we provide to design their own.
        special:
            let: True
            lwc_lr: 0.01
            let_lr: 0.005
            use_shift: False
            alpha: 0.5
            deactive_amp: True
            epochs: 20
            wd: 0
        # If quant_out is True, employ the outputs of the former quantized block as the
        # calibration data of the proceeding block.
        quant_out: True

Supported Model List

βœ… BLOOM

βœ… LLaMA

βœ… LLaMA V2

βœ… StarCoder

βœ… OPT

βœ… Falcon

βœ… InternLM2

βœ… Mistral

βœ… LLaMA V3

βœ… Mixtral

βœ… Qwen V2

βœ… LLaVA

βœ… InternLM2.5

βœ… StableLM

βœ… Gemma2

βœ… Phi2

βœ… Phi 1.5

βœ… MiniCPM

βœ… SmolLM

You can add your own model type referring to files under llmc/models/*.py.

Supported Algorithm List

Quantization

βœ… Naive

βœ… AWQ

βœ… GPTQ

βœ… SmoothQuant

βœ… OS+

βœ… OmniQuant

βœ… NormTweaking

βœ… AdaDim

βœ… QUIK

βœ… SpQR

βœ… DGQ

βœ… OWQ

βœ… LLM.int8()

βœ… HQQ

βœ… QuaRot

Pruning

βœ… Naive(Magnitude)

βœ… Wanda

βœ… ShortGPT

Acknowledgments

We develop our code referring to the following repos:

Star History

Star History Chart

Citation

If you find our LLM-QBench paper/llmc toolkit useful or relevant to your research, please kindly cite our paper:

@misc{llmc,
   author = {llmc contributors},
   title = {llmc: Towards Accurate and Efficient LLM Compression},
   year = {2024},
   publisher = {GitHub},
   journal = {GitHub repository},
   howpublished = {\url{https://github.com/ModelTC/llmc}},
}

@misc{gong2024llmqbench,
      title={LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models},
      author={Ruihao Gong and Yang Yong and Shiqiao Gu and Yushi Huang and Yunchen Zhang and Xianglong Liu and Dacheng Tao},
      year={2024},
      eprint={2405.06001},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

@misc{gong2024llmcbenchmarkinglargelanguage,
      title={LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit},
      author={Ruihao Gong and Yang Yong and Shiqiao Gu and Yushi Huang and Chentao Lv and Yunchen Zhang and Xianglong Liu and Dacheng Tao},
      year={2024},
      eprint={2405.06001},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2405.06001},
}

About

This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.9%
  • Shell 0.1%