[QUANT] Add GPTQModel Dynamic Quantization + `lm_head` Quantization #3790

Qubitium · 2025-02-22T14:04:12Z

Motivation

Per module GPTQ quantization control/override + Allow loading of quantized lm_head:

Before PR:

All GPTQ modules share the same GPTQConfig (bits, group_size, etc)
Cannot load quantized lm_head

This PR:

Every module can have unique GPTQConfig (bits, group_size, etc)
Specific modules can be optionally skipped entrirely for quantization load based on dynamic override.
Quantized lm_head can be loaded

Dynamic Logic:

If rule match, matching with prefix aka module full path name such as mlp.down_proj.
if negative rule match, the module is skipped and loaded as normal, non-gptq quantized module
if positive rule match, gptq config for this module is overriden by the key/value dicated in the match rule. For example, you can optionally override the bits, group_size per module.
If None or no-match, nothing happens. Base GPTQConfig is used.

Notes:

Depends on GPTQ models quantized by GPTQModel
Backward compatible with all existing non-dynamic GPTQ models.

dynamic config sample/doc: https://github.com/ModelCloud/GPTQModel?tab=readme-ov-file#dynamic-quantization-per-module-quantizeconfig-override

Modifications

dynamic property added to quantization_config.json in gptq models which contains regex rules paired with overriding values in dict format. The actual code for this feature is in vLLM (0.7.3). The dynamic override logic code is directly copied from vllm, also written by us.
prefix (module weight path) must be passed down in the loading logic so dynamic override can take effect before GPTQ linear layer creation. The dynamic override rules uses the prefix (weight path/name) value for matching and if match found, GPTQConfig base properties are overriden.
lm_head.linear_method property changed to lm_head.quant_method to allow correct GPTQ linear layer loading

TODO:

Validate Working State
Add dynamic GPTQ model unit tests
Clean-up code/structure

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

…nsistent with LinearBase. Signed-off-by: ZX-ModelCloud <[email protected]>

Signed-off-by: ZX-ModelCloud <[email protected]>

Qubitium · 2025-02-25T06:05:46Z

@merrymercy @Ying1123 @zhyncs Ready for code review. Please trigger the sglang CI so we can be sure there are no regressions.

The code/logic follows our vLLM code of same core feature released v0.7.3.

The biggest changes to SGLang is our maintaining and passing prefix value in the loading code. Prefix is the module weight full path/key name in the weight file. Even though GPTQModel dynamic is the only one using this prefix we believe it will be widely used in other quantization frameworks dynamic feature if they choose to add them.
The other part of the override_ method code we copied our code from vLLM and duplicated here so there is less dependency on specific version of vLLM.
unit test added to test all three dynamic conditons: negative match, positive match, no mach
unit test added to test for lm-head quantization (only non-tied embedding-lm-head models supports lm-head quantization)

zhaochenyang20 · 2025-03-03T08:27:24Z

Thanks so much. After remove the dependency, we shall merge it.

Qubitium · 2025-03-03T09:49:21Z

Thanks so much. After remove the dependency, we shall merge it.

Rebased. Waiting for CI test to run. I am still confused at what dependency that I should remove? Right now, the code is compatbile with vllm 0.7.2. I have already removed the dependencies which require vllm 0.7.3.

zhaochenyang20 · 2025-03-03T17:40:04Z

@Qubitium still have some conflicts. I will ask @yizhang2077 to help on review, "how to remove vllm dependency"

Qubitium · 2025-03-03T22:28:44Z

@Qubitium still have some conflicts. I will ask @yizhang2077 to help on review, "how to remove vllm dependency"

@zhaochenyang20 @yizhang2077 I already addressed the fake increase of vllm dependency in this reply:
#3790 (comment)

There is no increase of vLLM dependency. Zero. Actually, there is net negative, less vLLM dependency since we moved a few GPTQ config classes inside SGLang. On first glance of the gptq.py changes, you might think the below are extra depends:

import logging
from fractions import Fraction
from typing import Any, Dict, List, Optional, Union

import torch
from vllm.model_executor.layers.quantization.gptq import GPTQLinearMethod
from vllm.model_executor.layers.quantization.gptq_marlin import (
    GPTQMarlinLinearMethod,
    GPTQMarlinMoEMethod,
)
from vllm.model_executor.layers.quantization.marlin import MarlinLinearMethod
from vllm.model_executor.layers.quantization.utils.marlin_utils import (
    check_marlin_supported,
)
from vllm.platforms import current_platform
from vllm.scalar_type import scalar_types

But all the above are already implicitt depends when SGLang run any GPTQ related kernel since SGLangs calls vLLM to load it's GPTQ kernels. What I mean is that, they are imported 100% of the time if you use quantization with GPTQ. Here, I am just calling them all expliciity. Before this PR, the imports are hidden from code view, this PR exposes them.

In execution, this PR actually reduced vLLM dependency, not more.

Before this PR:

A (SGLang) calls B (vLLM) which imports X, Y, Z.

This PR:

A calls X, Y, Z. Removed depend on B because B requires vLLM 0.7.3.

Or to be even more clear:

A (SGLang) calls B (SGLang) which imports X, Y, Z (vLLM).

X, Y, Z imports are now visible in SGLang minus B vllm depend.

Edit: These imports cannot be easily removed. You can remove it from SGLang but they will always be imported so in that sense, did we actually remove the imports or just hide them under a rug? The only way to remove them is to also move quant linear_method's (kernels) from vLLM into SGLang but that's another PR and out of the scope of this PR which is already large in edits.

Qubitium · 2025-03-04T02:11:27Z

Latest CI error appears unrelated to our changes. KeyError: 'debug_tensor_dump_output_folder'. I have no idea what this variable is within SGLang. Is this injected by CI?

KeyError: 'debug_tensor_dump_output_folder'

FIX in #4046

zhaochenyang20 · 2025-03-04T08:03:06Z

https://github.com/sgl-project/sglang/actions/runs/13643964727/job/38139338876?pr=3790

@Qubitium Could you fix this?

zhaochenyang20 · 2025-03-04T08:03:36Z

Also, thanks so much for the PR. You are MY HERO. Too big PR to review and to do 😂 @Qubitium

zhaochenyang20 · 2025-03-04T08:31:45Z

@Qubitium I will merge it after the CI. thanks!

Qubitium · 2025-03-04T12:14:22Z

@Qubitium I will merge it after the CI. thanks!

The merge/conflict with main is never ending! lol. God I hope the conflicts stop happenings since we touched every single file. I guess we brought this on ourselves. Crying. lol

Qubitium · 2025-03-04T13:06:26Z

@zhaochenyang20 Previous failed tests passing! Remaining 2 failed tests appears unrelated to PR.

Qubitium · 2025-03-05T03:40:40Z

@zhaochenyang20 Can you check? All the CI tests are good. The only test failing appears unrelated. I don't want to do more rebase with master because it's clean and by the time you review the CI, the master is out of sync again (but clean mergeable) so should have no issues. If there is any conflict, I will merge but want to avoid merging since there is sooo much merging activity. lol

zhaochenyang20 · 2025-03-05T03:42:14Z

@Qubitium will merge it right now

zhaochenyang20 · 2025-03-05T09:11:08Z

@Qubitium mergred!

ZX-ModelCloud and others added 17 commits February 22, 2025 10:30

Changed VocabParallelEmbedding.linear_method to quant_method to be co…

dab77cf

…nsistent with LinearBase. Signed-off-by: ZX-ModelCloud <[email protected]>

call param.packed_factor instead of param.pack_factor

9c097bd

Signed-off-by: ZX-ModelCloud <[email protected]>

add monkey_patch_vllm_get_linear_quant_method()

4e26757

Signed-off-by: ZX-ModelCloud <[email protected]>

pass prefix argument

c52612c

Signed-off-by: ZX-ModelCloud <[email protected]>

fix gptq_marlin error

84630d8

Signed-off-by: ZX-ModelCloud <[email protected]>

cleanup

7ad7159

Signed-off-by: ZX-ModelCloud <[email protected]>

add prefix

c870c5f

Signed-off-by: ZX-ModelCloud <[email protected]>

add prefix

7f3ffa0

Signed-off-by: ZX-ModelCloud <[email protected]>

use clearer api name and re-order args

29a0e2a

format

a143440

move import to top

82461e5

Merge branch 'main' into compat_gptqmodel_dynamic

e726f1b

reduce vllm depend: move dynamic config extraction method to sglang

0d5a66d

Merge branch 'main' into compat_gptqmodel_dynamic

29518ba

add unittest

cedd221

Signed-off-by: ZX-ModelCloud <[email protected]>

update unittest

3f64919

Signed-off-by: ZX-ModelCloud <[email protected]>

Merge branch 'main' into compat_gptqmodel_dynamic

cd06ba8

Qubitium changed the title ~~[WIP] Support GPTQModel Dynamic Quantization + lm_head Quantization~~ Support GPTQModel Dynamic Quantization + lm_head Quantization Feb 25, 2025

Qubitium marked this pull request as ready for review February 25, 2025 05:33

Qubitium requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock and ByronHsu as code owners February 25, 2025 05:33

code format

0085065

Signed-off-by: ZX-ModelCloud <[email protected]>

Qubitium changed the title ~~Support GPTQModel Dynamic Quantization + lm_head Quantization~~ [QUANT] Add GPTQModel Dynamic Quantization + lm_head Quantization Feb 25, 2025

Qubitium mentioned this pull request Feb 27, 2025

[WIP] [Feature] Support DeepSeek-v3 gptq #3834

Draft

Merge branch 'main' into compat_gptqmodel_dynamic

7542b80

Merge branch 'main' into compat_gptqmodel_dynamic

ff5f364

fix ci

adf7df3

Qubitium added 2 commits March 4, 2025 06:31

Merge branch 'main' into compat_gptqmodel_dynamic

d1d9eb7

Merge branch 'main' into compat_gptqmodel_dynamic

4101ce9

This was referenced Mar 4, 2025

Fix debug_tensor_dump_output_folder optional key missing #4046

Merged

[Track] progress in removing vLLM dependencies #2245

Open

Merge branch 'main' into compat_gptqmodel_dynamic

ea4952e

Qubitium added 5 commits March 4, 2025 10:34

try to fix circular imports from vllm

a4c269f

try (2): fix circular imports

1dd58c5

Merge branch 'main' into compat_gptqmodel_dynamic

91c09cc

Merge branch 'main' into compat_gptqmodel_dynamic

218e12b

Merge branch 'main' into compat_gptqmodel_dynamic

937ca01

format

c2bba8d

Qubitium added 2 commits March 4, 2025 22:52

Merge branch 'main' into compat_gptqmodel_dynamic

cef4e20

Merge branch 'main' into compat_gptqmodel_dynamic

97f3ebc

zhaochenyang20 merged commit 56a724e into sgl-project:main Mar 5, 2025
34 of 35 checks passed

Qubitium deleted the compat_gptqmodel_dynamic branch March 5, 2025 09:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUANT] Add GPTQModel Dynamic Quantization + `lm_head` Quantization #3790

[QUANT] Add GPTQModel Dynamic Quantization + `lm_head` Quantization #3790

Qubitium commented Feb 22, 2025 •

edited

Loading

Qubitium commented Feb 25, 2025 •

edited

Loading

zhaochenyang20 commented Mar 3, 2025

Qubitium commented Mar 3, 2025

zhaochenyang20 commented Mar 3, 2025

Qubitium commented Mar 3, 2025 •

edited

Loading

Qubitium commented Mar 4, 2025 •

edited

Loading

zhaochenyang20 commented Mar 4, 2025

zhaochenyang20 commented Mar 4, 2025

zhaochenyang20 commented Mar 4, 2025

Qubitium commented Mar 4, 2025

Qubitium commented Mar 4, 2025

Qubitium commented Mar 5, 2025 •

edited

Loading

zhaochenyang20 commented Mar 5, 2025

zhaochenyang20 commented Mar 5, 2025

[QUANT] Add GPTQModel Dynamic Quantization + lm_head Quantization #3790

[QUANT] Add GPTQModel Dynamic Quantization + lm_head Quantization #3790

Conversation

Qubitium commented Feb 22, 2025 • edited Loading

Motivation

Before PR:

This PR:

Modifications

Checklist

Qubitium commented Feb 25, 2025 • edited Loading

zhaochenyang20 commented Mar 3, 2025

Qubitium commented Mar 3, 2025

zhaochenyang20 commented Mar 3, 2025

Qubitium commented Mar 3, 2025 • edited Loading

Qubitium commented Mar 4, 2025 • edited Loading

zhaochenyang20 commented Mar 4, 2025

zhaochenyang20 commented Mar 4, 2025

zhaochenyang20 commented Mar 4, 2025

Qubitium commented Mar 4, 2025

Qubitium commented Mar 4, 2025

Qubitium commented Mar 5, 2025 • edited Loading

zhaochenyang20 commented Mar 5, 2025

zhaochenyang20 commented Mar 5, 2025

[QUANT] Add GPTQModel Dynamic Quantization + `lm_head` Quantization #3790

[QUANT] Add GPTQModel Dynamic Quantization + `lm_head` Quantization #3790

Qubitium commented Feb 22, 2025 •

edited

Loading

Qubitium commented Feb 25, 2025 •

edited

Loading

Qubitium commented Mar 3, 2025 •

edited

Loading

Qubitium commented Mar 4, 2025 •

edited

Loading

Qubitium commented Mar 5, 2025 •

edited

Loading