Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUANT] Add GPTQModel Dynamic Quantization + lm_head Quantization #3790

Merged
Merged
Changes from 1 commit
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
dab77cf
Changed VocabParallelEmbedding.linear_method to quant_method to be co…
ZX-ModelCloud Feb 22, 2025
9c097bd
call param.packed_factor instead of param.pack_factor
ZX-ModelCloud Feb 22, 2025
4e26757
add monkey_patch_vllm_get_linear_quant_method()
ZX-ModelCloud Feb 22, 2025
c52612c
pass prefix argument
ZX-ModelCloud Feb 22, 2025
84630d8
fix gptq_marlin error
ZX-ModelCloud Feb 22, 2025
7ad7159
cleanup
ZX-ModelCloud Feb 22, 2025
c870c5f
add prefix
ZX-ModelCloud Feb 22, 2025
7f3ffa0
add prefix
ZX-ModelCloud Feb 22, 2025
29a0e2a
use clearer api name and re-order args
Qubitium Feb 22, 2025
a143440
format
Qubitium Feb 22, 2025
82461e5
move import to top
Qubitium Feb 22, 2025
e726f1b
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Feb 22, 2025
0d5a66d
reduce vllm depend: move dynamic config extraction method to sglang
Qubitium Feb 22, 2025
29518ba
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Feb 22, 2025
cedd221
add unittest
ZX-ModelCloud Feb 25, 2025
3f64919
update unittest
ZX-ModelCloud Feb 25, 2025
cd06ba8
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Feb 25, 2025
0085065
code format
ZX-ModelCloud Feb 25, 2025
7542b80
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Feb 27, 2025
9c73721
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Feb 28, 2025
5f72987
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Feb 28, 2025
a3b5811
add gptqmodel tests to run_suite.py
Qubitium Mar 1, 2025
498bb0f
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 1, 2025
bd26863
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 1, 2025
21842dc
Update quantization.md
Qubitium Mar 1, 2025
b9b7f9b
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 1, 2025
21cf4e5
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 2, 2025
3abe3c2
format
Qubitium Mar 2, 2025
bc4a63e
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 2, 2025
caaeaf0
remove vllm depends
Qubitium Mar 3, 2025
cf3ef86
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 3, 2025
b60c657
remove more vllm 0.7.3 specific depend
Qubitium Mar 3, 2025
4aa0c5a
Merge branch 'compat_gptqmodel_dynamic' of https://github.com/ZX-Mode…
Qubitium Mar 3, 2025
585f65a
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 3, 2025
eb3f6b3
all prefix code use add_prefix
Qubitium Mar 3, 2025
131e055
Merge branch 'compat_gptqmodel_dynamic' of https://github.com/ZX-Mode…
Qubitium Mar 3, 2025
fdd4ff3
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 3, 2025
d29fe7f
format
Qubitium Mar 3, 2025
95be5bb
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 3, 2025
ee8bbd5
simplify
Qubitium Mar 3, 2025
d31410e
assert output
ZX-ModelCloud Mar 3, 2025
f00b8de
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 3, 2025
ff5f364
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 3, 2025
adf7df3
fix ci
Qubitium Mar 3, 2025
d1d9eb7
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 3, 2025
4101ce9
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 4, 2025
ea4952e
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 4, 2025
a4c269f
try to fix circular imports from vllm
Qubitium Mar 4, 2025
1dd58c5
try (2): fix circular imports
Qubitium Mar 4, 2025
91c09cc
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 4, 2025
218e12b
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 4, 2025
937ca01
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 4, 2025
c2bba8d
format
Qubitium Mar 4, 2025
cef4e20
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 4, 2025
97f3ebc
Merge branch 'main' into compat_gptqmodel_dynamic
Qubitium Mar 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
remove more vllm 0.7.3 specific depend
  • Loading branch information
Qubitium committed Mar 3, 2025
commit b60c65789a0162c27c928d028299d095a9eff5a1
17 changes: 9 additions & 8 deletions python/sglang/srt/layers/quantization/gptq.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
GPTQMarlinMoEMethod,
)
from vllm.model_executor.layers.quantization.marlin import MarlinLinearMethod
from vllm.model_executor.layers.quantization.moe_wna16 import MoeWNA16Config
from vllm.model_executor.layers.quantization.utils.marlin_utils import (
check_marlin_supported,
)
Expand Down Expand Up @@ -276,13 +275,15 @@ def get_quant_method(
from sglang.srt.layers.quantization import get_linear_quant_method

if isinstance(layer, FusedMoE):
if layer.num_experts > 32:
# For MoEs with many experts the moe_wna16 kernel is faster
return MoeWNA16Config.from_config(self.full_config).get_quant_method(
layer, prefix
)
else:
return GPTQMarlinMoEMethod(self)
return GPTQMarlinMoEMethod(self)
# TODO: re-enable after SGLang syncs with vllm >= 0.7.3
# if layer.num_experts > 32:
# # For MoEs with many experts the moe_wna16 kernel is faster
# return MoeWNA16Config.from_config(self.full_config).get_quant_method(
# layer, prefix
# )
# else:
# return GPTQMarlinMoEMethod(self)
return get_linear_quant_method(self, layer, prefix, GPTQMarlinLinearMethod)

@classmethod
Expand Down