-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUANT] Add GPTQModel Dynamic Quantization + lm_head
Quantization
#3790
[QUANT] Add GPTQModel Dynamic Quantization + lm_head
Quantization
#3790
Conversation
…nsistent with LinearBase. Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
lm_head
Quantizationlm_head
Quantization
Signed-off-by: ZX-ModelCloud <[email protected]>
lm_head
Quantizationlm_head
Quantization
@merrymercy @Ying1123 @zhyncs Ready for code review. Please trigger the sglang CI so we can be sure there are no regressions. The code/logic follows our vLLM code of same core feature released
|
Thanks so much. After remove the dependency, we shall merge it. |
Rebased. Waiting for CI test to run. I am still confused at what dependency that I should remove? Right now, the code is compatbile with vllm 0.7.2. I have already removed the dependencies which require vllm 0.7.3. |
@Qubitium still have some conflicts. I will ask @yizhang2077 to help on review, "how to remove vllm dependency" |
@zhaochenyang20 @yizhang2077 I already addressed the fake increase of vllm dependency in this reply: There is no increase of vLLM dependency. Zero. Actually, there is net negative, less vLLM dependency since we moved a few GPTQ config classes inside SGLang. On first glance of the gptq.py changes, you might think the below are extra depends: import logging
from fractions import Fraction
from typing import Any, Dict, List, Optional, Union
import torch
from vllm.model_executor.layers.quantization.gptq import GPTQLinearMethod
from vllm.model_executor.layers.quantization.gptq_marlin import (
GPTQMarlinLinearMethod,
GPTQMarlinMoEMethod,
)
from vllm.model_executor.layers.quantization.marlin import MarlinLinearMethod
from vllm.model_executor.layers.quantization.utils.marlin_utils import (
check_marlin_supported,
)
from vllm.platforms import current_platform
from vllm.scalar_type import scalar_types But all the above are already implicitt depends when SGLang run any GPTQ related kernel since SGLangs calls vLLM to load it's GPTQ kernels. What I mean is that, they are imported 100% of the time if you use quantization with GPTQ. Here, I am just calling them all expliciity. Before this PR, the imports are In execution, this PR actually reduced vLLM dependency, not more. Before this PR: A (SGLang) calls B (vLLM) which imports X, Y, Z. This PR: A calls X, Y, Z. Removed depend on B because B requires vLLM 0.7.3. Or to be even more clear: A (SGLang) calls B (SGLang) which imports X, Y, Z (vLLM). X, Y, Z imports are now visible in SGLang minus B vllm depend. Edit: These imports cannot be easily removed. You can remove it from SGLang but they will always be imported so in that sense, did we actually remove the imports or just hide them under a rug? The only way to remove them is to also move quant |
Latest CI error appears unrelated to our changes. KeyError: 'debug_tensor_dump_output_folder' FIX in #4046 |
Also, thanks so much for the PR. You are MY HERO. Too big PR to review and to do 😂 @Qubitium |
@Qubitium I will merge it after the CI. thanks! |
The merge/conflict with |
@zhaochenyang20 Previous failed tests passing! Remaining 2 failed tests appears unrelated to PR. |
@zhaochenyang20 Can you check? All the CI tests are good. The only test failing appears unrelated. I don't want to do more rebase with master because it's clean and by the time you review the CI, the master is out of sync again (but clean mergeable) so should have no issues. If there is any conflict, I will merge but want to avoid merging since there is sooo much merging activity. lol |
@Qubitium will merge it right now |
@Qubitium mergred! |
Motivation
Per module GPTQ quantization control/override + Allow loading of quantized
lm_head
:Before PR:
modules
share the sameGPTQConfig
(bits, group_size, etc)lm_head
This PR:
module
can have uniqueGPTQConfig
(bits, group_size, etc)modules
can be optionally skipped entrirely for quantization load based ondynamic
override.lm_head
can be loadedDynamic
Logic:prefix
aka module full path name such asmlp.down_proj
.negative
rule match, the module is skipped and loaded as normal, non-gptq quantized modulepositive
rule match, gptq config for this module is overriden by the key/value dicated in the match rule. For example, you can optionally override thebits
,group_size
per module.None
or no-match, nothing happens. BaseGPTQConfig
is used.Notes:
dynamic
config sample/doc: https://github.com/ModelCloud/GPTQModel?tab=readme-ov-file#dynamic-quantization-per-module-quantizeconfig-overrideModifications
dynamic
property added toquantization_config.json
in gptq models which contains regex rules paired with overriding values in dict format. The actual code for this feature is in vLLM (0.7.3). The dynamic override logic code is directly copied from vllm, also written by us.prefix
(module weight path) must be passed down in the loading logic sodynamic
override can take effect before GPTQ linear layer creation. Thedynamic
override rules uses theprefix
(weight path/name) value for matching and if match found,GPTQConfig
base properties are overriden.lm_head.linear_method
property changed tolm_head.quant_method
to allow correct GPTQ linear layer loadingTODO:
dynamic
GPTQ model unit testsChecklist