Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUANT] Add GPTQModel Dynamic Quantization + lm_head Quantization #3790

Merged

Conversation

Qubitium
Copy link
Contributor

@Qubitium Qubitium commented Feb 22, 2025

Motivation

Per module GPTQ quantization control/override + Allow loading of quantized lm_head:

Before PR:

  • All GPTQ modules share the same GPTQConfig (bits, group_size, etc)
  • Cannot load quantized lm_head

This PR:

  • Every module can have unique GPTQConfig (bits, group_size, etc)
  • Specific modules can be optionally skipped entrirely for quantization load based on dynamic override.
  • Quantized lm_head can be loaded

Dynamic Logic:

  • If rule match, matching with prefix aka module full path name such as mlp.down_proj.
  • if negative rule match, the module is skipped and loaded as normal, non-gptq quantized module
  • if positive rule match, gptq config for this module is overriden by the key/value dicated in the match rule. For example, you can optionally override the bits, group_size per module.
  • If None or no-match, nothing happens. Base GPTQConfig is used.

Notes:

  • Depends on GPTQ models quantized by GPTQModel
  • Backward compatible with all existing non-dynamic GPTQ models.

dynamic config sample/doc: https://github.com/ModelCloud/GPTQModel?tab=readme-ov-file#dynamic-quantization-per-module-quantizeconfig-override

Modifications

  • dynamic property added to quantization_config.json in gptq models which contains regex rules paired with overriding values in dict format. The actual code for this feature is in vLLM (0.7.3). The dynamic override logic code is directly copied from vllm, also written by us.

  • prefix (module weight path) must be passed down in the loading logic so dynamic override can take effect before GPTQ linear layer creation. The dynamic override rules uses the prefix (weight path/name) value for matching and if match found, GPTQConfig base properties are overriden.

  • lm_head.linear_method property changed to lm_head.quant_method to allow correct GPTQ linear layer loading

TODO:

  • Validate Working State
  • Add dynamic GPTQ model unit tests
  • Clean-up code/structure

Checklist

@Qubitium Qubitium changed the title [WIP] Support GPTQModel Dynamic Quantization + lm_head Quantization Support GPTQModel Dynamic Quantization + lm_head Quantization Feb 25, 2025
@Qubitium Qubitium marked this pull request as ready for review February 25, 2025 05:33
Signed-off-by: ZX-ModelCloud <[email protected]>
@Qubitium Qubitium changed the title Support GPTQModel Dynamic Quantization + lm_head Quantization [QUANT] Add GPTQModel Dynamic Quantization + lm_head Quantization Feb 25, 2025
@Qubitium
Copy link
Contributor Author

Qubitium commented Feb 25, 2025

@merrymercy @Ying1123 @zhyncs Ready for code review. Please trigger the sglang CI so we can be sure there are no regressions.

The code/logic follows our vLLM code of same core feature released v0.7.3.

  • The biggest changes to SGLang is our maintaining and passing prefix value in the loading code. Prefix is the module weight full path/key name in the weight file. Even though GPTQModel dynamic is the only one using this prefix we believe it will be widely used in other quantization frameworks dynamic feature if they choose to add them.
  • The other part of the override_ method code we copied our code from vLLM and duplicated here so there is less dependency on specific version of vLLM.
  • unit test added to test all three dynamic conditons: negative match, positive match, no mach
  • unit test added to test for lm-head quantization (only non-tied embedding-lm-head models supports lm-head quantization)

@zhaochenyang20
Copy link
Collaborator

Thanks so much. After remove the dependency, we shall merge it.

@Qubitium
Copy link
Contributor Author

Qubitium commented Mar 3, 2025

Thanks so much. After remove the dependency, we shall merge it.

Rebased. Waiting for CI test to run. I am still confused at what dependency that I should remove? Right now, the code is compatbile with vllm 0.7.2. I have already removed the dependencies which require vllm 0.7.3.

@zhaochenyang20
Copy link
Collaborator

@Qubitium still have some conflicts. I will ask @yizhang2077 to help on review, "how to remove vllm dependency"

@Qubitium
Copy link
Contributor Author

Qubitium commented Mar 3, 2025

@Qubitium still have some conflicts. I will ask @yizhang2077 to help on review, "how to remove vllm dependency"

@zhaochenyang20 @yizhang2077 I already addressed the fake increase of vllm dependency in this reply:
#3790 (comment)

There is no increase of vLLM dependency. Zero. Actually, there is net negative, less vLLM dependency since we moved a few GPTQ config classes inside SGLang. On first glance of the gptq.py changes, you might think the below are extra depends:

import logging
from fractions import Fraction
from typing import Any, Dict, List, Optional, Union

import torch
from vllm.model_executor.layers.quantization.gptq import GPTQLinearMethod
from vllm.model_executor.layers.quantization.gptq_marlin import (
    GPTQMarlinLinearMethod,
    GPTQMarlinMoEMethod,
)
from vllm.model_executor.layers.quantization.marlin import MarlinLinearMethod
from vllm.model_executor.layers.quantization.utils.marlin_utils import (
    check_marlin_supported,
)
from vllm.platforms import current_platform
from vllm.scalar_type import scalar_types

But all the above are already implicitt depends when SGLang run any GPTQ related kernel since SGLangs calls vLLM to load it's GPTQ kernels. What I mean is that, they are imported 100% of the time if you use quantization with GPTQ. Here, I am just calling them all expliciity. Before this PR, the imports are hidden from code view, this PR exposes them.

In execution, this PR actually reduced vLLM dependency, not more.

Before this PR:

A (SGLang) calls B (vLLM) which imports X, Y, Z.

This PR:

A calls X, Y, Z. Removed depend on B because B requires vLLM 0.7.3.

Or to be even more clear:

A (SGLang) calls B (SGLang) which imports X, Y, Z (vLLM).

X, Y, Z imports are now visible in SGLang minus B vllm depend.

Edit: These imports cannot be easily removed. You can remove it from SGLang but they will always be imported so in that sense, did we actually remove the imports or just hide them under a rug? The only way to remove them is to also move quant linear_method's (kernels) from vLLM into SGLang but that's another PR and out of the scope of this PR which is already large in edits.

@Qubitium
Copy link
Contributor Author

Qubitium commented Mar 4, 2025

Latest CI error appears unrelated to our changes. KeyError: 'debug_tensor_dump_output_folder'. I have no idea what this variable is within SGLang. Is this injected by CI?

KeyError: 'debug_tensor_dump_output_folder'

FIX in #4046

@zhaochenyang20
Copy link
Collaborator

@zhaochenyang20
Copy link
Collaborator

Also, thanks so much for the PR. You are MY HERO. Too big PR to review and to do 😂 @Qubitium

@zhaochenyang20
Copy link
Collaborator

@Qubitium I will merge it after the CI. thanks!

@Qubitium
Copy link
Contributor Author

Qubitium commented Mar 4, 2025

@Qubitium I will merge it after the CI. thanks!

The merge/conflict with main is never ending! lol. God I hope the conflicts stop happenings since we touched every single file. I guess we brought this on ourselves. Crying. lol

@Qubitium
Copy link
Contributor Author

Qubitium commented Mar 4, 2025

@zhaochenyang20 Previous failed tests passing! Remaining 2 failed tests appears unrelated to PR.

@Qubitium
Copy link
Contributor Author

Qubitium commented Mar 5, 2025

@zhaochenyang20 Can you check? All the CI tests are good. The only test failing appears unrelated. I don't want to do more rebase with master because it's clean and by the time you review the CI, the master is out of sync again (but clean mergeable) so should have no issues. If there is any conflict, I will merge but want to avoid merging since there is sooo much merging activity. lol

@zhaochenyang20
Copy link
Collaborator

@Qubitium will merge it right now

@zhaochenyang20 zhaochenyang20 merged commit 56a724e into sgl-project:main Mar 5, 2025
34 of 35 checks passed
@zhaochenyang20
Copy link
Collaborator

@Qubitium mergred!

@Qubitium Qubitium deleted the compat_gptqmodel_dynamic branch March 5, 2025 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants