Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ Misc ] Support Act Order in Compressed Tensors #6358

Open
wants to merge 147 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
147 commits
Select commit Hold shift + click to select a range
1dfc42d
added
Jun 26, 2024
aa4a9f5
nits
Jun 26, 2024
27f9a03
cleanup
Jun 26, 2024
de7a064
stash
Jun 27, 2024
ec6a833
refactor gptq marlin
robertgshaw2-neuralmagic Jul 3, 2024
966f7be
back out w4a16 act-order compressed tensors
robertgshaw2-neuralmagic Jul 3, 2024
d391f44
back out w4a16 act-order compressed tensors
robertgshaw2-neuralmagic Jul 3, 2024
db075c3
missed
robertgshaw2-neuralmagic Jul 3, 2024
695dc05
formatted'
robertgshaw2-neuralmagic Jul 3, 2024
75c8a11
fix models without gidx
robertgshaw2-neuralmagic Jul 3, 2024
525cf08
format
robertgshaw2-neuralmagic Jul 3, 2024
81f028e
fix test failure
robertgshaw2-neuralmagic Jul 3, 2024
a8fbe89
fix perms not being on gpu
robertgshaw2-neuralmagic Jul 3, 2024
cc843ad
stash
robertgshaw2-neuralmagic Jul 3, 2024
b260c90
stage
robertgshaw2-neuralmagic Jul 3, 2024
c8e97b1
updated
robertgshaw2-neuralmagic Jul 3, 2024
e58063d
nit
robertgshaw2-neuralmagic Jul 3, 2024
383e471
added
robertgshaw2-neuralmagic Jul 3, 2024
865b743
format
robertgshaw2-neuralmagic Jul 3, 2024
9c24525
newline
robertgshaw2-neuralmagic Jul 3, 2024
8b5ac5a
formatting
robertgshaw2-neuralmagic Jul 3, 2024
0e46e4b
working
robertgshaw2-neuralmagic Jul 3, 2024
a47a251
added compressed tensors fp8 to automation
robertgshaw2-neuralmagic Jul 3, 2024
c6be536
missed file
robertgshaw2-neuralmagic Jul 3, 2024
0441171
format
robertgshaw2-neuralmagic Jul 3, 2024
d404f00
remove unnecessary file changes
robertgshaw2-neuralmagic Jul 3, 2024
6569323
restructure quant ops
robertgshaw2-neuralmagic Jul 3, 2024
aa56475
updated to transpose in process_after_loading
robertgshaw2-neuralmagic Jul 3, 2024
d94d07e
updated with varuns suggestion
robertgshaw2-neuralmagic Jul 3, 2024
54308d7
fixed nit
robertgshaw2-neuralmagic Jul 3, 2024
173b93b
name change
robertgshaw2-neuralmagic Jul 3, 2024
afa1ee1
format
robertgshaw2-neuralmagic Jul 3, 2024
5ffe0e4
fixed
robertgshaw2-neuralmagic Jul 3, 2024
4c0e565
fixed tests
robertgshaw2-neuralmagic Jul 3, 2024
ee58d33
Merge branch 'unify-w8a8' into compressed-tensors-fp8
robertgshaw2-neuralmagic Jul 3, 2024
282a038
merge w8a8 unify
robertgshaw2-neuralmagic Jul 3, 2024
a0fd035
fix nit
robertgshaw2-neuralmagic Jul 3, 2024
ba1116b
nits
robertgshaw2-neuralmagic Jul 3, 2024
c1d4375
cleanup
robertgshaw2-neuralmagic Jul 3, 2024
a12bfd5
stash
robertgshaw2-neuralmagic Jul 6, 2024
6aad8f6
Merge branch 'main' into compressed-tensors-fp8
robertgshaw2-neuralmagic Jul 6, 2024
4fc0177
autofp8 working
robertgshaw2-neuralmagic Jul 6, 2024
1d99867
stash
robertgshaw2-neuralmagic Jul 6, 2024
ccee126
stash
robertgshaw2-neuralmagic Jul 6, 2024
0969c67
format
robertgshaw2-neuralmagic Jul 6, 2024
b2eeb84
fix imported marlin_permute_scales
robertgshaw2-neuralmagic Jul 6, 2024
9316f92
format
robertgshaw2-neuralmagic Jul 7, 2024
4ff23c8
added w8a8 to correctness testing
robertgshaw2-neuralmagic Jul 7, 2024
08a8e4e
added testing
robertgshaw2-neuralmagic Jul 7, 2024
4238ac9
format
robertgshaw2-neuralmagic Jul 7, 2024
d1c7517
merged
robertgshaw2-neuralmagic Jul 7, 2024
94d6b35
stash
robertgshaw2-neuralmagic Jul 7, 2024
d48ba9d
readded
robertgshaw2-neuralmagic Jul 7, 2024
0dd2c6a
remove nm-vllm-env
robertgshaw2-neuralmagic Jul 7, 2024
29f40f5
remove old qwen2 moe
robertgshaw2-neuralmagic Jul 7, 2024
ad17c88
readded utils
robertgshaw2-neuralmagic Jul 7, 2024
fd7d825
format
robertgshaw2-neuralmagic Jul 7, 2024
697edfa
Update models-small.txt
robertgshaw2-neuralmagic Jul 7, 2024
e30bd57
gptq marlin tests passing
robertgshaw2-neuralmagic Jul 7, 2024
382d230
add missing files
robertgshaw2-neuralmagic Jul 7, 2024
ba4c7b3
refactoring in progress
robertgshaw2-neuralmagic Jul 7, 2024
0916182
Update models-small.txt
robertgshaw2-neuralmagic Jul 7, 2024
de0242f
stash
robertgshaw2-neuralmagic Jul 7, 2024
9fe4fce
removed lm-eval
robertgshaw2-neuralmagic Jul 7, 2024
c044a86
stash
robertgshaw2-neuralmagic Jul 7, 2024
a5f0aee
remove run
robertgshaw2-neuralmagic Jul 7, 2024
d3299f8
Merge branch 'main' into compressed-tensors-fp8
robertgshaw2-neuralmagic Jul 7, 2024
bcfcd38
added integration test for compressed-tensors-w4-a16
robertgshaw2-neuralmagic Jul 7, 2024
763ab2c
formatting
robertgshaw2-neuralmagic Jul 7, 2024
950de45
Merge branch 'compressed-tensors-fp8' into refactor-gptq-marlin
robertgshaw2-neuralmagic Jul 7, 2024
eb2fdfa
removed
robertgshaw2-neuralmagic Jul 7, 2024
2f49425
Merge branch 'refactor-gptq-marlin' of https://github.com/neuralmagic…
robertgshaw2-neuralmagic Jul 7, 2024
93812eb
add comment
robertgshaw2-neuralmagic Jul 7, 2024
d4b25cf
Update w8a8_utils.py
robertgshaw2-neuralmagic Jul 7, 2024
48b220e
Update w8a8_utils.py
robertgshaw2-neuralmagic Jul 7, 2024
f1d8ee4
cleanup unnessary changes
robertgshaw2-neuralmagic Jul 7, 2024
cfe27be
Merge branch 'refactor-gptq-marlin' of https://github.com/neuralmagic…
robertgshaw2-neuralmagic Jul 7, 2024
72b9368
fix gptq marlin
robertgshaw2-neuralmagic Jul 7, 2024
73ae598
formatting
robertgshaw2-neuralmagic Jul 7, 2024
f854c54
cleanup
robertgshaw2-neuralmagic Jul 7, 2024
13d4e93
Merge branch 'main' into refactor-gptq-marlin
robertgshaw2-neuralmagic Jul 7, 2024
4e09688
Update benchmark_marlin.py
robertgshaw2-neuralmagic Jul 7, 2024
db694e0
Update compressed_tensors_wNa16.py
robertgshaw2-neuralmagic Jul 7, 2024
4b2dba2
Update marlin_utils_test.py
robertgshaw2-neuralmagic Jul 7, 2024
9d8d12f
Update test_marlin_gemm.py
robertgshaw2-neuralmagic Jul 7, 2024
54cf4f2
format
robertgshaw2-neuralmagic Jul 7, 2024
7abc2b1
Merge branch 'refactor-gptq-marlin' of https://github.com/neuralmagic…
robertgshaw2-neuralmagic Jul 7, 2024
ed178d4
formatting
robertgshaw2-neuralmagic Jul 7, 2024
03b11b2
more formatting
robertgshaw2-neuralmagic Jul 7, 2024
e2a5e7a
fix
robertgshaw2-neuralmagic Jul 7, 2024
6f62ada
yapf
robertgshaw2-neuralmagic Jul 7, 2024
933bec3
fixed failing tests
robertgshaw2-neuralmagic Jul 8, 2024
fe6ae88
tweak scores
robertgshaw2-neuralmagic Jul 8, 2024
8285ef6
tweak scores
robertgshaw2-neuralmagic Jul 8, 2024
fcc8925
stash
robertgshaw2-neuralmagic Jul 9, 2024
c0b5d13
format
robertgshaw2-neuralmagic Jul 9, 2024
f6910a5
seems to still be working
robertgshaw2-neuralmagic Jul 9, 2024
84ed30f
stash
robertgshaw2-neuralmagic Jul 11, 2024
62368af
added tests
robertgshaw2-neuralmagic Jul 11, 2024
b618961
seems to be working!
robertgshaw2-neuralmagic Jul 12, 2024
f2755f2
Update build.sh
robertgshaw2-neuralmagic Jul 12, 2024
cd392f5
Merge branch 'main' into act-order
robertgshaw2-neuralmagic Jul 12, 2024
b092079
Merge branch 'act-order' of https://github.com/neuralmagic/nm-vllm in…
robertgshaw2-neuralmagic Jul 12, 2024
5cbed16
cleanup bad merge
robertgshaw2-neuralmagic Jul 12, 2024
054e2db
removed files that should not have been added
robertgshaw2-neuralmagic Jul 12, 2024
7e0b0ec
Update run-lm-eval-gsm-vllm-baseline.sh
robertgshaw2-neuralmagic Jul 12, 2024
bddf9d3
Update test_compressed_tensors.py
robertgshaw2-neuralmagic Jul 12, 2024
ad43c4e
undo
robertgshaw2-neuralmagic Jul 12, 2024
0aa9181
undo bad merge
robertgshaw2-neuralmagic Jul 12, 2024
777e74b
last undo?
robertgshaw2-neuralmagic Jul 12, 2024
77988d3
twas not last
robertgshaw2-neuralmagic Jul 12, 2024
39ed988
cleanup
robertgshaw2-neuralmagic Jul 12, 2024
7d2fff8
stash
robertgshaw2-neuralmagic Jul 12, 2024
2e74b0b
remove more
robertgshaw2-neuralmagic Jul 12, 2024
a845475
fix
robertgshaw2-neuralmagic Jul 12, 2024
2e7bf61
format
robertgshaw2-neuralmagic Jul 12, 2024
18596e2
format
robertgshaw2-neuralmagic Jul 12, 2024
48aae94
more cleanup
robertgshaw2-neuralmagic Jul 12, 2024
b34ca83
undo changes to gptq marlin
robertgshaw2-neuralmagic Jul 12, 2024
881afd7
another nit
robertgshaw2-neuralmagic Jul 12, 2024
3cd8b55
another nit
robertgshaw2-neuralmagic Jul 12, 2024
02637af
final bad merge?
robertgshaw2-neuralmagic Jul 12, 2024
81f41ed
last bad merge?
robertgshaw2-neuralmagic Jul 12, 2024
536fdde
cleanup
robertgshaw2-neuralmagic Jul 12, 2024
1d10244
stopping point
robertgshaw2-neuralmagic Jul 12, 2024
4c96377
stash
robertgshaw2-neuralmagic Jul 12, 2024
4ca4a08
updated
robertgshaw2-neuralmagic Jul 19, 2024
1080488
Merge branch 'main' into act-order
robertgshaw2-neuralmagic Jul 22, 2024
8531380
Merge branch 'act-order' of https://github.com/neuralmagic/nm-vllm in…
robertgshaw2-neuralmagic Jul 22, 2024
0ddd524
updated to have a defualt
robertgshaw2-neuralmagic Jul 22, 2024
052cc93
switch order of arguments
robertgshaw2-neuralmagic Jul 22, 2024
6211660
switch everything to actorder from act_order
robertgshaw2-neuralmagic Jul 22, 2024
a0d0251
more cleanup
robertgshaw2-neuralmagic Jul 22, 2024
f187922
more name change
robertgshaw2-neuralmagic Jul 22, 2024
434b471
merge in main
kylesayrs Aug 17, 2024
04ed5d7
reorder for better diff
kylesayrs Aug 17, 2024
07ad850
remove doubled variables, fix shape for marlin_permute_scales
kylesayrs Aug 17, 2024
d2a923a
merge in main
kylesayrs Aug 17, 2024
22de619
merge in main
kylesayrs Aug 17, 2024
fb8ffb2
use BasevLLMParameter
kylesayrs Aug 17, 2024
3bb7294
apply style
kylesayrs Aug 17, 2024
0e396fc
documentation
kylesayrs Aug 17, 2024
14495ba
use layer.group_size
kylesayrs Aug 18, 2024
2f46596
add warning
kylesayrs Aug 29, 2024
ef08596
Merge remote-tracking branch 'upstream/main' into act-order
kylesayrs Aug 30, 2024
22e579e
Merge remote-tracking branch 'upstream/main' into act-order
kylesayrs Sep 1, 2024
cc2c9ab
Group Index Conditioning (#405)
kylesayrs Sep 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions tests/quantization/test_compressed_tensors.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,5 +159,19 @@ def test_compressed_tensors_fp8(vllm_runner):
def test_compressed_tensors_kv_cache(vllm_runner):
model_path = "nm-testing/TinyLlama-1.1B-compressed-tensors-kv-cache-scheme"
with vllm_runner(model_path, kv_cache_dtype="fp8") as llm:
output = llm.generate_greedy("Hello world!", max_tokens=20)
assert output


def test_compressed_tensors_actorder_weight(vllm_runner):
model_path = "kylesayrs/TinyLlama-1.1B-Chat-v1.0-actorder-weight-e2e"
with vllm_runner(model_path) as llm:
output = llm.generate_greedy("Hello world!", max_tokens=20)
assert output


def test_compressed_tensors_actorder_group(vllm_runner):
model_path = "kylesayrs/TinyLlama-1.1B-Chat-v1.0-actorder-group-e2e"
with vllm_runner(model_path) as llm:
output = llm.generate_greedy("Hello world!", max_tokens=20)
assert output
2 changes: 2 additions & 0 deletions tests/weight_loading/models.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ compressed-tensors, nm-testing/Phi-3-mini-128k-instruct-FP8, main
compressed-tensors, neuralmagic/Phi-3-medium-128k-instruct-quantized.w4a16, main
compressed-tensors, nm-testing/Mixtral-8x7B-Instruct-v0.1-W4A16-quantized, main
compressed-tensors, nm-testing/Mixtral-8x7B-Instruct-v0.1-W4A16-channel-quantized, main
compressed-tensors, kylesayrs/TinyLlama-1.1B-Chat-v1.0-actorder-weight-e2e, main
compressed-tensors, kylesayrs/TinyLlama-1.1B-Chat-v1.0-actorder-group-e2e, main
awq, casperhansen/mixtral-instruct-awq, main
awq_marlin, casperhansen/mixtral-instruct-awq, main
fp8, neuralmagic/Meta-Llama-3-8B-Instruct-FP8-KV, main
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,13 @@
import torch

from vllm import _custom_ops as ops
from vllm.logger import init_logger
from vllm.model_executor.layers.quantization.compressed_tensors.schemes import (
CompressedTensorsScheme)
from vllm.model_executor.layers.quantization.utils.marlin_utils import (
apply_gptq_marlin_linear, marlin_make_empty_g_idx, marlin_make_workspace,
marlin_permute_scales, replace_tensor, verify_marlin_supported,
verify_marlin_supports_shape)
marlin_permute_scales, marlin_sort_g_idx, replace_tensor,
verify_marlin_supported, verify_marlin_supports_shape)
from vllm.model_executor.parameter import (BasevLLMParameter,
ChannelQuantScaleParameter,
GroupQuantScaleParameter,
Expand All @@ -22,6 +23,8 @@
}
WNA16_SUPPORTED_BITS = list(WNA16_SUPPORTED_TYPES_MAP.keys())

logger = init_logger(__name__)


class CompressedTensorsWNA16(CompressedTensorsScheme):

Expand Down Expand Up @@ -119,9 +122,15 @@ def create_weights(self, layer: torch.nn.Module, input_size: int,
dtype=torch.int64),
weight_loader=weight_loader)

# group index (for activation reordering)
weight_g_idx = BasevLLMParameter(data=torch.full(
(input_size_per_partition, ), -1, dtype=torch.int32),
weight_loader=weight_loader)

layer.register_parameter("weight_packed", weight)
layer.register_parameter("weight_scale", weight_scale)
layer.register_parameter("weight_shape", weight_shape)
layer.register_parameter("weight_g_idx", weight_g_idx)

layer.input_size_per_partition = input_size_per_partition
layer.output_size_per_partition = output_size_per_partition
Expand All @@ -137,9 +146,15 @@ def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
layer.workspace = marlin_make_workspace(
layer.output_size_per_partition, device)

# Act-order not supported in compressed-tensors yet, so set to empty.
layer.g_idx = marlin_make_empty_g_idx(device)
layer.g_idx_sort_indices = marlin_make_empty_g_idx(device)
# Handle sorting for activation reordering if needed.
has_g_idx = -1 not in layer.weight_g_idx
if has_g_idx:
g_idx, g_idx_sort_indices = marlin_sort_g_idx(layer.weight_g_idx)
layer.g_idx_sort_indices = g_idx_sort_indices
replace_tensor(layer, "weight_g_idx", g_idx)
else:
layer.weight_g_idx = marlin_make_empty_g_idx(device)
layer.g_idx_sort_indices = marlin_make_empty_g_idx(device)

# No zero-point
layer.weight_zp = marlin_make_empty_g_idx(device)
Expand All @@ -161,7 +176,8 @@ def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
# Permute scales from compressed-tensors format to marlin format.
marlin_scales = marlin_permute_scales(
layer.weight_scale,
size_k=layer.input_size_per_partition,
size_k=(layer.input_size
if has_g_idx else layer.input_size_per_partition),
size_n=layer.output_size_per_partition,
group_size=layer.group_size)
replace_tensor(layer, "weight_scale", marlin_scales)
Expand All @@ -174,7 +190,7 @@ def apply_weights(self, layer: torch.nn.Module, x: torch.Tensor,
weight=layer.weight_packed,
weight_scale=layer.weight_scale,
weight_zp=layer.weight_zp,
g_idx=layer.g_idx,
g_idx=layer.weight_g_idx,
g_idx_sort_indices=layer.g_idx_sort_indices,
workspace=layer.workspace,
wtype=self.quant_type,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,19 @@ class QuantizationStrategy(str, Enum):
TOKEN = "token"


class ActivationOrdering(str, Enum):
"""
Enum storing strategies for activation ordering

Group: reorder groups and weight\n
Weight: only reorder weight, not groups. Slightly lower latency and
accuracy compared to group actorder\n
"""

GROUP = "group"
WEIGHT = "weight"


class QuantizationArgs(BaseModel):
"""
User facing arguments used to define a quantization config
Expand All @@ -58,6 +71,8 @@ class QuantizationArgs(BaseModel):
observed with every sample. Defaults to False for static
quantization. Note that enabling dynamic quantization
will change the default observer to a memoryless one
:param actorder: whether to apply group quantization in decreasing order of
activation. Defaults to None for arbitrary ordering
"""

num_bits: int = 8
Expand All @@ -67,6 +82,7 @@ class QuantizationArgs(BaseModel):
strategy: Optional[QuantizationStrategy] = None
block_structure: Optional[str] = None
dynamic: bool = False
actorder: Optional[ActivationOrdering] = None
observer: str = Field(
default="minmax",
description=("The class to use to compute the quantization param - "
Expand Down
10 changes: 5 additions & 5 deletions vllm/model_executor/layers/quantization/utils/marlin_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,16 +129,16 @@ def marlin_make_workspace(output_size_per_partition: int,
requires_grad=False)


def marlin_is_k_full(act_order: bool, is_row_parallel: bool) -> bool:
return (not act_order) or (act_order and not is_row_parallel)
def marlin_is_k_full(has_g_idx: bool, is_row_parallel: bool) -> bool:
return (not has_g_idx) or (not is_row_parallel)


def marlin_repeat_scales_on_all_ranks(act_order: bool, group_size: int,
def marlin_repeat_scales_on_all_ranks(has_g_idx: bool, group_size: int,
is_row_parallel: bool) -> bool:
# Need to repeat scales on every rank if act_ordering or
# Need to repeat scales on every rank if actorder or
# channelwise and RowParallelLinear
is_channelwise = group_size == -1
return act_order or (is_channelwise and is_row_parallel)
return has_g_idx or (is_channelwise and is_row_parallel)


def marlin_make_empty_g_idx(device: torch.device) -> torch.Tensor:
Expand Down
Loading