-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ Misc ] Support Act Order in Compressed Tensors #6358
base: main
Are you sure you want to change the base?
[ Misc ] Support Act Order in Compressed Tensors #6358
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for changes, LGTM with model smoke test
LGTM |
# G_IDX (for activation reordering) | ||
g_idx = BasevLLMParameter(data=torch.empty(input_size_per_partition, | ||
dtype=torch.int32), | ||
weight_loader=weight_loader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay to make this parameter in every case? What about older checkpoints that don't have this parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gptq_marlin_gemm supports passing an empty tensor for g_idx, I'd prefer to that or a nullptr to avoid excess memory usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my question was worded weirdly, sorry. I am just concerned about the weight loader trying to find this parameter in the checkpoint, and it not being present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I regression tested using neuralmagic/TinyLlama-1.1B-Chat-v1.0-marlin
without issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just update to only create the parameter if self.actorder
is True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the g_idx passed to the kernel is conditional on the actorder
flag in the config
https://github.com/vllm-project/vllm/pull/6358/files#diff-df5f822218e5ac1430f35a806bc9cebd78c99cfe1e6738de89ae3e9f5a1fdbecR162
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If self.actorder is True, it'll use the created parameter. Otherwise, it'll create an empty one. So I dont think you need to initialize it here if self.actorder
is False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that seems to be the case from this else-case later - so no need to make the parameter
layer.weight_g_idx = marlin_make_empty_g_idx(device)
layer.g_idx_sort_indices = marlin_make_empty_g_idx(device)
Do not merge, tensor parallel bug needs to be fixed |
False alarm on tensor parallelism bug. Regression testing was performed with |
Moving to draft while support for static_grouping actorder is added |
Actually will make a separate PR to address static_grouping feature |
# G_IDX (for activation reordering) | ||
g_idx = BasevLLMParameter(data=torch.empty(input_size_per_partition, | ||
dtype=torch.int32), | ||
weight_loader=weight_loader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just update to only create the parameter if self.actorder
is True
@@ -119,14 +127,21 @@ def create_weights(self, layer: torch.nn.Module, input_size: int, | |||
dtype=torch.int64), | |||
weight_loader=weight_loader) | |||
|
|||
# G_IDX (for activation reordering) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test case for this case?
- tests/quantization/test_compressed_tensors.py
- add a model to models.txt under tests/weight_loading
# G_IDX (for activation reordering) | ||
g_idx = BasevLLMParameter(data=torch.empty(input_size_per_partition, | ||
dtype=torch.int32), | ||
weight_loader=weight_loader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the g_idx passed to the kernel is conditional on the actorder
flag in the config
https://github.com/vllm-project/vllm/pull/6358/files#diff-df5f822218e5ac1430f35a806bc9cebd78c99cfe1e6738de89ae3e9f5a1fdbecR162
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once the remaining issues are addressed
New requirements have been added to act-order to support different strategies such as I've made a PR I'd like to merge into this branch which conditions activation ordering on g-idx directly rather than relying on the config. See neuralmagic#405 |
Moved to #8135 |
This pull request has merge conflicts that must be resolved before it can be |
Summary
Add support for compressed-tensors models which have been quantized using activation ordering (group-wise quantization in decreasing order of activation).
actorder
argument toCompressedTensorsWNA16
weight_g_idx
layer parameterEvaluation
Accuracy
Full Precision
Group Quantization Only (
ksayers/gwen_group
)Activation Ordering (
ksayers/gwen_actorder
)Latency Regression
Group Quantization Only
Activation Ordering