[ Misc ] Support Act Order in Compressed Tensors #6358

robertgshaw2-neuralmagic · 2024-07-12T01:21:46Z

Summary

Add support for compressed-tensors models which have been quantized using activation ordering (group-wise quantization in decreasing order of activation).

add actorder argument to CompressedTensorsWNA16
add weight_g_idx layer parameter

Evaluation

Accuracy

Full Precision

vllm (pretrained=Qwen/Qwen2-0.5B-Instruct,add_bos_token=True), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.384|?  |0.0308|
|     |       |strict-match    |     5|exact_match|?  |0.384|?  |0.0308|

Group Quantization Only (ksayers/gwen_group)

vllm (pretrained=kylesayrs/gwen_group,add_bos_token=True), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.216|?  |0.0261|
|     |       |strict-match    |     5|exact_match|?  |0.196|?  |0.0252|

Activation Ordering (ksayers/gwen_actorder)

vllm (pretrained=ksayers/gwen_actorder,add_bos_token=True), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|                             
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|                             
|gsm8k|      3|flexible-extract|     5|exact_match|?  |0.248|?  |0.0274|                             
|     |       |strict-match    |     5|exact_match|?  |0.248|?  |0.0274|

Latency Regression

Namespace(model=‘/home/ksayers/llm-compressor/gwen_actorder/’, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, tokenizer=None, quantization=None, tensor_parallel_size=1, input_len=32, output_len=128, batch_size=32, n=1, use_beam_search=False, num_iters_warmup=10, num_iters=30, trust_remote_code=False, max_model_len=None, dtype=‘auto’, enforce_eager=False, kv_cache_dtype=‘auto’, quantization_param_path=None, profile=False, profile_result_dir=None, device=‘auto’, block_size=16, enable_chunked_prefill=False, enable_prefix_caching=False, use_v2_block_manager=False, ray_workers_use_nsight=False, download_dir=None, output_json=None, gpu_memory_utilization=0.9, load_format=‘auto’, distributed_executor_backend=None, otlp_traces_endpoint=None)

Group Quantization Only

Avg latency: 0.8884373404396076 seconds
10% percentile latency: 0.8715801022946834 seconds
25% percentile latency: 0.8739993472117931 seconds
50% percentile latency: 0.876951577141881 seconds
75% percentile latency: 0.8830150356516242 seconds
90% percentile latency: 0.9393035409972071 seconds
99% percentile latency: 0.9404808702412992 seconds

Activation Ordering

Avg latency: 0.9159474782645702 seconds
10% percentile latency: 0.9001966264098883 seconds
25% percentile latency: 0.9010569080710411 seconds
50% percentile latency: 0.9041027296334505 seconds
75% percentile latency: 0.9064613012596965 seconds
90% percentile latency: 0.9662564094178379 seconds
99% percentile latency: 0.9761117453686893 seconds

mgoin

Thanks for changes, LGTM with model smoke test

alexm-neuralmagic · 2024-07-23T16:25:53Z

LGTM

mgoin · 2024-08-19T20:08:16Z

vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py

+        # G_IDX (for activation reordering)
+        g_idx = BasevLLMParameter(data=torch.empty(input_size_per_partition,
+                                                   dtype=torch.int32),
+                                  weight_loader=weight_loader)


Is it okay to make this parameter in every case? What about older checkpoints that don't have this parameter?

gptq_marlin_gemm supports passing an empty tensor for g_idx, I'd prefer to that or a nullptr to avoid excess memory usage

I think my question was worded weirdly, sorry. I am just concerned about the weight loader trying to find this parameter in the checkpoint, and it not being present.

I regression tested using neuralmagic/TinyLlama-1.1B-Chat-v1.0-marlin without issue

I'd just update to only create the parameter if self.actorder is True

This is because the g_idx passed to the kernel is conditional on the actorder flag in the config
https://github.com/vllm-project/vllm/pull/6358/files#diff-df5f822218e5ac1430f35a806bc9cebd78c99cfe1e6738de89ae3e9f5a1fdbecR162

If self.actorder is True, it'll use the created parameter. Otherwise, it'll create an empty one. So I dont think you need to initialize it here if self.actorder is False

Yeah that seems to be the case from this else-case later - so no need to make the parameter

layer.weight_g_idx = marlin_make_empty_g_idx(device) layer.g_idx_sort_indices = marlin_make_empty_g_idx(device)

kylesayrs · 2024-08-21T01:31:50Z

Do not merge, tensor parallel bug needs to be fixed

kylesayrs · 2024-08-28T21:57:08Z

False alarm on tensor parallelism bug. Regression testing was performed with TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T with and without activation ordering and tensor_parallel_size=2

kylesayrs · 2024-08-29T14:44:55Z

Moving to draft while support for static_grouping actorder is added

kylesayrs · 2024-08-29T15:25:10Z

Actually will make a separate PR to address static_grouping feature

dsikka · 2024-08-29T15:23:40Z

vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py

+        # G_IDX (for activation reordering)
+        g_idx = BasevLLMParameter(data=torch.empty(input_size_per_partition,
+                                                   dtype=torch.int32),
+                                  weight_loader=weight_loader)


I'd just update to only create the parameter if self.actorder is True

dsikka · 2024-08-29T15:26:36Z

vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py

@@ -119,14 +127,21 @@ def create_weights(self, layer: torch.nn.Module, input_size: int,
                                                          dtype=torch.int64),
                                         weight_loader=weight_loader)

+        # G_IDX (for activation reordering)


Can you add a test case for this case?

tests/quantization/test_compressed_tensors.py

add a model to models.txt under tests/weight_loading

kylesayrs · 2024-08-29T15:29:21Z

vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py

+        # G_IDX (for activation reordering)
+        g_idx = BasevLLMParameter(data=torch.empty(input_size_per_partition,
+                                                   dtype=torch.int32),
+                                  weight_loader=weight_loader)


This is because the g_idx passed to the kernel is conditional on the actorder flag in the config
https://github.com/vllm-project/vllm/pull/6358/files#diff-df5f822218e5ac1430f35a806bc9cebd78c99cfe1e6738de89ae3e9f5a1fdbecR162

mgoin

LGTM once the remaining issues are addressed

kylesayrs · 2024-09-01T17:25:57Z

New requirements have been added to act-order to support different strategies such as weight-only ordering and group ordering. See neuralmagic/compressed-tensors#146

I've made a PR I'd like to merge into this branch which conditions activation ordering on g-idx directly rather than relying on the config. See neuralmagic#405

kylesayrs · 2024-09-03T21:47:44Z

Moved to #8135

mergify · 2024-11-26T05:51:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-neuralmagic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Robert Shaw and others added 30 commits June 26, 2024 21:42

added

1dfc42d

nits

aa4a9f5

cleanup

27f9a03

stash

de7a064

refactor gptq marlin

ec6a833

back out w4a16 act-order compressed tensors

966f7be

back out w4a16 act-order compressed tensors

d391f44

missed

db075c3

formatted'

695dc05

fix models without gidx

75c8a11

format

525cf08

fix test failure

81f028e

fix perms not being on gpu

a8fbe89

stash

cc843ad

stage

b260c90

updated

c8e97b1

nit

e58063d

added

383e471

format

865b743

newline

9c24525

formatting

8b5ac5a

working

0e46e4b

added compressed tensors fp8 to automation

a47a251

missed file

c6be536

format

0441171

remove unnecessary file changes

d404f00

restructure quant ops

6569323

updated to transpose in process_after_loading

aa56475

updated with varuns suggestion

d94d07e

fixed nit

54308d7

more name change

f187922

mgoin approved these changes Jul 22, 2024

View reviewed changes

kylesayrs added 9 commits August 17, 2024 06:14

merge in main

434b471

reorder for better diff

04ed5d7

remove doubled variables, fix shape for marlin_permute_scales

07ad850

merge in main

d2a923a

merge in main

22de619

use BasevLLMParameter

fb8ffb2

apply style

3bb7294

documentation

0e396fc

use layer.group_size

14495ba

mgoin reviewed Aug 19, 2024

View reviewed changes

add warning

2f46596

dsikka reviewed Aug 29, 2024

View reviewed changes

kylesayrs approved these changes Aug 29, 2024

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 29, 2024

mgoin requested changes Aug 29, 2024

View reviewed changes

kylesayrs added 2 commits August 30, 2024 20:09

Merge remote-tracking branch 'upstream/main' into act-order

ef08596

Merge remote-tracking branch 'upstream/main' into act-order

22e579e

Group Index Conditioning (#405)

cc2c9ab

simon-mo requested a review from youkaichao as a code owner November 26, 2024 05:49

mergify bot added the needs-rebase label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ Misc ] Support Act Order in Compressed Tensors #6358

[ Misc ] Support Act Order in Compressed Tensors #6358

robertgshaw2-neuralmagic commented Jul 12, 2024 •

edited by mgoin

Loading

mgoin left a comment

alexm-neuralmagic commented Jul 23, 2024

mgoin Aug 19, 2024

kylesayrs Aug 20, 2024

mgoin Aug 20, 2024

kylesayrs Aug 21, 2024

dsikka Aug 29, 2024 •

edited

Loading

kylesayrs Aug 29, 2024

dsikka Aug 29, 2024

mgoin Aug 29, 2024

kylesayrs commented Aug 21, 2024

kylesayrs commented Aug 28, 2024

kylesayrs commented Aug 29, 2024

kylesayrs commented Aug 29, 2024

dsikka Aug 29, 2024 •

edited

Loading

dsikka Aug 29, 2024

kylesayrs Aug 29, 2024

mgoin left a comment

kylesayrs commented Sep 1, 2024 •

edited

Loading

kylesayrs commented Sep 3, 2024

mergify bot commented Nov 26, 2024

[ Misc ] Support Act Order in Compressed Tensors #6358

Are you sure you want to change the base?

[ Misc ] Support Act Order in Compressed Tensors #6358

Conversation

robertgshaw2-neuralmagic commented Jul 12, 2024 • edited by mgoin Loading

Summary

Evaluation

Accuracy

Latency Regression

mgoin left a comment

Choose a reason for hiding this comment

alexm-neuralmagic commented Jul 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsikka Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylesayrs commented Aug 21, 2024

kylesayrs commented Aug 28, 2024

kylesayrs commented Aug 29, 2024

kylesayrs commented Aug 29, 2024

dsikka Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgoin left a comment

Choose a reason for hiding this comment

kylesayrs commented Sep 1, 2024 • edited Loading

kylesayrs commented Sep 3, 2024

mergify bot commented Nov 26, 2024

robertgshaw2-neuralmagic commented Jul 12, 2024 •

edited by mgoin

Loading

dsikka Aug 29, 2024 •

edited

Loading

dsikka Aug 29, 2024 •

edited

Loading

kylesayrs commented Sep 1, 2024 •

edited

Loading