Error deploying GPTQ models to sagemaker #235

GlacierPurpleBison · 2024-02-11T08:56:51Z

System Info

I have used the following guide to deploy lorax to sagemaker. I am able to do so successfully using the unquantized models. Have deployed OpenHermes 2.5 successfully. However when i try GPTQ version of Openhermes or Mixtral, I am consistently getting the following error:

> File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept return await response File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor raise error File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 79, in Warmup max_supported_total_tokens = self.model.warmup(batch) File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 726, in warmup _, batch = self.generate_token(batch) File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 855, in generate_token raise e File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 852, in generate_token out = self.forward(batch, adapter_data) File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mixtral.py", line 426, in forward logits = model.forward( File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 979, in forward hidden_states = self.model( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 922, in forward hidden_states, residual = layer( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 849, in forward attn_output = self.self_attn( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mixtral_modeling.py", line 379, in forward qkv = self.query_key_value(hidden_states, adapter_data) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 601, in forward result = self.base_layer(input) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 399, in forward return self.linear.forward(x) File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 349, in forward out = QuantLinearFunction.apply( File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 123, in decorate_fwd return fwd(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 244, in forward output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq) File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 216, in matmul248 matmul_248_kernel[grid]( File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 110, in run timings = { File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 111, in <dictcomp> config: self._bench(*args, config=config, **kwargs) File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 90, in _bench return triton.testing.do_bench( File "/opt/conda/lib/python3.10/site-packages/triton/testing.py", line 102, in do_bench fn() File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 80, in kernel_call self.fn.run( File "/opt/conda/lib/python3.10/site-packages/triton/runtime/jit.py", line 550, in run bin.c_wrapper( File "/opt/conda/lib/python3.10/site-packages/triton/compiler/compiler.py", line 692, in __getattribute__ self._init_handles() File "/opt/conda/lib/python3.10/site-packages/triton/compiler/compiler.py", line 683, in _init_handles mod, func, n_regs, n_spills = fn_load_binary(self.metadata["name"], self.asm[bin_path], self.shared, device) RuntimeError: Triton Error [CUDA]: device kernel image is invalid

I am using the latest lorax image, and unable to figure out how to resolve this. Can you someone please help in figuring this out?

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Using A10G (g5.12x large) for deployment - working with fp16 but not GPTQ

Expected behavior

Expected successful deployment, but not working with GPTQ.

The text was updated successfully, but these errors were encountered:

geoffreyangus · 2024-02-15T20:44:46Z

Hi @GlacierPurpleBison, taking a look. Do you know if this bug is a SageMaker-specific bug, or if this occurs when initializing a vanilla LoRAX docker container as well?

GlacierPurpleBison · 2024-02-16T07:20:27Z

Hi @geoffreyangus, it isn't working on sagemaker notebooks using docker directly as well. It seems to be getting stuck at waiting for the shard to be ready. When i run the client, i am getting connection refused error. When I use tekium's Openhermes directly I am able to connect to the client and the output is as expected.

`
Status: Downloaded newer image for ghcr.io/predibase/lorax:latest
2024-02-16T07:04:28.688487Z INFO lorax_launcher: Args { model_id: "TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ", adapter_id: None, source: "hub", adapter_source: "hub", revision: Some("gptq-4bit-32g-actorder_True"), validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Gptq), compile: false, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, hostname: "69d6ffd6f9c6", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_headers: [], cors_allow_methods: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2024-02-16T07:04:28.688602Z INFO download: lorax_launcher: Starting download process.
2024-02-16T07:04:33.112129Z INFO lorax_launcher: cli.py:109 Files are already present on the host. Skipping download.

2024-02-16T07:04:33.693079Z INFO download: lorax_launcher: Successfully downloaded weights.
2024-02-16T07:04:33.693239Z INFO shard-manager: lorax_launcher: Starting shard rank=0
2024-02-16T07:04:43.702379Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0
2024-02-16T07:04:53.710063Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0
2024-02-16T07:05:03.717915Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0
2024-02-16T07:05:13.726304Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0
2024-02-16T07:05:23.734374Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0
2024-02-16T07:05:33.742445Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0
2024-02-16T07:05:43.750895Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0
2024-02-16T07:05:51.652610Z INFO lorax_launcher: weights.py:253 Using exllama kernels

2024-02-16T07:05:53.758610Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0
2024-02-16T07:06:03.766768Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0
2024-02-16T07:06:13.775196Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0
2024-02-16T07:06:16.909582Z INFO lorax_launcher: weights.py:253 Using exllama kernels
`

I am unfortunately not able to verify this outside of sagemaker since I have a mac and hence can't test it locally for consistency.

I am using the same A10G machine for sagemaker notebook instance. But you'll notice that i am not getting the CUDA Triton error here.

GlacierPurpleBison · 2024-02-16T07:43:38Z

Btw, i am not getting any error when i try and deploy AWQ. Only getting this with GPTQ

GlacierPurpleBison · 2024-02-19T15:34:02Z

hey @geoffreyangus - were you able to check further on this?

geoffreyangus · 2024-02-20T17:04:55Z

Hi @GlacierPurpleBison– I apologize, I took President's day weekend off. I'll take a look at it this week, thanks!

magdyksaleh assigned geoffreyangus Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error deploying GPTQ models to sagemaker #235

Error deploying GPTQ models to sagemaker #235

GlacierPurpleBison commented Feb 11, 2024

geoffreyangus commented Feb 15, 2024

GlacierPurpleBison commented Feb 16, 2024 •

edited

Loading

GlacierPurpleBison commented Feb 16, 2024

GlacierPurpleBison commented Feb 19, 2024 •

edited

Loading

geoffreyangus commented Feb 20, 2024

Error deploying GPTQ models to sagemaker #235

Error deploying GPTQ models to sagemaker #235

Comments

GlacierPurpleBison commented Feb 11, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

geoffreyangus commented Feb 15, 2024

GlacierPurpleBison commented Feb 16, 2024 • edited Loading

GlacierPurpleBison commented Feb 16, 2024

GlacierPurpleBison commented Feb 19, 2024 • edited Loading

geoffreyangus commented Feb 20, 2024

GlacierPurpleBison commented Feb 16, 2024 •

edited

Loading

GlacierPurpleBison commented Feb 19, 2024 •

edited

Loading