Problem Inference LLAMA-3.3 70B Instruct #436

Tortoise17 · 2025-02-12T13:18:24Z

I am trying to use the LLAMA 3.3 70B and there is the log as below before I get the text return.


2025-02-12 13:46:44,045 INFO config.py L54: PyTorch version 2.6.0+cu126 available.
llama-3_3_70B/lib/python3.10/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:411: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, qweight, scales, qzeros, g_idx, bits, maxq):
llama-3_3_70B/lib/python3.10/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:419: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
llama-3_3_70B/lib/python3.10/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd(cast_inputs=torch.float16)
2025-02-12 13:46:44,477 WARNING qlinear_cuda.py L18: CUDA extension not installed.
2025-02-12 13:46:44,478 WARNING qlinear_cuda_old.py L17: CUDA extension not installed.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:58<00:00,  9.83s/it]
llama-3_3_70B/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:628: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
llama-3_3_70B/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:633: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

The question is that, as I am trying to use it on GPU, it takes single GPU 40 GB memory. But the time for inference is almost 7-9 minutes including model loading. If I try to use on CPU, than, its in hours to make single inference.

Is this statistics / time taken for inference using the auto-round[GPU] is correct or there is some mistake and it could be better optimized? Please if you could guide. If there is any way to make it better in inference time using CPU and if there is any advise for CPU based use of this model?

The text was updated successfully, but these errors were encountered:

wenhuach21 · 2025-02-12T13:28:49Z

1 is there a warning in the log that is similar to "exllamav2 kernel is not installed" when you run it on cuda
2 Could you attach your prompt and your device type? If the cpu does not support avx, then it will be quite slow

wenhuach21 · 2025-02-12T13:31:54Z

On cuda, try to use torch.float16

model = AutoModelForCausalLM.from_pretrained(
quantized_model_dir,
torch_dtype=torch.float16,
device_map="auto",
##revision="12cbcc0", ##AutoGPTQ format
)

Tortoise17 · 2025-02-12T13:47:09Z

@wenhuach21 the Prompt :

prompt = [
    "The diplomat is ashamed and holds his hand in front of his face. He is now being investigated after the death of his husband. According to media reports, the couple had been married for 5 years and lived together in the district in Syria. The consulate employee initially called for an ambulance himself because his husband had suddenly felt ill and had fallen. He had also taken sleeping pills and drunk a lot. However, the investigators later discovered traces of blood throughout the apartment. After the autopsy, the police assumed that a blow to the back of the head led to death. At the police headquarters, he stuck to the version that his husband had fallen. He could not explain where the injuries we found on the body came from. There are signs of repeated, blunt injuries and I can say with certainty that the victim had been beaten. The consulate general did not comment on the case when asked. The Foreign Office said that the embassy in Syria and the consulate general were in close contact with the authorities in the Investigate the case. The signs of the injuries are all over the apartment."]

The instruction was to shorten the length with important points.

and device_type is cuda.
I have used default

torch_dtype='auto',
device_map='auto'

If this needs to be changed too?

Tortoise17 · 2025-02-12T13:48:04Z

@wenhuach21 Please provide the proper link for the exllamav2 kernel installation. It failed with 2 versions so far.

wenhuach21 · 2025-02-12T13:49:17Z

Additionally, for model serving, it's best to use frameworks like vLLM, SLang, or TGI, as they are highly optimized for memory efficiency and performance. You can use the GPTQ or AWQ formats, which we provide for most models

Tortoise17 · 2025-02-12T13:52:42Z

Could you guide how to use those optimized lines like SLang or TGI for 70B?

wenhuach21 · 2025-02-12T13:54:44Z

exllamav2 pip install git+https://github.com/AutoGPTQ/AutoGPTQ.git@b8b4127

For cpu inference, if your devices has gpu, you need to specify the backend as shown in https://github.com/intel/auto-round?tab=readme-ov-file#cpuhpucuda

from auto_round import AutoRoundConfig

backend = "cpu"  ##cpu, hpu, cuda
quantization_config = AutoRoundConfig(
    backend=backend
)
quantized_model_path = "OPEA/Llama-3.3-70B-Instruct-int4-sym-inc"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
                                             device_map="cpu",
                                             quantization_config=quantization_config)

wenhuach21 · 2025-02-12T13:56:43Z

Could you guide how to use those optimized lines like SLang or TGI for 70B?

please follow their user guide.

wenhuach21 · 2025-02-12T13:59:13Z

https://github.com/vllm-project/vllm
https://github.com/sgl-project/sglang

Tortoise17 · 2025-02-12T14:32:29Z

@wenhuach21 First of all, a great billions time thank you. As Just I am now successful in kernel installation AUTOGPTQ. Now inference time after model load is roughly in seconds. Less than a minute on cuda_device. I have to work on CPU based model for 70B and see how it could be better optimized and used on CPU_ONLY basis. I will write you for that once I fix the installation and tests.

Tortoise17 changed the title ~~Inference LLAMA-3.3 70B taking~~ Inference LLAMA-3.3 70B taking long time Feb 12, 2025

Tortoise17 changed the title ~~Inference LLAMA-3.3 70B taking long time~~ Problem Inference LLAMA-3.3 70B Feb 12, 2025

Tortoise17 changed the title ~~Problem Inference LLAMA-3.3 70B~~ Problem Inference LLAMA-3.3 70B Instruct Feb 12, 2025

Tortoise17 closed this as completed Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem Inference LLAMA-3.3 70B Instruct #436

Problem Inference LLAMA-3.3 70B Instruct #436

Tortoise17 commented Feb 12, 2025

wenhuach21 commented Feb 12, 2025 •

edited

Loading

wenhuach21 commented Feb 12, 2025

Tortoise17 commented Feb 12, 2025

Tortoise17 commented Feb 12, 2025

wenhuach21 commented Feb 12, 2025

Tortoise17 commented Feb 12, 2025

wenhuach21 commented Feb 12, 2025

wenhuach21 commented Feb 12, 2025

wenhuach21 commented Feb 12, 2025

Tortoise17 commented Feb 12, 2025

Problem Inference LLAMA-3.3 70B Instruct #436

Problem Inference LLAMA-3.3 70B Instruct #436

Comments

Tortoise17 commented Feb 12, 2025

wenhuach21 commented Feb 12, 2025 • edited Loading

wenhuach21 commented Feb 12, 2025

Tortoise17 commented Feb 12, 2025

Tortoise17 commented Feb 12, 2025

wenhuach21 commented Feb 12, 2025

Tortoise17 commented Feb 12, 2025

wenhuach21 commented Feb 12, 2025

wenhuach21 commented Feb 12, 2025

wenhuach21 commented Feb 12, 2025

Tortoise17 commented Feb 12, 2025

wenhuach21 commented Feb 12, 2025 •

edited

Loading