Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem Inference LLAMA-3.3 70B Instruct #436

Closed
Tortoise17 opened this issue Feb 12, 2025 · 10 comments
Closed

Problem Inference LLAMA-3.3 70B Instruct #436

Tortoise17 opened this issue Feb 12, 2025 · 10 comments

Comments

@Tortoise17
Copy link

I am trying to use the LLAMA 3.3 70B and there is the log as below before I get the text return.


2025-02-12 13:46:44,045 INFO config.py L54: PyTorch version 2.6.0+cu126 available.
llama-3_3_70B/lib/python3.10/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:411: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, qweight, scales, qzeros, g_idx, bits, maxq):
llama-3_3_70B/lib/python3.10/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:419: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
llama-3_3_70B/lib/python3.10/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd(cast_inputs=torch.float16)
2025-02-12 13:46:44,477 WARNING qlinear_cuda.py L18: CUDA extension not installed.
2025-02-12 13:46:44,478 WARNING qlinear_cuda_old.py L17: CUDA extension not installed.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:58<00:00,  9.83s/it]
llama-3_3_70B/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:628: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
llama-3_3_70B/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:633: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

The question is that, as I am trying to use it on GPU, it takes single GPU 40 GB memory. But the time for inference is almost 7-9 minutes including model loading. If I try to use on CPU, than, its in hours to make single inference.

Is this statistics / time taken for inference using the auto-round[GPU] is correct or there is some mistake and it could be better optimized? Please if you could guide. If there is any way to make it better in inference time using CPU and if there is any advise for CPU based use of this model?

@Tortoise17 Tortoise17 changed the title Inference LLAMA-3.3 70B taking Inference LLAMA-3.3 70B taking long time Feb 12, 2025
@Tortoise17 Tortoise17 changed the title Inference LLAMA-3.3 70B taking long time Problem Inference LLAMA-3.3 70B Feb 12, 2025
@Tortoise17 Tortoise17 changed the title Problem Inference LLAMA-3.3 70B Problem Inference LLAMA-3.3 70B Instruct Feb 12, 2025
@wenhuach21
Copy link
Contributor

wenhuach21 commented Feb 12, 2025

1 is there a warning in the log that is similar to "exllamav2 kernel is not installed" when you run it on cuda
2 Could you attach your prompt and your device type? If the cpu does not support avx, then it will be quite slow

@wenhuach21
Copy link
Contributor

On cuda, try to use torch.float16

model = AutoModelForCausalLM.from_pretrained(
quantized_model_dir,
torch_dtype=torch.float16,
device_map="auto",
##revision="12cbcc0", ##AutoGPTQ format
)

@Tortoise17
Copy link
Author

@wenhuach21 the Prompt :

prompt = [
    "The diplomat is ashamed and holds his hand in front of his face. He is now being investigated after the death of his husband. According to media reports, the couple had been married for 5 years and lived together in the district in Syria. The consulate employee initially called for an ambulance himself because his husband had suddenly felt ill and had fallen. He had also taken sleeping pills and drunk a lot. However, the investigators later discovered traces of blood throughout the apartment. After the autopsy, the police assumed that a blow to the back of the head led to death. At the police headquarters, he stuck to the version that his husband had fallen. He could not explain where the injuries we found on the body came from. There are signs of repeated, blunt injuries and I can say with certainty that the victim had been beaten. The consulate general did not comment on the case when asked. The Foreign Office said that the embassy in Syria and the consulate general were in close contact with the authorities in the Investigate the case. The signs of the injuries are all over the apartment."]

The instruction was to shorten the length with important points.

and device_type is cuda.
I have used default

torch_dtype='auto',
device_map='auto'

If this needs to be changed too?

@Tortoise17
Copy link
Author

@wenhuach21 Please provide the proper link for the exllamav2 kernel installation. It failed with 2 versions so far.

@wenhuach21
Copy link
Contributor

Additionally, for model serving, it's best to use frameworks like vLLM, SLang, or TGI, as they are highly optimized for memory efficiency and performance. You can use the GPTQ or AWQ formats, which we provide for most models

@Tortoise17
Copy link
Author

Could you guide how to use those optimized lines like SLang or TGI for 70B?

@wenhuach21
Copy link
Contributor

exllamav2 pip install git+https://github.com/AutoGPTQ/AutoGPTQ.git@b8b4127

For cpu inference, if your devices has gpu, you need to specify the backend as shown in https://github.com/intel/auto-round?tab=readme-ov-file#cpuhpucuda

from auto_round import AutoRoundConfig

backend = "cpu"  ##cpu, hpu, cuda
quantization_config = AutoRoundConfig(
    backend=backend
)
quantized_model_path = "OPEA/Llama-3.3-70B-Instruct-int4-sym-inc"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
                                             device_map="cpu",
                                             quantization_config=quantization_config)

@wenhuach21
Copy link
Contributor

Could you guide how to use those optimized lines like SLang or TGI for 70B?

please follow their user guide.

@wenhuach21
Copy link
Contributor

https://github.com/vllm-project/vllm
https://github.com/sgl-project/sglang

@Tortoise17
Copy link
Author

@wenhuach21 First of all, a great billions time thank you. As Just I am now successful in kernel installation AUTOGPTQ. Now inference time after model load is roughly in seconds. Less than a minute on cuda_device. I have to work on CPU based model for 70B and see how it could be better optimized and used on CPU_ONLY basis. I will write you for that once I fix the installation and tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants