-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem Inference LLAMA-3.3 70B Instruct #436
Comments
1 is there a warning in the log that is similar to "exllamav2 kernel is not installed" when you run it on cuda |
On cuda, try to use torch.float16 model = AutoModelForCausalLM.from_pretrained( |
@wenhuach21 the Prompt :
The instruction was to shorten the length with important points. and device_type is
If this needs to be changed too? |
@wenhuach21 Please provide the proper link for the exllamav2 kernel installation. It failed with 2 versions so far. |
Additionally, for model serving, it's best to use frameworks like vLLM, SLang, or TGI, as they are highly optimized for memory efficiency and performance. You can use the GPTQ or AWQ formats, which we provide for most models |
Could you guide how to use those optimized lines like SLang or TGI for 70B? |
exllamav2 pip install git+https://github.com/AutoGPTQ/AutoGPTQ.git@b8b4127 For cpu inference, if your devices has gpu, you need to specify the backend as shown in https://github.com/intel/auto-round?tab=readme-ov-file#cpuhpucuda from auto_round import AutoRoundConfig
backend = "cpu" ##cpu, hpu, cuda
quantization_config = AutoRoundConfig(
backend=backend
)
quantized_model_path = "OPEA/Llama-3.3-70B-Instruct-int4-sym-inc"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
device_map="cpu",
quantization_config=quantization_config) |
please follow their user guide. |
@wenhuach21 First of all, a great billions time thank you. As Just I am now successful in kernel installation |
I am trying to use the LLAMA 3.3 70B and there is the log as below before I get the text return.
The question is that, as I am trying to use it on GPU, it takes single GPU 40 GB memory. But the time for inference is almost 7-9 minutes including model loading. If I try to use on CPU, than, its in hours to make single inference.
Is this statistics / time taken for inference using the
auto-round[GPU]
is correct or there is some mistake and it could be better optimized? Please if you could guide. If there is any way to make it better in inference time using CPU and if there is any advise for CPU based use of this model?The text was updated successfully, but these errors were encountered: