Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] notebook量化时torch.OutOfMemoryError: CUDA out of memory. #2915

Open
3 tasks
EvoNexusX opened this issue Dec 17, 2024 · 1 comment
Open
3 tasks

Comments

@EvoNexusX
Copy link

EvoNexusX commented Dec 17, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

版本:0.6.3
配置: GPU T42 (162)
量化 glm4-9b awq时,我发现卡1近100%,卡2 0%,量化到结尾时爆错,应该是没有用到卡2。
是需要额外设置一些参数吗?

!lmdeploy lite auto_awq \
  $HF_MODEL \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 2048 \
  --w-bits 4 \
  --w-group-size 128 \
  --batch-size 1 \
  --work-dir $WORK_DIR

Reproduction

1

Environment

1

Error traceback

1
@EvoNexusX
Copy link
Author

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|████████████████| 10/10 [00:50<00:00, 5.06s/it]
Move transformer.embedding to GPU.
Move transformer.rotary_pos_emb to GPU.
Move transformer.encoder.layers.0 to CPU.
Move transformer.encoder.layers.1 to CPU.
Move transformer.encoder.layers.2 to CPU.
Move transformer.encoder.layers.3 to CPU.
Move transformer.encoder.layers.4 to CPU.
Move transformer.encoder.layers.5 to CPU.
Move transformer.encoder.layers.6 to CPU.
Move transformer.encoder.layers.7 to CPU.
Move transformer.encoder.layers.8 to CPU.
Move transformer.encoder.layers.9 to CPU.
Move transformer.encoder.layers.10 to CPU.
Move transformer.encoder.layers.11 to CPU.
Move transformer.encoder.layers.12 to CPU.
Move transformer.encoder.layers.13 to CPU.
Move transformer.encoder.layers.14 to CPU.
Move transformer.encoder.layers.15 to CPU.
Move transformer.encoder.layers.16 to CPU.
Move transformer.encoder.layers.17 to CPU.
Move transformer.encoder.layers.18 to CPU.
Move transformer.encoder.layers.19 to CPU.
Move transformer.encoder.layers.20 to CPU.
Move transformer.encoder.layers.21 to CPU.
Move transformer.encoder.layers.22 to CPU.
Move transformer.encoder.layers.23 to CPU.
Move transformer.encoder.layers.24 to CPU.
Move transformer.encoder.layers.25 to CPU.
Move transformer.encoder.layers.26 to CPU.
Move transformer.encoder.layers.27 to CPU.
Move transformer.encoder.layers.28 to CPU.
Move transformer.encoder.layers.29 to CPU.
Move transformer.encoder.layers.30 to CPU.
Move transformer.encoder.layers.31 to CPU.
Move transformer.encoder.layers.32 to CPU.
Move transformer.encoder.layers.33 to CPU.
Move transformer.encoder.layers.34 to CPU.
Move transformer.encoder.layers.35 to CPU.
Move transformer.encoder.layers.36 to CPU.
Move transformer.encoder.layers.37 to CPU.
Move transformer.encoder.layers.38 to CPU.
Move transformer.encoder.layers.39 to CPU.
Move transformer.encoder.final_layernorm to GPU.
Move transformer.output_layer to CPU.
Loading calibrate dataset ...
Token indices sequence length is longer than the specified maximum sequence length for this model (1104488 > 128000). Running this sequence through the model will result in indexing errors
transformer.encoder.layers.0, samples: 128, max gpu memory: 7.56 GB
transformer.encoder.layers.1, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.2, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.3, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.4, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.5, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.6, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.7, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.8, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.9, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.10, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.11, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.12, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.13, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.14, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.15, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.16, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.17, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.18, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.19, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.20, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.21, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.22, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.23, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.24, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.25, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.26, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.27, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.28, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.29, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.30, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.31, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.32, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.33, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.34, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.35, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.36, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.37, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.38, samples: 128, max gpu memory: 9.56 GB
transformer.encoder.layers.39, samples: 128, max gpu memory: 9.56 GB
Traceback (most recent call last):
File "/opt/conda/bin/lmdeploy", line 8, in
sys.exit(run())
File "/opt/conda/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 42, in run
args.run(args)
File "/opt/conda/lib/python3.10/site-packages/lmdeploy/cli/lite.py", line 131, in auto_awq
auto_awq(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/lmdeploy/lite/apis/auto_awq.py", line 90, in auto_awq
vl_model, model, tokenizer, work_dir = calibrate(model,
File "/opt/conda/lib/python3.10/site-packages/lmdeploy/lite/apis/calibrate.py", line 305, in calibrate
calib_ctx.calibrate(all_data)
File "/opt/conda/lib/python3.10/site-packages/lmdeploy/lite/quantization/calibration.py", line 238, in calibrate
_ = model(data.to(self.device))
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/1/modeling_chatglm.py", line 777, in forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/1/modeling_chatglm.py", line 634, in forward
hidden_states = self.final_layernorm(hidden_states)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
result = forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/1/modeling_chatglm.py", line 156, in forward
return (self.weight * hidden_states).to(input_dtype)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB. GPU 0 has a total capacity of 14.74 GiB of which 1.39 GiB is free. Process 25419 has 13.35 GiB memory in use. Of the allocated memory 13.18 GiB is allocated by PyTorch, and 42.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant