Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] torch.OutOfMemoryError #1773

Open
2 tasks done
QingChengLineOne opened this issue Dec 23, 2024 · 3 comments
Open
2 tasks done

[Bug] torch.OutOfMemoryError #1773

QingChengLineOne opened this issue Dec 23, 2024 · 3 comments
Assignees

Comments

@QingChengLineOne
Copy link

先决条件

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境

{'CUDA available': True,
'CUDA_HOME': '/usr/local/cuda',
'GCC': 'gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0',
'GPU 0,1': 'NVIDIA GeForce RTX 4090',
'MMEngine': '0.10.5',
'MUSA available': False,
'NVCC': 'Cuda compilation tools, release 11.8, V11.8.89',
'OpenCV': '4.10.0',
'PyTorch': '2.5.1+cu124',
'PyTorch compiling details': 'PyTorch built with:\n'
' - GCC 9.3\n'
' - C++ Version: 201703\n'
' - Intel(R) oneAPI Math Kernel Library Version '
'2024.2-Product Build 20240605 for Intel(R) 64 '
'architecture applications\n'
' - Intel(R) MKL-DNN v3.5.3 (Git Hash '
'66f0cb9eb66affd2da3bf5f8d897376f04aae6af)\n'
' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
' - LAPACK is enabled (usually provided by '
'MKL)\n'
' - NNPACK is enabled\n'
' - CPU capability usage: AVX2\n'
' - CUDA Runtime 12.4\n'
' - NVCC architecture flags: '
'-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n'
' - CuDNN 90.1\n'
' - Magma 2.6.1\n'
' - Build settings: BLAS_INFO=mkl, '
'BUILD_TYPE=Release, CUDA_VERSION=12.4, '
'CUDNN_VERSION=9.1.0, '
'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
'-fabi-version=11 -fvisibility-inlines-hidden '
'-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
'-DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON '
'-DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK '
'-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
'-O2 -fPIC -Wall -Wextra -Werror=return-type '
'-Werror=non-virtual-dtor -Werror=bool-operation '
'-Wnarrowing -Wno-missing-field-initializers '
'-Wno-type-limits -Wno-array-bounds '
'-Wno-unknown-pragmas -Wno-unused-parameter '
'-Wno-strict-overflow -Wno-strict-aliasing '
'-Wno-stringop-overflow -Wsuggest-override '
'-Wno-psabi -Wno-error=old-style-cast '
'-Wno-missing-braces -fdiagnostics-color=always '
'-faligned-new -Wno-unused-but-set-variable '
'-Wno-maybe-uninitialized -fno-math-errno '
'-fno-trapping-math -Werror=format '
'-Wno-stringop-overflow, LAPACK_INFO=mkl, '
'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
'TORCH_VERSION=2.5.1, USE_CUDA=ON, USE_CUDNN=ON, '
'USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, '
'USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, '
'USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, '
'USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, '
'USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, \n',
'Python': '3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0]',
'TorchVision': '0.20.1+cu124',
'lmdeploy': "not installed:No module named 'lmdeploy'",
'numpy_random_seed': 2147483648,
'opencompass': '0.3.8+',
'sys.platform': 'linux',
'transformers': '4.47.1'}

重现问题 - 代码/配置示例

CUDA_VISIBLE_DEVICES=0,1 python run.py
--datasets tydiqa_gen
--hf-type chat
--hf-path /public/zzy/model/Phi/Phi-3-mini-128k-instruct
--batch-size 1
--debug

重现问题 - 命令或脚本

上面的python命令

重现问题 - 错误信息

/public/zzy/fintuning/opencompass/opencompass/init.py:19: UserWarning: Starting from v0.4.0, all AMOTIC configuration files currently located in ./configs/datasets, ./configs/models, and ./configs/summarizers will be migrated to the opencompass/configs/ package. Please update your configuration file paths accordingly.
_warn_about_config_migration()
12/22 15:16:50 - OpenCompass - WARNING - Found ambiguous patterns, using the first matched config.
+----------------------+---------------------------------------------------------------------------------------+
| Ambiguous patterns | Matched files |
|----------------------+---------------------------------------------------------------------------------------|
| tydiqa_gen | configs/datasets/tydiqa/tydiqa_gen.py |
| | /public/zzy/fintuning/opencompass/opencompass/configs/./datasets/tydiqa/tydiqa_gen.py |
+----------------------+---------------------------------------------------------------------------------------+
12/22 15:16:50 - OpenCompass - INFO - Loading tydiqa_gen: configs/datasets/tydiqa/tydiqa_gen.py
12/22 15:16:50 - OpenCompass - WARNING - Found ambiguous patterns, using the first matched config.
+----------------------+--------------------------------------------------------------------------------+
| Ambiguous patterns | Matched files |
|----------------------+--------------------------------------------------------------------------------|
| example | configs/summarizers/example.py |
| | /public/zzy/fintuning/opencompass/opencompass/configs/./summarizers/example.py |
+----------------------+--------------------------------------------------------------------------------+
12/22 15:16:50 - OpenCompass - INFO - Loading example: configs/summarizers/example.py
12/22 15:16:50 - OpenCompass - INFO - Current exp folder: outputs/default/20241222_151650
12/22 15:16:51 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
12/22 15:16:51 - OpenCompass - INFO - Partitioned into 1 tasks.
12/22 15:16:52 - OpenCompass - WARNING - Only use 1 GPUs for total 2 available GPUs in debug mode.
12/22 15:16:52 - OpenCompass - INFO - Task [Phi-3-mini-128k-instruct_hf/tydiqa-goldp_arabic,Phi-3-mini-128k-instruct_hf/tydiqa-goldp_bengali,Phi-3-mini-128k-instruct_hf/tydiqa-goldp_english,Phi-3-mini-128k-instruct_hf/tydiqa-goldp_finnish,Phi-3-mini-128k-instruct_hf/tydiqa-goldp_indonesian,Phi-3-mini-128k-instruct_hf/tydiqa-goldp_japanese,Phi-3-mini-128k-instruct_hf/tydiqa-goldp_korean,Phi-3-mini-128k-instruct_hf/tydiqa-goldp_russian,Phi-3-mini-128k-instruct_hf/tydiqa-goldp_swahili,Phi-3-mini-128k-instruct_hf/tydiqa-goldp_telugu,Phi-3-mini-128k-instruct_hf/tydiqa-goldp_thai]
flash-attention package not found, consider installing for better performance: No module named 'flash_attn'.
Current flash-attenton does not support window_size. Either upgrade or use attn_implementation='eager'.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 2/2 [01:20<00:00, 40.22s/it]
We've detected an older driver with an RTX 4000 series GPU. These drivers have issues with P2P. This can affect the multi-gpu inference when using accelerate device_map.Please make sure to update your driver to the latest version which resolves this.
12/22 15:18:13 - OpenCompass - INFO - using stop words: ['<|assistant|>', '<|endoftext|>', '<|end|>']
12/22 15:18:14 - OpenCompass - INFO - Start inferencing [Phi-3-mini-128k-instruct_hf/tydiqa-goldp_arabic]
[2024-12-22 15:18:14,186] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting build dataloader
[2024-12-22 15:18:14,187] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
0%| | 0/921 [00:00<?, ?it/s]The seen_tokens attribute is deprecated and will be removed in v4.41. Use the cache_position model input instead.
get_max_cache() is deprecated for all Cache classes. Use get_max_cache_shape() instead. Calling get_max_cache() will raise error from v4.48
You are not running the flash-attention implementation, expect numerical differences.
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 921/921 [1:47:55<00:00, 7.03s/it]
12/22 17:06:09 - OpenCompass - INFO - Start inferencing [Phi-3-mini-128k-instruct_hf/tydiqa-goldp_bengali]
[2024-12-22 17:06:09,609] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting build dataloader
[2024-12-22 17:06:09,609] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 113/113 [17:35<00:00, 9.34s/it]
12/22 17:23:45 - OpenCompass - INFO - Start inferencing [Phi-3-mini-128k-instruct_hf/tydiqa-goldp_english]
[2024-12-22 17:23:45,217] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting build dataloader
[2024-12-22 17:23:45,217] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 440/440 [12:06<00:00, 1.65s/it]
12/22 17:35:51 - OpenCompass - INFO - Start inferencing [Phi-3-mini-128k-instruct_hf/tydiqa-goldp_finnish]
[2024-12-22 17:35:51,916] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting build dataloader
[2024-12-22 17:35:51,916] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 782/782 [1:16:55<00:00, 5.90s/it]
12/22 18:52:47 - OpenCompass - INFO - Start inferencing [Phi-3-mini-128k-instruct_hf/tydiqa-goldp_indonesian]
[2024-12-22 18:52:47,531] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting build dataloader
[2024-12-22 18:52:47,531] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 565/565 [28:03<00:00, 2.98s/it]
12/22 19:20:50 - OpenCompass - INFO - Start inferencing [Phi-3-mini-128k-instruct_hf/tydiqa-goldp_japanese]
[2024-12-22 19:20:50,824] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting build dataloader
[2024-12-22 19:20:50,824] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 455/455 [25:30<00:00, 3.36s/it]
12/22 19:46:21 - OpenCompass - INFO - Start inferencing [Phi-3-mini-128k-instruct_hf/tydiqa-goldp_korean]
[2024-12-22 19:46:21,344] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting build dataloader
[2024-12-22 19:46:21,344] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 276/276 [27:02<00:00, 5.88s/it]
12/22 20:13:23 - OpenCompass - INFO - Start inferencing [Phi-3-mini-128k-instruct_hf/tydiqa-goldp_russian]
[2024-12-22 20:13:23,953] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting build dataloader
[2024-12-22 20:13:23,953] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 812/812 [23:46<00:00, 1.76s/it]
12/22 20:37:10 - OpenCompass - INFO - Start inferencing [Phi-3-mini-128k-instruct_hf/tydiqa-goldp_swahili]
[2024-12-22 20:37:10,967] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting build dataloader
[2024-12-22 20:37:10,967] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 499/499 [36:18<00:00, 4.37s/it]
12/22 21:13:29 - OpenCompass - INFO - Start inferencing [Phi-3-mini-128k-instruct_hf/tydiqa-goldp_telugu]
[2024-12-22 21:13:30,003] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting build dataloader
[2024-12-22 21:13:30,003] [opencompass.openicl.icl_inferencer.icl_gen_inferencer] [INFO] Starting inference process...
16%|███████████████▌ | 110/669 [25:30<2:09:37, 13.91s/it]
Traceback (most recent call last):
File "/public/zzy/fintuning/opencompass/run.py", line 4, in
main()
File "/public/zzy/fintuning/opencompass/opencompass/cli/main.py", line 308, in main
runner(tasks)
File "/public/zzy/fintuning/opencompass/opencompass/runners/base.py", line 38, in call
status = self.launch(tasks)
File "/public/zzy/fintuning/opencompass/opencompass/runners/local.py", line 128, in launch
task.run(cur_model=getattr(self, 'cur_model',
File "/public/zzy/fintuning/opencompass/opencompass/tasks/openicl_infer.py", line 89, in run
self._inference()
File "/public/zzy/fintuning/opencompass/opencompass/tasks/openicl_infer.py", line 134, in _inference
inferencer.inference(retriever,
File "/public/zzy/fintuning/opencompass/opencompass/openicl/icl_inferencer/icl_gen_inferencer.py", line 153, in inference
results = self.model.generate_from_template(
File "/public/zzy/fintuning/opencompass/opencompass/models/base.py", line 201, in generate_from_template
return self.generate(inputs, max_out_len=max_out_len, **kwargs)
File "/public/zzy/fintuning/opencompass/opencompass/models/huggingface_above_v4_33.py", line 479, in generate
outputs = self.model.generate(**tokens, **generation_kwargs)
File "/root/anaconda/envs/opencompass/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/anaconda/envs/opencompass/lib/python3.10/site-packages/transformers/generation/utils.py", line 2252, in generate
result = self._sample(
File "/root/anaconda/envs/opencompass/lib/python3.10/site-packages/transformers/generation/utils.py", line 3251, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/root/anaconda/envs/opencompass/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda/envs/opencompass/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda/envs/opencompass/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/Phi-3-mini-128k-instruct/modeling_phi3.py", line 1286, in forward
outputs = self.model(
File "/root/anaconda/envs/opencompass/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda/envs/opencompass/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/Phi-3-mini-128k-instruct/modeling_phi3.py", line 1164, in forward
layer_outputs = decoder_layer(
File "/root/anaconda/envs/opencompass/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda/envs/opencompass/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda/envs/opencompass/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/Phi-3-mini-128k-instruct/modeling_phi3.py", line 885, in forward
attn_outputs, self_attn_weights, present_key_value = self.self_attn(
File "/root/anaconda/envs/opencompass/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda/envs/opencompass/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda/envs/opencompass/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/Phi-3-mini-128k-instruct/modeling_phi3.py", line 405, in forward
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(value_states.dtype)
File "/root/anaconda/envs/opencompass/lib/python3.10/site-packages/torch/nn/functional.py", line 2142, in softmax
ret = input.softmax(dim, dtype=dtype)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.83 GiB. GPU 0 has a total capacity of 23.65 GiB of which 9.06 GiB is free. Process 43896 has 954.00 MiB memory in use. Process 44069 has 13.65 GiB memory in use. Of the allocated memory 12.72 GiB is allocated by PyTorch, and 485.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

其他信息

No response

@tonysy
Copy link
Collaborator

tonysy commented Dec 24, 2024

You can decrease the batch size to avoid out-of-memory

@QingChengLineOne
Copy link
Author

You can decrease the batch size to avoid out-of-memory

但是我的batch size已经是1了呀

@tonysy
Copy link
Collaborator

tonysy commented Dec 25, 2024

Get, if the OOM is still existing, maybe the tensor parallel is required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants