You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem description:
When configuring MPS with an A16 card, when the number of replicas is set to 6, it is reported that the device cannot be allocated (all devices that support CUDA are busy or unavailable). When the number of replicas is adjusted to 5, the operation is normal.
The same problem also occurs on T4 cards and A30 cards. The same error is reported when the number of copies exceeds 21 on T4 and the number of copies on A30 exceeds 30.
When MPS is enabled on A16, A30, and T4 cards, what is the maximum number of copies supported?
config Information:
k8s-device-plugin: 0.17.0
GPU operate Version:24.9.0
cuda Version:12.4
cuda drvier Version:12.4
k8s Version:1.26
sample:pytorch 2.4.0+cu124
Logs:
sample output:
using cuda:0 device.
Using 8 dataloader workers every process
using 60000 images for training, 10000 images for validation.
Traceback (most recent call last):
File "/workspace/code/pt-examples/resnet/train.py", line 133, in
main()
File "/workspace/code/pt-examples/resnet/train.py", line 74, in main
net.to(device)
File "/root/.conda/envs/ptcuda124/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1174, in to
return self._apply(convert)
File "/root/.conda/envs/ptcuda124/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/root/.conda/envs/ptcuda124/lib/python3.10/site-packages/torch/nn/modules/module.py", line 805, in _apply
param_applied = fn(param)
File "/root/.conda/envs/ptcuda124/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in convert
return t.to(
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
mps-control-deamon output:
I0107 06:20:18.770820 196 main.go:203] Retrieving MPS daemons.
W0107 06:20:18.781612 196 client_config.go:659] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0107 06:20:18.803608 196 daemon.go:97] "Staring MPS daemon" resource="nvidia.com/gpu"
I0107 06:20:18.803664 196 daemon.go:156] "SELinux enabled, setting context" path="/mps/nvidia.com/gpu/pipe" context="system_u:object_r:container_file_t:s0"
I0107 06:20:18.809265 196 daemon.go:139] "Starting log tailer" resource="nvidia.com/gpu"
[2025-01-07 06:20:18.806 Control 209] Starting control daemon using socket /mps/nvidia.com/gpu/pipe/control
[2025-01-07 06:20:18.806 Control 209] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu/pipe
[2025-01-07 06:20:18.807 Control 209] Accepting connection...
[2025-01-07 06:20:18.807 Control 209] NEW UI
[2025-01-07 06:20:18.807 Control 209] Cmd:set_default_device_pinned_mem_limit 0 2559M
[2025-01-07 06:20:18.807 Control 209] UI closed
[2025-01-07 06:20:18.808 Control 209] Accepting connection...
[2025-01-07 06:20:18.808 Control 209] NEW UI
[2025-01-07 06:20:18.808 Control 209] Cmd:set_default_active_thread_percentage 16
[2025-01-07 06:20:18.809 Control 209] 16.0
[2025-01-07 06:20:18.809 Control 209] UI closed
mps server output:
[2025-01-07 12:57:36.231 Other 82] Startup
[2025-01-07 12:57:36.231 Other 82] Connecting to control daemon on socket: /mps/nvidia.com/gpu/pipe/control
[2025-01-07 12:57:36.231 Other 82] Initializing server process
[2025-01-07 12:57:36.324 Server 82] Creating server context on device 0 (NVIDIA A16)
[2025-01-07 12:57:36.457 Server 82] Created named shared memory region /cuda.shm.0.52.1
[2025-01-07 12:57:36.457 Server 82] Active Threads Percentage set to 12.0
[2025-01-07 12:57:36.457 Server 82] Device pinned memory limit for device 0 set to 0x77f00000 bytes
[2025-01-07 12:57:36.457 Server 82] Server Priority set to 0
[2025-01-07 12:57:36.457 Server 82] Server has started
[2025-01-07 12:57:36.457 Server 82] Received new client request
[2025-01-07 12:57:36.457 Server 82] Worker created
[2025-01-07 12:57:36.457 Server 82] Creating worker thread
[2025-01-07 12:57:36.515 Server 82] Received new client request
[2025-01-07 12:57:36.516 Server 82] Worker created
[2025-01-07 12:57:36.516 Server 82] Creating worker thread
[2025-01-07 12:57:36.516 Server 82] Device NVIDIA A16 (uuid GPU-077f7bb7-c8e8-46db-77e3-e2c1e5665c78) is associated
[2025-01-07 12:57:36.516 Server 82] Status of client {0, 1} is ACTIVE
[2025-01-07 12:57:36.657 Server 82] Client process 0 encountered a fatal GPU error.
[2025-01-07 12:57:36.657 Server 82] Server is handling a fatal GPU error.
[2025-01-07 12:57:36.657 Server 82] Status of client {0, 1} is INACTIVE
[2025-01-07 12:57:36.657 Server 82] The following devices will be reset:
[2025-01-07 12:57:36.657 Server 82] 0
[2025-01-07 12:57:36.657 Server 82] The following clients have a sticky error set:
[2025-01-07 12:57:36.657 Server 82] 0
[2025-01-07 12:57:36.777 Server 82] Receive command failed, assuming client exit
[2025-01-07 12:57:36.777 Server 82] Client {0, 1} exit
[2025-01-07 12:57:36.777 Server 82] Client disconnected. Number of active client contexts is 0.
[2025-01-07 12:57:36.777 Server 82] Destroy server context on device 0
[2025-01-07 12:57:37.237 Server 82] Receive command failed, assuming client exit
[2025-01-07 12:57:37.237 Server 82] Client process disconnected
dmesg output:
[ 268.026103] NVRM: GPU at PCI:0000:00:0c: GPU-077f7bb7-c8e8-46db-77e3-e2c1e5665c78
[ 268.026131] NVRM: GPU Board Serial Number: 1321722074061
[ 268.026144] NVRM: Xid (PCI:0000:00:0c): 69, pid='<unknown>', name=<unknown>, Class Error: ChId 000a, Class 0000c7c0, Offset 000002ec, Data 00000000, ErrorCode 00000004
[ 579.875139] NVRM: Xid (PCI:0000:00:0c): 69, pid='<unknown>', name=<unknown>, Class Error: ChId 000a, Class 0000c7c0, Offset 000002ec, Data 00000000, ErrorCode 00000004
nv-smi output:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A16 Off | 00000000:00:0C.0 Off | 0 |
| 0% 49C P8 16W / 62W | 1MiB / 15356MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
The text was updated successfully, but these errors were encountered:
Problem description:
When configuring MPS with an A16 card, when the number of replicas is set to 6, it is reported that the device cannot be allocated (all devices that support CUDA are busy or unavailable). When the number of replicas is adjusted to 5, the operation is normal.
The same problem also occurs on T4 cards and A30 cards. The same error is reported when the number of copies exceeds 21 on T4 and the number of copies on A30 exceeds 30.
When MPS is enabled on A16, A30, and T4 cards, what is the maximum number of copies supported?
config Information:
k8s-device-plugin: 0.17.0
GPU operate Version:24.9.0
cuda Version:12.4
cuda drvier Version:12.4
k8s Version:1.26
sample:pytorch 2.4.0+cu124
Logs:
sample output:
using cuda:0 device.
Using 8 dataloader workers every process
using 60000 images for training, 10000 images for validation.
Traceback (most recent call last):
File "/workspace/code/pt-examples/resnet/train.py", line 133, in
main()
File "/workspace/code/pt-examples/resnet/train.py", line 74, in main
net.to(device)
File "/root/.conda/envs/ptcuda124/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1174, in to
return self._apply(convert)
File "/root/.conda/envs/ptcuda124/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/root/.conda/envs/ptcuda124/lib/python3.10/site-packages/torch/nn/modules/module.py", line 805, in _apply
param_applied = fn(param)
File "/root/.conda/envs/ptcuda124/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in convert
return t.to(
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.The text was updated successfully, but these errors were encountered: