Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPS use error: Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)! #1118

Open
Thomas-syq opened this issue Jan 13, 2025 · 0 comments

Comments

@Thomas-syq
Copy link

Problem description:
When configuring MPS with an A16 card, when the number of replicas is set to 6, it is reported that the device cannot be allocated (all devices that support CUDA are busy or unavailable). When the number of replicas is adjusted to 5, the operation is normal.
The same problem also occurs on T4 cards and A30 cards. The same error is reported when the number of copies exceeds 21 on T4 and the number of copies on A30 exceeds 30.
When MPS is enabled on A16, A30, and T4 cards, what is the maximum number of copies supported?

config Information:
k8s-device-plugin: 0.17.0
GPU operate Version:24.9.0
cuda Version:12.4
cuda drvier Version:12.4
k8s Version:1.26
sample:pytorch 2.4.0+cu124

Logs:
sample output:
using cuda:0 device.
Using 8 dataloader workers every process
using 60000 images for training, 10000 images for validation.
Traceback (most recent call last):
File "/workspace/code/pt-examples/resnet/train.py", line 133, in
main()
File "/workspace/code/pt-examples/resnet/train.py", line 74, in main
net.to(device)
File "/root/.conda/envs/ptcuda124/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1174, in to
return self._apply(convert)
File "/root/.conda/envs/ptcuda124/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/root/.conda/envs/ptcuda124/lib/python3.10/site-packages/torch/nn/modules/module.py", line 805, in _apply
param_applied = fn(param)
File "/root/.conda/envs/ptcuda124/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in convert
return t.to(
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

	mps-control-deamon output:			
		I0107 06:20:18.770820     196 main.go:203] Retrieving MPS daemons.
		W0107 06:20:18.781612     196 client_config.go:659] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
		I0107 06:20:18.803608     196 daemon.go:97] "Staring MPS daemon" resource="nvidia.com/gpu"
		I0107 06:20:18.803664     196 daemon.go:156] "SELinux enabled, setting context" path="/mps/nvidia.com/gpu/pipe" context="system_u:object_r:container_file_t:s0"
		I0107 06:20:18.809265     196 daemon.go:139] "Starting log tailer" resource="nvidia.com/gpu"
		[2025-01-07 06:20:18.806 Control   209] Starting control daemon using socket /mps/nvidia.com/gpu/pipe/control
		[2025-01-07 06:20:18.806 Control   209] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu/pipe
		[2025-01-07 06:20:18.807 Control   209] Accepting connection...
		[2025-01-07 06:20:18.807 Control   209] NEW UI
		[2025-01-07 06:20:18.807 Control   209] Cmd:set_default_device_pinned_mem_limit 0 2559M
		[2025-01-07 06:20:18.807 Control   209] UI closed
		[2025-01-07 06:20:18.808 Control   209] Accepting connection...
		[2025-01-07 06:20:18.808 Control   209] NEW UI
		[2025-01-07 06:20:18.808 Control   209] Cmd:set_default_active_thread_percentage 16
		[2025-01-07 06:20:18.809 Control   209] 16.0
		[2025-01-07 06:20:18.809 Control   209] UI closed
	
	mps server output:
		[2025-01-07 12:57:36.231 Other    82] Startup
		[2025-01-07 12:57:36.231 Other    82] Connecting to control daemon on socket: /mps/nvidia.com/gpu/pipe/control
		[2025-01-07 12:57:36.231 Other    82] Initializing server process
		[2025-01-07 12:57:36.324 Server    82] Creating server context on device 0 (NVIDIA A16)
		[2025-01-07 12:57:36.457 Server    82] Created named shared memory region /cuda.shm.0.52.1
		[2025-01-07 12:57:36.457 Server    82] Active Threads Percentage set to 12.0
		[2025-01-07 12:57:36.457 Server    82] Device pinned memory limit for device 0 set to 0x77f00000 bytes
		[2025-01-07 12:57:36.457 Server    82] Server Priority set to 0
		[2025-01-07 12:57:36.457 Server    82] Server has started
		[2025-01-07 12:57:36.457 Server    82] Received new client request
		[2025-01-07 12:57:36.457 Server    82] Worker created
		[2025-01-07 12:57:36.457 Server    82] Creating worker thread
		[2025-01-07 12:57:36.515 Server    82] Received new client request
		[2025-01-07 12:57:36.516 Server    82] Worker created
		[2025-01-07 12:57:36.516 Server    82] Creating worker thread
		[2025-01-07 12:57:36.516 Server    82] Device NVIDIA A16 (uuid GPU-077f7bb7-c8e8-46db-77e3-e2c1e5665c78) is associated
		[2025-01-07 12:57:36.516 Server    82] Status of client {0, 1} is ACTIVE
		[2025-01-07 12:57:36.657 Server    82] Client process 0 encountered a fatal GPU error.
		[2025-01-07 12:57:36.657 Server    82] Server is handling a fatal GPU error.
		[2025-01-07 12:57:36.657 Server    82] Status of client {0, 1} is INACTIVE
		[2025-01-07 12:57:36.657 Server    82] The following devices will be reset:
		[2025-01-07 12:57:36.657 Server    82] 0
		[2025-01-07 12:57:36.657 Server    82] The following clients have a sticky error set:
		[2025-01-07 12:57:36.657 Server    82] 0
		[2025-01-07 12:57:36.777 Server    82] Receive command failed, assuming client exit
		[2025-01-07 12:57:36.777 Server    82] Client {0, 1} exit
		[2025-01-07 12:57:36.777 Server    82] Client disconnected. Number of active client contexts is 0.
		[2025-01-07 12:57:36.777 Server    82] Destroy server context on device 0
		[2025-01-07 12:57:37.237 Server    82] Receive command failed, assuming client exit
		[2025-01-07 12:57:37.237 Server    82] Client process disconnected
	
	dmesg output:
		[  268.026103] NVRM: GPU at PCI:0000:00:0c: GPU-077f7bb7-c8e8-46db-77e3-e2c1e5665c78
		[  268.026131] NVRM: GPU Board Serial Number: 1321722074061
		[  268.026144] NVRM: Xid (PCI:0000:00:0c): 69, pid='<unknown>', name=<unknown>, Class Error: ChId 000a, Class 0000c7c0, Offset 000002ec, Data 00000000, ErrorCode 00000004
		[  579.875139] NVRM: Xid (PCI:0000:00:0c): 69, pid='<unknown>', name=<unknown>, Class Error: ChId 000a, Class 0000c7c0, Offset 000002ec, Data 00000000, ErrorCode 00000004
		
	nv-smi output:
           
			+-----------------------------------------------------------------------------------------+
			| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
			|-----------------------------------------+------------------------+----------------------+
			| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
			| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
			|                                         |                        |               MIG M. |
			|=========================================+========================+======================|
			|   0  NVIDIA A16                     Off |   00000000:00:0C.0 Off |                    0 |
			|  0%   49C    P8             16W /   62W |       1MiB /  15356MiB |      0%   E. Process |
			|                                         |                        |                  N/A |
			+-----------------------------------------+------------------------+----------------------+

			+-----------------------------------------------------------------------------------------+
			| Processes:                                                                              |
			|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
			|        ID   ID                                                               Usage      |
			|=========================================================================================|
			|  No running processes found                                                             |
			+-----------------------------------------------------------------------------------------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant