训练的时候总提示： RuntimeError: CUDA driver error: invalid argument #4103

listwebit · 2025-05-07T02:15:54Z

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图)

用小模型可以正常训练，改为大一些的模型，就会报错：RuntimeError: CUDA driver error: invalid argument
即使大尺寸一些模型不能训练，也应该报错 OOM呀，

gpuxdn033137081244:2111:2111 [0] NCCL INFO 192 Bytes -> Algo 1 proto 0 time 120.315826

[rank1]: Traceback (most recent call last):
[rank1]: File "/mnt4/code/chonghan.ll/ms-swift/swift/cli/rlhf.py", line 5, in
[rank1]: rlhf_main()
[rank1]: File "/mnt4/code/chonghan.ll/ms-swift/swift/llm/train/rlhf.py", line 99, in rlhf_main
[rank1]: return SwiftRLHF(args).main()
[rank1]: File "/mnt4/code/chonghan.ll/ms-swift/swift/llm/base.py", line 47, in main
[rank1]: result = self.run()
[rank1]: File "/mnt4/code/chonghan.ll/ms-swift/swift/llm/train/sft.py", line 147, in run
[rank1]: return self.train(trainer)
[rank1]: File "/mnt4/code/chonghan.ll/ms-swift/swift/llm/train/sft.py", line 207, in train
[rank1]: trainer.train(trainer.args.resume_from_checkpoint)
[rank1]: File "/mnt4/code/chonghan.ll/ms-swift/swift/trainers/mixin.py", line 321, in train
[rank1]: res = super().train(*args, **kwargs)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
[rank1]: return inner_training_loop(
[rank1]: File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3736, in training_step
[rank1]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank1]: File "/mnt4/code/chonghan.ll/ms-swift/swift/trainers/rlhf_trainer/rlhf_mixin.py", line 98, in compute_loss
[rank1]: res = super().compute_loss(model, inputs, return_outputs=return_outputs)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/trl/trainer/cpo_trainer.py", line 861, in compute_loss
[rank1]: loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
[rank1]: File "/opt/conda/lib/python3.10/site-packages/trl/trainer/cpo_trainer.py", line 839, in get_batch_loss_metrics
[rank1]: self.accelerator.gather_for_metrics(policy_rejected_logits).detach().mean().item()
[rank1]: File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2502, in gather_for_metrics
[rank1]: data = self.gather(input_data)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 2458, in gather
[rank1]: return gather(tensor)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 376, in wrapper
[rank1]: return function(*args, **kwargs)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 437, in gather
[rank1]: return _gpu_gather(tensor)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 356, in _gpu_gather
[rank1]: return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 127, in recursively_apply
[rank1]: return func(data, *args, **kwargs)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 341, in _gpu_gather_one
[rank1]: output_tensors = torch.empty(
[rank1]: RuntimeError: CUDA driver error: invalid argument

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等)

Additional context
Add any other context about the problem here(在这里补充其他信息)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

训练的时候总提示： RuntimeError: CUDA driver error: invalid argument #4103

训练的时候总提示： RuntimeError: CUDA driver error: invalid argument #4103

listwebit commented May 7, 2025

训练的时候总提示： RuntimeError: CUDA driver error: invalid argument #4103

训练的时候总提示： RuntimeError: CUDA driver error: invalid argument #4103

Comments

listwebit commented May 7, 2025

gpuxdn033137081244:2111:2111 [0] NCCL INFO 192 Bytes -> Algo 1 proto 0 time 120.315826