You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
报错:[E506 17:21:08.668667255 socket.cpp:1011] [c10d] The client socket has timed out after 600000ms while trying to connect to (33.145.123.167, 40341).
[W506 17:21:08.668906079 TCPStore.cpp:358] [c10d] TCP client failed to connect/validate to host 33.145.123.167:40341 - retrying (try=0, timeout=600000ms, delay=51298ms): The client socket has timed out after 600000ms while trying to connect to (33.145.123.167, 40341).
Exception raised from throwTimeoutError at ../torch/csrc/distributed/c10d/socket.cpp:1013 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe2a5aab446 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
Your hardware and system info
任何版本都一样的问题,可复现。
The text was updated successfully, but these errors were encountered:
Describe the bug
NPROC_PER_NODE=8
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
swift infer
--model "/mnt/modelhub/QwQ-32B"
--infer_backend vllm
--ddp_backend nccl
--tensor_parallel_size 2
--val_dataset "input_data_v2_20250506.jsonl"
--gpu_memory_utilization 0.9
--max_model_len 4096
--max_length 4096
--max_batch_size 400
--dataset_num_proc 16
--result_path "pred_a100_qwq32B_1w_result_v2_20250506.jsonl"
--system "prompt_v2.txt"
--remove_unused_columns False
报错:[E506 17:21:08.668667255 socket.cpp:1011] [c10d] The client socket has timed out after 600000ms while trying to connect to (33.145.123.167, 40341).
[W506 17:21:08.668906079 TCPStore.cpp:358] [c10d] TCP client failed to connect/validate to host 33.145.123.167:40341 - retrying (try=0, timeout=600000ms, delay=51298ms): The client socket has timed out after 600000ms while trying to connect to (33.145.123.167, 40341).
Exception raised from throwTimeoutError at ../torch/csrc/distributed/c10d/socket.cpp:1013 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe2a5aab446 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
Your hardware and system info
任何版本都一样的问题,可复现。
The text was updated successfully, but these errors were encountered: