swift infer在tp=2的情况下，不支持deepseek-r1-distill-qwen系列和qwq32B模型的批推理 #4130

phoenixbai · 2025-05-08T05:12:35Z

Describe the bug
NPROC_PER_NODE=8
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
swift infer
--model "/mnt/modelhub/QwQ-32B"
--infer_backend vllm
--ddp_backend nccl
--tensor_parallel_size 2
--val_dataset "input_data_v2_20250506.jsonl"
--gpu_memory_utilization 0.9
--max_model_len 4096
--max_length 4096
--max_batch_size 400
--dataset_num_proc 16
--result_path "pred_a100_qwq32B_1w_result_v2_20250506.jsonl"
--system "prompt_v2.txt"
--remove_unused_columns False

报错：[E506 17:21:08.668667255 socket.cpp:1011] [c10d] The client socket has timed out after 600000ms while trying to connect to (33.145.123.167, 40341).
[W506 17:21:08.668906079 TCPStore.cpp:358] [c10d] TCP client failed to connect/validate to host 33.145.123.167:40341 - retrying (try=0, timeout=600000ms, delay=51298ms): The client socket has timed out after 600000ms while trying to connect to (33.145.123.167, 40341).
Exception raised from throwTimeoutError at ../torch/csrc/distributed/c10d/socket.cpp:1013 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe2a5aab446 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)

Your hardware and system info
任何版本都一样的问题，可复现。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swift infer在tp=2的情况下，不支持deepseek-r1-distill-qwen系列和qwq32B模型的批推理 #4130

swift infer在tp=2的情况下，不支持deepseek-r1-distill-qwen系列和qwq32B模型的批推理 #4130

phoenixbai commented May 8, 2025

swift infer在tp=2的情况下，不支持deepseek-r1-distill-qwen系列和qwq32B模型的批推理 #4130

swift infer在tp=2的情况下，不支持deepseek-r1-distill-qwen系列和qwq32B模型的批推理 #4130

Comments

phoenixbai commented May 8, 2025