Skip to content

swift infer在tp=2的情况下,不支持deepseek-r1-distill-qwen系列和qwq32B模型的批推理 #4130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
phoenixbai opened this issue May 8, 2025 · 0 comments

Comments

@phoenixbai
Copy link

Describe the bug
NPROC_PER_NODE=8
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
swift infer
--model "/mnt/modelhub/QwQ-32B"
--infer_backend vllm
--ddp_backend nccl
--tensor_parallel_size 2
--val_dataset "input_data_v2_20250506.jsonl"
--gpu_memory_utilization 0.9
--max_model_len 4096
--max_length 4096
--max_batch_size 400
--dataset_num_proc 16
--result_path "pred_a100_qwq32B_1w_result_v2_20250506.jsonl"
--system "prompt_v2.txt"
--remove_unused_columns False

报错:[E506 17:21:08.668667255 socket.cpp:1011] [c10d] The client socket has timed out after 600000ms while trying to connect to (33.145.123.167, 40341).
[W506 17:21:08.668906079 TCPStore.cpp:358] [c10d] TCP client failed to connect/validate to host 33.145.123.167:40341 - retrying (try=0, timeout=600000ms, delay=51298ms): The client socket has timed out after 600000ms while trying to connect to (33.145.123.167, 40341).
Exception raised from throwTimeoutError at ../torch/csrc/distributed/c10d/socket.cpp:1013 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe2a5aab446 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)

Your hardware and system info
任何版本都一样的问题,可复现。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant