Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION]IF FLUX supports RoCE NIC? #51

Open
xuzhenguoloveyjh opened this issue Feb 15, 2025 · 1 comment
Open

[QUESTION]IF FLUX supports RoCE NIC? #51

xuzhenguoloveyjh opened this issue Feb 15, 2025 · 1 comment

Comments

@xuzhenguoloveyjh
Copy link

description
When I was conducting cross-node test using NVSHMEM, I encountered a bug. the script I used is test_ag_kernel_crossnode.py, the error as below.

/flux/3rdparty/nvshmem/src/modules/transport/ibrc/ibrc.cpp:422: non-zero status: 110 ibv_modify_qp failed
/flux/3rdparty/nvshmem/src/modules/transport/ibrc/ibrc.cpp:1437: non-zero status: 7 ep_connect failed
/flux/3rdparty/nvshmem/src/modules/transport/ibrc/ibrc.cpp:1504: non-zero status: 7 transport create connect failed
/flux/3rdparty/nvshmem/src/host/transport/transport.cpp:394: non-zero status: 7 connect EPS failed
/flux/3rdparty/nvshmem/src/host/init/init.cu:981: non-zero status: 7 nvshmem setup connections failed

I tests on a 2-node A100 cluster with each node has 8 GPUs and RoCE NIC. What could be the possible reasons?

@houqi
Copy link
Collaborator

houqi commented Mar 6, 2025

it's not tested on RoCE NIC. maybe this is a problem with NVSHMEM. can you run nvshmem examples with nvshmrun on RoCE NIC?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants