-
Notifications
You must be signed in to change notification settings - Fork 897
SEGV in ompi_request_default_test_all() when triggering IPoIB networking problem during ucc_perftest run #13191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@bfaccini was ompi built with ucc? if so, I think you need to disable ucc via --mca coll ^ucc to run ucc perf tests. |
It looks like :
I don't get it, do you mean I need to use "--mca coll ^ucc" at build or run time ? And anyway, don't you think that analysis and fix proposal are ok ? |
I'm not sure the patch is enough. UCC uses UCX to move data, and UCX is also used by OMPI as the default PML. The collective framework being initialized after the PML, I wonder how the connection establishment succeeded during the PML setup but then failed during UCC setup ? If you disable UCC is this test succeeding and the connection between your processes is correctly established ? |
Actually, nevermind - I confirmed with my colleagues that UCC presence at coll ompi won't affect the functionality of test itself. |
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v4.1.7rc1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Git clone.
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
With the following reproducer with ucc_perftest :
that leads to corefiles being created on several nodes.
Thus I have been able to dig into these corefiles to try to understand the reason of the SEGVs, likely to be caused by a wrong/missing error-path in OMPI/UCC code, when triggering the original error/msgs above upon some IPoIB networking issue.
Here are my findings.
The fully unwinded stack is like following :
where you can see that the unresolved symbol/frame in previously detailed stack is in fact in oob_allgather_test().
And the reason of the SEGV is because :
where reqs[0] is garbage when being dereferenced :
Looking at the corresponding source code in "ompi/mca/coll/ucc/coll_ucc_module.c" :
and just to be complete :
Based on all of this it appears that the following patch/correction (in v4.1.7rc1, the quite recent OMPI version we are running) would allow OMPI/UCC to no longer coredump by gracefully handling any error during isend/irecv :
The text was updated successfully, but these errors were encountered: