Skip to content

SEGV in ompi_request_default_test_all() when triggering IPoIB networking problem during ucc_perftest run #13191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bfaccini opened this issue Apr 14, 2025 · 6 comments

Comments

@bfaccini
Copy link
Contributor

bfaccini commented Apr 14, 2025

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v4.1.7rc1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Git clone.

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: Ubuntu 24.04.2 LTS (Noble Numbat), Kernel 6.8.0-57-generic
  • Computer hardware:
  • Network type: IPoIB

Details of the problem

With the following reproducer with ucc_perftest :
​​​​​​

$ srun -A admin -p admin -N64 --mpi=pmix --ntasks-per-node=8 --container-image=<container-image> env UCX_TLS=self,tcp ucc_perftest -c alltoall -m host -b 1048576 -e 2147483648 -n 2 
srun: job 357102 queued and waiting for resources 
srun: job 357102 has been allocated resources 
[1744257072.724104] [node0190:1259783:0] sock.c:334 UCX ERROR connect(fd=252, dest_addr=<ip address>) failed: Connection timed out 
[node0190.<domain>:1259783] pml_ucx.c:424 Error: ucp_ep_create(proc=496) failed: Destination is unreachable 
[node0190.<domain>:1259783] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 496 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 7
.................

that leads to corefiles being created on several nodes.
Thus I have been able to dig into these corefiles to try to understand the reason of the SEGVs, likely to be caused by a wrong/missing error-path in OMPI/UCC code, when triggering the original error/msgs above upon some IPoIB networking issue.

Here are my findings.
The fully unwinded stack is like following :

#0  ompi_request_default_test_all (count=2, requests=0x555555a2f228, completed=0x7fffffffc5c4, statuses=0x0) at request/req_test.c:187

#1  0x00007ffff50139ac in oob_allgather_test (req=0x555555a2f200) at coll_ucc_module.c:182

#2  0x00007ffff7f8ea5c in ucc_core_addr_exchange (context=context@entry=0x555555a2e990, oob=oob@entry=0x555555a2e9a8, addr_storage=addr_storage@entry=0x555555a2eaa0) at core/ucc_context.c:461

#3  0x00007ffff7f8f657 in ucc_context_create_proc_info (lib=0x5555559d12b0, params=params@entry=0x7fffffffc960, config=0x555555a2e840, context=context@entry=0x7ffff50213c8 <mca_coll_ucc_component+392>, proc_info=0x7ffff7fbca60 <ucc_local_proc>)

    at core/ucc_context.c:723

#4  0x00007ffff7f901f0 in ucc_context_create (lib=<optimized out>, params=params@entry=0x7fffffffc960, config=<optimized out>, context=context@entry=0x7ffff50213c8 <mca_coll_ucc_component+392>) at core/ucc_context.c:866

#5  0x00007ffff5013cb1 in mca_coll_ucc_init_ctx () at coll_ucc_module.c:302

#6  0x00007ffff501583f in mca_coll_ucc_comm_query (comm=0x55555557d240 <ompi_mpi_comm_world>, priority=0x7fffffffcb6c) at coll_ucc_module.c:488

#7  0x00007ffff7ee5e4c in query_2_0_0 (module=<synthetic pointer>, priority=0x7fffffffcb6c, comm=0x55555557d240 <ompi_mpi_comm_world>, component=0x7ffff5021240 <mca_coll_ucc_component>) at base/coll_base_comm_select.c:540

#8  query (module=<synthetic pointer>, priority=0x7fffffffcb6c, comm=<optimized out>, component=0x7ffff5021240 <mca_coll_ucc_component>) at base/coll_base_comm_select.c:523

#9  check_one_component (module=<synthetic pointer>, component=0x7ffff5021240 <mca_coll_ucc_component>, comm=<optimized out>) at base/coll_base_comm_select.c:486

#10 check_components (comm=comm@entry=0x55555557d240 <ompi_mpi_comm_world>, components=<optimized out>) at base/coll_base_comm_select.c:406

#11 0x00007ffff7ee6446 in mca_coll_base_comm_select (comm=0x55555557d240 <ompi_mpi_comm_world>) at base/coll_base_comm_select.c:114

#12 0x00007ffff7f33613 in ompi_mpi_init (argc=<optimized out>, argc@entry=0, argv=<optimized out>, argv@entry=0x0, requested=0, provided=0x7fffffffcdf4, reinit_ok=reinit_ok@entry=false) at runtime/ompi_mpi_init.c:957

#13 0x00007ffff7ed6c2c in PMPI_Init (argc=0x0, argv=0x0) at pinit.c:69

#14 0x000055555555dbf4 in ucc_pt_bootstrap_mpi::ucc_pt_bootstrap_mpi() ()

#15 0x0000555555565666 in ucc_pt_comm::ucc_pt_comm(ucc_pt_comm_config) ()

#16 0x0000555555558f2a in main ()

where you can see that the unresolved symbol/frame in previously detailed stack is in fact in oob_allgather_test().

And the reason of the SEGV is because :

(gdb) p/x *(oob_allgather_req_t *)0x555555a2f200
​​​​​​$1 = {sbuf = 0x555555a2ea00, rbuf = 0x555555a710c0, oob_coll_ctx = 0x55555557d240, msglen = 0x8, iter = 0x1, reqs = {0x726568, 0x555555a8fa48}}

where reqs[0] is garbage when being dereferenced :

(gdb) p/x $rip

$3 = 0x7ffff7eb39e8

(gdb) x/10i ($rip - 0x18)

   0x7ffff7eb39d0 <ompi_request_default_test_all+48>:   cmpq   $0x1,0x58(%rax)

   0x7ffff7eb39d5 <ompi_request_default_test_all+53>:   je     0x7ffff7eb39f0 <ompi_request_default_test_all+80>

   0x7ffff7eb39d7 <ompi_request_default_test_all+55>:   lea    0x1(%r12),%rax

   0x7ffff7eb39dc <ompi_request_default_test_all+60>:   cmp    %rax,%rdi

   0x7ffff7eb39df <ompi_request_default_test_all+63>:   je     0x7ffff7eb39fe <ompi_request_default_test_all+94>

   0x7ffff7eb39e1 <ompi_request_default_test_all+65>:   mov    %rax,%r12

   0x7ffff7eb39e4 <ompi_request_default_test_all+68>:   mov    (%rbx,%r12,8),%rax

=> 0x7ffff7eb39e8 <ompi_request_default_test_all+72>:   mov    0x60(%rax),%esi

   0x7ffff7eb39eb <ompi_request_default_test_all+75>:   cmp    $0x1,%esi

   0x7ffff7eb39ee <ompi_request_default_test_all+78>:   jne    0x7ffff7eb39d0 <ompi_request_default_test_all+48>

(gdb) x/gx ($rax + 0x60)

0x7265c8:       Cannot access memory at address 0x7265c8

(gdb) p/x $rbx + $r12 * 0x8

$4 = 0x555555a2f228

(gdb) x/gx ($rbx + $r12 * 0x8)

0x555555a2f228: 0x0000000000726568

(gdb) p/x $rax

$5 = 0x726568

(gdb) x/gx ($rax + 0x60)

0x7265c8:       Cannot access memory at address 0x7265c8

(gdb) 

Looking at the corresponding source code in "ompi/mca/coll/ucc/coll_ucc_module.c" :

141         

142 typedef struct oob_allgather_req{

143     void           *sbuf;

144     void           *rbuf;  

145     void           *oob_coll_ctx;

146     size_t          msglen;

147     int             iter;

148     ompi_request_t *reqs[2];

149 } oob_allgather_req_t;

150         

151 static ucc_status_t oob_allgather_test(void *req)

152 {   

153     oob_allgather_req_t *oob_req = (oob_allgather_req_t*)req;

154     ompi_communicator_t *comm    = (ompi_communicator_t *)oob_req->oob_coll_ctx;

155     char                *tmpsend = NULL;

156     char                *tmprecv = NULL;

157     size_t               msglen  = oob_req->msglen;

158     int                  probe_count = 5;

159     int rank, size, sendto, recvfrom, recvdatafrom,

160         senddatafrom, completed, probe;

161     

162     size = ompi_comm_size(comm);

163     rank = ompi_comm_rank(comm);

164     if (oob_req->iter == 0) {

165         tmprecv = (char*) oob_req->rbuf + (ptrdiff_t)rank * (ptrdiff_t)msglen;

166         memcpy(tmprecv, oob_req->sbuf, msglen);

167     }

168     sendto   = (rank + 1) % size;

169     recvfrom = (rank - 1 + size) % size;

170     for (; oob_req->iter < size - 1; oob_req->iter++) {

171         if (oob_req->iter > 0) {             <<<< iter is 0 for 1st loop ...

172             probe = 0;

173             do {​​​​​​

174                 ompi_request_test_all(2, oob_req->reqs, &completed, MPI_STATUS_IGNORE);
<<<<<< during 2nd loop (iter == 1) , ompi_request_test_all() is called with garbled reqs[0] !!
175                 probe++;

176             } while (!completed && probe < probe_count);

177             if (!completed) {

178                 return UCC_INPROGRESS;

179             }

180         }

181         recvdatafrom = (rank - oob_req->iter - 1 + size) % size;

182         senddatafrom = (rank - oob_req->iter + size) % size;

183         tmprecv = (char*)oob_req->rbuf + (ptrdiff_t)recvdatafrom * (ptrdiff_t)msglen;

184         tmpsend = (char*)oob_req->rbuf + (ptrdiff_t)senddatafrom * (ptrdiff_t)msglen;

185         MCA_PML_CALL(isend(tmpsend, msglen, MPI_BYTE, sendto, MCA_COLL_BASE_TAG_UCC,

186                            MCA_PML_BASE_SEND_STANDARD, comm, &oob_req->reqs[0]));
<<<<<< isend triggers an error so reqs[0] is not populated !!

187         MCA_PML_CALL(irecv(tmprecv, msglen, MPI_BYTE, recvfrom,

188                            MCA_COLL_BASE_TAG_UCC, comm, &oob_req->reqs[1]));
<<<<<< irecv do not report error, so reqs[1] is populated.
189     }

190     probe = 0;

191     do {

192         ompi_request_test_all(2, oob_req->reqs, &completed, MPI_STATUS_IGNORE);

193         probe++;

194     } while (!completed && probe < probe_count);

195     if (!completed) {

196         return UCC_INPROGRESS;

197     }

198     return UCC_OK;

199 }

200 

201 static ucc_status_t oob_allgather_free(void *req)

202 {

203     free(req);

204     return UCC_OK;

205 }

206 

207 static ucc_status_t oob_allgather(void *sbuf, void *rbuf, size_t msglen,

208                                   void *oob_coll_ctx, void **req)

209 {

210     oob_allgather_req_t *oob_req = malloc(sizeof(*oob_req));

211     oob_req->sbuf                = sbuf;

212     oob_req->rbuf                = rbuf;

213     oob_req->msglen              = msglen;

214     oob_req->oob_coll_ctx        = oob_coll_ctx;

215     oob_req->iter                = 0;​​​​​​

216     *req                         = oob_req;

217     return UCC_OK;

218 }

219 
​​​​​​"ompi/mca/coll/ucc/coll_ucc_module.c" 528 lines --41%--                              219,0-1       37%

and just to be complete :

​​​​​​#define ompi_request_test_all   (ompi_request_functions.req_test_all)
​​​​​​"ompi/request/request.h" 504L, 19446B              407,1         83%

​​​​​​

(gdb) x/i ompi_request_functions.req_test_all

   0x7ffff7eb39a0 <ompi_request_default_test_all>:      endbr64

Based on all of this it appears that the following patch/correction (in v4.1.7rc1, the quite recent OMPI version we are running) would allow OMPI/UCC to no longer coredump by gracefully handling any error during isend/irecv :

~/ompi$ git status

HEAD detached at v4.1.7rc1

Changes not staged for commit:

  (use "git add <file>..." to update what will be committed)

  (use "git restore <file>..." to discard changes in working directory)

        modified:   ompi/mca/coll/ucc/coll_ucc_module.c

no changes added to commit (use "git add" and/or "git commit -a")

~/ompi$ git diff 

diff --git a/ompi/mca/coll/ucc/coll_ucc_module.c b/ompi/mca/coll/ucc/coll_ucc_module.c

index 1686697618..dfa2674a3d 100644

--- a/ompi/mca/coll/ucc/coll_ucc_module.c

+++ b/ompi/mca/coll/ucc/coll_ucc_module.c

@@ -158,6 +158,7 @@ static ucc_status_t oob_allgather_test(void *req)

     int                  probe_count = 5;

     int rank, size, sendto, recvfrom, recvdatafrom,

         senddatafrom, completed, probe;

+    int rc;

 

     size = ompi_comm_size(comm);

     rank = ompi_comm_rank(comm);

@@ -182,10 +183,12 @@ static ucc_status_t oob_allgather_test(void *req)

         senddatafrom = (rank - oob_req->iter + size) % size;

         tmprecv = (char*)oob_req->rbuf + (ptrdiff_t)recvdatafrom * (ptrdiff_t)msglen;

         tmpsend = (char*)oob_req->rbuf + (ptrdiff_t)senddatafrom * (ptrdiff_t)msglen;

-        MCA_PML_CALL(isend(tmpsend, msglen, MPI_BYTE, sendto, MCA_COLL_BASE_TAG_UCC,

+        rc = MCA_PML_CALL(isend(tmpsend, msglen, MPI_BYTE, sendto, MCA_COLL_BASE_TAG_UCC,

                            MCA_PML_BASE_SEND_STANDARD, comm, &oob_req->reqs[0]));

-        MCA_PML_CALL(irecv(tmprecv, msglen, MPI_BYTE, recvfrom,

+       if (OMPI_SUCCESS != rc) return rc

+        rc = MCA_PML_CALL(irecv(tmprecv, msglen, MPI_BYTE, recvfrom,

                            MCA_COLL_BASE_TAG_UCC, comm, &oob_req->reqs[1]));

+       if (OMPI_SUCCESS != rc) return rc

     }

     probe = 0;

     do {

@@ -213,6 +216,8 @@ static ucc_status_t oob_allgather(void *sbuf, void *rbuf, size_t msglen,

     oob_req->msglen              = msglen;

     oob_req->oob_coll_ctx        = oob_coll_ctx;

     oob_req->iter                = 0;

+    oob_req->reqs[0]             = NULL;

+    oob_req->reqs[1]             = NULL;

     *req                         = oob_req;

     return UCC_OK;

 }

~/ompi$ 
@janjust
Copy link
Contributor

janjust commented Apr 14, 2025

@bfaccini was ompi built with ucc? if so, I think you need to disable ucc via --mca coll ^ucc to run ucc perf tests.

@bfaccini
Copy link
Contributor Author

#13194

@bfaccini
Copy link
Contributor Author

bfaccini commented Apr 14, 2025

was ompi built with ucc?

It looks like :

$ ompi_info
                 Package: Open MPI root@sharp-ci-01 Distribution
                Open MPI: 4.1.7rc1
  Open MPI repo revision: v4.1.5-175-ga2335dd1c5
   Open MPI release date: Unreleased developer copy
                Open RTE: 4.1.7rc1
  Open RTE repo revision: v4.1.5-175-ga2335dd1c5
   Open RTE release date: Unreleased developer copy
                    OPAL: 4.1.7rc1
      OPAL repo revision: v4.1.5-175-ga2335dd1c5
       OPAL release date: Unreleased developer copy
                 MPI API: 3.1.0
            Ident string: 4.1.7rc1
                  Prefix: /opt/hpcx/ompi
 Configured architecture: x86_64-pc-linux-gnu
          Configure host: sharp-ci-01
           Configured by: root
           Configured on: Wed Dec 25 17:24:17 UTC 2024
          Configure host: sharp-ci-01
  Configure command line: '--prefix=/build-result/hpcx-v2.22-gcc-inbox-ubuntu24.04-cuda12-x86_64/ompi' '--with-libevent=internal' '--enable-mpi1-compatibility' '--without-xpmem' '--with-cuda=/hpc/local/oss/cuda12.6.1/ubuntu24.04' '--with-slurm' '--with-platform=contrib/platform/mellanox/optimized' '--with-hcoll=/build-result/hpcx-v2.22-gcc-inbox-ubuntu24.04-cuda12-x86_64/hcoll' '--with-ucx=/build-result/hpcx-v2.22-gcc-inbox-ubuntu24.04-cuda12-x86_64/ucx' '--with-ucc=/build-result/hpcx-v2.22-gcc-inbox-ubuntu24.04-cuda12-x86_64/ucc'
                Built by: 
                Built on: Wed Dec 25 17:37:46 UTC 2024
              Built host: sharp-ci-01
              C bindings: yes
            C++ bindings: no
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to limitations in the gfortran compiler and/or Open MPI, does not support the following: array subsections, direct passthru (where possible) to underlying Open MPI's C functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
  C compiler family name: GNU
      C compiler version: 13.2.0
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
           Fort compiler: gfortran
       Fort compiler abs: /usr/bin/gfortran
         Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
          Fort PROTECTED: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
           C++ profiling: no
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: never
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
 mpirun default --prefix: yes
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
      MPI1 compatibility: yes
          MPI extensions: affinity, cuda, pcollreq
   FT Checkpoint support: no (checkpoint thread: no)
   C/R Enabled Debugging: no
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.1.7)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.1.7)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.7)
                 MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.1.7)
                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.7)
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.7)
            MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.1.7)
            MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                 MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.1.7)
               MCA event: libevent2022 (MCA v2.1.0, API v2.0.0, Component v4.1.7)
               MCA hwloc: hwloc201 (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component v4.1.7)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.1.7)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.1.7)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.1.7)
               MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.1.7)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component v4.1.7)
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.1.7)
               MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.1.7)
              MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v4.1.7)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.1.7)
              MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v4.1.7)
           MCA reachable: netlink (MCA v2.1.0, API v2.0.0, Component v4.1.7)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.1.7)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.1.7)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.1.7)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.1.7)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.1.7)
              MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component v4.1.7)
              MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component v4.1.7)
              MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component v4.1.7)
              MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component v4.1.7)
                 MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.1.7)
                 MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.1.7)
                 MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.1.7)
                 MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component v4.1.7)
                 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.1.7)
                 MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.1.7)
               MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.1.7)
             MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.1.7)
                 MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                 MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                 MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                 MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                 MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                 MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                 MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.1.7)
                MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.1.7)
                MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.1.7)
               MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.1.7)
               MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.1.7)
               MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component v4.1.7)
               MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component v4.1.7)
               MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component v4.1.7)
               MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                 MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.1.7)
              MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.1.7)
              MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.1.7)
              MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.1.7)
                 MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.1.7)
              MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.1.7)
              MCA schizo: jsm (MCA v2.1.0, API v1.0.0, Component v4.1.7)
              MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.1.7)
              MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.1.7)
              MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.1.7)
               MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.1.7)
               MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.1.7)
               MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.1.7)
               MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.1.7)
               MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.1.7)
                 MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA coll: adapt (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA coll: cuda (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA coll: han (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA coll: hcoll (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA coll: ucc (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.7)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.1.7)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component v4.1.7)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component v4.1.7)
               MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component v4.1.7)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                  MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                  MCA op: avx (MCA v2.1.0, API v1.0.0, Component v4.1.7)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component v4.1.7)
                 MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.1.7)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.1.7)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.1.7)
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.7)
                 MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                 MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                 MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                 MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                 MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.1.7)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component v4.1.7)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component v4.1.7)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.1.7)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.1.7)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component v4.1.7)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component v4.1.7)

I think you need to disable ucc via --mca coll ^ucc to run ucc perf tests

I don't get it, do you mean I need to use "--mca coll ^ucc" at build or run time ?

And anyway, don't you think that analysis and fix proposal are ok ?

@bosilca
Copy link
Member

bosilca commented Apr 17, 2025

I'm not sure the patch is enough. UCC uses UCX to move data, and UCX is also used by OMPI as the default PML. The collective framework being initialized after the PML, I wonder how the connection establishment succeeded during the PML setup but then failed during UCC setup ?

If you disable UCC is this test succeeding and the connection between your processes is correctly established ?

@janjust
Copy link
Contributor

janjust commented Apr 17, 2025

Actually, nevermind - I confirmed with my colleagues that UCC presence at coll ompi won't affect the functionality of test itself.

@bfaccini
Copy link
Contributor Author

bfaccini commented May 6, 2025

PR #13194 has been abandonned, patch has now been pushed as part of PR #13238 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants