Skip to content

SIGFPE in MPI_INIT possibly related to XML #13291

Closed
@mankoff

Description

@mankoff

I've just set up a new machine with Debian 13 Trixie which includes OpenMPI 5.0.7-1 and gfortan 14.2.0. Hardware is AMD Ryzen 7840U.

I've isolated the issue to, I think, OpenMPI or one of its dependencies (xml?). Here's an MWE:

program hello_mpi
  use mpi
  use ieee_exceptions, only: ieee_divide_by_zero, ieee_invalid, ieee_overflow, ieee_set_halting_mode
  implicit none

  integer :: ierr, rank, size

#ifdef TEST
  call ieee_set_halting_mode(ieee_divide_by_zero, .false.)
  call ieee_set_halting_mode(ieee_invalid, .false.)
  call ieee_set_halting_mode(ieee_overflow, .false.)
#endif

  call MPI_Init(ierr)
  call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
  call MPI_Comm_size(MPI_COMM_WORLD, size, ierr)

  print *, "Hello from rank", rank, "of", size

#ifdef TEST
  call ieee_set_halting_mode(ieee_divide_by_zero, .true.)
  call ieee_set_halting_mode(ieee_invalid, .true.)
  call ieee_set_halting_mode(ieee_overflow, .true.)
#endif

  call MPI_Finalize(ierr)
end program hello_mpi

If I compile with this, it runs:

$ mpif90 -ffpe-trap=invalid,zero,overflow -D TEST hello.F90 ; mpirun -np 1 ./a.out 
 Hello from rank           0 of           1

If I compile without TEST, meaning FPE traps are enabled when calling MPI_Inrit(ieer) this is the stack trace:

$ mpif90 -ffpe-trap=invalid,zero,overflow hello.F90 ; mpirun -np 1 ./a.out 

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x1526776232ba in ???
#1  0x152677622375 in ???
#2  0x152677359def in ???
#3  0x1526752edb43 in xmlXPathInit
#4  0x1526752a6892 in xmlInitParser
#5  0x152675281ec4 in xmlCheckVersion
#6  0x15267797bc61 in ???
#7  0x15267692b03b in ???
#8  0x15267691d328 in ???
#9  0x152676650789 in ???
#10  0x152676651163 in pmix_hwloc_setup_topology
#11  0x15267665b2d9 in PMIx_Init
#12  0x152676e9741b in ompi_rte_init
#13  0x152676e9b429 in ???
#14  0x152676e9c187 in ompi_mpi_instance_init
#15  0x152676e9371f in ompi_mpi_init
#16  0x152676ec449e in MPI_Init
#17  0x152677959bc9 in mpi_init
#18  0x56233d65321b in MAIN__
#19  0x56233d653368 in main
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 535607 on node fw13 exited on
signal 8 (Floating point exception).
--------------------------------------------------------------------------

Oddly, if I compile with the -g flag, there is less info. When I step through this, gdb doesn't take me deep enough in the stack to debug this. I'm not sure of next steps, but happy to help dig deeper if anyone has any suggestions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions