Closed
Description
I've just set up a new machine with Debian 13 Trixie which includes OpenMPI 5.0.7-1 and gfortan 14.2.0. Hardware is AMD Ryzen 7840U.
I've isolated the issue to, I think, OpenMPI or one of its dependencies (xml?). Here's an MWE:
program hello_mpi
use mpi
use ieee_exceptions, only: ieee_divide_by_zero, ieee_invalid, ieee_overflow, ieee_set_halting_mode
implicit none
integer :: ierr, rank, size
#ifdef TEST
call ieee_set_halting_mode(ieee_divide_by_zero, .false.)
call ieee_set_halting_mode(ieee_invalid, .false.)
call ieee_set_halting_mode(ieee_overflow, .false.)
#endif
call MPI_Init(ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
call MPI_Comm_size(MPI_COMM_WORLD, size, ierr)
print *, "Hello from rank", rank, "of", size
#ifdef TEST
call ieee_set_halting_mode(ieee_divide_by_zero, .true.)
call ieee_set_halting_mode(ieee_invalid, .true.)
call ieee_set_halting_mode(ieee_overflow, .true.)
#endif
call MPI_Finalize(ierr)
end program hello_mpi
If I compile with this, it runs:
$ mpif90 -ffpe-trap=invalid,zero,overflow -D TEST hello.F90 ; mpirun -np 1 ./a.out
Hello from rank 0 of 1
If I compile without TEST
, meaning FPE traps are enabled when calling MPI_Inrit(ieer)
this is the stack trace:
$ mpif90 -ffpe-trap=invalid,zero,overflow hello.F90 ; mpirun -np 1 ./a.out
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x1526776232ba in ???
#1 0x152677622375 in ???
#2 0x152677359def in ???
#3 0x1526752edb43 in xmlXPathInit
#4 0x1526752a6892 in xmlInitParser
#5 0x152675281ec4 in xmlCheckVersion
#6 0x15267797bc61 in ???
#7 0x15267692b03b in ???
#8 0x15267691d328 in ???
#9 0x152676650789 in ???
#10 0x152676651163 in pmix_hwloc_setup_topology
#11 0x15267665b2d9 in PMIx_Init
#12 0x152676e9741b in ompi_rte_init
#13 0x152676e9b429 in ???
#14 0x152676e9c187 in ompi_mpi_instance_init
#15 0x152676e9371f in ompi_mpi_init
#16 0x152676ec449e in MPI_Init
#17 0x152677959bc9 in mpi_init
#18 0x56233d65321b in MAIN__
#19 0x56233d653368 in main
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 535607 on node fw13 exited on
signal 8 (Floating point exception).
--------------------------------------------------------------------------
Oddly, if I compile with the -g
flag, there is less info. When I step through this, gdb
doesn't take me deep enough in the stack to debug this. I'm not sure of next steps, but happy to help dig deeper if anyone has any suggestions.
Metadata
Metadata
Assignees
Labels
No labels