How is PRTE's failure detector supposed to work? #13215

arielscht · 2025-04-27T20:41:03Z

Please submit all the information below so that we can understand the working environment that is the context for your question.

If you have a problem building or installing Open MPI, be sure to read this.
If you have a problem launching MPI or OpenSHMEM applications, be sure to read this.
If you have a problem running MPI or OpenSHMEM applications (i.e., after launching them), be sure to read this.

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.7

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed from the source available at https://www.open-mpi.org/software/ompi/v5.0/

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

Please describe the system on which you are running

Operating system/version: Ubuntu 20.04 and Ubuntu 22.04
Computer hardware:
Network type: LAN

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

I wrote a simple MPI application for leader election with ULFM, it is the simplest algorithm possible, the working process with the lowest rank is elected as the leader.

My goal is to observe the failure detector behavior of the processes, and for that I tried enabling the PRTE failure detection that supposedly uses a heartbeating mechanism where we can configure its period and timeout with --prtemca parameters.

I'm running the application on 2 nodes inside the same local network and I was expecting processes on machine A to detect processes on machine B as faulty if I shutdown the communication link between both machines, but nothing happens.

The only way I've been able to detect failures so far, was by manually killing processes on the corresponding machines.

So I'm wondering why the failure detector is not reporting processes as failed even though it is obviously not receiving any heartbeat reply. It seems like parameters errmgr_detector_heartbeat_period and errmgr_detector_heartbeat_timeout are not being taken into account at all.

I'm trying to investigate some fundamental distributed system concepts on OpenMPI with ULFM. We know that failure detection can make mistakes on asynchronous system subject to a single process failure. For that, I'm trying to show that different processes can detect different faults and cause an inconsistency while electing a leader, for example, because of communication link failure, since a failed process can not be distinguished from a slow one, or from a communication link failure.

It would be of great help if someone can tell me why I'm not getting any MPI_ERR_PROC_FAILED in the example above.

I'm running the application with the following flags:

$ mpirun --with-ft ulfm --hostfile hostfile.txt -n 4 /home/leader-election

I already played around with values for the heartbeat period and heartbeat timeout but nothing was detected in any case.

Thank you in advance!

The text was updated successfully, but these errors were encountered:

rhc54 · 2025-04-28T01:32:29Z

Not entirely clear what made you think that PRRTE would monitor heartbeats from a client app? Were you perhaps confused with some heartbeat method for detecting that a PRRTE daemon had failed (which is quite distinct from an app proc failing)?

FWIW: PRRTE does not have heartbeat detection in it, for app procs or its remote daemons. It relies on connection timeout/failure to detect that a daemon has died, which can (depending on system settings) take some seconds to occur.

If you want to detect app proc failure, then we rely on PMIx to provide that detection. If the proc fails, the PMIx connection between the proc and local PRRTE daemon will close - this is used to detect that the proc has failed. PRRTE will then (with ULFM requested) generate a PMIx error that the MPI library traps and uses to notify the remaining procs of the failure.

You don't need heartbeats for any of that - the loss of connection serves for failure detection. If you want to detect a "stalled" process, then you can use the PMIx heartbeat sensor. I don't believe there is any OMPI integration with that capability at this time, so you probably wouldn't be able to use it without modifying your app to use PMIx directly - doable, but requires a little knowledge beyond MPI.

Again, though, you don't need heartbeats for failure detection - only for detecting "stalled" procs.

arielscht · 2025-04-28T13:13:01Z

Not entirely clear what made you think that PRRTE would monitor heartbeats from a client app? Were you perhaps confused with some heartbeat method for detecting that a PRRTE daemon had failed (which is quite distinct from an app proc failing)?

I was studying ULFM to understand how it is able to detect process failures. At first I understood that the implementation just provided the minimum set of methods required for designing fault tolerance systems, with no restriction to which kind of ft strategy to use.

When looking at some examples I noticed that OpenMPI had some parameters to enable ft_detector and also to tune its heartbeat period and timeout.

In the OpenMPI documentation, I found the MPI level option below, which has an orange box saying it is deprecated and that the failure detection has been moved to PRTE level. So that's where my question comes from.

I'm not used to OpenMPI runtime environment, so I don't fully understand what are PRRTE, PMIx etc. I mentioned it based on the documention.

Although, looking at the open sockets and running processes, I noticed that each app process was connected to a prterun process (in the host machine) and to a prted process (in the remote machine). With that said, I expected that these centralized processes on each machine would heartbeat each app process on their machine and communicate the others about any timeout, which would be a fault suspicion.

Since you said there is no such thing, I would be grateful if you can explain to me that these PRTE level options actually do:

Thank you for your time!

rhc54 · 2025-04-28T18:48:19Z

Traveling today but will respond more later. Thanks for the info! Exactly what I was hoping you'd explain. The docs are outdated and incorrect. Those options don't exist.

App procs only connect to the local prrte daemon (prted), not to mpirun. Done via PMIx library. Daemons connect to mpirun.

App procs do connect to each other via MPI of course.

rhc54 · 2025-04-29T01:17:07Z

A little more info. PRRTE is the runtime that is launching and monitoring application procs. PMIx is a library used to provide the connection between an application proc and the local runtime daemon (prted). Each app proc uses PMIx to establish a TCP connection to the local prted - period. There are no connections to any other prted.

There are two categories of failures to consider. One or more application procs can fail. This is detected by the local prted when its PMIx library notifies it that the TCP connection to the local app proc has failed. The local prted then notifies mpirun, which generates an event notifying all other procs in the app - telling them the identity of the proc that failed. The MPI library in each proc then calls your registered error handler to tell the app that "proc A has died". The app is responsible for whatever happens in response.

Other failure category is to have a prted fail. This is what that documentation referred to - a prior attempt to use heartbeats between the daemons to detect daemon failure. Unfortunately, sys admins get very upset at runtimes that clog their management network with heartbeats at scale - that network is actually rather busy managing the cluster. So we don't do that - prted failure is detected by seeing the TCP connection between the daemons fail. We don't have a mechanism for recovering from that scenario, so PRRTE will simply terminate all running apps and exit.

Bottom line here: there are no heartbeats anywhere in the system. Failure detection is done by socket closure. At the MPI level, your sole need is to (a) register the error handler, and (b) do something when it is called.

arielscht · 2025-04-29T02:26:20Z

Everything is clear now, I'll keep all of these in mind while writing code for my experiments.

Thank you very much for the explanation!

arielscht · 2025-04-30T00:09:03Z

Before closing the issue, one last question about the implementation.

Does OpenMPI only use TCP for communication or it also supports UDP? If so, is there a parameter I can set to use UDP instead?

I'm asking this because since failures are detected upon closing the TCP connection, in UDP it would not be possible to detect any failure, since the protocol is connectionless.

jsquyres · 2025-04-30T00:14:02Z

Open MPI uses a bunch of different communications protocols.

For the run time (e.g., your questions have been about PRRTE), that's all TCP.

But for MPI, a variety of different protocols and network types are supported. Plain vanilla TCP is supported, but we do not have a UDP module (although some of the underlying HPC-class network support modules utilize connectionless and/or unreliable transports).

rhc54 · 2025-04-30T00:36:01Z

Correct - the runtime is solely TCP. We did consider adding a UDP option at one point in the past, but it would have required a lot of work. We would have to create all the support infrastructure to make it reliable (retry, ack, etc.), deal with UDP routing differences, etc. In the end, it just didn't feel worth all the trouble.

If someone ever comes up with some must-have advantage, and is willing to provide the effort to do it, then we'd certainly be open to the idea.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is PRTE's failure detector supposed to work? #13215

How is PRTE's failure detector supposed to work? #13215

arielscht commented Apr 27, 2025

rhc54 commented Apr 28, 2025

arielscht commented Apr 28, 2025

rhc54 commented Apr 28, 2025

rhc54 commented Apr 29, 2025

arielscht commented Apr 29, 2025

arielscht commented Apr 30, 2025

jsquyres commented Apr 30, 2025

rhc54 commented Apr 30, 2025

How is PRTE's failure detector supposed to work? #13215

How is PRTE's failure detector supposed to work? #13215

Comments

arielscht commented Apr 27, 2025

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Details of the problem

rhc54 commented Apr 28, 2025

arielscht commented Apr 28, 2025

rhc54 commented Apr 28, 2025

rhc54 commented Apr 29, 2025

arielscht commented Apr 29, 2025

arielscht commented Apr 30, 2025

jsquyres commented Apr 30, 2025

rhc54 commented Apr 30, 2025

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.