Skip to content

How is PRTE's failure detector supposed to work? #13215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
arielscht opened this issue Apr 27, 2025 · 8 comments
Open

How is PRTE's failure detector supposed to work? #13215

arielscht opened this issue Apr 27, 2025 · 8 comments

Comments

@arielscht
Copy link

Please submit all the information below so that we can understand the working environment that is the context for your question.

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.7

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed from the source available at https://www.open-mpi.org/software/ompi/v5.0/

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: Ubuntu 20.04 and Ubuntu 22.04
  • Computer hardware:
  • Network type: LAN

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

I wrote a simple MPI application for leader election with ULFM, it is the simplest algorithm possible, the working process with the lowest rank is elected as the leader.

My goal is to observe the failure detector behavior of the processes, and for that I tried enabling the PRTE failure detection that supposedly uses a heartbeating mechanism where we can configure its period and timeout with --prtemca parameters.

I'm running the application on 2 nodes inside the same local network and I was expecting processes on machine A to detect processes on machine B as faulty if I shutdown the communication link between both machines, but nothing happens.

The only way I've been able to detect failures so far, was by manually killing processes on the corresponding machines.

So I'm wondering why the failure detector is not reporting processes as failed even though it is obviously not receiving any heartbeat reply. It seems like parameters errmgr_detector_heartbeat_period and errmgr_detector_heartbeat_timeout are not being taken into account at all.

I'm trying to investigate some fundamental distributed system concepts on OpenMPI with ULFM. We know that failure detection can make mistakes on asynchronous system subject to a single process failure. For that, I'm trying to show that different processes can detect different faults and cause an inconsistency while electing a leader, for example, because of communication link failure, since a failed process can not be distinguished from a slow one, or from a communication link failure.

It would be of great help if someone can tell me why I'm not getting any MPI_ERR_PROC_FAILED in the example above.

I'm running the application with the following flags:

$ mpirun --with-ft ulfm --hostfile hostfile.txt -n 4 /home/leader-election

I already played around with values for the heartbeat period and heartbeat timeout but nothing was detected in any case.

Thank you in advance!

@rhc54
Copy link
Contributor

rhc54 commented Apr 28, 2025

Not entirely clear what made you think that PRRTE would monitor heartbeats from a client app? Were you perhaps confused with some heartbeat method for detecting that a PRRTE daemon had failed (which is quite distinct from an app proc failing)?

FWIW: PRRTE does not have heartbeat detection in it, for app procs or its remote daemons. It relies on connection timeout/failure to detect that a daemon has died, which can (depending on system settings) take some seconds to occur.

If you want to detect app proc failure, then we rely on PMIx to provide that detection. If the proc fails, the PMIx connection between the proc and local PRRTE daemon will close - this is used to detect that the proc has failed. PRRTE will then (with ULFM requested) generate a PMIx error that the MPI library traps and uses to notify the remaining procs of the failure.

You don't need heartbeats for any of that - the loss of connection serves for failure detection. If you want to detect a "stalled" process, then you can use the PMIx heartbeat sensor. I don't believe there is any OMPI integration with that capability at this time, so you probably wouldn't be able to use it without modifying your app to use PMIx directly - doable, but requires a little knowledge beyond MPI.

Again, though, you don't need heartbeats for failure detection - only for detecting "stalled" procs.

@arielscht
Copy link
Author

Not entirely clear what made you think that PRRTE would monitor heartbeats from a client app? Were you perhaps confused with some heartbeat method for detecting that a PRRTE daemon had failed (which is quite distinct from an app proc failing)?

I was studying ULFM to understand how it is able to detect process failures. At first I understood that the implementation just provided the minimum set of methods required for designing fault tolerance systems, with no restriction to which kind of ft strategy to use.

When looking at some examples I noticed that OpenMPI had some parameters to enable ft_detector and also to tune its heartbeat period and timeout.

In the OpenMPI documentation, I found the MPI level option below, which has an orange box saying it is deprecated and that the failure detection has been moved to PRTE level. So that's where my question comes from.
Image

I'm not used to OpenMPI runtime environment, so I don't fully understand what are PRRTE, PMIx etc. I mentioned it based on the documention.

Although, looking at the open sockets and running processes, I noticed that each app process was connected to a prterun process (in the host machine) and to a prted process (in the remote machine). With that said, I expected that these centralized processes on each machine would heartbeat each app process on their machine and communicate the others about any timeout, which would be a fault suspicion.

Since you said there is no such thing, I would be grateful if you can explain to me that these PRTE level options actually do:
Image

Thank you for your time!

@rhc54
Copy link
Contributor

rhc54 commented Apr 28, 2025

Traveling today but will respond more later. Thanks for the info! Exactly what I was hoping you'd explain. The docs are outdated and incorrect. Those options don't exist.

App procs only connect to the local prrte daemon (prted), not to mpirun. Done via PMIx library. Daemons connect to mpirun.

App procs do connect to each other via MPI of course.

@rhc54
Copy link
Contributor

rhc54 commented Apr 29, 2025

A little more info. PRRTE is the runtime that is launching and monitoring application procs. PMIx is a library used to provide the connection between an application proc and the local runtime daemon (prted). Each app proc uses PMIx to establish a TCP connection to the local prted - period. There are no connections to any other prted.

There are two categories of failures to consider. One or more application procs can fail. This is detected by the local prted when its PMIx library notifies it that the TCP connection to the local app proc has failed. The local prted then notifies mpirun, which generates an event notifying all other procs in the app - telling them the identity of the proc that failed. The MPI library in each proc then calls your registered error handler to tell the app that "proc A has died". The app is responsible for whatever happens in response.

Other failure category is to have a prted fail. This is what that documentation referred to - a prior attempt to use heartbeats between the daemons to detect daemon failure. Unfortunately, sys admins get very upset at runtimes that clog their management network with heartbeats at scale - that network is actually rather busy managing the cluster. So we don't do that - prted failure is detected by seeing the TCP connection between the daemons fail. We don't have a mechanism for recovering from that scenario, so PRRTE will simply terminate all running apps and exit.

Bottom line here: there are no heartbeats anywhere in the system. Failure detection is done by socket closure. At the MPI level, your sole need is to (a) register the error handler, and (b) do something when it is called.

@arielscht
Copy link
Author

Everything is clear now, I'll keep all of these in mind while writing code for my experiments.

Thank you very much for the explanation!

@arielscht
Copy link
Author

Before closing the issue, one last question about the implementation.

Does OpenMPI only use TCP for communication or it also supports UDP? If so, is there a parameter I can set to use UDP instead?

I'm asking this because since failures are detected upon closing the TCP connection, in UDP it would not be possible to detect any failure, since the protocol is connectionless.

@jsquyres
Copy link
Member

Open MPI uses a bunch of different communications protocols.

For the run time (e.g., your questions have been about PRRTE), that's all TCP.

But for MPI, a variety of different protocols and network types are supported. Plain vanilla TCP is supported, but we do not have a UDP module (although some of the underlying HPC-class network support modules utilize connectionless and/or unreliable transports).

@rhc54
Copy link
Contributor

rhc54 commented Apr 30, 2025

Correct - the runtime is solely TCP. We did consider adding a UDP option at one point in the past, but it would have required a lot of work. We would have to create all the support infrastructure to make it reliable (retry, ack, etc.), deal with UDP routing differences, etc. In the end, it just didn't feel worth all the trouble.

If someone ever comes up with some must-have advantage, and is willing to provide the effort to do it, then we'd certainly be open to the idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants