-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sled-agent] handle disappearing Propolis zones #7794
Conversation
At present, sled-agents don't really have any way to detect the unexpected disappearance of a Propolis zone. If a Propolis zone is deleted, such as by someone `ssh`ing into the sled and running `zoneadm` commands, the sled-agent won't detect this and instead believes the instance to still exist. Fortunately, this is a fairly rare turn of events: if a VMM panics or other bad things happen, the zone is not usually torn down. Instead, the `propolis-server` process is restarted, in a state where it is no longer aware that it's supposed to have been running an instance. VMs in such a state report to the sled-agent that they're screwed up, and it knows to treat them as having failed. This is all discussed in detail in [RFD 486 What Shall We Do With The Failèd Instance?][486]. Unfortunately, under the rules laid down in that RFD, sled-agents will _only_ treat a Propolis zone as having failed when the `propolis-server` returns one of the errors that *affirmatively indicate* that it has crashed and been restarted. All other errors that occur while checking an instance's state are retried, whether they HTTP errors returned by the `propolis-server` process, or (critically, in this case) failures to establish a TCP connection because the `propolis-server` process no longer exists. This is pretty bad, as the sled-agent is now left believing that the instance was in its last observed state indefinitely, and that instance (which is now Way Gone) cannot be stopped, restarted, or deleted through normal means. That _sucks_, man! This commit changes sled-agent to behave more intelligently in this situation. Now, when attempts to check `propolis-server`'s instance-state-monitor API endpoint fail with communication errors, the sled-agent will run `zoneadm list` to find out whether the Propolis zone is still there. If it isn't, we now move the instance to `Failed`, because it's..., you know, totally gone. Fixes #7563 [486]: https://rfd.shared.oxide.computer/rfd/0486
I'm going to test this on a racklette by manually deleting a zone. Opening the PR now while I wait for the TUF repo build. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in chat, we'll want to check that the termination machinery that runs on the main task after getting the ZoneGone
message doesn't do anything untoward if the zone is already missing, but as long as it doesn't I think this looks good. Thanks for putting this together!
Yup, not gonna merge this until I've actually tested it in real life! |
Hmm, clearly I've gotten something wrong: when I
I think this is because |
hm, the cleanup path might be getting stuck on...something...when the zone is already halted. Now, after bd46bd6, when i run BRM42220036 # tail -f /var/svc/log/oxide-sled-agent:default.log | looker -c 'r.component?.contains("Instance")'
22:30:44.727Z INFO SledAgent (InstanceManager): Propolis zone is no longer running!
error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
file = sled-agent/src/instance.rs:410
instance_id = 95cedf3c-7489-4ea6-bc8d-d2484fe81b91
propolis_id = 307dbec2-1445-47ee-a47e-b220ed46b3ad
zone = oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad
zone_state = Down
22:30:44.727Z WARN SledAgent (InstanceManager): Propolis zone has gone away entirely! Moving to Failed
file = sled-agent/src/instance.rs:575
instance_id = 95cedf3c-7489-4ea6-bc8d-d2484fe81b91
propolis_id = 307dbec2-1445-47ee-a47e-b220ed46b3ad
22:30:44.835Z ERRO SledAgent (InstanceManager): Failed to take zone bundle for terminated instance
file = sled-agent/src/instance.rs:1431
instance_id = 95cedf3c-7489-4ea6-bc8d-d2484fe81b91
propolis_id = 307dbec2-1445-47ee-a47e-b220ed46b3ad
reason = BundleFailed(failed to enumerate zone service processes\n\nCaused by:\n 0: Failed to run a command\n 1: Error running command in zone 'oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad': Failed to start execution of [svcs -H -o fmri]: Invalid argument (os error 22)\n 2: Failed to start execution of [svcs -H -o fmri]: Invalid argument (os error 22))
zone_name = oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad
22:30:44.835Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad
file = sled-agent/src/instance.rs:1444
instance_id = 95cedf3c-7489-4ea6-bc8d-d2484fe81b91
propolis_id = 307dbec2-1445-47ee-a47e-b220ed46b3ad
and...that's it. Nexus still sees the instance as Going to make sure we're not getting stuck somewhere. |
OH:
|
The line sled-agent is panicking on is here: omicron/sled-agent/src/instance.rs Line 1445 in bd46bd6
but, as far as I can tell, it looks like Zones::halt_and_remove{_logged} should handle removing zones that are already halted:omicron/illumos-utils/src/zone.rs Lines 225 to 273 in 4a7556c
Unfortunately the rest of the panic seems to have gotten eaten. |
oh there we go, the panic got eaten because i had been using BRM42220036 # cat /var/svc/log/oxide-sled-agent:default.log | grep -A 20 panic
thread 'tokio-runtime-worker' panicked at sled-agent/src/instance.rs:1445:64:
called `Result::unwrap()` on an `Err` value: AdmError { op: Uninstall, zone: "oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad", err: CommandOutput(CommandOutputError("exit code 1\nstdout:\n\nstderr:\nzoneadm: zone 'oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad': uninstall operation is invalid for down zones.")) }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Mar 14 22:30:45 Stopping because all processes in service exited. ]
[ Mar 14 22:30:45 Executing stop method (:kill). ]
[ Mar 14 22:30:45 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml &"). ]
[ Mar 14 22:30:45 Method "start" exited with status 0. ]
note: configured to log to "/dev/stdout" |
Per
I guess what happened is that the zone had not finished halting, and the |
okay, so after e4f322f, i now see the sled-agent doing what i believe is the correct thing:
EDIT: OH NEVER MIND I'M JUST DUMB, Nexus just auto-restarted it, it wasn't even on the same sled anymore:
So, everything seems to be working properly now! |
illumos-utils/src/zone.rs
Outdated
let (halt, uninstall) = loop { | ||
let mut poll = tokio::time::interval( | ||
std::time::Duration::from_secs(1), | ||
); | ||
match state { | ||
// For states where we could be running, attempt to halt. | ||
zone::State::Running | zone::State::Ready => { | ||
break (true, true); | ||
} | ||
// For zones where we never performed installation, simply | ||
// delete the zone - uninstallation is invalid. | ||
zone::State::Configured => break (false, false), | ||
// Attempting to uninstall a zone in the "down" state will | ||
// fail. Instead, wait for it to finish shutting down and | ||
// then uninstall it. | ||
zone::State::Down | zone::State::ShuttingDown => { | ||
poll.tick().await; | ||
match Self::find(name).await? { | ||
None => return Ok(None), | ||
Some(zone) => state = zone.state(), | ||
} | ||
} | ||
// For most zone states, perform uninstallation. | ||
_ => break (false, true), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered if this polling ought to have a timeout attached to it (although I think higher-level code ought to be responsibile for applying it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see this resolved - although I think your change here kinda makes sense (if we're in a transient state, wait for it to not be transient) the docs on the illumos man page scare me ("the zone can become stuck in one of these states").
In that case, if we don't have a higher-level timeout at all the call-sites, it seems plausible that this change could cause a different part of the sled agent to become wedged. Note that this API is used for all zones, not just VMM instances, so this could be relevant if e.g. we fail to tear down an internal service zone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The questions I have here are
- What should we do if we do hit a timeout? Just return an error? The man page suggests that if the zone becomes stuck in one of those states, "operator intervention is required", but I have no idea what kind of operator intervention would unfuck it...
- In that case, do you think the timeout ought to be this function's responsibility, to ensure all uses of it cannot loop forever? Do we think the same timeout would be reasonable for all callers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the example call stacks:
- ensure_all_omicron_zones
- zone_bundle_and_try_remove
- zone.runtime.stop()
- NOTE - today, this only logs an error! It does not stop progress
- Zones::halt_and_remove_logged
My fear is that any zone wedged here will prevent any subsequent changes from happening, period, rather than continuing to try and make progress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What should we do if we do hit a timeout? Just return an error? The man page suggests that if the zone becomes stuck in one of those states, "operator intervention is required", but I have no idea what kind of operator intervention would unfuck it...
Truthfully, I'm not sure either. This feels extremely vaguely-defined, and may need to be resolved on a case-by-case basis
- In that case, do you think the timeout ought to be this function's responsibility, to ensure all uses of it cannot loop forever? Do we think the same timeout would be reasonable for all callers?
I think we could change this to make a timeout the caller's responsibility, but I think that also will require us to go check out the callsites (like the one I mentioned above) because we're basically changing semantics, even if they aren't explicitly part of the function signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that, since most of the code that calls into this function already handles other errors halting and removing the zone, perhaps the timeout should just be here, and we return an error if the zone doesn't reach a state where it can be uninstalled in a timely manner. Note that the present behavior of this function is to fail immediately if the zone is in Down
or ShuttingDown
, so the worst case of giving it a couple minutes before returning an error is similar to what this function already does...
Ok(Some(zone)) => { | ||
info!( | ||
self.log, | ||
"Propolis zone is no longer running!"; | ||
"error" => %e, | ||
"zone" => %self.zone_name, | ||
"zone_state" => ?zone.state(), | ||
); | ||
Ok(InstanceMonitorUpdate::ZoneGone) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is already true - but to confirm, monitor
can only be called after the VMM's zone was initialized, correct?
(e.g., there's no concern about racing with zone startup here, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, the function that starts the state monitor task is called with a RunningZone
, which is only returned if the zone has already reached running.
illumos-utils/src/zone.rs
Outdated
let (halt, uninstall) = loop { | ||
let mut poll = tokio::time::interval( | ||
std::time::Duration::from_secs(1), | ||
); | ||
match state { | ||
// For states where we could be running, attempt to halt. | ||
zone::State::Running | zone::State::Ready => { | ||
break (true, true); | ||
} | ||
// For zones where we never performed installation, simply | ||
// delete the zone - uninstallation is invalid. | ||
zone::State::Configured => break (false, false), | ||
// Attempting to uninstall a zone in the "down" state will | ||
// fail. Instead, wait for it to finish shutting down and | ||
// then uninstall it. | ||
zone::State::Down | zone::State::ShuttingDown => { | ||
poll.tick().await; | ||
match Self::find(name).await? { | ||
None => return Ok(None), | ||
Some(zone) => state = zone.state(), | ||
} | ||
} | ||
// For most zone states, perform uninstallation. | ||
_ => break (false, true), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the example call stacks:
- ensure_all_omicron_zones
- zone_bundle_and_try_remove
- zone.runtime.stop()
- NOTE - today, this only logs an error! It does not stop progress
- Zones::halt_and_remove_logged
My fear is that any zone wedged here will prevent any subsequent changes from happening, period, rather than continuing to try and make progress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smklein what do you think of this change?
Ok(Ok(_)) => {} | ||
Ok(Err(e)) => panic!("{e}"), | ||
Err(_) => { | ||
panic!("Zone {zname:?} could not be halted within 5 minutes") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love that we panic here, but we were unwrap
ping it previously, so...this preserves the current behavior, at least...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. We can, and probably should, make a less hostile solution here, but I agree that at least this part of the change is a lateral movement!
@@ -69,7 +69,25 @@ pub struct AdmError { | |||
op: Operation, | |||
zone: String, | |||
#[source] | |||
err: zone::ZoneError, | |||
err: AdmErrorKind, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was necessary to include a structured enum error variant in the case of "zone is ShuttingDown
or Down
" so that it could be differentiated from other, less-retryable zoneadm errors. Initially, I was going to have halt_and_remove(_logged)?
return an enum of either InvalidState
or AdmError
to indicate that, but install_omicron_zone
calls halt_and_remove_logged
and bubbles up the error, and that returns an AdmError
.
I don't love hacking up every place we construct an AdmError
just to handle this one case, but it seemed like the best solution I could come up with in a pinch. Alternatively, install_omicron_zone
could return a boxed error type of either a HaltError
or a normal AdmError
, but that also felt inconsistent with every other function here returning AdmError
...I could be convinced either way.
At present, sled-agents don't really have any way to detect the unexpected disappearance of a Propolis zone. If a Propolis zone is deleted, such as by someone
ssh
ing into the sled and runningzoneadm
commands, the sled-agent won't detect this and instead believes the instance to still exist.Fortunately, this is a fairly rare turn of events: if a VMM panics or other bad things happen, the zone is not usually torn down. Instead, the
propolis-server
process is restarted, in a state where it is no longer aware that it's supposed to have been running an instance. VMs in such a state report to the sled-agent that they're screwed up, and it knows to treat them as having failed. This is all discussed in detail in RFD 486 What Shall We Do With The Failèd Instance?.Unfortunately, under the rules laid down in that RFD, sled-agents will only treat a Propolis zone as having failed when the
propolis-server
returns one of the errors that affirmatively indicate that it has crashed and been restarted. All other errors that occur while checking an instance's state are retried, whether they HTTP errors returned by thepropolis-server
process, or (critically, in this case) failures to establish a TCP connection because thepropolis-server
process no longer exists. This is pretty bad, as the sled-agent is now left believing that the instance was in its last observed state indefinitely, and that instance (which is now Way Gone) cannot be stopped, restarted, or deleted through normal means. That sucks, man!This commit changes sled-agent to behave more intelligently in this situation. Now, when attempts to check
propolis-server
's instance-state-monitor API endpoint fail with communication errors, the sled-agent will runzoneadm list
to find out whether the Propolis zone is still there. If it isn't, we now move the instance toFailed
, because it's..., you know, totally gone.Fixes #7563