Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled-agent] handle disappearing Propolis zones #7794

Merged
merged 4 commits into from
Mar 18, 2025
Merged

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Mar 13, 2025

At present, sled-agents don't really have any way to detect the unexpected disappearance of a Propolis zone. If a Propolis zone is deleted, such as by someone sshing into the sled and running zoneadm commands, the sled-agent won't detect this and instead believes the instance to still exist.

Fortunately, this is a fairly rare turn of events: if a VMM panics or other bad things happen, the zone is not usually torn down. Instead, the propolis-server process is restarted, in a state where it is no longer aware that it's supposed to have been running an instance. VMs in such a state report to the sled-agent that they're screwed up, and it knows to treat them as having failed. This is all discussed in detail in RFD 486 What Shall We Do With The Failèd Instance?.

Unfortunately, under the rules laid down in that RFD, sled-agents will only treat a Propolis zone as having failed when the propolis-server returns one of the errors that affirmatively indicate that it has crashed and been restarted. All other errors that occur while checking an instance's state are retried, whether they HTTP errors returned by the propolis-server process, or (critically, in this case) failures to establish a TCP connection because the propolis-server process no longer exists. This is pretty bad, as the sled-agent is now left believing that the instance was in its last observed state indefinitely, and that instance (which is now Way Gone) cannot be stopped, restarted, or deleted through normal means. That sucks, man!

This commit changes sled-agent to behave more intelligently in this situation. Now, when attempts to check propolis-server's instance-state-monitor API endpoint fail with communication errors, the sled-agent will run zoneadm list to find out whether the Propolis zone is still there. If it isn't, we now move the instance to Failed, because it's..., you know, totally gone.

Fixes #7563

At present, sled-agents don't really have any way to detect the
unexpected disappearance of a Propolis zone. If a Propolis zone is
deleted, such as by someone `ssh`ing into the sled and running `zoneadm`
commands, the sled-agent won't detect this and instead believes the
instance to still exist.

Fortunately, this is a fairly rare turn of events: if a VMM panics or
other bad things happen, the zone is not usually torn down. Instead, the
`propolis-server` process is restarted, in a state where it is no longer
aware that it's supposed to have been running an instance. VMs in such a
state report to the sled-agent that they're screwed up, and it knows to
treat them as having failed. This is all discussed in detail in [RFD 486
What Shall We Do With The Failèd Instance?][486].

Unfortunately, under the rules laid down in that RFD, sled-agents will
_only_ treat a Propolis zone as having failed when the `propolis-server`
returns one of the errors that *affirmatively indicate* that it has
crashed and been restarted. All other errors that occur while checking
an instance's state are retried, whether they HTTP errors returned by
the `propolis-server` process, or (critically, in this case) failures to
establish a TCP connection because the `propolis-server` process no
longer exists. This is pretty bad, as the sled-agent is now left
believing that the instance was in its last observed state indefinitely,
and that instance (which is now Way Gone) cannot be stopped, restarted,
or deleted through normal means. That _sucks_, man!

This commit changes sled-agent to behave more intelligently in this
situation. Now, when attempts to check `propolis-server`'s instance-state-monitor API endpoint fail with communication errors, the sled-agent will run `zoneadm list` to find out whether the Propolis zone
is still there. If it isn't, we now move the instance to `Failed`,
because it's..., you know, totally gone.

Fixes #7563

[486]: https://rfd.shared.oxide.computer/rfd/0486
@hawkw
Copy link
Member Author

hawkw commented Mar 13, 2025

I'm going to test this on a racklette by manually deleting a zone. Opening the PR now while I wait for the TUF repo build.

Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in chat, we'll want to check that the termination machinery that runs on the main task after getting the ZoneGone message doesn't do anything untoward if the zone is already missing, but as long as it doesn't I think this looks good. Thanks for putting this together!

@hawkw
Copy link
Member Author

hawkw commented Mar 13, 2025

As I mentioned in chat, we'll want to check that the termination machinery that runs on the main task after getting the ZoneGone message doesn't do anything untoward if the zone is already missing, but as long as it doesn't I think this looks good. Thanks for putting this together!

Yup, not gonna merge this until I've actually tested it in real life!

@hawkw
Copy link
Member Author

hawkw commented Mar 14, 2025

Hmm, clearly I've gotten something wrong: when I zoneadm halt a running Propolis zone, sled-agent goes into the "communication error, but the zone still exists" path, which is...wrong:

19:34:08.340Z INFO SledAgent (InstanceManager): updated state after observing Propolis state change
    file = sled-agent/src/instance.rs:917
    instance_id = 676060ed-f3e3-4581-b2bc-07a7bc585ca0
    new_vmm_state = VmmRuntimeState { state: Running, gen: Generation(4), time_updated: 2025-03-14T19:34:08.340221801Z }
    propolis_id = c068f7a7-37b9-43de-9d79-c26dd1345d0e
19:34:08.340Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    file = sled-agent/src/instance.rs:814
    instance_id = 676060ed-f3e3-4581-b2bc-07a7bc585ca0
    propolis_id = c068f7a7-37b9-43de-9d79-c26dd1345d0e
    state = SledVmmState { vmm_state: VmmRuntimeState { state: Running, gen: Generation(4), time_updated: 2025-03-14T19:34:08.340221801Z }, migration_in: None, migration_out: None }
19:38:00.396Z WARN SledAgent (InstanceManager): communication error checking up on Propolis, but the zone still exists...
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:415
    instance_id = 676060ed-f3e3-4581-b2bc-07a7bc585ca0
    propolis_id = c068f7a7-37b9-43de-9d79-c26dd1345d0e
    zone = oxz_propolis-server_c068f7a7-37b9-43de-9d79-c26dd1345d0e
19:38:00.396Z WARN SledAgent (InstanceManager): Failed to poll Propolis state
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:331
    generation = 3
    instance_id = 676060ed-f3e3-4581-b2bc-07a7bc585ca0
    propolis_id = c068f7a7-37b9-43de-9d79-c26dd1345d0e
    retry_in = 43.827ms
19:38:00.464Z WARN SledAgent (InstanceManager): communication error checking up on Propolis, but the zone still exists...
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:415
    instance_id = 676060ed-f3e3-4581-b2bc-07a7bc585ca0
    propolis_id = c068f7a7-37b9-43de-9d79-c26dd1345d0e
    zone = oxz_propolis-server_c068f7a7-37b9-43de-9d79-c26dd1345d0e
19:38:00.464Z WARN SledAgent (InstanceManager): Failed to poll Propolis state
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:331
    generation = 3
    instance_id = 676060ed-f3e3-4581-b2bc-07a7bc585ca0
    propolis_id = c068f7a7-37b9-43de-9d79-c26dd1345d0e
    retry_in = 54.500286ms
19:41:15.596Z WARN SledAgent (InstanceManager): communication error checking up on Propolis, but the zone still exists...
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:415
    instance_id = 676060ed-f3e3-4581-b2bc-07a7bc585ca0
    propolis_id = c068f7a7-37b9-43de-9d79-c26dd1345d0e
    zone = oxz_propolis-server_c068f7a7-37b9-43de-9d79-c26dd1345d0e
19:41:15.596Z WARN SledAgent (InstanceManager): Failed to poll Propolis state
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:331
    generation = 3
    instance_id = 676060ed-f3e3-4581-b2bc-07a7bc585ca0
    propolis_id = c068f7a7-37b9-43de-9d79-c26dd1345d0e
    retry_in = 157.891271ms

I think this is because Zones::find runs zoneadm with the -cip flags, so it includes zones which are installed but not running. We should change the logic to specifically test if the zone is running.

@hawkw
Copy link
Member Author

hawkw commented Mar 14, 2025

hm, the cleanup path might be getting stuck on...something...when the zone is already halted. Now, after bd46bd6, when i run zoneadm -z oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad halt, i see this:

BRM42220036 # tail -f /var/svc/log/oxide-sled-agent:default.log | looker -c 'r.component?.contains("Instance")'
22:30:44.727Z INFO SledAgent (InstanceManager): Propolis zone is no longer running!
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:410
    instance_id = 95cedf3c-7489-4ea6-bc8d-d2484fe81b91
    propolis_id = 307dbec2-1445-47ee-a47e-b220ed46b3ad
    zone = oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad
    zone_state = Down
22:30:44.727Z WARN SledAgent (InstanceManager): Propolis zone has gone away entirely! Moving to Failed
    file = sled-agent/src/instance.rs:575
    instance_id = 95cedf3c-7489-4ea6-bc8d-d2484fe81b91
    propolis_id = 307dbec2-1445-47ee-a47e-b220ed46b3ad
22:30:44.835Z ERRO SledAgent (InstanceManager): Failed to take zone bundle for terminated instance
    file = sled-agent/src/instance.rs:1431
    instance_id = 95cedf3c-7489-4ea6-bc8d-d2484fe81b91
    propolis_id = 307dbec2-1445-47ee-a47e-b220ed46b3ad
    reason = BundleFailed(failed to enumerate zone service processes\n\nCaused by:\n    0: Failed to run a command\n    1: Error running command in zone 'oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad': Failed to start execution of [svcs -H -o fmri]: Invalid argument (os error 22)\n    2: Failed to start execution of [svcs -H -o fmri]: Invalid argument (os error 22))
    zone_name = oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad
22:30:44.835Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad
    file = sled-agent/src/instance.rs:1444
    instance_id = 95cedf3c-7489-4ea6-bc8d-d2484fe81b91
    propolis_id = 307dbec2-1445-47ee-a47e-b220ed46b3ad

and...that's it. Nexus still sees the instance as Running.

Going to make sure we're not getting stuck somewhere.

@hawkw
Copy link
Member Author

hawkw commented Mar 14, 2025

OH:

22:30:44.835Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad
    file = sled-agent/src/instance.rs:1444
    instance_id = 95cedf3c-7489-4ea6-bc8d-d2484fe81b91
    propolis_id = 307dbec2-1445-47ee-a47e-b220ed46b3ad
thread 'tokio-runtime-worker' panicked at sled-agent/src/instance.rs:1445:64:
22:30:50.802Z WARN SledAgent (BootstrapAgentStartup): Deleting existing VNIC
    file = sled-hardware/src/cleanup.rs:100
    vnic_kind = OxideControlVnic
    vnic_name = oxControlInstance0

@hawkw
Copy link
Member Author

hawkw commented Mar 14, 2025

The line sled-agent is panicking on is here:

Zones::halt_and_remove_logged(&self.log, &zname).await.unwrap();

but, as far as I can tell, it looks like Zones::halt_and_remove{_logged} should handle removing zones that are already halted:
pub async fn halt_and_remove(
name: &str,
) -> Result<Option<zone::State>, AdmError> {
match Self::find(name).await? {
None => Ok(None),
Some(zone) => {
let state = zone.state();
let (halt, uninstall) = match state {
// For states where we could be running, attempt to halt.
zone::State::Running | zone::State::Ready => (true, true),
// For zones where we never performed installation, simply
// delete the zone - uninstallation is invalid.
zone::State::Configured => (false, false),
// For most zone states, perform uninstallation.
_ => (false, true),
};
if halt {
zone::Adm::new(name).halt().await.map_err(|err| {
AdmError {
op: Operation::Halt,
zone: name.to_string(),
err,
}
})?;
}
if uninstall {
zone::Adm::new(name)
.uninstall(/* force= */ true)
.await
.map_err(|err| AdmError {
op: Operation::Uninstall,
zone: name.to_string(),
err,
})?;
}
zone::Config::new(name)
.delete(/* force= */ true)
.run()
.await
.map_err(|err| AdmError {
op: Operation::Delete,
zone: name.to_string(),
err,
})?;
Ok(Some(state))
}
}
}

Unfortunately the rest of the panic seems to have gotten eaten.

@hawkw
Copy link
Member Author

hawkw commented Mar 14, 2025

oh there we go, the panic got eaten because i had been using grep to filter the logs, and it's multiline:

BRM42220036 # cat /var/svc/log/oxide-sled-agent:default.log | grep -A 20 panic
thread 'tokio-runtime-worker' panicked at sled-agent/src/instance.rs:1445:64:
called `Result::unwrap()` on an `Err` value: AdmError { op: Uninstall, zone: "oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad", err: CommandOutput(CommandOutputError("exit code 1\nstdout:\n\nstderr:\nzoneadm: zone 'oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad': uninstall operation is invalid for down zones.")) }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Mar 14 22:30:45 Stopping because all processes in service exited. ]
[ Mar 14 22:30:45 Executing stop method (:kill). ]
[ Mar 14 22:30:45 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml &"). ]
[ Mar 14 22:30:45 Method "start" exited with status 0. ]
note: configured to log to "/dev/stdout"

@hawkw
Copy link
Member Author

hawkw commented Mar 14, 2025

Per man 7 zones:

SHUTTING_DOWN
DOWN
Indicates that the zone is being halted. The zone can
become stuck in one of these states if it is unable to
tear down the application environment state (such as
mounted file systems) or if some portion of the
virtual platform cannot be destroyed. Such cases
require operator intervention.

I guess what happened is that the zone had not finished halting, and the Zones::halt_and_remove function can't actually handle zones that someone else has started halting but haven't finished yet. We probably need to wait until it's fully halted before trying to remove it? I'll have to think a bit about where the right place to handle this is...

@hawkw
Copy link
Member Author

hawkw commented Mar 17, 2025

okay, so after e4f322f, i now see the sled-agent doing what i believe is the correct thing:

21:24:08.336Z INFO SledAgent (InstanceManager): Observing new propolis state: ObservedPropolisState { vmm_state: PropolisInstanceState(Running), migration_in: None, migration_out: None, time: 2025-03-17T21:24:08.336605288Z }
    file = sled-agent/src/instance.rs:904
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
21:24:08.336Z INFO SledAgent (InstanceManager): updated state after observing Propolis state change
    file = sled-agent/src/instance.rs:926
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    new_vmm_state = VmmRuntimeState { state: Running, gen: Generation(4), time_updated: 2025-03-17T21:24:08.336605288Z }
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
21:24:08.336Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    file = sled-agent/src/instance.rs:823
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
    state = SledVmmState { vmm_state: VmmRuntimeState { state: Running, gen: Generation(4), time_updated: 2025-03-17T21:24:08.336605288Z }, migration_in: None, migration_out: None }
21:26:54.613Z INFO SledAgent (InstanceManager): Propolis zone is no longer running!
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:104::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:410
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
    zone = oxz_propolis-server_c8b625b4-7096-4f8d-b4ed-a5087df609af
    zone_state = ShuttingDown
21:26:54.613Z WARN SledAgent (InstanceManager): Propolis zone has gone away entirely! Moving to Failed
    file = sled-agent/src/instance.rs:575
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
21:26:54.614Z INFO SledAgent (ZoneBundler): creating zone bundle
    context = ZoneBundleContext { storage_dirs: ["/pool/int/87b6379b-99bb-4af3-b0cf-52d6077310a9/debug/bundle/zone", "/pool/int/f2a80c57-3d02-4ae1-b07d-a9da74aa24c4/debug/bundle/zone"], cause: TerminatedInstance, extra_log_dirs: ["/pool/ext/dda09115-f052-459b-a7b1-c8b6970671d2/crypt/debug", "/pool/ext/30d1c715-2219-4504-95f6-b3507e508db1/crypt/debug", "/pool/ext/df7b22d2-7983-42da-aa72-ae20527b67c3/crypt/debug", "/pool/ext/1e3544bb-28f3-4572-920b-fe7fbdc24fb8/crypt/debug", "/pool/ext/c0e60a5b-2087-4938-9358-70ab1f7b4d2f/crypt/debug", "/pool/ext/8797e2a6-32cf-4653-aa84-fef64107b5de/crypt/debug", "/pool/ext/d26248cf-1d7b-4fe0-ba65-9d9c7062d225/crypt/debug", "/pool/ext/49a77078-831c-4975-b66e-4ec6fab4b384/crypt/debug", "/pool/ext/5e5b22b6-1c99-4063-8470-eca1d07b5055/crypt/debug", "/pool/ext/1146dcc3-ba63-4234-82a3-aa4178fef668/crypt/debug"] }
    file = sled-agent/src/zone_bundle.rs:354
    zone_name = oxz_propolis-server_c8b625b4-7096-4f8d-b4ed-a5087df609af
21:26:54.706Z ERRO SledAgent (InstanceManager): Failed to take zone bundle for terminated instance
    file = sled-agent/src/instance.rs:1431
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
    reason = BundleFailed(failed to enumerate zone service processes\n\nCaused by:\n    0: Failed to run a command\n    1: Error running command in zone 'oxz_propolis-server_c8b625b4-7096-4f8d-b4ed-a5087df609af': Failed to start execution of [svcs -H -o fmri]: Invalid argument (os error 22)\n    2: Failed to start execution of [svcs -H -o fmri]: Invalid argument (os error 22))
    zone_name = oxz_propolis-server_c8b625b4-7096-4f8d-b4ed-a5087df609af
21:26:54.706Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_c8b625b4-7096-4f8d-b4ed-a5087df609af
    file = sled-agent/src/instance.rs:1444
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
21:26:56.303Z INFO SledAgent (InstanceManager): halt_and_remove_logged: Previous zone state: Installed
    file = illumos-utils/src/zone.rs:298
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
21:26:56.326Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    file = sled-agent/src/instance.rs:823
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
    state = SledVmmState { vmm_state: VmmRuntimeState { state: Failed, gen: Generation(5), time_updated: 2025-03-17T21:26:56.326055012Z }, migration_in: None, migration_out: None }
21:26:56.341Z INFO SledAgent (InstanceManager): State monitoring task complete
    file = sled-agent/src/instance.rs:1381
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
although weirdly, Nexus seems to have not updated the instance state to `Failed`:
root@oxz_switch1:~# omdb db instances
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:103::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:102::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (129.0.0)
ID                                   STATE   PROPOLIS_ID                          SLED_ID                              HOST_SERIAL NAME
4ebf2723-3aeb-49ad-a9b7-a556675c9ae6 running e44522b0-cde7-439a-a537-67e98fc20f35 5447746b-5d2c-4d3c-9011-1dbb7d3cff7b BRM42220062 my-cool-instance

Going to have to pick through nexus logs I guess!

EDIT: OH NEVER MIND I'M JUST DUMB, Nexus just auto-restarted it, it wasn't even on the same sled anymore:

== RUNTIME STATE ===============================================================
               nexus state: Vmm
(i)     external API state: Running
           last updated at: 2025-03-17T21:26:56.488262Z (generation 5)
       needs reincarnation: false
/!\          karmic status: cooling down (TimeDelta { secs: 3066, nanos: 844029127 } remaining)
      last reincarnated at: Some(2025-03-17T21:26:57.332468Z)
             active VMM ID: Some(e44522b0-cde7-439a-a537-67e98fc20f35)
             target VMM ID: None
              migration ID: None
              updater lock: UNLOCKED at generation: 4

== ACTIVE VMM ==================================================================
                        ID: e44522b0-cde7-439a-a537-67e98fc20f35
               instance ID: 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
                created at: 2025-03-17 21:26:57.287666 UTC
                     state: running
                updated at: 2025-03-17T21:27:08.864616Z (generation 4)
          propolis address: fd00:1122:3344:103::1:0:12400
                   sled ID: 5447746b-5d2c-4d3c-9011-1dbb7d3cff7b

So, everything seems to be working properly now!

Comment on lines 232 to 256
let (halt, uninstall) = loop {
let mut poll = tokio::time::interval(
std::time::Duration::from_secs(1),
);
match state {
// For states where we could be running, attempt to halt.
zone::State::Running | zone::State::Ready => {
break (true, true);
}
// For zones where we never performed installation, simply
// delete the zone - uninstallation is invalid.
zone::State::Configured => break (false, false),
// Attempting to uninstall a zone in the "down" state will
// fail. Instead, wait for it to finish shutting down and
// then uninstall it.
zone::State::Down | zone::State::ShuttingDown => {
poll.tick().await;
match Self::find(name).await? {
None => return Ok(None),
Some(zone) => state = zone.state(),
}
}
// For most zone states, perform uninstallation.
_ => break (false, true),
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered if this polling ought to have a timeout attached to it (although I think higher-level code ought to be responsibile for applying it).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see this resolved - although I think your change here kinda makes sense (if we're in a transient state, wait for it to not be transient) the docs on the illumos man page scare me ("the zone can become stuck in one of these states").

In that case, if we don't have a higher-level timeout at all the call-sites, it seems plausible that this change could cause a different part of the sled agent to become wedged. Note that this API is used for all zones, not just VMM instances, so this could be relevant if e.g. we fail to tear down an internal service zone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The questions I have here are

  1. What should we do if we do hit a timeout? Just return an error? The man page suggests that if the zone becomes stuck in one of those states, "operator intervention is required", but I have no idea what kind of operator intervention would unfuck it...
  2. In that case, do you think the timeout ought to be this function's responsibility, to ensure all uses of it cannot loop forever? Do we think the same timeout would be reasonable for all callers?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the example call stacks:

  • ensure_all_omicron_zones
  • zone_bundle_and_try_remove
  • zone.runtime.stop()
    • NOTE - today, this only logs an error! It does not stop progress
  • Zones::halt_and_remove_logged

My fear is that any zone wedged here will prevent any subsequent changes from happening, period, rather than continuing to try and make progress

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What should we do if we do hit a timeout? Just return an error? The man page suggests that if the zone becomes stuck in one of those states, "operator intervention is required", but I have no idea what kind of operator intervention would unfuck it...

Truthfully, I'm not sure either. This feels extremely vaguely-defined, and may need to be resolved on a case-by-case basis

  1. In that case, do you think the timeout ought to be this function's responsibility, to ensure all uses of it cannot loop forever? Do we think the same timeout would be reasonable for all callers?

I think we could change this to make a timeout the caller's responsibility, but I think that also will require us to go check out the callsites (like the one I mentioned above) because we're basically changing semantics, even if they aren't explicitly part of the function signature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that, since most of the code that calls into this function already handles other errors halting and removing the zone, perhaps the timeout should just be here, and we return an error if the zone doesn't reach a state where it can be uninstalled in a timely manner. Note that the present behavior of this function is to fail immediately if the zone is in Down or ShuttingDown, so the worst case of giving it a couple minutes before returning an error is similar to what this function already does...

@hawkw hawkw requested a review from smklein March 17, 2025 21:39
Comment on lines +409 to +418
Ok(Some(zone)) => {
info!(
self.log,
"Propolis zone is no longer running!";
"error" => %e,
"zone" => %self.zone_name,
"zone_state" => ?zone.state(),
);
Ok(InstanceMonitorUpdate::ZoneGone)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is already true - but to confirm, monitor can only be called after the VMM's zone was initialized, correct?

(e.g., there's no concern about racing with zone startup here, right?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, the function that starts the state monitor task is called with a RunningZone, which is only returned if the zone has already reached running.

Comment on lines 232 to 256
let (halt, uninstall) = loop {
let mut poll = tokio::time::interval(
std::time::Duration::from_secs(1),
);
match state {
// For states where we could be running, attempt to halt.
zone::State::Running | zone::State::Ready => {
break (true, true);
}
// For zones where we never performed installation, simply
// delete the zone - uninstallation is invalid.
zone::State::Configured => break (false, false),
// Attempting to uninstall a zone in the "down" state will
// fail. Instead, wait for it to finish shutting down and
// then uninstall it.
zone::State::Down | zone::State::ShuttingDown => {
poll.tick().await;
match Self::find(name).await? {
None => return Ok(None),
Some(zone) => state = zone.state(),
}
}
// For most zone states, perform uninstallation.
_ => break (false, true),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the example call stacks:

  • ensure_all_omicron_zones
  • zone_bundle_and_try_remove
  • zone.runtime.stop()
    • NOTE - today, this only logs an error! It does not stop progress
  • Zones::halt_and_remove_logged

My fear is that any zone wedged here will prevent any subsequent changes from happening, period, rather than continuing to try and make progress

Copy link
Member Author

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smklein what do you think of this change?

Comment on lines +1464 to +1468
Ok(Ok(_)) => {}
Ok(Err(e)) => panic!("{e}"),
Err(_) => {
panic!("Zone {zname:?} could not be halted within 5 minutes")
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love that we panic here, but we were unwrapping it previously, so...this preserves the current behavior, at least...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. We can, and probably should, make a less hostile solution here, but I agree that at least this part of the change is a lateral movement!

@@ -69,7 +69,25 @@ pub struct AdmError {
op: Operation,
zone: String,
#[source]
err: zone::ZoneError,
err: AdmErrorKind,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was necessary to include a structured enum error variant in the case of "zone is ShuttingDown or Down" so that it could be differentiated from other, less-retryable zoneadm errors. Initially, I was going to have halt_and_remove(_logged)? return an enum of either InvalidState or AdmError to indicate that, but install_omicron_zone calls halt_and_remove_logged and bubbles up the error, and that returns an AdmError.

I don't love hacking up every place we construct an AdmError just to handle this one case, but it seemed like the best solution I could come up with in a pinch. Alternatively, install_omicron_zone could return a boxed error type of either a HaltError or a normal AdmError, but that also felt inconsistent with every other function here returning AdmError...I could be convinced either way.

@hawkw hawkw requested a review from smklein March 17, 2025 23:30
@hawkw hawkw merged commit 8c13222 into main Mar 18, 2025
16 checks passed
@hawkw hawkw deleted the eliza/dude-wheres-my-zone branch March 18, 2025 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

if a Propolis zone is abruptly deleted, sled-agent is left completely unaware
3 participants