[sled-agent] handle disappearing Propolis zones #7794

hawkw · 2025-03-13T21:43:28Z

At present, sled-agents don't really have any way to detect the unexpected disappearance of a Propolis zone. If a Propolis zone is deleted, such as by someone sshing into the sled and running zoneadm commands, the sled-agent won't detect this and instead believes the instance to still exist.

Fortunately, this is a fairly rare turn of events: if a VMM panics or other bad things happen, the zone is not usually torn down. Instead, the propolis-server process is restarted, in a state where it is no longer aware that it's supposed to have been running an instance. VMs in such a state report to the sled-agent that they're screwed up, and it knows to treat them as having failed. This is all discussed in detail in RFD 486 What Shall We Do With The Failèd Instance?.

Unfortunately, under the rules laid down in that RFD, sled-agents will only treat a Propolis zone as having failed when the propolis-server returns one of the errors that affirmatively indicate that it has crashed and been restarted. All other errors that occur while checking an instance's state are retried, whether they HTTP errors returned by the propolis-server process, or (critically, in this case) failures to establish a TCP connection because the propolis-server process no longer exists. This is pretty bad, as the sled-agent is now left believing that the instance was in its last observed state indefinitely, and that instance (which is now Way Gone) cannot be stopped, restarted, or deleted through normal means. That sucks, man!

This commit changes sled-agent to behave more intelligently in this situation. Now, when attempts to check propolis-server's instance-state-monitor API endpoint fail with communication errors, the sled-agent will run zoneadm list to find out whether the Propolis zone is still there. If it isn't, we now move the instance to Failed, because it's..., you know, totally gone.

Fixes #7563

At present, sled-agents don't really have any way to detect the unexpected disappearance of a Propolis zone. If a Propolis zone is deleted, such as by someone `ssh`ing into the sled and running `zoneadm` commands, the sled-agent won't detect this and instead believes the instance to still exist. Fortunately, this is a fairly rare turn of events: if a VMM panics or other bad things happen, the zone is not usually torn down. Instead, the `propolis-server` process is restarted, in a state where it is no longer aware that it's supposed to have been running an instance. VMs in such a state report to the sled-agent that they're screwed up, and it knows to treat them as having failed. This is all discussed in detail in [RFD 486 What Shall We Do With The Failèd Instance?][486]. Unfortunately, under the rules laid down in that RFD, sled-agents will _only_ treat a Propolis zone as having failed when the `propolis-server` returns one of the errors that *affirmatively indicate* that it has crashed and been restarted. All other errors that occur while checking an instance's state are retried, whether they HTTP errors returned by the `propolis-server` process, or (critically, in this case) failures to establish a TCP connection because the `propolis-server` process no longer exists. This is pretty bad, as the sled-agent is now left believing that the instance was in its last observed state indefinitely, and that instance (which is now Way Gone) cannot be stopped, restarted, or deleted through normal means. That _sucks_, man! This commit changes sled-agent to behave more intelligently in this situation. Now, when attempts to check `propolis-server`'s instance-state-monitor API endpoint fail with communication errors, the sled-agent will run `zoneadm list` to find out whether the Propolis zone is still there. If it isn't, we now move the instance to `Failed`, because it's..., you know, totally gone. Fixes #7563 [486]: https://rfd.shared.oxide.computer/rfd/0486

hawkw · 2025-03-13T21:44:00Z

I'm going to test this on a racklette by manually deleting a zone. Opening the PR now while I wait for the TUF repo build.

gjcolombo

As I mentioned in chat, we'll want to check that the termination machinery that runs on the main task after getting the ZoneGone message doesn't do anything untoward if the zone is already missing, but as long as it doesn't I think this looks good. Thanks for putting this together!

hawkw · 2025-03-13T21:52:38Z

As I mentioned in chat, we'll want to check that the termination machinery that runs on the main task after getting the ZoneGone message doesn't do anything untoward if the zone is already missing, but as long as it doesn't I think this looks good. Thanks for putting this together!

Yup, not gonna merge this until I've actually tested it in real life!

hawkw · 2025-03-14T19:48:20Z

Hmm, clearly I've gotten something wrong: when I zoneadm halt a running Propolis zone, sled-agent goes into the "communication error, but the zone still exists" path, which is...wrong:

19:34:08.340Z INFO SledAgent (InstanceManager): updated state after observing Propolis state change
    file = sled-agent/src/instance.rs:917
    instance_id = 676060ed-f3e3-4581-b2bc-07a7bc585ca0
    new_vmm_state = VmmRuntimeState { state: Running, gen: Generation(4), time_updated: 2025-03-14T19:34:08.340221801Z }
    propolis_id = c068f7a7-37b9-43de-9d79-c26dd1345d0e
19:34:08.340Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    file = sled-agent/src/instance.rs:814
    instance_id = 676060ed-f3e3-4581-b2bc-07a7bc585ca0
    propolis_id = c068f7a7-37b9-43de-9d79-c26dd1345d0e
    state = SledVmmState { vmm_state: VmmRuntimeState { state: Running, gen: Generation(4), time_updated: 2025-03-14T19:34:08.340221801Z }, migration_in: None, migration_out: None }
19:38:00.396Z WARN SledAgent (InstanceManager): communication error checking up on Propolis, but the zone still exists...
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:415
    instance_id = 676060ed-f3e3-4581-b2bc-07a7bc585ca0
    propolis_id = c068f7a7-37b9-43de-9d79-c26dd1345d0e
    zone = oxz_propolis-server_c068f7a7-37b9-43de-9d79-c26dd1345d0e
19:38:00.396Z WARN SledAgent (InstanceManager): Failed to poll Propolis state
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:331
    generation = 3
    instance_id = 676060ed-f3e3-4581-b2bc-07a7bc585ca0
    propolis_id = c068f7a7-37b9-43de-9d79-c26dd1345d0e
    retry_in = 43.827ms
19:38:00.464Z WARN SledAgent (InstanceManager): communication error checking up on Propolis, but the zone still exists...
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:415
    instance_id = 676060ed-f3e3-4581-b2bc-07a7bc585ca0
    propolis_id = c068f7a7-37b9-43de-9d79-c26dd1345d0e
    zone = oxz_propolis-server_c068f7a7-37b9-43de-9d79-c26dd1345d0e
19:38:00.464Z WARN SledAgent (InstanceManager): Failed to poll Propolis state
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:331
    generation = 3
    instance_id = 676060ed-f3e3-4581-b2bc-07a7bc585ca0
    propolis_id = c068f7a7-37b9-43de-9d79-c26dd1345d0e
    retry_in = 54.500286ms
19:41:15.596Z WARN SledAgent (InstanceManager): communication error checking up on Propolis, but the zone still exists...
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:415
    instance_id = 676060ed-f3e3-4581-b2bc-07a7bc585ca0
    propolis_id = c068f7a7-37b9-43de-9d79-c26dd1345d0e
    zone = oxz_propolis-server_c068f7a7-37b9-43de-9d79-c26dd1345d0e
19:41:15.596Z WARN SledAgent (InstanceManager): Failed to poll Propolis state
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:331
    generation = 3
    instance_id = 676060ed-f3e3-4581-b2bc-07a7bc585ca0
    propolis_id = c068f7a7-37b9-43de-9d79-c26dd1345d0e
    retry_in = 157.891271ms

I think this is because Zones::find runs zoneadm with the -cip flags, so it includes zones which are installed but not running. We should change the logic to specifically test if the zone is running.

hawkw · 2025-03-14T22:37:16Z

hm, the cleanup path might be getting stuck on...something...when the zone is already halted. Now, after bd46bd6, when i run zoneadm -z oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad halt, i see this:

BRM42220036 # tail -f /var/svc/log/oxide-sled-agent:default.log | looker -c 'r.component?.contains("Instance")'
22:30:44.727Z INFO SledAgent (InstanceManager): Propolis zone is no longer running!
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:101::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:410
    instance_id = 95cedf3c-7489-4ea6-bc8d-d2484fe81b91
    propolis_id = 307dbec2-1445-47ee-a47e-b220ed46b3ad
    zone = oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad
    zone_state = Down
22:30:44.727Z WARN SledAgent (InstanceManager): Propolis zone has gone away entirely! Moving to Failed
    file = sled-agent/src/instance.rs:575
    instance_id = 95cedf3c-7489-4ea6-bc8d-d2484fe81b91
    propolis_id = 307dbec2-1445-47ee-a47e-b220ed46b3ad
22:30:44.835Z ERRO SledAgent (InstanceManager): Failed to take zone bundle for terminated instance
    file = sled-agent/src/instance.rs:1431
    instance_id = 95cedf3c-7489-4ea6-bc8d-d2484fe81b91
    propolis_id = 307dbec2-1445-47ee-a47e-b220ed46b3ad
    reason = BundleFailed(failed to enumerate zone service processes\n\nCaused by:\n    0: Failed to run a command\n    1: Error running command in zone 'oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad': Failed to start execution of [svcs -H -o fmri]: Invalid argument (os error 22)\n    2: Failed to start execution of [svcs -H -o fmri]: Invalid argument (os error 22))
    zone_name = oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad
22:30:44.835Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad
    file = sled-agent/src/instance.rs:1444
    instance_id = 95cedf3c-7489-4ea6-bc8d-d2484fe81b91
    propolis_id = 307dbec2-1445-47ee-a47e-b220ed46b3ad

and...that's it. Nexus still sees the instance as Running.

Going to make sure we're not getting stuck somewhere.

hawkw · 2025-03-14T22:42:07Z

OH:

22:30:44.835Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad
    file = sled-agent/src/instance.rs:1444
    instance_id = 95cedf3c-7489-4ea6-bc8d-d2484fe81b91
    propolis_id = 307dbec2-1445-47ee-a47e-b220ed46b3ad
thread 'tokio-runtime-worker' panicked at sled-agent/src/instance.rs:1445:64:
22:30:50.802Z WARN SledAgent (BootstrapAgentStartup): Deleting existing VNIC
    file = sled-hardware/src/cleanup.rs:100
    vnic_kind = OxideControlVnic
    vnic_name = oxControlInstance0

hawkw · 2025-03-14T22:46:48Z

The line sled-agent is panicking on is here:

omicron/sled-agent/src/instance.rs

Line 1445 in bd46bd6

Zones::halt_and_remove_logged(&self.log, &zname).await.unwrap();

but, as far as I can tell, it looks like Zones::halt_and_remove{_logged} should handle removing zones that are already halted:

omicron/illumos-utils/src/zone.rs

Lines 225 to 273 in 4a7556c

    
               pub async fn halt_and_remove( 
        
                   name: &str, 
        
               ) -> Result<Option<zone::State>, AdmError> { 
        
                   match Self::find(name).await? { 
        
                       None => Ok(None), 
        
                       Some(zone) => { 
        
                           let state = zone.state(); 
        
                           let (halt, uninstall) = match state { 
        
                               // For states where we could be running, attempt to halt. 
        
                               zone::State::Running | zone::State::Ready => (true, true), 
        
                               // For zones where we never performed installation, simply 
        
                               // delete the zone - uninstallation is invalid. 
        
                               zone::State::Configured => (false, false), 
        
                               // For most zone states, perform uninstallation. 
        
                               _ => (false, true), 
        
                           }; 
        
                           if halt { 
        
                               zone::Adm::new(name).halt().await.map_err(|err| { 
        
                                   AdmError { 
        
                                       op: Operation::Halt, 
        
                                       zone: name.to_string(), 
        
                                       err, 
        
                                   } 
        
                               })?; 
        
                           } 
        
                           if uninstall { 
        
                               zone::Adm::new(name) 
        
                                   .uninstall(/* force= */ true) 
        
                                   .await 
        
                                   .map_err(|err| AdmError { 
        
                                       op: Operation::Uninstall, 
        
                                       zone: name.to_string(), 
        
                                       err, 
        
                                   })?; 
        
                           } 
        
                           zone::Config::new(name) 
        
                               .delete(/* force= */ true) 
        
                               .run() 
        
                               .await 
        
                               .map_err(|err| AdmError { 
        
                               op: Operation::Delete, 
        
                               zone: name.to_string(), 
        
                               err, 
        
                           })?; 
        
                           Ok(Some(state)) 
        
                       } 
        
                   } 
        
               }

Unfortunately the rest of the panic seems to have gotten eaten.

hawkw · 2025-03-14T22:49:46Z

oh there we go, the panic got eaten because i had been using grep to filter the logs, and it's multiline:

BRM42220036 # cat /var/svc/log/oxide-sled-agent:default.log | grep -A 20 panic
thread 'tokio-runtime-worker' panicked at sled-agent/src/instance.rs:1445:64:
called `Result::unwrap()` on an `Err` value: AdmError { op: Uninstall, zone: "oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad", err: CommandOutput(CommandOutputError("exit code 1\nstdout:\n\nstderr:\nzoneadm: zone 'oxz_propolis-server_307dbec2-1445-47ee-a47e-b220ed46b3ad': uninstall operation is invalid for down zones.")) }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Mar 14 22:30:45 Stopping because all processes in service exited. ]
[ Mar 14 22:30:45 Executing stop method (:kill). ]
[ Mar 14 22:30:45 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml &"). ]
[ Mar 14 22:30:45 Method "start" exited with status 0. ]
note: configured to log to "/dev/stdout"

hawkw · 2025-03-14T22:55:06Z

Per man 7 zones:

SHUTTING_DOWN
DOWN
Indicates that the zone is being halted. The zone can
become stuck in one of these states if it is unable to
tear down the application environment state (such as
mounted file systems) or if some portion of the
virtual platform cannot be destroyed. Such cases
require operator intervention.

I guess what happened is that the zone had not finished halting, and the Zones::halt_and_remove function can't actually handle zones that someone else has started halting but haven't finished yet. We probably need to wait until it's fully halted before trying to remove it? I'll have to think a bit about where the right place to handle this is...

hawkw · 2025-03-17T21:35:41Z

okay, so after e4f322f, i now see the sled-agent doing what i believe is the correct thing:

21:24:08.336Z INFO SledAgent (InstanceManager): Observing new propolis state: ObservedPropolisState { vmm_state: PropolisInstanceState(Running), migration_in: None, migration_out: None, time: 2025-03-17T21:24:08.336605288Z }
    file = sled-agent/src/instance.rs:904
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
21:24:08.336Z INFO SledAgent (InstanceManager): updated state after observing Propolis state change
    file = sled-agent/src/instance.rs:926
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    new_vmm_state = VmmRuntimeState { state: Running, gen: Generation(4), time_updated: 2025-03-17T21:24:08.336605288Z }
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
21:24:08.336Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    file = sled-agent/src/instance.rs:823
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
    state = SledVmmState { vmm_state: VmmRuntimeState { state: Running, gen: Generation(4), time_updated: 2025-03-17T21:24:08.336605288Z }, migration_in: None, migration_out: None }
21:26:54.613Z INFO SledAgent (InstanceManager): Propolis zone is no longer running!
    error = Communication Error: error sending request for url (http://[fd00:1122:3344:104::1:0]:12400/instance/state-monitor)
    file = sled-agent/src/instance.rs:410
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
    zone = oxz_propolis-server_c8b625b4-7096-4f8d-b4ed-a5087df609af
    zone_state = ShuttingDown
21:26:54.613Z WARN SledAgent (InstanceManager): Propolis zone has gone away entirely! Moving to Failed
    file = sled-agent/src/instance.rs:575
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
21:26:54.614Z INFO SledAgent (ZoneBundler): creating zone bundle
    context = ZoneBundleContext { storage_dirs: ["/pool/int/87b6379b-99bb-4af3-b0cf-52d6077310a9/debug/bundle/zone", "/pool/int/f2a80c57-3d02-4ae1-b07d-a9da74aa24c4/debug/bundle/zone"], cause: TerminatedInstance, extra_log_dirs: ["/pool/ext/dda09115-f052-459b-a7b1-c8b6970671d2/crypt/debug", "/pool/ext/30d1c715-2219-4504-95f6-b3507e508db1/crypt/debug", "/pool/ext/df7b22d2-7983-42da-aa72-ae20527b67c3/crypt/debug", "/pool/ext/1e3544bb-28f3-4572-920b-fe7fbdc24fb8/crypt/debug", "/pool/ext/c0e60a5b-2087-4938-9358-70ab1f7b4d2f/crypt/debug", "/pool/ext/8797e2a6-32cf-4653-aa84-fef64107b5de/crypt/debug", "/pool/ext/d26248cf-1d7b-4fe0-ba65-9d9c7062d225/crypt/debug", "/pool/ext/49a77078-831c-4975-b66e-4ec6fab4b384/crypt/debug", "/pool/ext/5e5b22b6-1c99-4063-8470-eca1d07b5055/crypt/debug", "/pool/ext/1146dcc3-ba63-4234-82a3-aa4178fef668/crypt/debug"] }
    file = sled-agent/src/zone_bundle.rs:354
    zone_name = oxz_propolis-server_c8b625b4-7096-4f8d-b4ed-a5087df609af
21:26:54.706Z ERRO SledAgent (InstanceManager): Failed to take zone bundle for terminated instance
    file = sled-agent/src/instance.rs:1431
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
    reason = BundleFailed(failed to enumerate zone service processes\n\nCaused by:\n    0: Failed to run a command\n    1: Error running command in zone 'oxz_propolis-server_c8b625b4-7096-4f8d-b4ed-a5087df609af': Failed to start execution of [svcs -H -o fmri]: Invalid argument (os error 22)\n    2: Failed to start execution of [svcs -H -o fmri]: Invalid argument (os error 22))
    zone_name = oxz_propolis-server_c8b625b4-7096-4f8d-b4ed-a5087df609af
21:26:54.706Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_c8b625b4-7096-4f8d-b4ed-a5087df609af
    file = sled-agent/src/instance.rs:1444
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
21:26:56.303Z INFO SledAgent (InstanceManager): halt_and_remove_logged: Previous zone state: Installed
    file = illumos-utils/src/zone.rs:298
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
21:26:56.326Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    file = sled-agent/src/instance.rs:823
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af
    state = SledVmmState { vmm_state: VmmRuntimeState { state: Failed, gen: Generation(5), time_updated: 2025-03-17T21:26:56.326055012Z }, migration_in: None, migration_out: None }
21:26:56.341Z INFO SledAgent (InstanceManager): State monitoring task complete
    file = sled-agent/src/instance.rs:1381
    instance_id = 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
    propolis_id = c8b625b4-7096-4f8d-b4ed-a5087df609af

although weirdly, Nexus seems to have not updated the instance state to `Failed`:
root@oxz_switch1:~# omdb db instances note: database URL not specified. Will search DNS. note: (override with --db-url or OMDB_DB_URL) note: using DNS server for subnet fd00:1122:3344::/48 note: (if this is not right, use --dns-server to specify an alternate DNS server) note: using database URL postgresql://root@[fd00:1122:3344:103::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:102::3]:32221/omicron?sslmode=disable note: database schema version matches expected (129.0.0) ID STATE PROPOLIS_ID SLED_ID HOST_SERIAL NAME 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6 running e44522b0-cde7-439a-a537-67e98fc20f35 5447746b-5d2c-4d3c-9011-1dbb7d3cff7b BRM42220062 my-cool-instance

~~Going to have to pick through nexus logs I guess!~~

EDIT: OH NEVER MIND I'M JUST DUMB, Nexus just auto-restarted it, it wasn't even on the same sled anymore:

== RUNTIME STATE ===============================================================
               nexus state: Vmm
(i)     external API state: Running
           last updated at: 2025-03-17T21:26:56.488262Z (generation 5)
       needs reincarnation: false
/!\          karmic status: cooling down (TimeDelta { secs: 3066, nanos: 844029127 } remaining)
      last reincarnated at: Some(2025-03-17T21:26:57.332468Z)
             active VMM ID: Some(e44522b0-cde7-439a-a537-67e98fc20f35)
             target VMM ID: None
              migration ID: None
              updater lock: UNLOCKED at generation: 4

== ACTIVE VMM ==================================================================
                        ID: e44522b0-cde7-439a-a537-67e98fc20f35
               instance ID: 4ebf2723-3aeb-49ad-a9b7-a556675c9ae6
                created at: 2025-03-17 21:26:57.287666 UTC
                     state: running
                updated at: 2025-03-17T21:27:08.864616Z (generation 4)
          propolis address: fd00:1122:3344:103::1:0:12400
                   sled ID: 5447746b-5d2c-4d3c-9011-1dbb7d3cff7b

So, everything seems to be working properly now!

hawkw · 2025-03-17T21:39:17Z

illumos-utils/src/zone.rs

+                let (halt, uninstall) = loop {
+                    let mut poll = tokio::time::interval(
+                        std::time::Duration::from_secs(1),
+                    );
+                    match state {
+                        // For states where we could be running, attempt to halt.
+                        zone::State::Running | zone::State::Ready => {
+                            break (true, true);
+                        }
+                        // For zones where we never performed installation, simply
+                        // delete the zone - uninstallation is invalid.
+                        zone::State::Configured => break (false, false),
+                        // Attempting to uninstall a zone in the "down" state will
+                        // fail. Instead, wait for it to finish shutting down and
+                        // then uninstall it.
+                        zone::State::Down | zone::State::ShuttingDown => {
+                            poll.tick().await;
+                            match Self::find(name).await? {
+                                None => return Ok(None),
+                                Some(zone) => state = zone.state(),
+                            }
+                        }
+                        // For most zone states, perform uninstallation.
+                        _ => break (false, true),
+                    }


I wondered if this polling ought to have a timeout attached to it (although I think higher-level code ought to be responsibile for applying it).

I'd like to see this resolved - although I think your change here kinda makes sense (if we're in a transient state, wait for it to not be transient) the docs on the illumos man page scare me ("the zone can become stuck in one of these states").

In that case, if we don't have a higher-level timeout at all the call-sites, it seems plausible that this change could cause a different part of the sled agent to become wedged. Note that this API is used for all zones, not just VMM instances, so this could be relevant if e.g. we fail to tear down an internal service zone.

The questions I have here are

What should we do if we do hit a timeout? Just return an error? The man page suggests that if the zone becomes stuck in one of those states, "operator intervention is required", but I have no idea what kind of operator intervention would unfuck it...

In that case, do you think the timeout ought to be this function's responsibility, to ensure all uses of it cannot loop forever? Do we think the same timeout would be reasonable for all callers?

One of the example call stacks:

ensure_all_omicron_zones

zone_bundle_and_try_remove

zone.runtime.stop()

NOTE - today, this only logs an error! It does not stop progress

Zones::halt_and_remove_logged

My fear is that any zone wedged here will prevent any subsequent changes from happening, period, rather than continuing to try and make progress

What should we do if we do hit a timeout? Just return an error? The man page suggests that if the zone becomes stuck in one of those states, "operator intervention is required", but I have no idea what kind of operator intervention would unfuck it...

Truthfully, I'm not sure either. This feels extremely vaguely-defined, and may need to be resolved on a case-by-case basis

In that case, do you think the timeout ought to be this function's responsibility, to ensure all uses of it cannot loop forever? Do we think the same timeout would be reasonable for all callers?

I think we could change this to make a timeout the caller's responsibility, but I think that also will require us to go check out the callsites (like the one I mentioned above) because we're basically changing semantics, even if they aren't explicitly part of the function signature.

Yeah, I think that, since most of the code that calls into this function already handles other errors halting and removing the zone, perhaps the timeout should just be here, and we return an error if the zone doesn't reach a state where it can be uninstalled in a timely manner. Note that the present behavior of this function is to fail immediately if the zone is in Down or ShuttingDown, so the worst case of giving it a couple minutes before returning an error is similar to what this function already does...

smklein · 2025-03-17T21:50:21Z

sled-agent/src/instance.rs

+                    Ok(Some(zone)) => {
+                        info!(
+                            self.log,
+                            "Propolis zone is no longer running!";
+                            "error" => %e,
+                            "zone" => %self.zone_name,
+                            "zone_state" => ?zone.state(),
+                        );
+                        Ok(InstanceMonitorUpdate::ZoneGone)
+                    }


I believe this is already true - but to confirm, monitor can only be called after the VMM's zone was initialized, correct?

(e.g., there's no concern about racing with zone startup here, right?)

That's correct, the function that starts the state monitor task is called with a RunningZone, which is only returned if the zone has already reached running.

smklein · 2025-03-17T22:00:12Z

illumos-utils/src/zone.rs

+                let (halt, uninstall) = loop {
+                    let mut poll = tokio::time::interval(
+                        std::time::Duration::from_secs(1),
+                    );
+                    match state {
+                        // For states where we could be running, attempt to halt.
+                        zone::State::Running | zone::State::Ready => {
+                            break (true, true);
+                        }
+                        // For zones where we never performed installation, simply
+                        // delete the zone - uninstallation is invalid.
+                        zone::State::Configured => break (false, false),
+                        // Attempting to uninstall a zone in the "down" state will
+                        // fail. Instead, wait for it to finish shutting down and
+                        // then uninstall it.
+                        zone::State::Down | zone::State::ShuttingDown => {
+                            poll.tick().await;
+                            match Self::find(name).await? {
+                                None => return Ok(None),
+                                Some(zone) => state = zone.state(),
+                            }
+                        }
+                        // For most zone states, perform uninstallation.
+                        _ => break (false, true),
+                    }


One of the example call stacks:

ensure_all_omicron_zones

zone_bundle_and_try_remove

zone.runtime.stop()

NOTE - today, this only logs an error! It does not stop progress

Zones::halt_and_remove_logged

My fear is that any zone wedged here will prevent any subsequent changes from happening, period, rather than continuing to try and make progress

hawkw

@smklein what do you think of this change?

hawkw · 2025-03-17T23:17:26Z

sled-agent/src/instance.rs

+            Ok(Ok(_)) => {}
+            Ok(Err(e)) => panic!("{e}"),
+            Err(_) => {
+                panic!("Zone {zname:?} could not be halted within 5 minutes")
+            }


I don't love that we panic here, but we were unwrapping it previously, so...this preserves the current behavior, at least...

agreed. We can, and probably should, make a less hostile solution here, but I agree that at least this part of the change is a lateral movement!

hawkw · 2025-03-17T23:20:43Z

illumos-utils/src/zone.rs

@@ -69,7 +69,25 @@ pub struct AdmError {
    op: Operation,
    zone: String,
    #[source]
-    err: zone::ZoneError,
+    err: AdmErrorKind,


It was necessary to include a structured enum error variant in the case of "zone is ShuttingDown or Down" so that it could be differentiated from other, less-retryable zoneadm errors. Initially, I was going to have halt_and_remove(_logged)? return an enum of either InvalidState or AdmError to indicate that, but install_omicron_zone calls halt_and_remove_logged and bubbles up the error, and that returns an AdmError.

I don't love hacking up every place we construct an AdmError just to handle this one case, but it seemed like the best solution I could come up with in a pinch. Alternatively, install_omicron_zone could return a boxed error type of either a HaltError or a normal AdmError, but that also felt inconsistent with every other function here returning AdmError...I could be convinced either way.

gjcolombo approved these changes Mar 13, 2025

View reviewed changes

hawkw mentioned this pull request Mar 14, 2025

test failed in CI: test_action_failure_can_unwind_no_pantry #7551

Open

halted zones are still listed

bd46bd6

wait for zone shutdown to complete in halt_and_remove

e4f322f

hawkw commented Mar 17, 2025

View reviewed changes

hawkw requested a review from smklein March 17, 2025 21:39

gjcolombo approved these changes Mar 17, 2025

View reviewed changes

smklein reviewed Mar 17, 2025

View reviewed changes

make InstanceRunner responsible for timeouts

c07748d

hawkw commented Mar 17, 2025

View reviewed changes

hawkw requested a review from smklein March 17, 2025 23:30

smklein approved these changes Mar 18, 2025

View reviewed changes

hawkw merged commit 8c13222 into main Mar 18, 2025
16 checks passed

hawkw deleted the eliza/dude-wheres-my-zone branch March 18, 2025 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sled-agent] handle disappearing Propolis zones #7794

[sled-agent] handle disappearing Propolis zones #7794

hawkw commented Mar 13, 2025

hawkw commented Mar 13, 2025

gjcolombo left a comment

hawkw commented Mar 13, 2025

hawkw commented Mar 14, 2025

hawkw commented Mar 14, 2025

hawkw commented Mar 14, 2025 •

edited

Loading

hawkw commented Mar 14, 2025

hawkw commented Mar 14, 2025

hawkw commented Mar 14, 2025 •

edited

Loading

hawkw commented Mar 17, 2025 •

edited

Loading

hawkw Mar 17, 2025

smklein Mar 17, 2025

hawkw Mar 17, 2025

smklein Mar 17, 2025

smklein Mar 17, 2025

hawkw Mar 17, 2025

smklein Mar 17, 2025

hawkw Mar 17, 2025

smklein Mar 17, 2025

hawkw left a comment

hawkw Mar 17, 2025

smklein Mar 18, 2025

hawkw Mar 17, 2025

[sled-agent] handle disappearing Propolis zones #7794

[sled-agent] handle disappearing Propolis zones #7794

Conversation

hawkw commented Mar 13, 2025

hawkw commented Mar 13, 2025

gjcolombo left a comment

Choose a reason for hiding this comment

hawkw commented Mar 13, 2025

hawkw commented Mar 14, 2025

hawkw commented Mar 14, 2025

hawkw commented Mar 14, 2025 • edited Loading

hawkw commented Mar 14, 2025

hawkw commented Mar 14, 2025

hawkw commented Mar 14, 2025 • edited Loading

hawkw commented Mar 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hawkw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hawkw commented Mar 14, 2025 •

edited

Loading

hawkw commented Mar 14, 2025 •

edited

Loading

hawkw commented Mar 17, 2025 •

edited

Loading