releasing CSI claim after node GC can result in loop in volumewatcher #25349

tgross · 2025-03-11T17:26:21Z

When a node has been GC'd, it appears to be possible that a CSI volume claim can get stuck when the volumewatcher attempts to release its claims, resulting in a loop on the volumewatcher. The workaround is to re-register the volume temporarily (with a different node ID?).

Logs look like the following:

    2025-03-11T17:01:50.738Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=my-namespace volume_id=redis-shard-0[0]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: 2fc5938a-02d1-57d0-6d6c-b7766419c059
  |

    2025-03-11T17:01:50.746Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=my-namespace volume_id=redis-shard-0[0]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: 2fc5938a-02d1-57d0-6d6c-b7766419c059
  |

    2025-03-11T17:01:50.751Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=my-namespace volume_id=redis-shard-1[0]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: 2fc5938a-02d1-57d0-6d6c-b7766419c059

Ref: https://hashicorp.atlassian.net/browse/NET-12298
(reported internally from a ENT customer)

The text was updated successfully, but these errors were encountered:

ygersie · 2025-03-11T17:40:03Z

Hey @tgross thanks for creating the issue. Yes, a forceful csi volume re-registration is a workaround to get out of this loop.

tgross · 2025-03-11T19:57:38Z

It looks like we're hitting this when we're trying to send the controller RPC to detach the volume and need to tell the controller what node (from the storage provider's perspective, ex. the AWS EC2 instance ID) the volume is on. The node RPC step already handles this gracefully, assuming that a GC'd node will have already done what it can to unmount the volume (Nomad can't really do anything about that at that point). But the controller RPC doesn't include the same logic to bail out early. That should be a fairly easy fix, just needs a little testing.

What's a little trickier is the volumewatcher getting into a tight loop. We have to checkpoint the claim when we unpublish, which writes to state. That state write causes the blocking query in the volumewatcher to unblock, which is normally what we want so that we queue-up another pass to make sure everything's been cleaned up in the case of an error. But we're not applying appropriate rate-limiting of attempts. I'll need to do that too.

(Longer term I'd like to build out something more like what we have for the eval broker here, so that we're pulling changes rather than getting them pushed, but that's a much bigger re-architecture that was waiting on seeing how dynamic host volumes got implemented in case they needed to use the same machinery. They don't.)

ygersie · 2025-03-12T08:22:41Z

Just for context, I did run into this before which is described here. This triggers right after a new leader election.

tgross added type/bug hcc/jira labels Mar 11, 2025

tgross self-assigned this Mar 11, 2025

tgross added this to the 1.10.0 milestone Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

releasing CSI claim after node GC can result in loop in volumewatcher #25349

releasing CSI claim after node GC can result in loop in volumewatcher #25349

tgross commented Mar 11, 2025 •

edited

Loading

ygersie commented Mar 11, 2025

tgross commented Mar 11, 2025

ygersie commented Mar 12, 2025

releasing CSI claim after node GC can result in loop in volumewatcher #25349

releasing CSI claim after node GC can result in loop in volumewatcher #25349

Comments

tgross commented Mar 11, 2025 • edited Loading

ygersie commented Mar 11, 2025

tgross commented Mar 11, 2025

ygersie commented Mar 12, 2025

tgross commented Mar 11, 2025 •

edited

Loading