Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

releasing CSI claim after node GC can result in loop in volumewatcher #25349

Open
tgross opened this issue Mar 11, 2025 · 3 comments
Open

releasing CSI claim after node GC can result in loop in volumewatcher #25349

tgross opened this issue Mar 11, 2025 · 3 comments
Assignees
Milestone

Comments

@tgross
Copy link
Member

tgross commented Mar 11, 2025

When a node has been GC'd, it appears to be possible that a CSI volume claim can get stuck when the volumewatcher attempts to release its claims, resulting in a loop on the volumewatcher. The workaround is to re-register the volume temporarily (with a different node ID?).

Logs look like the following:

    2025-03-11T17:01:50.738Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=my-namespace volume_id=redis-shard-0[0]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: 2fc5938a-02d1-57d0-6d6c-b7766419c059
  |

    2025-03-11T17:01:50.746Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=my-namespace volume_id=redis-shard-0[0]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: 2fc5938a-02d1-57d0-6d6c-b7766419c059
  |

    2025-03-11T17:01:50.751Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=my-namespace volume_id=redis-shard-1[0]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: 2fc5938a-02d1-57d0-6d6c-b7766419c059

Ref: https://hashicorp.atlassian.net/browse/NET-12298
(reported internally from a ENT customer)

@ygersie
Copy link
Contributor

ygersie commented Mar 11, 2025

Hey @tgross thanks for creating the issue. Yes, a forceful csi volume re-registration is a workaround to get out of this loop.

@tgross tgross self-assigned this Mar 11, 2025
@tgross
Copy link
Member Author

tgross commented Mar 11, 2025

It looks like we're hitting this when we're trying to send the controller RPC to detach the volume and need to tell the controller what node (from the storage provider's perspective, ex. the AWS EC2 instance ID) the volume is on. The node RPC step already handles this gracefully, assuming that a GC'd node will have already done what it can to unmount the volume (Nomad can't really do anything about that at that point). But the controller RPC doesn't include the same logic to bail out early. That should be a fairly easy fix, just needs a little testing.

What's a little trickier is the volumewatcher getting into a tight loop. We have to checkpoint the claim when we unpublish, which writes to state. That state write causes the blocking query in the volumewatcher to unblock, which is normally what we want so that we queue-up another pass to make sure everything's been cleaned up in the case of an error. But we're not applying appropriate rate-limiting of attempts. I'll need to do that too.

(Longer term I'd like to build out something more like what we have for the eval broker here, so that we're pulling changes rather than getting them pushed, but that's a much bigger re-architecture that was waiting on seeing how dynamic host volumes got implemented in case they needed to use the same machinery. They don't.)

@tgross tgross added this to the 1.10.0 milestone Mar 11, 2025
@ygersie
Copy link
Contributor

ygersie commented Mar 12, 2025

Just for context, I did run into this before which is described here. This triggers right after a new leader election.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants