You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a node has been GC'd, it appears to be possible that a CSI volume claim can get stuck when the volumewatcher attempts to release its claims, resulting in a loop on the volumewatcher. The workaround is to re-register the volume temporarily (with a different node ID?).
It looks like we're hitting this when we're trying to send the controller RPC to detach the volume and need to tell the controller what node (from the storage provider's perspective, ex. the AWS EC2 instance ID) the volume is on. The node RPC step already handles this gracefully, assuming that a GC'd node will have already done what it can to unmount the volume (Nomad can't really do anything about that at that point). But the controller RPC doesn't include the same logic to bail out early. That should be a fairly easy fix, just needs a little testing.
What's a little trickier is the volumewatcher getting into a tight loop. We have to checkpoint the claim when we unpublish, which writes to state. That state write causes the blocking query in the volumewatcher to unblock, which is normally what we want so that we queue-up another pass to make sure everything's been cleaned up in the case of an error. But we're not applying appropriate rate-limiting of attempts. I'll need to do that too.
(Longer term I'd like to build out something more like what we have for the eval broker here, so that we're pulling changes rather than getting them pushed, but that's a much bigger re-architecture that was waiting on seeing how dynamic host volumes got implemented in case they needed to use the same machinery. They don't.)
When a node has been GC'd, it appears to be possible that a CSI volume claim can get stuck when the volumewatcher attempts to release its claims, resulting in a loop on the volumewatcher. The workaround is to re-register the volume temporarily (with a different node ID?).
Logs look like the following:
Ref: https://hashicorp.atlassian.net/browse/NET-12298
(reported internally from a ENT customer)
The text was updated successfully, but these errors were encountered: