-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributed provisioning: unset "selected-node" for nodes which have no driver running #544
Comments
This isn't ideal because "provisioning" will be started by the central provisioner for all nodes and then must be made to fail for those which do have a driver, which will emit additional events. |
This is conceptually very similar to setting If so, then this is probably the right solution for this issue because it avoids the problem entirely. There's a slight race (node has the right labels, is selected for a PVC, labels get removed, driver no longer runs -> PVC stuck), but that should be rare and can be documented as a caveat for admins. |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Due to a bug in the scheduler a node with no driver instance might be picked and the volume is stuck in pending as the "no capacity - > reschedule" recovery is never triggered [[0]](kubernetes/kubernetes#122109), [[1]](kubernetes-csi/external-provisioner#544). - See #400 --------- Co-authored-by: lukasmetzner <[email protected]> Co-authored-by: Julian Tölle <[email protected]>
Due to a bug in the scheduler a node with no driver instance might be picked and the volume is stuck in pending as the "no capacity - > reschedule" recovery is never triggered [[0]](kubernetes/kubernetes#122109), [[1]](kubernetes-csi/external-provisioner#544). - See #400 --------- Co-authored-by: lukasmetzner <[email protected]> Co-authored-by: Julian Tölle <[email protected]>
We modified the response for `NodeGetInfo` to return an additional Topology Segment. We assumed that this only “adds” new info, but in practice it breaks the spec. When trying to schedule a volume to nodes, the container orchestration systems should verify that the Node fulfills at least one Accessible Topology of the Node, where “fulfills” means that all supplied segments match. This is not implemented in the same way between Kubernetes and Nomad. - **Kubernetes**: requirements are fulfilled if the volume specifies a subset of the Nodes topology - **Nomad**: requirements are fulfilled if the volume specifies all of the Nodes topology We made these changes to work around a bug in the Kubernetes scheduler ([here](kubernetes-csi/external-provisioner#544)) where nodes without the CSI Plugin would still be considered for scheduling, but then creating and attaching the volume fails with no automatic reconciliation of this error.
When deploying external-provisioner alongside the CSI driver on each node, there is one problem: if the scheduler picks a node which has no driver instance, then the volume is stuck because the usual "no capacity -> reschedule" recovery is never triggered.
A custom scheduler extension and capacity tracking can minimize the risk, but cannot prevent this entirely.
Possible solutions:
The text was updated successfully, but these errors were encountered: