-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FlinkDeployment health interpreter does not account for ImagePullBackOff Error #6023
Comments
Apologies, but I think I spoke too soon. This morning I had just taken a look at the deployment without verifying the ResourceBinding. I just tried this again and notice that the ResourceBinding is correctly reflecting that the FlinkDeployment is |
What's shown on the status part in this case? And can you remind me why you think the health interpreter should NOT determine the status is |
Kindly ping @deefreak and @bharathguvvala for the confirmation. Link to the interpreter logic here: Lines 10 to 21 in d79c600
|
Here is an example:
Using the existing health interpreter, this FlinkDeployment would return |
I consider this healthy since the error is due to human-error rather than a cluster-wide problem (if they've defined the image ref incorrectly for example). If there is a cluster-wide issue pulling images, then perhaps that is a different story, but I'm not sure we have enough granularity in the existing status to differentiate those cases. |
Yeah, it makes sense to me. Another possible reason is the pull image timeout due to network issues. |
What happened:
Tests done by @deefreak and @bharathguvvala show that the current FlinkDeployment health interpreter determines that the FlinkDeployment is unhealthy when it has an ImagePullBackOff Error.
I have confirmed this behavior and have identified that the root cause is due to the interpreter not accounting for this specific error case.What you expected to happen:
The FlinkDeployment health interpreter should only mark FlinkDeployments unhealthy if there are unable to be scheduled (meaning they are stuck in a RECONCILING or CREATED state, unrelated to a jobManagerDeploymentStatus of ERROR).
Karmada Version: v1.12.2
The text was updated successfully, but these errors were encountered: