Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FlinkDeployment health interpreter does not account for ImagePullBackOff Error #6023

Closed
mszacillo opened this issue Jan 7, 2025 · 6 comments · Fixed by #6073
Closed

FlinkDeployment health interpreter does not account for ImagePullBackOff Error #6023

mszacillo opened this issue Jan 7, 2025 · 6 comments · Fixed by #6073
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mszacillo
Copy link
Contributor

mszacillo commented Jan 7, 2025

What happened:

Tests done by @deefreak and @bharathguvvala show that the current FlinkDeployment health interpreter determines that the FlinkDeployment is unhealthy when it has an ImagePullBackOff Error. I have confirmed this behavior and have identified that the root cause is due to the interpreter not accounting for this specific error case.

What you expected to happen:

The FlinkDeployment health interpreter should only mark FlinkDeployments unhealthy if there are unable to be scheduled (meaning they are stuck in a RECONCILING or CREATED state, unrelated to a jobManagerDeploymentStatus of ERROR).

Karmada Version: v1.12.2

@mszacillo mszacillo added the kind/bug Categorizes issue or PR as related to a bug. label Jan 7, 2025
@mszacillo
Copy link
Contributor Author

Apologies, but I think I spoke too soon. This morning I had just taken a look at the deployment without verifying the ResourceBinding. I just tried this again and notice that the ResourceBinding is correctly reflecting that the FlinkDeployment is healthy. Not sure if this ticket is necessary, but will wait for confirmation.

@RainbowMango
Copy link
Member

Tests done by @deefreak and @bharathguvvala show that the current FlinkDeployment health interpreter determines that the FlinkDeployment is unhealthy when it has an ImagePullBackOff Error.

What's shown on the status part in this case?

And can you remind me why you think the health interpreter should NOT determine the status is unhealthy? Do you mean the image might be pulled successfully after a backoff period?

@RainbowMango
Copy link
Member

Kindly ping @deefreak and @bharathguvvala for the confirmation.

Link to the interpreter logic here:

healthInterpretation:
luaScript: >
function InterpretHealth(observedObj)
if observedObj.status ~= nil and observedObj.status.jobStatus ~= nil then
if observedObj.status.jobStatus.state ~= 'CREATED' and observedObj.status.jobStatus.state ~= 'RECONCILING' then
return true
else
return observedObj.status.jobManagerDeploymentStatus == 'ERROR'
end
end
return false
end

@mszacillo
Copy link
Contributor Author

mszacillo commented Jan 9, 2025

What's shown on the status part in this case?

Here is an example:

status:
  clusterInfo: {}
  error: '{"type":"org.apache.flink.kubernetes.operator.exception.DeploymentFailedException","message":"rpc
    error: code = NotFound desc = failed to pull and unpack image \"test-image":
    failed to resolve reference \"test-image:
    not found","additionalMetadata":{"reason":"ErrImagePull"},"throwableList":[]}'
  jobManagerDeploymentStatus: ERROR
  jobStatus:
    jobId: 789522148c5fca4b2f6f45227655030e
    savepointInfo:
      lastPeriodicSavepointTimestamp: 0
      savepointHistory: []
    state: RECONCILING
  lifecycleState: DEPLOYED

Using the existing health interpreter, this FlinkDeployment would return healthy since it is able to be scheduled, but the user has defined an incorrect image reference. In the above interpreter, if the status is RECONCILING or CREATED, but observedObj.status.jobManagerDeploymentStatus == 'ERROR', then we return healthy.

@mszacillo
Copy link
Contributor Author

And can you remind me why you think the health interpreter should NOT determine the status is unhealthy? Do you mean the image might be pulled successfully after a backoff period?

I consider this healthy since the error is due to human-error rather than a cluster-wide problem (if they've defined the image ref incorrectly for example). If there is a cluster-wide issue pulling images, then perhaps that is a different story, but I'm not sure we have enough granularity in the existing status to differentiate those cases.

@RainbowMango
Copy link
Member

I consider this healthy since the error is due to human-error rather than a cluster-wide problem (if they've defined the image ref incorrectly for example).

Yeah, it makes sense to me. Another possible reason is the pull image timeout due to network issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
Status: No status
2 participants