FlinkDeployment health interpreter does not account for ImagePullBackOff Error #6023

mszacillo · 2025-01-07T21:20:17Z

What happened:

Tests done by @deefreak and @bharathguvvala show that the current FlinkDeployment health interpreter determines that the FlinkDeployment is unhealthy when it has an ImagePullBackOff Error. ~~I have confirmed this behavior and have identified that the root cause is due to the interpreter not accounting for this specific error case.~~

What you expected to happen:

The FlinkDeployment health interpreter should only mark FlinkDeployments unhealthy if there are unable to be scheduled (meaning they are stuck in a RECONCILING or CREATED state, unrelated to a jobManagerDeploymentStatus of ERROR).

Karmada Version: v1.12.2

mszacillo · 2025-01-08T04:02:54Z

Apologies, but I think I spoke too soon. This morning I had just taken a look at the deployment without verifying the ResourceBinding. I just tried this again and notice that the ResourceBinding is correctly reflecting that the FlinkDeployment is healthy. Not sure if this ticket is necessary, but will wait for confirmation.

RainbowMango · 2025-01-09T08:10:50Z

Tests done by @deefreak and @bharathguvvala show that the current FlinkDeployment health interpreter determines that the FlinkDeployment is unhealthy when it has an ImagePullBackOff Error.

What's shown on the status part in this case?

And can you remind me why you think the health interpreter should NOT determine the status is unhealthy? Do you mean the image might be pulled successfully after a backoff period?

RainbowMango · 2025-01-09T08:12:33Z

Kindly ping @deefreak and @bharathguvvala for the confirmation.

Link to the interpreter logic here:

karmada/pkg/resourceinterpreter/default/thirdparty/resourcecustomizations/flink.apache.org/v1beta1/FlinkDeployment/customizations.yaml

Lines 10 to 21 in d79c600

    
               healthInterpretation: 
        
                 luaScript: > 
        
                   function InterpretHealth(observedObj) 
        
                     if observedObj.status ~= nil and observedObj.status.jobStatus ~= nil then 
        
                       if observedObj.status.jobStatus.state ~= 'CREATED' and observedObj.status.jobStatus.state ~= 'RECONCILING' then 
        
                         return true 
        
                       else 
        
                         return observedObj.status.jobManagerDeploymentStatus == 'ERROR' 
        
                       end 
        
                     end 
        
                     return false 
        
                   end

mszacillo · 2025-01-09T18:40:34Z

What's shown on the status part in this case?

Here is an example:

status:
  clusterInfo: {}
  error: '{"type":"org.apache.flink.kubernetes.operator.exception.DeploymentFailedException","message":"rpc
    error: code = NotFound desc = failed to pull and unpack image \"test-image":
    failed to resolve reference \"test-image:
    not found","additionalMetadata":{"reason":"ErrImagePull"},"throwableList":[]}'
  jobManagerDeploymentStatus: ERROR
  jobStatus:
    jobId: 789522148c5fca4b2f6f45227655030e
    savepointInfo:
      lastPeriodicSavepointTimestamp: 0
      savepointHistory: []
    state: RECONCILING
  lifecycleState: DEPLOYED

Using the existing health interpreter, this FlinkDeployment would return healthy since it is able to be scheduled, but the user has defined an incorrect image reference. In the above interpreter, if the status is RECONCILING or CREATED, but observedObj.status.jobManagerDeploymentStatus == 'ERROR', then we return healthy.

mszacillo · 2025-01-09T18:42:05Z

And can you remind me why you think the health interpreter should NOT determine the status is unhealthy? Do you mean the image might be pulled successfully after a backoff period?

I consider this healthy since the error is due to human-error rather than a cluster-wide problem (if they've defined the image ref incorrectly for example). If there is a cluster-wide issue pulling images, then perhaps that is a different story, but I'm not sure we have enough granularity in the existing status to differentiate those cases.

RainbowMango · 2025-01-10T04:01:24Z

I consider this healthy since the error is due to human-error rather than a cluster-wide problem (if they've defined the image ref incorrectly for example).

Yeah, it makes sense to me. Another possible reason is the pull image timeout due to network issues.

mszacillo added the kind/bug Categorizes issue or PR as related to a bug. label Jan 7, 2025

github-project-automation bot added this to Karmada Overall Backlog Jan 7, 2025

mszacillo mentioned this issue Jan 21, 2025

Updating FlinkDeployment interpreter to display error status, improving health interpreter #6073

Merged

karmada-bot closed this as completed in #6073 Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FlinkDeployment health interpreter does not account for ImagePullBackOff Error #6023

FlinkDeployment health interpreter does not account for ImagePullBackOff Error #6023

mszacillo commented Jan 7, 2025 •

edited

Loading

mszacillo commented Jan 8, 2025

RainbowMango commented Jan 9, 2025

RainbowMango commented Jan 9, 2025

mszacillo commented Jan 9, 2025 •

edited

Loading

mszacillo commented Jan 9, 2025

RainbowMango commented Jan 10, 2025

FlinkDeployment health interpreter does not account for ImagePullBackOff Error #6023

FlinkDeployment health interpreter does not account for ImagePullBackOff Error #6023

Comments

mszacillo commented Jan 7, 2025 • edited Loading

mszacillo commented Jan 8, 2025

RainbowMango commented Jan 9, 2025

RainbowMango commented Jan 9, 2025

mszacillo commented Jan 9, 2025 • edited Loading

mszacillo commented Jan 9, 2025

RainbowMango commented Jan 10, 2025

mszacillo commented Jan 7, 2025 •

edited

Loading

mszacillo commented Jan 9, 2025 •

edited

Loading