OutlierDetection might require longer time for an ejected host to prove itself #37602

spacewander · 2024-12-10T10:31:07Z

If you are reporting any crash or any potential security issue, do not
open an issue in this repo. Please report the issue via emailing
[email protected] where the issue will be triaged appropriately.

Title: OutlierDetection might require longer time for an ejected host to prove itself

Description:

Describe the issue.

envoy/source/common/upstream/outlier_detection_impl.cc

Lines 371 to 375 in f5663c1

    
             if ((min(base_eject_time * monitor->ejectTimeBackoff(), max_eject_time) + jitter) <= 
        
                 (now - monitor->lastEjectionTime().value())) { 
        
               unejectHost(host); 
        
             } 
        
           }

As base_eject_time * monitor->ejectTimeBackoff() is expected to be smaller than max_eject_time, we can get a result that maxEjectTimeBackoff should be not greater than upper_bound(max_eject_time / base_eject_time).

Assumed max_eject_time = 7, base_eject_time = 2, so maxEjectTimeBackoff = 4.

However, according to

envoy/source/common/upstream/outlier_detection_impl.cc

Lines 516 to 520 in f5663c1

    
           if ((host_monitors_[host]->ejectTimeBackoff() * base_eject_time) < 
        
               (max_eject_time + base_eject_time)) { 
        
             host_monitors_[host]->ejectTimeBackoff()++; 
        
           }

the maxEjectTimeBackoff could be 5. Since when maxEjectTimeBackoff = 4, the base_eject_time * monitor->ejectTimeBackoff() already be larger than max_eject_time, I am not sure whether it's by design to continue increasing the ejectTimeBackoff to 5.

A larger maxEjectTimeBackoff will require the host to spend a longer time to decrease the ejectTimeBackoff back to zero.

Change the condition to:

if ((host_monitors_[host]->ejectTimeBackoff() * base_eject_time) < 
     max_eject_time) { 
   host_monitors_[host]->ejectTimeBackoff()++; 
 }

can make the maxEjectTimeBackoff equal to the one calculated from the max_eject_time cap.

The text was updated successfully, but these errors were encountered:

wbpcode · 2024-12-10T17:17:21Z

I think it's by design. If a host is ejected frequently, then the eject time for this host will relatively longer. But the eject time will never longer than the max_eject time.

spacewander · 2024-12-11T03:07:02Z

There is no doubt that the eject time for the host should be longer. Now the question is, how long is good enough?

Let's do some math work. Define variables:
max_eject_time, base_eject_time, maxEjectTimeBackoff (ejectTimeBackoff higher than this doesn't contribute to the real eject time as it's higher than max_eject_time), realMaxEjectTimeBackoff.

maxEjectTimeBackoff = upper_bound(max_eject_time / base_eject_time)

if ((host_monitors_[host]->ejectTimeBackoff() * base_eject_time) < 
     (max_eject_time + base_eject_time)) { 
   host_monitors_[host]->ejectTimeBackoff()++; 
 }

realMaxEjectTimeBackoff is the max host_monitors_[host]->ejectTimeBackoff(), which is equal to realMaxEjectTimeBackoff = upper_bound(max_eject_time / base_eject_time) + 1.

So there will always be a one-off that doesn't contribute to the max ejection.

BTW, I wonder why outlier detection doesn't use exponential backoff, which is more natural for continuously failing nodes.

wbpcode · 2024-12-11T03:26:20Z

BTW, I wonder why outlier detection doesn't use exponential backoff, which is more natural for continuously failing nodes.

I think just because there is no strong requirement to that so there is no related implementation.

github-actions · 2025-01-10T04:01:36Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions · 2025-01-17T08:01:31Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

spacewander added the triage Issue requires triage label Dec 10, 2024

wbpcode added area/outlier_detection and removed triage Issue requires triage labels Dec 10, 2024

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jan 10, 2025

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OutlierDetection might require longer time for an ejected host to prove itself #37602

OutlierDetection might require longer time for an ejected host to prove itself #37602

spacewander commented Dec 10, 2024

wbpcode commented Dec 10, 2024 •

edited

Loading

spacewander commented Dec 11, 2024

wbpcode commented Dec 11, 2024

github-actions bot commented Jan 10, 2025

github-actions bot commented Jan 17, 2025

OutlierDetection might require longer time for an ejected host to prove itself #37602

OutlierDetection might require longer time for an ejected host to prove itself #37602

Comments

spacewander commented Dec 10, 2024

wbpcode commented Dec 10, 2024 • edited Loading

spacewander commented Dec 11, 2024

wbpcode commented Dec 11, 2024

github-actions bot commented Jan 10, 2025

github-actions bot commented Jan 17, 2025

wbpcode commented Dec 10, 2024 •

edited

Loading