Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutlierDetection might require longer time for an ejected host to prove itself #37602

Closed
spacewander opened this issue Dec 10, 2024 · 5 comments
Labels
area/outlier_detection stale stalebot believes this issue/PR has not been touched recently

Comments

@spacewander
Copy link
Contributor

If you are reporting any crash or any potential security issue, do not
open an issue in this repo. Please report the issue via emailing
[email protected] where the issue will be triaged appropriately.

Title: OutlierDetection might require longer time for an ejected host to prove itself

Description:

Describe the issue.

if ((min(base_eject_time * monitor->ejectTimeBackoff(), max_eject_time) + jitter) <=
(now - monitor->lastEjectionTime().value())) {
unejectHost(host);
}
}

As base_eject_time * monitor->ejectTimeBackoff() is expected to be smaller than max_eject_time, we can get a result that maxEjectTimeBackoff should be not greater than upper_bound(max_eject_time / base_eject_time).

Assumed max_eject_time = 7, base_eject_time = 2, so maxEjectTimeBackoff = 4.

However, according to

if ((host_monitors_[host]->ejectTimeBackoff() * base_eject_time) <
(max_eject_time + base_eject_time)) {
host_monitors_[host]->ejectTimeBackoff()++;
}

the maxEjectTimeBackoff could be 5. Since when maxEjectTimeBackoff = 4, the base_eject_time * monitor->ejectTimeBackoff() already be larger than max_eject_time, I am not sure whether it's by design to continue increasing the ejectTimeBackoff to 5.

A larger maxEjectTimeBackoff will require the host to spend a longer time to decrease the ejectTimeBackoff back to zero.

Change the condition to:

if ((host_monitors_[host]->ejectTimeBackoff() * base_eject_time) < 
     max_eject_time) { 
   host_monitors_[host]->ejectTimeBackoff()++; 
 } 

can make the maxEjectTimeBackoff equal to the one calculated from the max_eject_time cap.

@spacewander spacewander added the triage Issue requires triage label Dec 10, 2024
@wbpcode
Copy link
Member

wbpcode commented Dec 10, 2024

I think it's by design. If a host is ejected frequently, then the eject time for this host will relatively longer. But the eject time will never longer than the max_eject time.

@wbpcode wbpcode added area/outlier_detection and removed triage Issue requires triage labels Dec 10, 2024
@spacewander
Copy link
Contributor Author

There is no doubt that the eject time for the host should be longer. Now the question is, how long is good enough?

Let's do some math work. Define variables:
max_eject_time, base_eject_time, maxEjectTimeBackoff (ejectTimeBackoff higher than this doesn't contribute to the real eject time as it's higher than max_eject_time), realMaxEjectTimeBackoff.

maxEjectTimeBackoff = upper_bound(max_eject_time / base_eject_time)

if ((host_monitors_[host]->ejectTimeBackoff() * base_eject_time) < 
     (max_eject_time + base_eject_time)) { 
   host_monitors_[host]->ejectTimeBackoff()++; 
 }

realMaxEjectTimeBackoff is the max host_monitors_[host]->ejectTimeBackoff(), which is equal to realMaxEjectTimeBackoff = upper_bound(max_eject_time / base_eject_time) + 1.

So there will always be a one-off that doesn't contribute to the max ejection.

BTW, I wonder why outlier detection doesn't use exponential backoff, which is more natural for continuously failing nodes.

@wbpcode
Copy link
Member

wbpcode commented Dec 11, 2024

BTW, I wonder why outlier detection doesn't use exponential backoff, which is more natural for continuously failing nodes.

I think just because there is no strong requirement to that so there is no related implementation.

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jan 10, 2025
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/outlier_detection stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

2 participants