Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't retry failures because of execution timeouts #9232

Open
OleksandrBerchenko opened this issue Jun 11, 2020 · 1 comment
Open

Don't retry failures because of execution timeouts #9232

OleksandrBerchenko opened this issue Jun 11, 2020 · 1 comment
Labels
area:core area:Scheduler including HA (high availability) scheduler kind:feature Feature Requests

Comments

@OleksandrBerchenko
Copy link

Description

When reaching execution_timeout, a task fails but continues to retry. Either it should not be retried in this situation, or there should be a possibility to define another timeout for the "total" task execution, taking into account all retries.

Use case / motivation

In our case current behavior makes execution_timeout feature useless: we have retries in place to prevent random issues like network connectivity. At the same time, we want to make sure that the tasks don't run for too long and execution_time + retries would just make them running even longer.

See also https://stackoverflow.com/questions/53830604/airflow-execution-timeout-resetting-every-retry: one more request for the same feature.

@OleksandrBerchenko OleksandrBerchenko added the kind:feature Feature Requests label Jun 11, 2020
@boring-cyborg
Copy link

boring-cyborg bot commented Jun 11, 2020

Thanks for opening your first issue here! Be sure to follow the issue template!

rwitzel added a commit to rwitzel/airflow that referenced this issue Oct 27, 2020
Why? If the Airflow scheduler is restarted (e.g. due to
rollout of a new Kubernetes version), then the timeout
behaviour of a sensor should not be affected.
But before the code change, the timer would start from
zero again when the sensor is retried. This was unexpected.

Solution: When a sensor is retried, then the sensor
uses the start date of the earliest try to justify
a time-out. To stay backwards-compatible, the new behaviour
is only active when explicitly activated for that sensor.

Note: The exponential backoff feature for poking still uses
the start date of the current try. This is to keep the
code change small. No issues expected from that.

related: apache#9232 (the linked issue cares about execution_timeout for tasks
in general, not only sensors)
@jscheffl jscheffl added area:Scheduler including HA (high availability) scheduler area:core labels Aug 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core area:Scheduler including HA (high availability) scheduler kind:feature Feature Requests
Projects
None yet
Development

No branches or pull requests

2 participants