Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backfilling ignores task-level start_date and end_date #17734

Open
SDubrulle opened this issue Aug 19, 2021 · 6 comments
Open

Backfilling ignores task-level start_date and end_date #17734

SDubrulle opened this issue Aug 19, 2021 · 6 comments
Assignees
Labels
affected_version:2.1 Issues Reported for 2.1 airflow3.0:candidate Potential candidates for Airflow 3.0 area:backfill Specifically for backfill related kind:bug This is a clearly a bug
Milestone

Comments

@SDubrulle
Copy link

SDubrulle commented Aug 19, 2021

Apache Airflow version:

2.1.0

OS:

Debian GNU/Linux 10

Apache Airflow Provider versions:

Irrelevant for this issue.

Deployment:

K8S using a custom helm chart

What happened:

When backfilling a DAG with a --mark-success flag, all tasks (within specified timerange) are set to success.

What you expected to happen:

I expected the task-level start_date and end_date to be taken into account. As such, a task-instance would only be created if the dag execution date is within the task date range.

How to reproduce it:

DAG:

from datetime import datetime
from airflow.decorators import dag, task

@dag(
    default_args={},
    description="Start date test DAG",
    schedule_interval="0 0 * * *",
    start_date=datetime(2021, 8, 1),
    catchup=True,
)
def start_date_test_dag():

    @task()
    def since_dag_start_date():
        print("Done")

    @task(start_date=datetime(2021, 8, 10))
    def task_defined_start_date():
        print("Done")

    [since_dag_start_date(), task_defined_start_date()]

dag = start_date_test_dag()

Backfill command:

airflow dags backfill -s 2020-08-01 -e 2021-8-18 --mark-success start_date_test_dag


Are you willing to submit a PR?

Sure, if this is unwanted behaviour and there is no workaround (other than backfilling each task individually).

@SDubrulle SDubrulle added the kind:bug This is a clearly a bug label Aug 19, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Aug 19, 2021

Thanks for opening your first issue here! Be sure to follow the issue template!

@eladkal
Copy link
Contributor

eladkal commented Aug 26, 2021

It make sense that backfill should respect the task definition for tasks start_date and end_date.

Giving another example if we have a dag as:

with DAG(
    'my_dag',
    default_args=default_args,
    schedule_interval='@daily',
    start_date=datetime(2019, 1, 1),
) as dag:
    t1 = BashOperator(
        task_id='print_date',
        bash_command='date',
    )
      t2 = BashOperator(
        task_id='print_date2',
        bash_command='date',
        start_date=datetime(2021, 1, 1)
    )

and you are running airflow dags backfill my_dag --start-date 2019-01-01 --end-date 2019-06-01
I wouldn't expect print_date2 to be a part of this backfill at all. To my perspective the backfill start-date & stop-date specify the range of runs that needs to be created - in these runs it doesn't make sense to consider tasks that are not part of this window.

If someone has a different take on that I would love to hear the reasons

@uranusjr
Copy link
Member

I believe this difference originates in the dependency definitions for (manually-triggered) backfills and scheduler-triggered runs. In dependencies_deps.py, there's a ExecDateAfterStartDateDep in SCHEDULER_QUEUED_DEPS that would skips a task if its start_date is later than the run's logical date (execution_date), but that dependency is not declared in BACKFILL_QUEUED_DEPS.

This difference dates way back to #5079 when BACKFILL_QUEUED_DEPS was introduced, and unfortunately no rationales were provided, from what I can tell. Maybe some of the people involved at the time could remember something?

/cc @kaxil @ashb

@eladkal eladkal added area:backfill Specifically for backfill related and removed area:core labels Feb 9, 2023
@uranusjr
Copy link
Member

@dstandish one thing to note when reimplementing backfilling with the scheduler.

@uranusjr uranusjr added this to the Airflow 3.0.0 milestone Jul 31, 2024
@dstandish
Copy link
Contributor

@dstandish one thing to note when reimplementing backfilling with the scheduler.

thanks @uranusjr .

This seems to be desirable behavior, no?

@kaxil kaxil added the airflow3.0:candidate Potential candidates for Airflow 3.0 label Jul 31, 2024
@uranusjr
Copy link
Member

uranusjr commented Aug 7, 2024

Do you mean ignoring is desirable, or respecting? Airflow currently has both, depending on whether the run is triggered by the airflow dags backfill CLI or the scheduler (via catchup=True).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.1 Issues Reported for 2.1 airflow3.0:candidate Potential candidates for Airflow 3.0 area:backfill Specifically for backfill related kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

5 participants