DagsterDocs
Quick search

Backfills#

Dagster supports data backfills for each partition or subsets of partitions.

Relevant APIs#

NameDescription
PartitionSetDefinitionThe class to define a partition set.

Overview#

After defining a Partition Set, you can use backfills to instigate pipeline runs for each partition in the set.


Launching Backfills#

Using Dagit#

You can launch and monitor backfills of a pipeline page in Dagit from the Partitions tab.

To launch a backfill, click the "Launch backfill" button at the top center of the Partitions tab. This opens the "Launch backfill" modal, which lets you select the set of partitions to launch the backfill over.

You can click the button on the bottom right to submit the runs. What happens when you hit this button depends on your Run Coordinator. With the default run coordinator, the modal will exit after all runs have been launched. With the queued run coordinator, the modal will exit after all runs have been queued.

After all the runs have been submitted, you'll be returned to the partitions page, with a filter for runs inside the backfill. This refreshes periodically and allows you to see how the backfill is progressing. Boxes become green or red as steps in the backfill runs succeed or fail.

Using the Backfill CLI#

You can also launch backfills using the backfill CLI.

In the Partitions section, we defined a pipeline and a date_partition_set partition set that targeted the pipeline:

def get_date_partitions():
    """Every day in the month of May, 2020"""
    return [Partition(f"2020-05-{str(day).zfill(2)}") for day in range(1, 32)]


def run_config_for_date_partition(partition):
    date = partition.value
    return {"solids": {"process_data_for_date": {"config": {"date": date}}}}


date_partition_set = PartitionSetDefinition(
    name="date_partition_set",
    pipeline_name="my_data_pipeline",
    partition_fn=get_date_partitions,
    run_config_fn_for_partition=run_config_for_date_partition,
)

Let's also setup a RepositoryDefinition and workspace.yaml for this pipeline and partition set:

@repository
def my_repository():
    return [
        my_data_pipeline,
        date_partition_set,
    ]
load_from:
  - python_file: repo.py

See details in Repositories and Workspaces if you are not familiar with these Dagster concepts.

Now we can run the command dagster pipeline backfill to execute the backfill. To run all partitions, simply run the command with the arguments to specify the pipeline and partition set.

Examples#

Executing a subset of partitions#

You can also execute subsets of the partition sets.

You can specify the --partitions argument and provide a comma-separated list of partition names you want to backfill:

$ dagster pipeline backfill -p my_pipeline --partition-set date_partition_set --partitions M,Tu,W

Alternatively, you can also specify ranges of partitions using the --from and --to arguments:

$ dagster pipeline backfill -p my_pipeline --partition-set date_partition_set --from W --to F