Apache Airflow Reference

Free reference guide: Apache Airflow Reference

26 results

About Apache Airflow Reference

The Apache Airflow Reference is a searchable cheat sheet covering the core concepts and APIs used to build data pipelines with Apache Airflow. It is organized into six categories: DAG (Directed Acyclic Graph definitions), Operators (PythonOperator, BashOperator, EmailOperator, BranchPythonOperator), Sensors (FileSensor, ExternalTaskSensor, HttpSensor, S3KeySensor), Scheduler (CLI commands for listing, triggering, testing, and backfilling DAGs), Variables (Airflow Variables API and XCom inter-task communication), and Connections (CLI-based connection setup, hook usage with PostgreSQL and S3).

This reference is used by data engineers, pipeline developers, and MLOps practitioners who build automated workflows on Apache Airflow. It covers both the classic DAG constructor pattern and the modern TaskFlow API using the @dag and @task decorators. The code examples use realistic variable names and patterns — for instance, the schedule_interval examples show actual cron expressions alongside the named presets (@daily, @hourly, @weekly), and the BranchPythonOperator example demonstrates date-based conditional logic.

Each entry in the reference includes the Python syntax or CLI command, a concise description of what it does, and a complete, runnable code snippet. Sensor examples show how to configure poke_interval and timeout values. Connection examples show both the CLI airflow connections add command and the programmatic approach using BaseHook.get_connection() and provider-specific hooks (PostgresHook, S3Hook). The XCom section covers both the push/pull pattern and TaskFlow automatic value passing.

Key Features

Six searchable categories: DAG, Operators, Sensors, Scheduler, Variables, Connections
DAG definitions: classic constructor pattern and modern @dag/@task TaskFlow API decorator
Operator coverage: PythonOperator, BashOperator, EmailOperator, BranchPythonOperator with Jinja templates
Sensor coverage: FileSensor, ExternalTaskSensor, HttpSensor, and S3KeySensor with poke/timeout config
Scheduler CLI: airflow dags list/trigger/backfill and airflow tasks test with date arguments
Variables API: Variable.get/set with JSON deserialization and Jinja template syntax
XCom push/pull pattern for passing data between tasks via the task instance context
Connection hooks: PostgresHook.get_pandas_df and S3Hook.load_file with conn_id references

Frequently Asked Questions

What is the difference between the classic DAG constructor and the TaskFlow API?

The classic pattern creates a DAG object, then defines operator instances that reference it via dag=dag. The TaskFlow API (@dag and @task decorators) allows you to write Python functions directly as tasks, with XCom passing handled automatically — function return values become the input of downstream tasks.

How do I set task dependencies in Airflow?

Use the >> operator to chain tasks: extract_task >> transform_task >> load_task. For fan-out and fan-in patterns, use lists: [task_a, task_b] >> task_c >> [task_d, task_e]. This sets up the upstream/downstream dependency graph that the scheduler respects.

What is the difference between poke_interval and timeout in sensors?

poke_interval is how often (in seconds) the sensor checks the condition — for example, every 60 seconds. timeout is the maximum total time (in seconds) the sensor will wait before raising a timeout error. Setting a reasonable timeout prevents stuck DAG runs from blocking the scheduler indefinitely.

How does XCom work for passing data between tasks?

A task can push a value using context["ti"].xcom_push(key="result", value=42), and a downstream task retrieves it with context["ti"].xcom_pull(task_ids="push_task", key="result"). In the TaskFlow API, function return values are automatically pushed as XCom and injected as parameters in downstream functions.

What is the purpose of catchup=False in a DAG definition?

When catchup=False, Airflow only runs the DAG for the current schedule interval, not for any missed intervals between start_date and today. Setting catchup=True (the default) would cause Airflow to backfill all missed runs since start_date, which is often undesirable for production pipelines.

How do I store and access secrets like API keys in Airflow?

Use Variable.set("api_key", "your-secret") to store the value, then Variable.get("api_key") to retrieve it at runtime. In Jinja templates, reference it as {{ var.value.api_key }}. For production, configure a secrets backend (AWS Secrets Manager, HashiCorp Vault) to avoid storing secrets in the Airflow metadata DB.

What is the airflow dags backfill command used for?

backfill re-runs a DAG for a specified historical date range: airflow dags backfill my_pipeline --start-date 2024-01-01 --end-date 2024-01-31 will trigger one run per schedule interval within that range. This is useful after fixing a bug in a DAG that caused failed runs.

How do I test a single task without running the whole DAG?

Use airflow tasks test <dag_id> <task_id> <execution_date>. For example: airflow tasks test my_pipeline run_python 2024-01-01. This runs the task locally without interacting with the metadata database, making it ideal for rapid iteration during development.