NashTech Blog

Understanding Apache Airflow DAGs: A Complete Guide

Table of Contents

Introduction

Apache Airflow is a leading platform for programmatically authoring, scheduling, and monitoring workflows. At the heart of Airflow lies the concept of DAGs (Directed Acyclic Graphs). DAGs define how tasks are organized, the sequence in which they run, and the logic that binds them together. In this blog, we will explore what DAGs are, their components, and how to create them with practical examples.


What is a DAG in Apache Airflow?

DAG stands for Directed Acyclic Graph. In simple terms, it is a graph where:

  • Directed: Tasks flow in a specific direction (e.g., Task A -> Task B).
  • Acyclic: There are no loops or cycles; a task cannot depend on itself directly or indirectly.

DAGs allow you to design workflows where each task has a clear order of execution based on dependencies.


Structure of a DAG File

An Airflow DAG file is a Python script containing four major parts:

1. Import Statements

You begin by importing necessary modules and operators:

from airflow import DAG
from airflow.operators.bash import BashOperator
import datetime

2. DAG Definition

This section defines the DAG and its configuration:

with DAG(
    dag_id='example_dag',
    start_date=datetime.datetime(2023, 10, 5),
    schedule_interval='@daily',
    catchup=False,
    tags=['example']
) as dag:

Key Parameters of DAG:

  • dag_id: Unique identifier of the DAG.
  • start_date: When the DAG will start running.
  • schedule_interval: How often the DAG runs (e.g., @daily, @hourly).
  • catchup: Whether to run missed schedules between start date and current date.
  • tags: Metadata tags for categorizing DAGs.

3. Task Definition

Define individual tasks using operators:

task_1 = BashOperator(
    task_id='print_hello',
    bash_command="echo 'Hello, Airflow!'"
)

task_2 = BashOperator(
    task_id='print_done',
    bash_command="echo 'Task Completed!'"
)

Components of a Task:

  • task_id: Unique ID for the task.
  • bash_command: Command that will run when the task executes.

4. Setting Dependencies

Define the sequence in which tasks should run:

task_1 >> task_2

This means task_2 will run only after task_1 is complete.


Complete Example of a Simple DAG

from airflow import DAG
from airflow.operators.bash import BashOperator
import datetime

with DAG(
    dag_id='simple_airflow_dag',
    start_date=datetime.datetime(2023, 10, 5),
    schedule_interval='@daily',
    catchup=False,
    tags=['example']
) as dag:

    task_1 = BashOperator(
        task_id='print_hello',
        bash_command="echo 'Hello, Airflow!'"
    )

    task_2 = BashOperator(
        task_id='print_done',
        bash_command="echo 'Task Completed!'"
    )

    task_1 >> task_2  # Setting dependency

Advanced Concepts in DAGs

Parallel Tasks

You can run tasks in parallel:

task_1 >> [task_2, task_3]  # Both task_2 and task_3 run after task_1

ynamic DAGs

Tasks and dependencies can be generated dynamically using Python logic, making workflows flexible and reusable.

Sensors

You can add sensors to wait for external events, like a file appearing in storage.


Best Practices for Writing DAGs

  1. Use Clear Naming Conventions for DAGs and tasks.
  2. Avoid Heavy Computation in DAG File; only define structure and dependencies.
  3. Use Templates for dynamic commands and paths.
  4. Modularize Complex Workflows using SubDAGs or TaskGroups.
  5. Set Appropriate Retries and Timeout Policies for robustness.

Conclusion

Apache Airflow DAGs provide a powerful way to define and manage complex workflows in a structured, code-driven manner. With clear task definitions and dependencies, Airflow makes automation, monitoring, and scaling of workflows easy and efficient.

By mastering DAGs, you unlock the real power of Apache Airflow to handle everything from simple scripts to complex data pipelines.

Picture of rupali1520

rupali1520

Leave a Comment

Your email address will not be published. Required fields are marked *

Suggested Article

Scroll to Top