NashTech Blog

Getting started with Airflow

Introduction to Apache Airflow

Apache Airflow is an open-source platform for writing, scheduling, and monitoring workflows programmatically. A programmer can use the logic of workflow to create workflow rules using Airflow as directed acyclic graphs, which represent sequences of operations using Python. It has been highly requested in the management of engineering data sets and pipeline management for machine learning; its flexibility allows the management of complex workflows on multiple platforms and services.

Key Concepts in Airflow

AG (Directed Acyclic Graph): A collection of tasks with dependencies that define the flow of execution.
Task: A single unit of work or operation, such as running a Python function, a bash command, or a database query.
Operator: Predefined templates for tasks (e.g., PythonOperator, BashOperator).
Scheduler: Monitors DAGs and triggers tasks based on dependencies and schedules.
Executor: Manages task execution across available resources.
Task Instance: A specific instance of a task on a particular run.

Setting Up Airflow

Installation

Create a Docker Compose File
				
					version: '3'
services:
  airflow:
    image: apache/airflow:2.6.0
    environment:
      - AIRFLOW__CORE__EXECUTOR=LocalExecutor
      - AIRFLOW__CORE__LOAD_EXAMPLES=False
    volumes:
      - ./dags:/opt/airflow/dags
      - ./logs:/opt/airflow/logs
      - ./plugins:/opt/airflow/plugins
    ports:
      - "8080:8080"
    command: >
      bash -c "
      airflow db init &&
      airflow users create -u admin -p admin -r Admin -e admin@example.com -f Admin -l User &&
      airflow webserver &
      airflow scheduler
      "

				
			
Create the Necessary Directories
				
					mkdir -p dags logs plugins

				
			
Start Airflow
				
					docker-compose up -d

				
			
Once started, the Airflow UI is accessible at http://localhost:8080 with login credentials admin/admin.

Writing Your First DAG

Creating the DAG File
				
					#In the dags folder, create a new Python file called simple_dag.py:
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

def print_hello():
    print("Hello, Seemab!")

default_args = {
    'owner': 'airflow',
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

with DAG(
    'simple_dag',
    default_args=default_args,
    description='A simple DAG',
    schedule_interval=timedelta(days=1),
    start_date=datetime(2023, 1, 1),
    catchup=False,
) as dag:

    # Task 1: Python function to print "Hello, World!"
    hello_task = PythonOperator(
        task_id='hello_task',
        python_callable=print_hello,
    )

    # Task 2: Bash command to display the current date
    date_task = BashOperator(
        task_id='date_task',
        bash_command='date',
    )

    # Set dependencies
    hello_task >> date_task

				
			
Running the DAG
Restart the Airflow scheduler if needed to detect new DAGs:
				
					docker-compose restart airflow

				
			

Extending the DAG with More Tasks (using email notification)

Add Email Notification
				
					from airflow.operators.email import EmailOperator

# Task 3: Email notification (update with valid email details)
email_task = EmailOperator(
    task_id='email_task',
    to='you@example.com',
    subject='Airflow Task Complete',
    html_content='<p>Your task has completed successfully.</p>',
)

# Update dependencies
hello_task &gt;&gt; date_task &gt;&gt; email_task

				
			

Conclusion

You’ve created your first DAG, experimented with dependencies, and extended tasks to include Python, Bash and Email. Airflow has vast capabilities, including integration with cloud platforms, data transformation tools, and database systems.
Scroll to Top