Tracking Apache Airflow

A guide to enabling tracking for Apache Airflow DAGs.

Databand provides various monitoring, alerting, and analytical functionality that helps you monitor the health and reliability of your Airflow DAG.

Databand helps you to:

  • Proactively identify unusually long-running processes
  • View out-of-the-box management dashboards
  • Run comparisons to identify the root cause of issues

Databand allows you to monitor multiple Airflow instances, providing a centralized tracking system for company-wide DAGs.

The Airflow integration works by tracking data from the Airflow database and syncing Airflow data with metadata observed from tasks and other system levels.

Prerequisites

  • Airflow UI URL
  • Location of Airflow dags/ directory

To find this information on managed Airflow environments, refer to our reference guides in the chart below:

Deployment Model

Description

Reference

Amazon Managed Workflows (MWAA)

Amazon Managed Workflows is a managed Apache Airflow service that helps you create, schedule, monitor and manage workflows.
In this use case, you need to have an existing MWAA instance.

Integrating Amazon Managed Workflows Airflow

Google Cloud Composer
(GCC)

Cloud Composer is a managed Apache Airflow service that helps you create, schedule, monitor and manage workflows.
In this use case, you have an existing Cloud Composer.

Integrating Google Cloud Composer

Setting up Airflow Integration

To fully integrate Databand with your Airflow environment:

  1. Enable communication between Airflow and Databand
  2. Install DBND on Airflow scheduler
  3. Upload and enable databand_airflow_monitor DAG in Airflow

Create Syncer in Databand

  1. By clicking your user profile picture in the Databand web application (https://yourcompanyname.databand.ai/), open the 'Settings' page.
  2. In the 'Settings', go to the 'Syncer' page. This page will show active Airflow integrations.
  3. Click the 'Add' button to add an integration to a new Airflow Instance.
  1. Configure the Syncer with the information about your Airflow environment.

Below we break the fields into 'required' and 'optional' then provide a short description of each field.

Required fields

  • Airflow mode - Select your environment (Amazon, MWAA, Google composer, On premise).
    If are using Airflow on premise, and your Airflow main page URL ends with:
    • /home - select RBAC mode
    • /admin - your flask-admin mode
    • if you are using experimental mode, you probably know it
  • Airflow URL - Airflow Webserver URL
  • Airflow Name - user-provided label to identify Airflow instance

Optional fields

  • Monitor as DAG - If you are in On premise mode, click to enable it
  • External URL - External URL, if applicable (in On Premise modes only)
  • DAG IDs to fetch from - specific DAG ids to sync from Airflow environment; empty field will return ALL Dags in environment

Advanced settings

  • Environment - Internal name of airflow environment
  • DagRun Page Size - Number of DAGs to sync every 10 seconds
  • DagRun Start Time Window - Number of days of historical data to pull from Airflow DB

If your mode is Amazon MWAA, Google Composer, or OnPrem with Monitor As DAG enabled, after clicking 'Save', you will see a dialog window with a ready-to-use JSON file and instructions how to use it to enable communication from Airflow to Databand. This process is also explained in the next documentation section below.

Create dbnd_config connection in Airflow

To enable communication from Airflow to Databand, create a connection in the Airflow UI with the following values:

  • Conn Id: dbnd_config
  • Conn Type: HTTP
  • Extra: see JSON object below
{
  "core": {
    "databand_url": "<databand_url>,
    "databand_access_token": "<token>"
  },
  "airflow_monitor": {
    "syncer_name": "<airflow_syncer_name>"
  }
}

Replace the values in <...> with the following information:

πŸ“˜

dbnd_config Extra field

The JSON object contains the minimum parameters required to establish communication between Airflow and Databand.

In general, the JSON keys map to the parameters found in databand-core.cfg. See our Configuration documentation on how you can extend this object with additional parameters.

Installing DBND on Airflow scheduler

Run the following command to install the required dbnd PyPI packages on Airflow Scheduler:

πŸ“˜

Installing Python packages on managed Airflow environments

See chart below above for how to install packages on specific managed Airflow services.

NOTE: Installing new Python packages on managed Airflow environments will trigger an automatic restart of the Airflow Scheduler.

pip install dbnd-airflow-auto-tracking dbnd-airflow-monitor[direct_db]

Upload and enable Databand Monitor DAG

The Monitor DAG issues a dbnd CLI command every 10 seconds to retrieve execution metadata directly from the Airflow DB.

To complete the integration process:

  • Download Monitor DAG from Github
  • Upload Monitor DAG to your Airflow's /dag directory
  • Enable databand_airflow_monitor DAG in Airflow UI

Did this page help you?