GuidesAPI ReferenceDiscussions
GuidesBlogPlatform

Tracking Apache Airflow

A guide to enabling tracking for Apache Airflow DAGs

Databand provides various monitoring, alerting, and analytical functionality that helps you monitor the health and reliability of your Airflow DAGs.

Databand helps you to:

  • Proactively identify unusually long-running processes
  • View out-of-the-box management dashboards
  • Run comparisons to identify the root cause of issues

Databand allows you to monitor multiple Airflow instances, providing a centralized tracking system for company-wide DAGs.

The Airflow integration works by tracking data from the Airflow database and syncing Airflow data with metadata observed from tasks and other system levels. You can check what exactly is tracked and how this can be configured at Data Collection Cheat Sheet

Prerequisites

  • Airflow UI URL
  • Location of Airflow /dags directory

To find this information on managed Airflow environments, refer to our reference guides in the chart below:

Deployment Model

Description

Reference

Amazon Managed Workflows (MWAA)

Amazon Managed Workflows is a managed Apache Airflow service that helps you create, schedule, monitor and manage workflows.
In this use case, you need to have an existing MWAA instance.

Integrating Amazon Managed Workflows Airflow

Google Cloud Composer
(GCC)

Cloud Composer is a managed Apache Airflow service that helps you create, schedule, monitor and manage workflows.
In this use case, you have an existing Cloud Composer.

Integrating Google Cloud Composer

Setting up Airflow Integration

To fully integrate Databand with your Airflow environment:

  1. Make sure you have enabled communication between Apache Airflow and Databand (from Apache Airflow to Databand for DAG mode)
  2. Install dbnd-airflow-auto-tracking on your Airflow cluster (scheduler and workers). NOTE: Installing new Python packages on managed Airflow environments will trigger an automatic restart of the Airflow Scheduler.
  3. Create databand_airflow_monitor DAG in Airflow. Please create a new file databand_airflow_monitor.py with the following dag definition and add it to your project DAGs:
from airflow_monitor.monitor_as_dag import get_monitor_dag
# This DAG is used by Databand to monitor your Airflow installation.
dag = get_monitor_dag()
  1. Deploy your new DAG and enable it in Airflow UI.

Create Syncer in Databand

  1. By clicking your user profile picture in the Databand web application (https://yourcompanyname.databand.ai/), open the 'Settings' page.
  2. In the 'Settings', go to the 'Airflow Syncers' page. This page will show active Airflow integrations.
  3. Click the 'Add' button to add an integration to a new Airflow Instance.
  1. Configure the Syncer with the information about your Airflow environment.

Configure Syncer

Required fields

  • Airflow Mode - Select your environment (Amazon MWAA, Google Composer, OnPrem).

    If you are using on-premises Airflow, and your Airflow main page URL ends with:

    • /home - select RBAC mode
    • /admin - select FlaskAdmin mode
  • Monitor as DAG - If you are in DAG mode, make sure you click the toggle to enable this option.

  • Airflow URL - Airflow Webserver URL

  • Syncer Name - user-provided label to identify Airflow instance

Optional fields

  • External URL - The external URL if you are in on-premises mode
  • DAG IDs to Sync - Specific DAG IDs to sync from your Airflow environment. Leaving this field empty will return all DAGs in Databand.
  • Include source code - Specifies whether or not to send source code of the DAGs to Databand.
  • Include logs collection - Specifies whether or not to send logs from DAGs to Databand.
  • Bytes to collect from the head of log - The number of KB to collect from DAGs log head. This field is present only when 'Include logs collection' is enabled. The default value is 8 KB and 8096 KB is the maximum allowed.
  • Bytes to collect from the head of end - The number of KB to collect from DAGs log end. This field is present only when 'Include logs collection' is enabled. The default value is 8 KB and 8096 KB is the maximum allowed.

Advanced settings

  • Environment - Internal name of your Airflow environment
  • DagRun Page Size - Number of DAGs to sync every 10 seconds
  • DagRun Start Time Window - Number of days of historical data to pull from your Airflow DB

If your mode is Amazon MWAA, Google Composer, or OnPrem with Monitor as DAG enabled, after clicking 'Continue', you will see a dialog window with a ready-to-use JSON file and instructions on how to use it to enable communication from Airflow to Databand. This process is also explained in a section below.

Set Automatic Alerts

At this step, you can configure automatic alerts on your pipelines. A standalone alert definition would be created for each pipeline in an Airflow instance, and a user would be able to edit it later in the process (subject to change in a feature).

Automatic alerts are created with severity HIGH.

The following automatic alerts are supported:

  1. State alerts - an alert is fired when a pipeline run fails.
  2. Run duration alerts - an alert is fired when the run duration is too long or too short. By default, this determination is made by looking at the previous 10 runs and applying an acceptable range for what is considered normal. You can edit these alerts to change both the number of previous runs to consider in the calculation and the sensitivity of what is considered an anomaly.
  3. Schema change alerts - an alert is fired when the schema of a dataset has changed from the previous run, either through the addition or removal of one or more columns.

Configure Slack Reciver

You would be prompted to configure alert receivers if no alert receivers are configured yet. Alert receivers are destinations where alerts are sent. This step is optional. If no receiver is configured, an alert would only appear in the Databand UI.

To learn more about how the Slack receiver is configured, check this guide.

Connect and Validate the Syncer

If your mode is Amazon MWAA, Google Composer, or On-Prem with Monitor as DAG enabled, after clicking 'Continue', you will see a dialog window with a ready-to-use JSON file and instructions on how to use it to enable communication from Airflow to Databand. Complete the steps in Setting Up Airflow Integration, and then click "Test Connection" to confirm the connection was set up properly.

Using Airflow Connection to Configure Airflow Monitor and SDK

In order to configure the Airflow monitor and SDK, you can create an Airflow connection with an ID of dbnd_config and use its JSON field to provide configuration parameters as shown below.

The JSON contains the following fields:

Mandatory Parameters
Replace the values in <...> with the following information:

Optional Parameters

  • <dag_ids> - list of specific DAG IDs that you want to track
  • <number_of_airflow_log_head_bytes> - number of bytes to collect from DAG log heads
  • <number_of_airflow_log_tail_bytes> - number of bytes to collect from DAG log tails

πŸ“˜

dbnd_config modification

The JSON object contains the minimum parameters required to establish communication between Airflow and Databand.

In general, the JSON keys map to the parameters found in databand-core.cfg. See our Configuration documentation on how you can extend this object with additional parameters.

However, you should NOT modify the values of the following fields manually: dag_ids, number_of_airflow_log_head_bytes, number_of_airflow_log_tail_bytes, track_source_code.

These fields get their values automatically from the values you provide in the syncer dialog.
Even when you edit an existing syncer, the Databand monitor would sync them to dbnd_config in Airflow automatically.

Extras

Airflow HTTP communication

If you are using Airflow version 2.1.0, 2.1.1, or 2.1.2, please verify that the apache-airflow-providers-http package is installed, or consider upgrading your Airflow.

Databand Monitor DAG Memory Guard

When running the monitor DAG, it automatically limits the amount of memory it can consume. The default value is 8GB. If the monitor consumes more memory than the guard allows for any reason, it will automatically stop itself.

Add the guard_memory parameter to the get_monitor_dag function and set it to the maximum number of bytes the monitor can consume. For example, the below parameter would limit memory consumption to 5GB:

dag = get_monitor_dag(guard_memory=5 * 1024 * 1024 * 1024)

Did this page help you?