Databand provides various monitoring, alerting, and analytical functionality that helps you monitor the health and reliability of your Airflow DAGs.
Databand helps you to:
- Proactively identify unusually long-running processes
- View out-of-the-box management dashboards
- Run comparisons to identify the root cause of issues
Databand allows you to monitor multiple Airflow instances, providing a centralized tracking system for company-wide DAGs.
The Airflow integration works by tracking data from the Airflow database and syncing Airflow data with metadata observed from tasks and other system levels. You can check what exactly is tracked and how this can be configured at Data Collection Cheat Sheet
- Airflow UI URL
- Location of Airflow
To find this information on managed Airflow environments, refer to our reference guides in the chart below:
Amazon Managed Workflows (MWAA)
Amazon Managed Workflows is a managed Apache Airflow service that helps you create, schedule, monitor and manage workflows.
Google Cloud Composer
Cloud Composer is a managed Apache Airflow service that helps you create, schedule, monitor and manage workflows.
To fully integrate Databand with your Airflow environment:
- Make sure you have enabled communication between Apache Airflow and Databand (from Apache Airflow to Databand for DAG mode)
dbnd-airflow-auto-trackingon your Airflow cluster (scheduler and workers). NOTE: Installing new Python packages on managed Airflow environments will trigger an automatic restart of the Airflow Scheduler.
databand_airflow_monitorDAG in Airflow. Please create a new file
databand_airflow_monitor.pywith the following dag definition and add it to your project DAGs:
from airflow_monitor.monitor_as_dag import get_monitor_dag # This DAG is used by Databand to monitor your Airflow installation. dag = get_monitor_dag()
- Deploy your new DAG and enable it in Airflow UI.
- By clicking your user profile picture in the Databand web application (https://yourcompanyname.databand.ai/), open the 'Settings' page.
- In the 'Settings', go to the 'Airflow Syncers' page. This page will show active Airflow integrations.
- Click the 'Add' button to add an integration to a new Airflow Instance.
- Configure the Syncer with the information about your Airflow environment.
Airflow Mode - Select your environment (Amazon MWAA, Google Composer, OnPrem).
If you are using on-premises Airflow, and your Airflow main page URL ends with:
Monitor as DAG - If you are in DAG mode, make sure you click the toggle to enable this option.
Airflow URL - Airflow Webserver URL
Syncer Name - user-provided label to identify Airflow instance
- External URL - The external URL if you are in on-premises mode
- DAG IDs to Sync - Specific DAG IDs to sync from your Airflow environment. Leaving this field empty will return all DAGs in Databand.
- Include source code - Specifies whether or not to send source code of the DAGs to Databand.
- Include logs collection - Specifies whether or not to send logs from DAGs to Databand.
- Bytes to collect from the head of log - The number of KB to collect from DAGs log head. This field is present only when 'Include logs collection' is enabled. The default value is 8 KB and 8096 KB is the maximum allowed.
- Bytes to collect from the head of end - The number of KB to collect from DAGs log end. This field is present only when 'Include logs collection' is enabled. The default value is 8 KB and 8096 KB is the maximum allowed.
- Environment - Internal name of your Airflow environment
- DagRun Page Size - Number of DAGs to sync every 10 seconds
- DagRun Start Time Window - Number of days of historical data to pull from your Airflow DB
If your mode is Amazon MWAA, Google Composer, or OnPrem with Monitor as DAG enabled, after clicking 'Continue', you will see a dialog window with a ready-to-use JSON file and instructions on how to use it to enable communication from Airflow to Databand. This process is also explained in a section below.
At this step, you can configure automatic alerts on your pipelines. A standalone alert definition would be created for each pipeline in an Airflow instance, and a user would be able to edit it later in the process (subject to change in a feature).
Automatic alerts are created with severity HIGH.
The following automatic alerts are supported:
- State alerts - an alert is fired when a pipeline run fails.
- Run duration alerts - an alert is fired when the run duration is too long or too short. By default, this determination is made by looking at the previous 10 runs and applying an acceptable range for what is considered normal. You can edit these alerts to change both the number of previous runs to consider in the calculation and the sensitivity of what is considered an anomaly.
- Schema change alerts - an alert is fired when the schema of a dataset has changed from the previous run, either through the addition or removal of one or more columns.
You would be prompted to configure alert receivers if no alert receivers are configured yet. Alert receivers are destinations where alerts are sent. This step is optional. If no receiver is configured, an alert would only appear in the Databand UI.
To learn more about how the Slack receiver is configured, check this guide.
If your mode is Amazon MWAA, Google Composer, or On-Prem with Monitor as DAG enabled, after clicking 'Continue', you will see a dialog window with a ready-to-use JSON file and instructions on how to use it to enable communication from Airflow to Databand. Complete the steps in Setting Up Airflow Integration, and then click "Test Connection" to confirm the connection was set up properly.
In order to configure the Airflow monitor and SDK, you can create an Airflow connection with an ID of
dbnd_config and use its JSON field to provide configuration parameters as shown below.
The JSON contains the following fields:
Replace the values in <...> with the following information:
- <databand_url> - your environment URL, e.g. https://yourcompanyname.databand.ai
- <access_token> - Create Access Token
- <airflow_syncer_name> - name of Airflow Syncer created above
- <dag_ids> - list of specific DAG IDs that you want to track
- <number_of_airflow_log_head_bytes> - number of bytes to collect from DAG log heads
- <number_of_airflow_log_tail_bytes> - number of bytes to collect from DAG log tails
The JSON object contains the minimum parameters required to establish communication between Airflow and Databand.
In general, the JSON keys map to the parameters found in databand-core.cfg. See our Configuration documentation on how you can extend this object with additional parameters.
However, you should NOT modify the values of the following fields manually: dag_ids, number_of_airflow_log_head_bytes, number_of_airflow_log_tail_bytes, track_source_code.
These fields get their values automatically from the values you provide in the syncer dialog.
Even when you edit an existing syncer, the Databand monitor would sync them to dbnd_config in Airflow automatically.
If you are using Airflow version 2.1.0, 2.1.1, or 2.1.2, please verify that the
apache-airflow-providers-http package is installed, or consider upgrading your Airflow.
When running the monitor DAG, it automatically limits the amount of memory it can consume. The default value is 8GB. If the monitor consumes more memory than the guard allows for any reason, it will automatically stop itself.
guard_memory parameter to the
get_monitor_dag function and set it to the maximum number of bytes the monitor can consume. For example, the below parameter would limit memory consumption to 5GB:
dag = get_monitor_dag(guard_memory=5 * 1024 * 1024 * 1024)
Updated 13 days ago