Tracking Pipeline Metadata

Use DBND to track a wide range of pipeline metadata, without ever changing the way you run your pipelines.

DBND tracking allows you to track pipelines and associated data, both holistically and atomically. With seamless integrations, it is easy to track and create alerts for metadata, pipeline performance, query resource usage, and custom metrics.

Enabling Tracking

After Installing DBND, enable metadata tracking by exporting the environmental variable:

export DBND__TRACKING=True

Methods of Use

Once tracking is enabled, you can use functions like log_dataframe to track Pandas or Spark dataframes and log_metric to track custom metrics or key performance indicators in your various tasks. With these tracking methods, histograms, statistics, previews, and more can be observed directly in your CLI or if you are using Apache Airflow, in your airflow logs.

Tracking plugins extend the native tracking features of orchestrators such as Airflow and Azkaban, data lake providers such as Snowflake and Redshift, and more. For a full list of plugins, visit Installing DBND Plugins.

Using DBND logging functions to log dataframes, custom metrics, and resource usages.Using DBND logging functions to log dataframes, custom metrics, and resource usages.

Using DBND logging functions to log dataframes, custom metrics, and resource usages.

While these logging methods provide a more atomic approach to tracking pipelines, metadata, and data, you can also integrate tracking holistically to minimize code overhead.

DBND also provides methods of tracking tasks and functions in detail using the track_functions and track_module_functions methods. By using these methods, you can track the inputs and outputs of your functions, modules, and scripts.

`track_functions` in an ETL Pipeline`track_functions` in an ETL Pipeline

track_functions in an ETL Pipeline

For example, in a pipeline with three tasks: Extract, Transform, and Load, you can use track_functions to select any specific function(s) you wish to track.

A Note on Pipeline Metadata

Pipeline metadata describes a comprehensive range of metadata used for tracking and monitoring, uniquely relevant to active data processes. In other words, pipeline metadata includes any system, application, graph process, or data level information that’s broadly relevant to the normal functioning of your data pipelines. This includes:

  • Job Runtime Information
  • Application Logs
  • Task Function Statuses
  • Performance Metrics
  • Data Quality Metrics
  • Input/Output
  • Intermediate Results
  • Data Lineage
  • System Resources

DBND tracks this information and contextualizes it in your pipeline and task definitions, so you can instantly see for any given pipeline where issues are coming from, and from a system-wide perspective which pipelines are the source of problems.

Did this page help you?