Getting Started with DBND Tracking
Use DBND to track a wide range of pipeline metadata, without ever changing the way you run your pipelines.
DBND tracking allows you to track pipelines and associated data, both holistically and atomically. With seamless integrations, it is easy to track and create alerts for metadata, pipeline performance, query resource usage, and custom metrics.
This Python Quickstart will walk you through some basic Databand capabilities with Python script as an example.
Pipeline metadata describes a comprehensive range of metadata used for tracking and monitoring, uniquely relevant to active data processes. In other words, pipeline metadata includes any system, application, graph process, or data level information that’s broadly relevant to the normal functioning of your data pipelines. This includes:
- Job Runtime Information
- Application Logs
- Task Function Statuses
- Data Quality Metrics
- Input/Output
- Data Lineage
DBND tracks this information and contextualizes it in your pipeline and task definitions, so you can instantly see for any given pipeline where issues are coming from, and from a system-wide perspective which pipelines are the source of problems.
Methods of Use
Once tracking is enabled, you can use functions like log_dataset_op to track Pandas or Spark dataframes and log_metric
to track custom metrics or key performance indicators in your various tasks. With these tracking methods, histograms, statistics, previews, and more can be observed directly in your CLI or if you are using Apache Airflow, in your airflow
logs.
Tracking plugins extend the native tracking features of orchestrators such as Apache Airflow and Azkaban, data lake providers such as Snowflake and Redshift, and more. For a full list of plugins, visit Installing DBND Plugins.
While these logging methods provide a more atomic approach to tracking pipelines, metadata, and data, you can also integrate tracking holistically to minimize code overhead.
Current Integrations
The current list of integrations available for tracking with Databand includes:
Supported Languages
- AWS EMR
- GCP Dataproc
- Local Spark
- Databricks
- Qubole
- Any Spark server via Apache Livy
Tracking Databases
- Integrating with Standard Apache Airflow Cluster
- Integrating Amazon Managed Workflows Airflow
- Integrating Google Cloud Composer
- Integrating Astronomer
Tracking Other Orchestrators
Tracking Other Trackers
Updated 12 months ago