Databand Concept Overview

Databand is an observability solution for data engineers and DataOps teams.

Most data engineers use a variety of tools to run their pipelines (Airflow, Spark, Snowflake, BigQuery, etc.). When working across all these systems, you need deep visibility across DAGDAG - (aka Directed Acyclic Graph), is a collection of all tasks that you want to run, organized in a way that reflects their relationships and dependencies.s, data flows, and levels of infrastructure to make sure pipelines are reliable, to detect issues that lead to SLASLA - Service Level Agreement misses, and to identify problems in data quality.

Databand tracks your pipeline metadata, providing granular visibility into data processes so that it is easier to identify issues, enhance team productivity, and optimize performance & costs. The Databand application includes an intuitive web-based user interface, monitoring & alerting functionality, as well as tools for debugging & lifecycle management.

Use Cases

  • Platform Operations: monitor pipelines for early warning on signals on failures or missed SLAs.
  • Data Quality: track quality statistics and changes in data sets and pipeline outputs
  • Resource Control: keep tabs on resource consumption and spending levels

Benefits

  • Connect and compare metrics, logs, and traces from all data processes
  • Instantly streamline analysis of pipelines and health of your system
  • Access alerts and notifications on issues that require immediate attention
  • Compare trends on data and code changes to identify the root cause of problems fast
  • Focus on optimal code configurations and scheduled data operations.

System Components

Databand includes the following major components:

  • SDK - a Python library and the CLI that engineers use to create pipelines and collect metadata from runs. You can install it through a simple PIP install process in an environment of your choice.

  • Metadata Store - a database that stores pipeline definitions and metadata that enables the system to execute, version, and reproduce pipelines. These definitions include paths to data inputs and outputs, code logic (tasks), environment configurations, and other artifacts necessary for execution.

  • Application - a web UI that provides monitoring and observability on runs and projects.

Key concepts

Logging methods

DBND's logging methods are used in your Python or Java code to track application and data metrics, such as dataset statistics, column profiles or custom performance KPIs.

Tasks

In DBND, a task is a function annotated with a DBND decorator (@task). Through the decorator, the task can be tracked by DBND. The task can also be run using the dbnd run command in CLI or Python code.

Pipelines

A pipeline is a sequence of tasks wired together. Machine learning teams usually use pipelines for running data ingestion, aggregations, and model training.

DAGs

DAG is a synonym for Pipeline and often used interchangeably. In the DBND documentation, the DAG term is used to denote an Airflow pipeline.

Runs

A run is an execution of a pipeline or a task.

Parameters

Parameters are all the defined and changeable properties of a task or pipeline, for example, data input and model weights.

Environments

Environments define the location where pipelines are executed (for example, local machine, Docker, Kubernetes, Spark cluster) and the metadata store where artifacts are persisted (i.e., input source, system folders, temporary folders, output destination).


Did this page help you?