Databand Overview

Databand is an observability solution for data engineers and DataOps teams.

Most data engineers use a variety of tools to run their pipelines (Airflow, Python, Spark, Snowflake, BigQuery, etc.). When working across all these systems, you need deep visibility across DAGDAG - (aka Directed Acyclic Graph), is a collection of all tasks that you want to run, organized in a way that reflects their relationships and dependencies.s, data flows, and levels of infrastructure to make sure pipelines are reliable, to detect issues that lead to SLASLA - Service Level Agreement misses, and to identify problems in data quality.

Databand can track, alert, and help you investigate problems in data quality, integrity, and access. Databand provides visibility into this information by collecting usage and profiling information about your datasets, as well as providing you the ability to custom define data quality metrics that will be sent to Databand's tracking system. In addition, Databand tracks your pipeline metadata, providing granular visibility into data processes so that it is easier to identify issues, enhance team productivity, and optimize performance & costs. The Databand application includes an intuitive web-based user interface, monitoring & alerting functionality, as well as tools for debugging & lifecycle management.

Use Cases

  • Platform Operations: monitor pipelines for early warning on signals on failures or missed SLAs.
  • Data Quality: track quality statistics and changes in data sets and pipeline outputs

Benefits

  • Connect and compare metrics, logs, and traces from all data processes
  • Instantly streamline analysis of pipelines and health of your system
  • Access alerts and notifications on issues that require immediate attention
  • Compare trends on data and code changes to identify the root cause of problems fast
  • Focus on optimal code configurations and scheduled data operations.

Automatically Extracted Metadata

When you integrate Databand with your pipelines, Databand can automatically gather metadata about your data sets in use and store that info for analysis. Examples include:

  • Data schema info and previews from DataFrames, SQL queries, and file types like CSV or Parquet
  • Data distributions, profiles, and histograms
  • User access information (who is running processes on a given data set or file)

System Components

The application contains services for storing, analyzing, visualizing, and alerting on pipeline metadata. Pipeline metadata includes various run information, like job durations and errors, and data quality metrics, like data counts and completeness.

The application stack contains the following components:

  • SDK - a Python library and the CLI that engineers use to create pipelines and collect metadata from runs. You can install it through a simple PIP install process in an environment of your choice.

  • Metadata Store - a database that stores pipeline definitions and metadata that enables the system to execute, version, and reproduce pipelines. These definitions include paths to data inputs and outputs, code logic (tasks), environment configurations, and other artifacts necessary for execution.

  • Application - a web UI that provides monitoring and observability on runs and projects. It includes Alert Engine & Anomaly Detection System

Databand's application interfaces with the open-source library (DBND). DBND reports pipeline
metadata to the application for storage, visualization, analysis, and alerting.

This section provides overviews of various platform functionality including monitoring trends and errors.


Did this page help you?