Dataset Logging

Logging dataset operations using Databand

Introduction

Logging your datasets through Databand allows you to capture metadata about the operations your code is performing. This allows you to gain key insights into your data and to monitor the success or failure of your dataset operations.

Ways to Log Datasets

Data in Motion

Overview

Databand allows you to monitor your data in motion through minor changes to your existing code. In most cases, only a few additional lines of code are required to integrate our SDK. Logged dataset operations are accessible through the Affected Datasets tab within each run of your pipelines. Additionally, you can use the Datasets page to view historical operations for each of your datasets.

Metadata Logged

The metadata captured as part of logged datasets includes:

  • dataset path: The URI that is associated with the dataset being logged
  • operation type: Whether the operation was a read or a write
  • schema: The column names and data types of your dataset
  • operation volume: The number of records that was read or written as part of the operation
  • data preview: Rows taken from the head of the dataset to provide sample data
  • column-level stats: Statistical characteristics of each column in the logged dataset
  • histograms: Graphs showing the frequency and distribution of the column values in your dataset

The following table provides an overview of the metadata that is logged for each type of dataset object:

Trends of an operation's metadata metrics are presented in the context of its related dataset in the Datasets page, or in the context of a specific run in the run's Affected Dataset tab.

Instructions and Examples

Tracking Datasets in Memory

Tracking Cursor Functions

Data at Rest

Overview

Databand also provides users with the ability to track data at rest in BigQuery. This is done through the creation of a service account in GCP which allows Databand to track your tables at set intervals. Unlike data in motion, your data at rest is only available on the Datasets page in your Databand environment.

Metadata Logged

The following attributes are captured when monitoring data at rest in BigQuery:

  • table path: The combination of project, dataset, and table name
  • operation type: Whether an operation was a read or a write
  • operation volume: The number of records that was read or written as part of an operation

Historical trends are also available for your BigQuery data at rest. In addition to the Daily Rows Written & Read and Daily Data Operations trends that are tracked for datasets in motion, the Total Rows Over Time trend is also tracked for your monitored BigQuery tables.

πŸ“˜

Blending BigQuery data in motion with data at rest

Databand will attempt to blend your BigQuery data in motion with monitored tables at rest when possible. This is done by comparing the unique combination of project, dataset, and table name for a given table. When a dataset operation tracked via manual dataset logging is determined to be the same as an operation monitored in BigQuery, Databand will present the BigQuery job ID alongside the pipeline name and run ID.

Instructions and Examples

BigQuery


Did this page help you?