Dataset Logging

Logging dataset operations using Databand.

The dataset_op_logger context enables logging of dataset operations that your code is performing. This allows you to gain key insights into your data and to monitor the success or failure of your dataset operations. This information is displayed in the Affected Datasets section within each Run view.

Requirements for Dataset Logging

In order to report dataset operation to Databand, you need to provide the following:

  • The path for your dataset. Provide the full URI (as complete as possible) with the schema, host, region, etc.
  • The type of operation. Specify whether the operation is a read or a write.
  • The dataframe (Spark, pandas, or other) that contains the records being read or written in your operation. This is what provides Databand visibility of your operation so that metadata can be collected such as row counts, schemas, histograms, etc.

Choosing the Right URI

In order for Databand to track your datasets from different sources, you should provide URI paths in a consistent manner. This will help ensure that a dataset affected by multiple operations will be identified as the same dataset across tasks (i.e. some file is written by a task and then read downstream by a different task). Some recommendations for good URI formats are below:

Filesystem / GCS / S3

Provide the full URI.
Example: s3://bucket/key/dataset.csv

BigQuery

Specify the region, project, dataset, and table.
Example: bigquery://region/project/dataset/table

Snowflake

Specify the relevant parts of the hostname along with the database, schema, and table
Example: snowflake://name.region.cloud/database/schema/table

🚧

Note

URIs are case-sensitive. This can potentially lead to Databand identifying two instances of the same URI as two different datasets.

Using dataset_op_logger

You can log datasets by using the with dataset_op_logger... context. With this approach, you don’t need to specify if the operation was successful because its status is already being tracked inside the context.

Example:

from dbnd import dataset_op_logger

# Read example
with dataset_op_logger(path, "read") as logger:
    df = read(path, ...)
    logger.set(data=df)

# Write example
df = Dataframe(data)
with dataset_op_logger(path, "write") as logger:
    write(df, ...)
    logger.set(data=df)

Once inside the dataset_op_logger context, use the set command to provide the dataframe either produced by the read operation or used in the write operation.

❗️

Dataframe Support

By default, dbnd supports pandas dataframe logging.

To enable pySpark dataframe logging, install the dbnd-spark PyPI library on your client. No additional import statements are required in your Python script.

For other dataframe support, see the Custom Dataframe Logging section below.

It’s crucial that you wrap only operation-related code in the dataset_op_logger context. Anything that’s not related to reading or writing your dataset should be placed outside the context so that unrelated errors do not potentially flag your dataset operation as having failed.

Good Example:

with dataset_op_logger("location://path/to/value.csv", "read"):
    value = read_from()
    logger.set(data=value)
    # Read is successful

unrelated_func()

Bad Example:

with dataset_op_logger("location://path/to/value.csv", "read") as logger:
    value = read_from()
    logger.set(data=value)
    # Read is successful
    unrelated_func()
    # If unrelated_func raises an exception, a failed read operation is reported to Databand.

Histogram Reporting

Histograms for the columns of your datasets can be logged using the dataset logging feature. To enable histogram reporting, use with_histograms=True. Your histograms--along with a few key data profiling metrics--will be accessible within the Databand UI.

Supported Dataset Types

All dataset logging features are fully supported with pandas and Spark dataframes; however, if your application is running in a lightweight environment, you have the option of providing a list of dictionaries.

When fetching data from an external API, you will often get data in the following form:

data = [
  {
    "Name": "Name 1",
    "ID": 1,
    "Information": "Some information"
  },
  {
    "Name": "Name 2",
    "ID": 2,
    "Information": "Other information"
  },
  ...
]

Providing this list of dictionaries as the data argument for the dataset logging function will allow you to report its schema and its volume:

with dataset_op_logger("http://some/api/response.json", "read"):
    logger.set(data=data)

Volume is going to be determined by calculating the length of this list. In our example, the volume will be 2.

Schema is going to be determined by flattening the dictionary. In our example, the schema is going to be: Name: str, ID: int, Information: str.

Viewing Affected Datasets in the UI

The approach described above allows you to log the dataset you’re interacting with by informing Databand that you are reading or writing some sort of data. Such interactions will be displayed in the Databand UI in the β€˜Affected Datasets’ section within each run of your pipeline.

To access the Affected Datasets section:

  • Go to the Runs page
  • Click on the DAG run that contains the datasets you are logging
  • Navigate to the β€˜Affected Datasets’ tab

Support for Custom Dataframe Types

By default, Databand's dataset logging supports only pandas and Spark dataframes. You can enable support for other dataframe types by implementing a DataFrameValueType and registering it with dbnd.

For example, you might want to log a Koalas dataframe. Koalas implements the pandas API on top of Spark.

In order to support Koalas, copy the code snippet below, and make sure it is running when your system is starting up:

import databricks.koalas as ks

from targets.values import DataFrameValueType, register_value_type


class KoalasValueType(DataFrameValueType):
    type = ks.DataFrame
    type_str = "KoalasDataFrame"

register_value_type(KoalasValueType())

🚧

Important

Histograms are currently not supported within custom dataframes.


Did this page help you?