Dataset Logging

Logging dataset operations using Databand.

The dataset_op_logger context and log_dataset_op method enable logging of dataset operations that your code is performing. These operations are displayed in the Affected Datasets section within each Run view.

Prerequisites for Enabling Dataset Logging

In order to report dataset operation to Databand, you need to provide the following:

  • Dataset path you’re working with to Databand. Provide full URI (as complete as possible) with schema, the host and the region info, etc.

🚧

URI info

If you are using S3 or GC, you need to provide the full URI.

  • Specify the operation type performed over a dataset. The operation type can be “read” or “write”.
  • (Optional) Inform Databand if the operation was successful. Use context manager syntax rather than functional syntax. If you’re using functional syntax, you also need to inform Databand if the operation has failed.
  • Only provide Databand with a dataframe (Spark or other) that contains information relevant to the operation. If you are reading some rows from some table, you need to provide dataframe of this process to Databand, for it to have visibility on the operation dataset - to be able to learn how many rows you’ve read, what’s the schema, histogram, etc.

How to Report Dataset operations

There are two possible approaches to logging datasets:

  • Using with statement and dataset_op_logger(recommended).
  • Functional - using a specific log_dataset_op function (not recommended).

After you’ve implemented one of these approaches, you’ll be able to log datasets that you’re interacting with and monitor their data.

Using dataset_op_logger (recommended)

You can log datasets by using the with and dataset_op_logger.
With this approach, you don’t need to specify if the operation was successful because its status is already being tracked inside this context.

Example:

from dbnd import dataset_op_logger

# Read dataset example
with dataset_op_logger(path, "read") as logger:
    df = read(path, ...)
    logger.set(data=df)

# Write dataset example
df = Dataframe(data)
with dataset_op_logger(path, "write") as logger:
    write(df, ...)
    logger.set(data=df)

You need to provide Databand the following info:

  • path
  • operation type (read or write)

Once inside the dataset_op_logger context, use the set command to provide the dataframe either produced by the read operation or used in the write operation.

❗️

Dataframe Support

By default, dbnd supports pandas DF logging.

To enable pySpark DF logging, install the dbnd-spark PyPI library on your client. No additional import statements are required in your Python script

For other DF support, see the Custom Dataframe Support section below.

It’s crucial that you wrap only operation-related code. Do not try to do more than read or write in this context - otherwise, you will be getting unnecessary errors. Anything that’s not related to writing or reading should be placed outside.

Here are the good and bad examples of how to use log_dataset_operation within this approach.

Good Example:

with dataset_op_logger("location://path/to/value.csv", "read"):
    value = read_from()
    logger.set(data=value)
    # Read is successful

unrelated_func()

Bad Example:

with dataset_op_logger("location://path/to/value.csv", "read") as logger:
    value = read_from()
    logger.set(data=value)
    # Read is successful
    unrelated_func()
    # If unrelated_func raises an exception, a failed read operation is reported to Databand.

Using log_dataset_op (reference only)

You can log datasets by using log_dataset_op.
You will need to provide:

  • path
  • operation type (read or write)
  • dataframe
  • (optional, recommended) report preview
  • (optional, recommended) report schema
  • (optional, recommended) report histogram
@task()
def task_with_log_datasets():
    log_dataset_op(
        "/path/to/value.csv",
        DbndDatasetOperationType.read,
        data=pandas_data_frame,
        with_preview=True,
        with_schema=True,
    )

📘

Note

Instead of DbndDatasetOperationType.read, you can use the strings "write" or "read".

Histogram Reporting

Histograms and other metrics can also be logged and reported as a part of this feature. To enable histogram reporting, use with_histograms=True.

Viewing Affected Datasets in the UI

The approaches described above allow you to log the dataset you’re interacting with by informing Databand that you are interacting with some dataset - writing or reading to/from it.

Such interactions will be displayed in the Databand as the ‘Affected Datasets’ section within each Run view.

To access it:

  • Go to Runs tab
  • Click on the DAG that contains the datasets you are logging
  • Navigate to the ‘Affected Datasets’ tab

Support for Custom Dataframe Types

By default, log_dataset_op supports only pandas and Spark dataframes.
But you can enable support for other dataframe types by implementing a DataFrameValueType and registering it to dbnd.

Let's say you want to use log_dataset_op with koalas dataframe. koalas provides the Panda’s API for spark DataFrame.
In order to support it, copy this code snippet and make sure it is running when your system is starting up:

import databricks.koalas as ks

from targets.values import DataFrameValueType, register_value_type


class KoalasValueType(DataFrameValueType):
    type = ks.DataFrame
    type_str = "KoalasDataFrame"

register_value_type(KoalasValueType())

🚧

Important

Histograms are currently not supported within custom dataframes.


Did this page help you?