dataset_op_logger context and
log_dataset_op method enable logging of dataset operations that your code is performing. These operations are displayed in the Affected Datasets section within each Run view.
In order to report dataset operation to Databand, you need to provide the following:
- Dataset path you’re working with to Databand. Provide full URI (as complete as possible) with schema, the host and the region info, etc.
If you are using S3 or GC, you need to provide the full URI.
- Specify the operation type performed over a dataset. The operation type can be “read” or “write”.
- (Optional) Inform Databand if the operation was successful. Use context manager syntax rather than functional syntax. If you’re using functional syntax, you also need to inform Databand if the operation has failed.
- Only provide Databand with a dataframe (Spark or other) that contains information relevant to the operation. If you are reading some rows from some table, you need to provide dataframe of this process to Databand, for it to have visibility on the operation dataset - to be able to learn how many rows you’ve read, what’s the schema, histogram, etc.
There are two possible approaches to logging datasets:
- Functional - using a specific
log_dataset_opfunction (not recommended).
After you’ve implemented one of these approaches, you’ll be able to log datasets that you’re interacting with and monitor their data.
You can log datasets by using the
With this approach, you don’t need to specify if the operation was successful because its status is already being tracked inside this context.
from dbnd import dataset_op_logger # Read dataset example with dataset_op_logger(path, "read") as logger: df = read(path, ...) logger.set(data=df) # Write dataset example df = Dataframe(data) with dataset_op_logger(path, "write") as logger: write(df, ...) logger.set(data=df)
You need to provide Databand the following info:
- operation type (
Once inside the
dataset_op_logger context, use the
set command to provide the dataframe either produced by the
read operation or used in the
dbndsupports pandas DF logging.
To enable pySpark DF logging, install the
dbnd-sparkPyPI library on your client. No additional import statements are required in your Python script
For other DF support, see the Custom Dataframe Support section below.
It’s crucial that you wrap only operation-related code. Do not try to do more than
write in this context - otherwise, you will be getting unnecessary errors. Anything that’s not related to writing or reading should be placed outside.
Here are the good and bad examples of how to use
log_dataset_operation within this approach.
with dataset_op_logger("location://path/to/value.csv", "read"): value = read_from() logger.set(data=value) # Read is successful unrelated_func()
with dataset_op_logger("location://path/to/value.csv", "read") as logger: value = read_from() logger.set(data=value) # Read is successful unrelated_func() # If unrelated_func raises an exception, a failed read operation is reported to Databand.
You can log datasets by using
You will need to provide:
- operation type (
- (optional, recommended) report preview
- (optional, recommended) report schema
- (optional, recommended) report histogram
@task() def task_with_log_datasets(): log_dataset_op( "/path/to/value.csv", DbndDatasetOperationType.read, data=pandas_data_frame, with_preview=True, with_schema=True, )
DbndDatasetOperationType.read, you can use the strings
Histograms and other metrics can also be logged and reported as a part of this feature. To enable histogram reporting, use
The approaches described above allow you to log the dataset you’re interacting with by informing Databand that you are interacting with some dataset - writing or reading to/from it.
Such interactions will be displayed in the Databand as the ‘Affected Datasets’ section within each Run view.
To access it:
- Go to Runs tab
- Click on the DAG that contains the datasets you are logging
- Navigate to the ‘Affected Datasets’ tab
log_dataset_op supports only
But you can enable support for other dataframe types by implementing a
DataFrameValueType and registering it to
Let's say you want to use
koalas provides the Panda’s API for spark DataFrame.
In order to support it, copy this code snippet and make sure it is running when your system is starting up:
import databricks.koalas as ks from targets.values import DataFrameValueType, register_value_type class KoalasValueType(DataFrameValueType): type = ks.DataFrame type_str = "KoalasDataFrame" register_value_type(KoalasValueType())
Histograms are currently not supported within custom dataframes.
Updated 9 days ago