GuidesAPI ReferenceDiscussions


Histogram reporting in Databand.

Logging Histograms and Statistics

Histograms that include data profiling information can be automatically fetched from Pandas DataFrames, Spark DataFrames and from warehouses like Amazon Redshift and PostgreSQL. See more information at Histograms

By using the log_dataset_op function, you can enable its advanced logging options: statistics and histograms.
To enable these options, set the with_histogram and with_stats parameters to True:

from dbnd import log_dataset_op


Calculating statistics and histograms can take a long time on large data chunks, as it requires analyzing the data. DBND provides a way of specifying which columns you want to analyze.

The following options are available for both with_histogram and with_stats parameters:

  • Iterable[str] - calculate for columns matching names within an iterable (list, tuple, etc.)

  • str - a comma-delimited list of column names
  • True - calculate for all columns within a data frame
  • False - do not calculate; this behavior is the default

Enabling Histograms for Python Functions Tracking

Enable histogram tracking in individual tasks. You can do this by using one of the following methods:

  • Add a decorator with histogram tracking enabled to your task functions:
    @task (<parameter name>=parameter[DataFrame](log_histograms=True) for task input or @task(result=output.prod_immutable[DataFrame](log_histograms=True)) for task output.

  • Add the following line to your task code:
    log_dataframe (with_histograms=True)

[histogram] Configuration Section Parameter Reference

  • spark_parquet_cache_dir - Enables pre-caching of DataFrames using .parquet store at spark_temp_dir
  • spark_cache_dataframe - Determine whether to cache the whole DataFrame or not.
  • spark_cache_dataframe_column - Enable caching of the numerical DataFrame during histogram calculation.

What’s Next