GuidesAPI ReferenceDiscussions
GuidesBlogPlatform

Histograms

Histogram reporting in Databand.

Logging Histograms and Statistics

Histograms that include data profiling information can be automatically fetched from Pandas DataFrames, Spark DataFrames and from warehouses like Amazon Redshift and PostgreSQL.

By using the log_dataframe function, you can enable its advanced logging options: statistics and histograms.
To enable these options, set the with_histogram and with_stats parameters to True:

log_dataframe(
              "key"
             , data=pandas_df
             , with_stats=True
             , with_histograms=True
             )

Calculating statistics and histograms can take a long time on large data chunks, as it requires analyzing the data. DBND provides a way of specifying which columns you want to analyze.

The following options are available for both with_histogram and with_stats parameters:

  • Iterable[str] - calculate for columns matching names within an iterable (list, tuple, etc.)

  • str - a comma-delimited list of column names
  • True - calculate for all columns within a data frame
  • False - do not calculate; this behavior is the default

The LogDataRequest - can be use for more flexible options, such as calculating only boolean columns. The LogDataRequest has the following attributes:

  • include_columns - list of column names to include
  • exclude_columns - list of column names to exclude
  • include_all_boolean, include_all_numeric, include_all_string - select all boolean, numeric, and/or string columns respectively.


Here is an example of using the LogDataRequest:

log_dataframe("customers_data", data,
                  with_histograms=LogDataRequest(include_all_numeric=True,
                                                   exclude_columns=["name", "phone"]))

Alternatively, you can use the following helper methods:

LogDataRequest.ALL()
LogDataRequest.ALL_STRING()
LogDataRequest.ALL_NUMERIC()
LogDataRequest.ALL_BOOLEAN()
LogDataRequest.NONE()

Histograms Visualization

Example histogram showing the data distribution for a column named "AvgVolume"Example histogram showing the data distribution for a column named "AvgVolume"

Example histogram showing the data distribution for a column named "AvgVolume"

When histogram reporting is enabled, Databand will collect the following metrics across columns of a data table or dataframe:

  • Mean
  • Quartiles
  • Max
  • Min
  • Column type
  • Standard dev
  • Distinct counts
  • Null count
  • Non-null count
Example chart showing the trend of distinct counts in a date column across pipeline runsExample chart showing the trend of distinct counts in a date column across pipeline runs

Example chart showing the trend of distinct counts in a date column across pipeline runs

See more information at Histograms

Enabling Histograms for Functions Tracking

There are two options to enable histograms logging:

  1. Enable Databand to log and display histograms in config. To do this, set the [tracking]/log_histograms and [features_flags]/ui_histograms flags to true. For more information about these options, see SDK Configuration.

  2. Enable histogram tracking in individual tasks. You can do this by using one of the following methods:

  • Add a decorator with histogram tracking enabled to your task functions:
    @task (<parameter name>=parameter[DataFrame](log_histograms=True) for task input or @task(result=output.prod_immutable[DataFrame](log_histograms=True)) for task output.

  • Add the following line to your task code:
    log_dataframe (with_histograms=True)

Optimizing Histograms on Spark

When calling log_dataframe with a dataframe that is not optimized for histogram calculation (e.g., read from CSV and not cached), we recommend setting spark_parquet_cache_dir, which will save the dataframe to a temporary parquet file and use it to calculate histograms.

Another option is to cache the dataframe by setting spark_cache_dataframe=True which can improve performance if the dataframe fits in memory.

[histogram]
spark_parquet_cache_dir = "hdfs://tmp/"
spark_cache_dataframe = False

What’s Next
Did this page help you?