Histograms that include data profiling information can be automatically fetched from Pandas DataFrames, Spark DataFrames and from warehouses like Amazon Redshift and PostgreSQL.
When histogram reporting is enabled, Databand will collect the following metrics across columns of a data table or dataframe:
- Column type
- Standard dev
- Distinct counts
- Null count
- Non-null count
There are two options to enable histograms logging:
Enable Databand to log and display histograms in config. To do this, set the
truein the databand-system.cfg. For more information about these options, see Configuration options in the Databand-system.cfg.
Enable histogram tracking in individual tasks. You can do this by using one of the following methods:
Add a decorator with histogram tracking enabled to your task functions:
@task (<parameter name>=parameter[DataFrame](log_histograms=True)for task input or
@task(result=output.prod_immutable[DataFrame](log_histograms=True))for task output.
Add the following line to your task code:
log_dataframe with a dataframe that is not optimized for histogram calculation (e.g., read from CSV and not cached), we recommend setting
spark_parquet_cache_dir, which will save the dataframe to a temporary parquet file and use it to calculate histograms.
Another option is to cache the dataframe by setting
spark_cache_dataframe=True which can improve performance if the dataframe fits in memory.
[histogram] spark_parquet_cache_dir = "hdfs://tmp/" spark_cache_dataframe = False
Updated about 1 month ago