Pandas Dataframe

Tracking Pandas DataFrame using DBND.

This guide showcases how DBND can be used for tracking important data metrics for Pandas DataFrame. We will use Pandas to compute data metrics and log them with DBND.

Before you can follow the steps in this guide, you need to install dbnd and pandas . Please install SQLAlchemy if you want to access your data:

pip install dbnd pandas

The database used will be a local PostgreSQL database containing the following data:

DataFrameDataFrame

DataFrame

The table columns are:

  • index - The indices of dataframe.
  • invoice_ID - Unique ID generated for each transaction.
  • transaction_amt - Amount exchanged in the transaction.
  • transaction_time - POSIX timestamp of transaction.
  • transaction_fulfilled - Whether the transaction has been fulfilled.

Logging Metrics with Pandas

As with the previous SQL example, first you need to establish a connection to your database with SQLAlchemy and use this connection to extract the table from the database.

You can use DBND's log_dataframe function to track the dataframe, its shape, meaningful statistics, and automatically generate histograms (if the option is turned on).

For a complete set of options available in log_dataframe, visit the Metrics page.

From here, you can use Pandas to find the number of transactions and mean of transactions greater than or equal to 5000. You can use log_dataframe again to quickly track a sub-dataframe containing only rows where the transaction_amt is greater than or equal to 5000.

# Python 3.6.8
from dbnd import log_metric, log_dataframe
import pandas as pd
 
transactions_df = pd.read_csv("data/example.csv")

# log dataframe
log_dataframe("Full table", transactions_df, with_histograms=False, with_preview=False)

# log_metrics
minimum_amount = 5000
num_large_transactions = transactions_df[transactions_df['transaction_amt'] >= minimum_amount].shape[0]
avg_large_transaction = transactions_df[transactions_df['transaction_amt'] >= minimum_amount].mean()['transaction_amt']
large_transactions_df = transactions_df[transactions_df['transaction_amt'] >= minimum_amount]

log_metric("Number of large transactions(>= 5000)", num_large_transactions)
log_metric("mean large transactions", avg_large_transaction) 
log_dataframe("Large transactions (>= 5000)",large_transactions_df, with_preview=False)

Executing the Scripts

🚧

Reminder

Make sure that you have enabled tracking for the script Tracking Python.

Use Python to execute the script normally:

python tracking_sql_data.py

Output from tracking Pandas

The output of the script is descriptive of the data since log_dataframe was used in place of tracking rows as a list. The metrics that were logged are identical to ones logged in the first script but uses different tools for computations. The output is shown below:

==================== 
= Running Databand!
 TRACKERS   : ['console']
 
INFO  tracking_sql_df_based.py__5851eb4702 - Metrics logged:
    Full table.shape0=100
    Full table.shape1=5
    Full table.schema={'type': 'DataFrame', 'columns': ['index', 'invoice_ID', 'transaction_amt', 'transaction_time', 'transaction_fulfilled'], 'shape': (100, 5), 'dtypes': {'index': 'int64', 'invoice_ID': 'object', 'transaction_amt': 'float64', 'transaction_time': 'float64', 'transaction_fulfilled': 'int64'}, 'size': 500}
INFO  tracking_sql_df_based.py__5851eb4702 - Metric logged: Number of large transactions(>= 5000)=59
INFO  tracking_sql_df_based.py__5851eb4702 - Metric logged: mean large transactions=9234.756779661016
INFO  tracking_sql_df_based.py__5851eb4702 - Metrics logged:
    Large transactions (>= 5000).shape0=59
    Large transactions (>= 5000).shape1=5
    Large transactions (>= 5000).schema={'type': 'DataFrame', 'columns': ['index', 'invoice_ID', 'transaction_amt', 'transaction_time', 'transaction_fulfilled'], 'shape': (59, 5), 'dtypes': {'index': 'int64', 'invoice_ID': 'object', 'transaction_amt': 'float64', 'transaction_time': 'float64', 'transaction_fulfilled': 'int64'}, 'size': 295}
  
...
INFO  tracking_sql_df_based.py__5851eb4702 - Histogram logged: Large transactions (>= 5000).transaction_amt
INFO  tracking_sql_df_based.py__5851eb4702 - ###############################################################################
INFO  tracking_sql_df_based.py__5851eb4702 -                                                         countbin
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  6  5024.09           
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                        2  5441.297500000001 
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                 1  5858.505          
INFO  tracking_sql_df_based.py__5851eb4702 -                                                           0  6275.7125         
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                        2  6692.92           
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ            5  7110.1275000000005
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                     4  7527.335          
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                              3  7944.5425         
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  6  8361.75           
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                        2  8778.9575         
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                              3  9196.165          
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                              3  9613.3725         
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                 1  10030.58          
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                              3  10447.7875        
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                        2  10864.994999999999
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                     4  11282.2025        
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                              3  11699.41          
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                 1  12116.6175        
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                              3  12533.825         
INFO  tracking_sql_df_based.py__5851eb4702 - β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ            5  12951.0325   

...
==================== 
= Your run has been successfully executed!
 TRACKERS   : ['console']

Logging Histograms and Statistics

By using the log_dataframe function, you can enable its advanced logging options: statistics and histograms.
To enable these options, set the with_histogram and with_stats parameters to True:

log_dataframe(
              "data"
             , pandas_df
             , with_stats=True
             , with_histograms=True
             )

Calculating statistics and histograms can take a long time on large data chunks, as it requires analyzing the data. DBND provides a way of specifying which columns you want to analyze.

The following options are available for both with_histogram and with_stats parameters:

  • Iterable[str] - calculate for columns matching names within an iterable (list, tuple, etc.)

  • str - a comma-delimited list of column names
  • True - calculate for all columns within a data frame
  • False - do not calculate; this behavior is the default

The LogDataRequest - can be use for more flexible options, such as calculating only boolean columns. The LogDataRequest has the following attributes:

  • include_columns - list of column names to include
  • exclude_columns - list of column names to exclude
  • include_all_boolean, include_all_numeric, include_all_string - select all boolean, numeric, and/or string columns respectively.


Here is an example of using the LogDataRequest:

@task
def process_customers_data(data: pd.DataFrame = "s3://pipelines/customers_data.csv") -> pd.DataFrame:
    log_dataframe("customers_data", data,
                  with_histograms=LogDataRequest(include_all_numeric=True,
                                                   exclude_columns=["name", "phone"]))

Alternatively, you can use the following helper methods:

LogDataRequest.ALL()
LogDataRequest.ALL_STRING()
LogDataRequest.ALL_NUMERIC()
LogDataRequest.ALL_BOOLEAN()
LogDataRequest.NONE()


What’s Next
Did this page help you?