Tracking Quickstart

After installing Databand, try our quickstart example to learn how to track various metrics in your code.

In this tutorial, you will learn how to:

  • Log data, metrics, outputs, and other important details for tracking your data pipeline health in a simple ML workflow with Pandas DataFrames
  • Use DBND's histogram feature to quickly visualize DataFrame metadata in your CLI

The tutorial uses a Python script that trains a model to predict disease progression from a healthcare dataset.

Installing DBND

Before starting with this tutorial, install dbnd, scikit-learn, and pandas using the following command:

pip install dbnd scikit-learn pandas

Step 1. Loading and Tracking the Dataset

Sci-Kit Learn provides many datasets for experimentation. Among them is the Diabetes dataset. The Diabetes dataset has ten features:

  • age (age)
  • sex (sex)
  • body mass index (bmi)
  • average blood pressure (bp)
  • six blood serum measurements (s1 - s6) collected from 442 patients.

The target (target) is a quantitative measurement of disease progression one year after the indicators were collected.

Note that in this dataset, each of the ten feature variables has been mean centered and scaled by the standard deviation times n_samples (i.e., the sum of squares of each column totals 1).

First, load this dataset, then split it into a training set and testing set for our model:

# Python 3.6.8
from sklearn import datasets
from sklearn.model_selection import train_test_split
from pandas import DataFrame, Series
from typing import Tuple

def prepare_data() -> Tuple[DataFrame, DataFrame]:
    ''' load dataset from sklearn. split into training and testing sets'''
    raw_data = datasets.load_diabetes()

    # create a pandas DataFrame from sklearn dataset
    df = DataFrame(raw_data['data'], columns=raw_data['feature_names'])
    df['target'] = Series(raw_data['target'])

    # split the data into training and testing sets 
    training_data, testing_data = train_test_split(df, test_size=0.25)

    return training_data, testing_data

As is, there is no good method to quickly understand this data. Are there biases in the data? What is the statistical composition? To answer these questions, you can track statistics, metadata, and visualize your data composition with DBND logging.

Integrating DataFrame Tracking

# Python 3.6.8
from sklearn import datasets
from sklearn.model_selection import train_test_split
from pandas import DataFrame, Series
import logging 
from typing import Tuple
from dbnd import log_dataframe, log_metric

logging.basicConfig(level=logging.INFO)

def prepare_data() -> Tuple[DataFrame, DataFrame]:
    ''' load dataset from sklearn. split into training and testing sets'''
    raw_data = datasets.load_diabetes()

    # create a pandas DataFrame from sklearn dataset
    df = DataFrame(raw_data['data'], columns=raw_data['feature_names'])
    df['target'] = Series(raw_data['target'])

    # split the data into training and testing sets 
    training_data, testing_data = train_test_split(df, test_size=0.25)

    # use DBND logging features to log DataFrames with histograms
    log_dataframe("training data", training_data,with_histograms=True,
                with_schema=True, with_stats=True)
    log_dataframe("testing_data", testing_data) 

    # use DBND logging features to log the mean of s1
    log_metric("mean s1", training_data['s1'].mean())

    return training_data, testing_data

log_dataframe("training data", training_data, with_histograms=True, with_schema=True, with_stats=True) allows us to track the Pandas DataFrame object.

By setting our parameters with_histograms=True, with_schema=True, and with_stats=True, you capture important statistics and metadata describing the DataFrame. The data displayed by the histogram can be further customized.
See also Tracking Histograms for more customization options.

Integrating Metric Tracking

log_metric allows logging meaningful metrics of our data. For example, if you want to find the mean of the s1 column, you can add log_metric("mean s1", training_data['s1'].mean()). log_metric can also be used to log larger data structures, such as a numpy array or Pandas Series.

Step 2. Training a Model and Tracking Metrics

Now that the data is ready, you can start to train and test a LinearRegression model to predict patient disease progression.

# Python 3.6.8
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from pandas import DataFrame, Series
import logging 
from typing import Tuple
from dbnd import log_dataframe
from sklearn.linear_model import LinearRegression

logging.basicConfig(level=logging.INFO)

def prepare_data() -> Tuple[DataFrame, DataFrame]:
    ''' load dataset from sklearn. split into training and testing sets'''
    ...

def train_model(training_data: DataFrame) -> LinearRegression:
    ''' train a linear regression model '''
    model = LinearRegression()

    # train a linear regression model 
    model.fit(training_data.drop('target', axis=1), training_data['target'])
    return model    

def test_model(model: LinearRegression, testing_data:DataFrame) -> str:
    ''' test the model, output mean squared error and r2 score'''
    testing_x = testing_data.drop('target', axis=1)
    testing_y = testing_data['target']
    predictions = model.predict(testing_x)
    
    mse = mean_squared_error(testing_y, predictions)
    r2_score = model.score(testing_x, testing_y)

    return f"MSE: {mse}, R2: {r2_score}"

For an ML workflow like this, it's good practice to log our model's performance metrics. In addition, it is helpful to log important attributes of our model. For example, in a LinearRegression model you may want to evaluate the coefficients and intercept of the model and/or log them.

Tracking More Metrics

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from pandas import DataFrame, Series
import logging 
from typing import Tuple
from dbnd import log_dataframe, log_metric
from sklearn.linear_model import LinearRegression

logging.basicConfig(level=logging.INFO)

def prepare_data() -> Tuple[DataFrame, DataFrame]:
    ''' load dataset from sklearn. split into training and testing sets'''
    raw_data = datasets.load_diabetes()

    # create a pandas DataFrame from sklearn dataset
    df = DataFrame(raw_data['data'], columns=raw_data['feature_names'])
    df['target'] = Series(raw_data['target'])

    # split the data into training and testing sets 
    training_data, testing_data = train_test_split(df, test_size=0.25)

    # use DBND logging features to log DataFrames with histograms
    log_dataframe("training data", training_data,with_histograms=True,
                with_schema=True, with_stats=True)
    log_dataframe("testing_data", testing_data) 

    # use DBND logging features to log the mean of s1
    log_metric("mean s1", training_data['s1'].mean())

    return training_data, testing_data

def train_model(training_data: DataFrame) -> LinearRegression:
    ''' train a linear regression model '''
    model = LinearRegression()

    # train a linear regression model 
    model.fit(training_data.drop('target', axis=1), training_data['target'])

    # use DBND log crucial details about the regression model with log_metric:
    log_metric('model intercept', model.intercept_) # logging a numeric
    log_metric('coefficients', model.coef_) # logging an np array
    return model    

def test_model(model: LinearRegression, testing_data:DataFrame) -> str:
    ''' test the model, output mean squared error and r2 score'''
    testing_x = testing_data.drop('target', axis=1)
    testing_y = testing_data['target']
    predictions = model.predict(testing_x)
    mse = mean_squared_error(testing_y, predictions)
    r2_score = model.score(testing_x, testing_y)

    # use DBND log_metric to capture important model details:
    log_metric('mean squared error:', mse)
    log_metric('r2 score', r2_score)

    return f"MSE: {mse}, R2: {r2_score}"

if __name__ == '__main__':
    training_set, testing_set = prepare_data()
    model = train_model(training_set)
    metrics = test_model(model, testing_set)

DBND’s log_metric function can handle a variety of data types and data structures. In this case, you are logging a float for the model's intercept and a numpy array for the model's coefficients. After testing our model, you also log pertinent performance metrics such as the mean squared error and R2 score in the same manner.

📘

Tracking Functions

DBND allows you to track functions from different file locations. Visit Tracking Functions for more information.

Step 3. Running the Script

Before running the script, you need to enable tracking of Python scripts by exporting the DBND__TRACKING environment variable:

$export DBND__TRACKING=True

Now, you can execute the Python script normally:

$python linear_regression.py

The output will be all of the tracked data, including histograms, DataFrame statistics, and metrics.

==================== 
= Running Databand!
 TRACKERS   : ['console']
...

INFO  linear_regression.py__d98fd8fb0a - Histogram logged: training data.target
INFO  linear_regression.py__d98fd8fb0a - ###############################################################################
INFO  linear_regression.py__d98fd8fb0a - ███████                                                   5  25.0              
INFO  linear_regression.py__d98fd8fb0a - ████████████████████████████████████                     23  41.05             
INFO  linear_regression.py__d98fd8fb0a - ██████████████████████████████████████████████████       32  57.1              
INFO  linear_regression.py__d98fd8fb0a - █████████████████████████████████████                    24  73.15             
INFO  linear_regression.py__d98fd8fb0a - ███████████████████████████████████████████████████████  35  89.2              
INFO  linear_regression.py__d98fd8fb0a - ██████████████████████████                               17  105.25            
INFO  linear_regression.py__d98fd8fb0a - ██████████████████████████████████                       22  121.30000000000001
INFO  linear_regression.py__d98fd8fb0a - ████████████████████████████████████████                 26  137.35000000000002
INFO  linear_regression.py__d98fd8fb0a - ████████████████████                                     13  153.4             
INFO  linear_regression.py__d98fd8fb0a - ███████████████████████████████████████                  25  169.45000000000002
INFO  linear_regression.py__d98fd8fb0a - ██████████████████████████                               17  185.5             
INFO  linear_regression.py__d98fd8fb0a - ████████████████████                                     13  201.55            
INFO  linear_regression.py__d98fd8fb0a - ███████████████████████                                  15  217.60000000000002
INFO  linear_regression.py__d98fd8fb0a - ████████████████████████████                             18  233.65            
INFO  linear_regression.py__d98fd8fb0a - ███████████████████████                                  15  249.70000000000002
INFO  linear_regression.py__d98fd8fb0a - ██████████████████████████                               17  265.75            
INFO  linear_regression.py__d98fd8fb0a - ███████                                                   5  281.8             
INFO  linear_regression.py__d98fd8fb0a - ██████                                                    4  297.85            
INFO  linear_regression.py__d98fd8fb0a - ███                                                       2  313.90000000000003
INFO  linear_regression.py__d98fd8fb0a - ████                                                      3  329.95

...

INFO linear_regression.py__d98fd8fb0a - Metric logged: mean s1=0.00016865
INFO linear_regression.py__d98fd8fb0a - Metric logged: model intercept=154.34022510700007
INFO  linear_regression.py__d98fd8fb0a - Metric logged: coefficients=[ -10.29867519 -268.24113834  522.7893521   310.80055726 -908.43288433
  517.96198994  120.04400416  248.71644549  805.18877544   52.55633119]
INFO linear_regression.py__d98fd8fb0a -Metric logged: mean squared error:=2572.1155664746416
INFO  linear_regression.py__d98fd8fb0a - Metric logged: r2 score=0.49160156291981305
INFO linear_regression.py__d98fd8fb0a - Task linear_regression.py has been completed!
INFO linear_regression.py__d98fd8fb0a - Task dbnd_driver__d9b2d516f3 has been completed!
INFO  linear_regression.py__d98fd8fb0a -

==================== 
= Your run has been successfully executed!
 TRACKERS   : ['console']

📘

Enabling Tracking

For more details on how to enable tracking, visit the Tracking Python Scripts page.

Privacy Considerations (Sending Source Code)

If you don’t want to send or store your sensitive code in the cloud (Databand’s database) for privacy or security reasons, there is a config option available that allows you to disable (or enable) this option.

To disable sending your source code into the cloud, add track_source_code = False to your .cfg file in the [tracking] section:

[tracking]
track_source_code = False

If you would like to enable sending source code, set track_source_code=True.


Did this page help you?