In this tutorial, you will learn how to perform the following tasks:
- Create a pipeline for use with your data.
- Run a pipeline and learn how to view the results.
- Learn how to modify your pipeline so that you can modify the future runs easily.
After you have successfully installed DBND, you can create a new task, run it, and change its inputs.
Navigate to the root directory of your DBND project (the root directory is the one with the
project.cfgfile) via the command line.
In the root directory, create a new file called
example.pyand open it using your text editor/IDE. Let's add a simple Python function with a DBND
from dbnd import task @task def calculate_alpha(alpha: float = 0.5) -> float: alpha += 0.1 return alpha
- Save your file.
- Now that you have created a task let us run it. To run the task via CLI, use:
dbnd run example.calculate_alpha
- You will see a similar snippet near the bottom of the command line output:
PARAMS: : Name Kind Type Format Source -= Value =- alpha param float default 0.5 result output float .pickle /Users/name/databand/data/dev/2021-08-02/calculate_alpha/calculate_alpha_d869b45637/result.txt :='0.6' [Text omitted for brevity] ==================== = Your run has been successfully executed!
DBND allows you to change your task inputs from the CLI. Changing these values affects the output of your tasks.
- In your task, the
calculate_alphafunction has a default value for the alpha parameter. You can override it using the CLI:
dbnd run example.calculate_alpha --set-root alpha=0.4
- The results displayed should include the following snippet:
PARAMS: : Name Kind Type Format Source -= Value =- alpha param float ctor 0.4 result output float .pickle /Users/name/databand/data/dev/2021-08-02/calculate_alpha/calculate_alpha_d869b45637/result.txt :='0.5' = ====================
In this section, you will create a pipeline using some example code.
A pipeline is a series of tasks wired together to run in some order.
In the following sections, you will work with a pipeline that produces an ML model that predicts different wine quality.
The following is the initial workflow code before any modifications with DBND. The process includes the following tasks:
Task 1: Input Data - loading a data file containing the wines and their attributes
Task 2: Data Preparation - splitting the data set into distinct training and validation sets
Task 3: Model Training - creating an ElasticNet model based on the training data set
Task 4: Model Validation - testing the model with test data to create performance metrics
In addition to these four tasks, the workflow includes two input parameters that have been hardcoded.
import pandas import numpy as np import logging from sklearn.linear_model import ElasticNet from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score from dbnd_examples.data import data_repo logging.basicConfig(level=logging.INFO) def training_script(): # load data raw_data = pandas.read_csv(data_repo.wines) # split data into training and validation sets train_df, validation_df = train_test_split(raw_data) # create hyperparameters and model alpha = 0.5 l1_ratio = 0.2 lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio) lr.fit(train_df.drop(["quality"], 1), train_df[["quality"]]) # validation validation_x = validation_df.drop(["quality"], 1) validation_y = validation_df[["quality"]] prediction = lr.predict(validation_x) rmse = np.sqrt(mean_squared_error(validation_y, prediction)) mae = mean_absolute_error(validation_y, prediction) r2 = r2_score(validation_y, prediction) logging.info("%s,%s,%s", rmse, mae, r2) return lr training_script()
Using DBND in the workflow involves the following steps:
Step 1. Functionalize the parts of the code that you want to modularize into tasks
Step 2. Assign
@task decorators to define each function as a task
Step 3. Use the
@pipeline decorator to define the structure of the pipeline
# python3.6 import numpy as np import logging from pandas import DataFrame from dbnd import task, pipeline from sklearn.linear_model import ElasticNet from dbnd import log_dataframe, log_metric from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score from typing import Tuple from sklearn.model_selection import train_test_split logging.basicConfig(level=logging.INFO) @task(result="training_set, validation_set") def prepare_data(raw_data: DataFrame) -> Tuple[DataFrame, DataFrame]: train_df, validation_df = train_test_split(raw_data) return train_df, validation_df @task def train_model( training_set: DataFrame, alpha: float = 0.5, l1_ratio: float = 0.5, ) -> ElasticNet: lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio) lr.fit(training_set.drop(["quality"], 1), training_set[["quality"]]) return lr @task def validate_model(model: ElasticNet, validation_dataset: DataFrame) -> str: log_dataframe("validation", validation_dataset) validation_x = validation_dataset.drop(["quality"], 1) validation_y = validation_dataset[["quality"]] prediction = model.predict(validation_x) rmse = np.sqrt(mean_squared_error(validation_y, prediction)) mae = mean_absolute_error(validation_y, prediction) r2 = r2_score(validation_y, prediction) log_metric("rmse", rmse) log_metric("mae", rmse) log_metric("r2", r2) return "%s,%s,%s" % (rmse, mae, r2) @pipeline(result=("model", "validation")) def predict_wine_quality( raw_data: DataFrame, alpha: float = 0.5, l1_ratio: float = 0.5, ): training_set, validation_set = prepare_data(raw_data=raw_data) model = train_model( training_set=training_set, alpha=alpha, l1_ratio=l1_ratio ) validation = validate_model(model=model, validation_dataset=validation_set) return model, validation
Pay attention to the following changes in the modified code above:
- Each task has a
@taskdecorator with input and output parameters.
- The pipeline has a
@pipelinedecorator with defined input and output parameters.
resultkeyword at the decorator is used for defining names for outputs of the task.
- The input data (in this case, a CSV file) is no longer a part of the code. Instead, it is an attribute that's entered and is adjustable as a parameter.
- All variable parameters have been defined as pipeline attributes.
- Logging and metrics APIs have been added to the
model_validationtask to track performance.
Using the text editor/IDE of your choice, begin by creating a new Python module. You can call it with the name of your choice (e.g.,
Copy and save the DBND code provided on this page into your module.
Create a new file called
wine.csvand copy in the sample data we've provided.
Once you've prepared your data file, you can run your pipeline with the CLI:
dbnd run predict_wine_quality.predict_wine_quality --set raw_data=wine.csv
Note that this command reflects the module and data file that is saved in your DBND directory; otherwise, you will need to provide full file names.
You can run a pipeline with different input data and/or parameters by specifying a different data file or parameter as part of the
dbnd run command.
To re-run your pipeline with a different data file, duplicate your original
wine.csv file and call it
In the command line, run:
dbnd run predict_wine_quality.predict_wine_quality --set raw_data=wine2.csv
DBND will run the pipeline using the new data file.
You can change the parameter values for a run by specifying new values as part of the
dbnd run command. For example, if you wanted to set the values for
l1_ratio to 0.7, you would run the following:
dbnd run predict_wine_quality.predict_wine_quality --set raw_data=wine.csv --set l1_ratio=0.7 --set alpha=0.7
DBND will run the pipeline using the values provided as the net input parameters.
Updated about 1 month ago