Orchestration Quickstart Tutorial
After installing Databand, try the our quickstart example to learn how to build pipelines.
In this tutorial, you will learn how to perform the following tasks:
- Create a pipeline for use with your data.
- Run a pipeline and learn how to view the results.
- Learn how to modify your pipeline so that you can modify the future runs easily.
Creating a Task
After you have successfully installed DBND, you can create a new task, run it, and change its inputs.
Step 1: Create a Task
-
Navigate to the root directory of your DBND project (the root directory is the one with the
project.cfg
file) via the command line. -
In the root directory, create a new file called
example.py
and open it using your text editor/IDE. Let's add a simple Python function with a DBND@task
decorator:
from dbnd import task
@task
def calculate_alpha(alpha: float = 0.5) -> float:
alpha += 0.1
return alpha
- Save your file.
Step 2: Run a Task
- Now that you have created a task let us run it. To run the task via CLI, use:
dbnd run example.calculate_alpha
- You will see a similar snippet near the bottom of the command line output:
PARAMS: :
Name Kind Type Format Source -= Value =-
alpha param float default 0.5
result output float .pickle /Users/name/databand/data/dev/2021-08-02/calculate_alpha/calculate_alpha_d869b45637/result.txt :='0.6'
[Text omitted for brevity]
====================
= Your run has been successfully executed!
Step 3: Iterate Task Inputs
DBND allows you to change your task inputs from the CLI. Changing these values affects the output of your tasks.
- In your task, the
calculate_alpha
function has a default value for the alpha parameter. You can override it using the CLI:
dbnd run example.calculate_alpha --set-root alpha=0.4
- The results displayed should include the following snippet:
PARAMS: :
Name Kind Type Format Source -= Value =-
alpha param float ctor 0.4
result output float .pickle /Users/name/databand/data/dev/2021-08-02/calculate_alpha/calculate_alpha_d869b45637/result.txt :='0.5'
=
====================
Creating a Pipeline
In this section, you will create a pipeline using some example code.
Pipelines 101
A pipeline is a series of tasks wired together to run in some order.
In the following sections, you will work with a pipeline that produces an ML model that predicts different wine quality.
Step 1: Look at the Initial Code Workflow
The following is the initial workflow code before any modifications with DBND. The process includes the following tasks:
Task 1: Input Data - loading a data file containing the wines and their attributes
Task 2: Data Preparation - splitting the data set into distinct training and validation sets
Task 3: Model Training - creating an ElasticNet model based on the training data set
Task 4: Model Validation - testing the model with test data to create performance metrics
In addition to these four tasks, the workflow includes two input parameters that have been hardcoded.
import pandas
import numpy as np
import logging
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from dbnd_examples.data import data_repo
logging.basicConfig(level=logging.INFO)
def training_script():
# load data
raw_data = pandas.read_csv(data_repo.wines)
# split data into training and validation sets
train_df, validation_df = train_test_split(raw_data)
# create hyperparameters and model
alpha = 0.5
l1_ratio = 0.2
lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)
lr.fit(train_df.drop(["quality"], 1), train_df[["quality"]])
# validation
validation_x = validation_df.drop(["quality"], 1)
validation_y = validation_df[["quality"]]
prediction = lr.predict(validation_x)
rmse = np.sqrt(mean_squared_error(validation_y, prediction))
mae = mean_absolute_error(validation_y, prediction)
r2 = r2_score(validation_y, prediction)
logging.info("%s,%s,%s", rmse, mae, r2)
return lr
training_script()
Step 2: Adapt the Workflow with DBND
Using DBND in the workflow involves the following steps:
Step 1. Functionalize the parts of the code that you want to modularize into tasks
Step 2. Assign @task
decorators to define each function as a task
Step 3. Use the @pipeline
decorator to define the structure of the pipeline
# python3.6
import numpy as np
import logging
from pandas import DataFrame
from dbnd import task, pipeline
from sklearn.linear_model import ElasticNet
from dbnd import log_dataframe, log_metric
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from typing import Tuple
from sklearn.model_selection import train_test_split
logging.basicConfig(level=logging.INFO)
@task(result="training_set, validation_set")
def prepare_data(raw_data: DataFrame) -> Tuple[DataFrame, DataFrame]:
train_df, validation_df = train_test_split(raw_data)
return train_df, validation_df
@task
def train_model(
training_set: DataFrame,
alpha: float = 0.5,
l1_ratio: float = 0.5,
) -> ElasticNet:
lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)
lr.fit(training_set.drop(["quality"], 1), training_set[["quality"]])
return lr
@task
def validate_model(model: ElasticNet, validation_dataset: DataFrame) -> str:
log_dataframe("validation", validation_dataset)
validation_x = validation_dataset.drop(["quality"], 1)
validation_y = validation_dataset[["quality"]]
prediction = model.predict(validation_x)
rmse = np.sqrt(mean_squared_error(validation_y, prediction))
mae = mean_absolute_error(validation_y, prediction)
r2 = r2_score(validation_y, prediction)
log_metric("rmse", rmse)
log_metric("mae", rmse)
log_metric("r2", r2)
return "%s,%s,%s" % (rmse, mae, r2)
@pipeline(result=("model", "validation"))
def predict_wine_quality(
raw_data: DataFrame,
alpha: float = 0.5,
l1_ratio: float = 0.5,
):
training_set, validation_set = prepare_data(raw_data=raw_data)
model = train_model(
training_set=training_set, alpha=alpha, l1_ratio=l1_ratio
)
validation = validate_model(model=model, validation_dataset=validation_set)
return model, validation
Pay attention to the following changes in the modified code above:
- Each task has a
@task
decorator with input and output parameters. - The pipeline has a
@pipeline
decorator with defined input and output parameters. - The
result
keyword at the decorator is used for defining names for outputs of the task. - The input data (in this case, a CSV file) is no longer a part of the code. Instead, it is an attribute that's entered and is adjustable as a parameter.
- All variable parameters have been defined as pipeline attributes.
- Logging and metrics APIs have been added to the
model_validation
task to track performance.
Step 3: Create the Wine Quality Pipeline
-
Using the text editor/IDE of your choice, begin by creating a new Python module. You can call it with the name of your choice (e.g.,
predict_wine_quality
). -
Copy and save the DBND code provided on this page into your module.
Running a Pipeline with Data
-
Create a new file called
wine.csv
and copy in the sample data we've provided. -
Once you've prepared your data file, you can run your pipeline with the CLI:
dbnd run predict_wine_quality.predict_wine_quality --set raw_data=wine.csv
Note that this command reflects the module and data file that is saved in your DBND directory; otherwise, you will need to provide full file names.
Running a Pipeline with Varying Inputs
You can run a pipeline with different input data and/or parameters by specifying a different data file or parameter as part of the dbnd run
command.
Re-Run Your Pipeline with a Different Data File
To re-run your pipeline with a different data file, duplicate your original wine.csv
file and call it wine2.csv
.
In the command line, run:
dbnd run predict_wine_quality.predict_wine_quality --set raw_data=wine2.csv
DBND will run the pipeline using the new data file.
Re-Run Your Pipeline with Different Parameter Values
You can change the parameter values for a run by specifying new values as part of the dbnd run
command. For example, if you wanted to set the values for alpha
and l1_ratio
to 0.7, you would run the following:
dbnd run predict_wine_quality.predict_wine_quality --set raw_data=wine.csv --set l1_ratio=0.7 --set alpha=0.7
DBND will run the pipeline using the values provided as the net input parameters.
Updated 10 months ago