In DBND, a task is a modular unit of computation, usually a data transformation with an input and an output.
To create a task, you need to decorate your function with the
@task decorator. Under the hood, the
task object is created, and the function parameters are mapped to the task parameters, and the function returns value to task outputs.
from dbnd import task from pandas import DataFrame @task def prepare_data(data: DataFrame) -> DataFrame: return data
A pipeline is a collection of tasks wired together to run in a specific order. Pipelines define tasks and data flow. Pipelines encapsulate an execution plan that can run as a typical python function.
from dbnd import task, pipeline from pandas import DataFrame @task def calculate_alpha(data: DataFrame) -> object: pass @pipeline def prepare_data_pipeline(data: DataFrame): prepared_data = prepare_data(data) alpha = calculate_alpha(prepared_data) return alpha
The DBND runtime consists of two stages:
- Dependency resolution
- Task execution.
Pipeline.band (or pipeline function) will run at the first stage - dependency resolution at the very beginning of your run.
Task.run will be called in the second phase.
When DBND "reads" pipelines - during the dependency resolution time and before the execution time, it substitutes the future output for
Targets signal that DBND should resolve this parameter during execution.
DBND task controls how inputs and outputs are loaded and saved.
from dbnd import task from pandas import DataFrame @task def prepare_data(data: DataFrame) -> DataFrame: data["new_column"] = 5 return data
By default, DBND detects input and output data types based on Python typing. This enables a lot of out of the box functionality, such as loading dataframes from different file formats and automatically serializing the outputs.
Let's consider an example that demonstrates how tasks or pipelines can be run on "development" and "production" data.
$ dbnd run example.prepare_data --set data=myfile.csv $ dbnd run example.prepare_data --set data=s3://my_backet/myfile.json
The first command here will:
- Load data from a local CSV file
- Inject it into the
- Run the
The DataFrame returned by this task will be serialized into a CSV file according to the system's default behavior.
The second command will:
- Run the same task, but this time loading data from a different JSON file located in AWS S3.
How to change parameter value?
Updated 8 months ago