GuidesAPI ReferenceDiscussions

Task Inputs

On task inputs in Databand.

Every Task parameter at DBND can represent an input or an output. When a DBND task starts to run, DBND automatically loads data from the location defined as task input from file systems into the proper in-memory format. When the task ends, DBND will persist task outputs in the desired file system.

We'll use this prepare_data task to describe how inputs and outputs work.

from dbnd import task
from pandas import DataFrame

def prepare_data(data: DataFrame, key: str) -> str:
    result = data.get(key)
    return result

Loading Data into Tasks

If you need to quickly test pipelines on different sets of data, you can easily load new data into a task.

Running prepare_data will load a specified file type into the data DataFrame. By default, the type is resolved using the file extension, so .csv or .parquet will work.

Running prepare_data from the command line:

dbnd run prepare_data --set data=dbnd-examples/data/wine_quality.csv.gz

Running prepare_data as part of a pipeline:


Multiple sources can be loaded as DataFrame. Check fetch_data task in 'Predict Wine Quality' example to see how partitioned data is dynamically loaded into a task input as a single DataFrame.

If you work with a file directly (you don't want DBND to autoload it for you), use DBND's Target or python 'Path' type to represent file system independent paths:

from dbnd import task
from targets.types import Path

def read_data(path: Path) -> int:
    num_of_lines = len(open(path, "r").readlines())
    return num_of_lines

When task input is a string, you need to explicitly indicate that some value in the command line is the path:

from dbnd import task

def prepare_data(data: str) -> str:
    return data
#value is path
dbnd run prepare_data --set [email protected]/path_to_file_to_load

#value is **string**
dbnd run prepare_data --set data=my_string_value

A @task decorator can also be used to configure how data is loaded. For example, if your input file is tab-delimited, you can configure it as follows:

from dbnd import task, parameter
from pandas import DataFrame
from targets.target_config import FileFormat

@task(data=parameter[DataFrame].csv.load_options(FileFormat.csv, sep="\t"))
def prepare_data(data: DataFrame) -> DataFrame:
    data["new_column"] = 5
    return data

This way, Pandas parameters for read_csv method can be provided.

To specify that your input is of a specific type, regardless of its file extension, use:

dbnd run prepare_data --set --prepare-data-data--target csv