Orchestrating Pipelines

Orchestrating pipelines with the help of Databand's SDK.

Databand's SDK ("DBND") enables data engineers to orchestrate dynamic pipelines. DBND is an advanced framework that allows you to manage your data flows and run them on supported orchestrators. You can use DBND by defining pipelines with DBND decorators and executing pipelines with DBND's runtime.

Pipeline frameworks today did not provide an easy abstraction layer that enabled engineers to focus cleanly on their business logic while maintaining high development agility - and DBND successfully solves this problem.

DBND provides the abstraction layer that enables you to:

  • Decouple pipeline code from data & compute infrastructure, making it easier to iterate
  • Enable adaptive/dynamic pipelines (so you don’t write as much hardcoded logic to define different pipeline behaviors)
  • Track everything that you run, including performance metrics, data flow, and data quality information

Engineers should not have to worry so much about the surrounding noise, like passing data between tasks of the pipeline or running tasks on different compute engines. It slows down team agility by requiring more redundant code and overhead. Databand solves that with an agile framework for pipeline development.

Data tasks and pipelines

With DBND, you can define data tasks and pipelines ("DAGs") at the function level which simplifies the process of building and iterating your pipelines.

The code snippet below is a Python function that takes some input parameters and returns two DataFrames. Using the @task decorator makes the function a task in DBND. The decorator also tells DBND to manage and track the function's inputs and outputs:

from dbnd import task

# define a function with a decorator

@task
def prepare_data(raw_data: DataFrame) -> Tuple[DataFrame, DataFrame]:
    train_df, validation_df = train_test_split(raw_data)

    return train_df, validation_df

Decorators

Decorators are powerful. They offer a lot of capabilities in a small package. They act as function wrappers for your pipelines., You can write code the way you want to.

We can define a pipeline by wiring together any number of tasks:

@task
def train_model(
    training_set: DataFrame,
    alpha: float = 0.5,
    l1_ratio: float = 0.5,
) -> ElasticNet:
    lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio)
    lr.fit(training_set.drop(["quality"], 1), training_set[["quality"]])
    return lr
 
@pipeline
def predict_wine_quality(
    raw_data: DataFrame,
    alpha: float = 0.5,
    l1_ratio: float = 0.5,
):
    training_set, validation_set = prepare_data(raw_data=raw_data)

    model = train_model(
        training_set=training_set, alpha=alpha, l1_ratio=l1_ratio
    )
     
    return model,validation_set

When wiring pipelines with DBND, DBND will automate passing outputs between tasks, making it easier to iterate pipelines with new data files and parameters.

Data portability

With DBND, you can manage single points of connection to data infrastructure. Since DBND takes care of passing outputs between tasks, you can centrally manage data set connections and run pipelines with new data files as needed.

Data portability is especially useful for pipeline development. You can easily inject specific inputs during debugging and testing. Through a simple CLI switch, you can run the same pipeline with a data file from Amazon S3 or a data file from your local file system.

$ dbnd run prepare_data --set prepeare_data.raw_data=s3://my_backet/myfile.json
$ dbnd run prepare_data --set prepeare_data.raw_data=datafile.csv

Using this approach, you can read and write data from any data source (local, AWS S3, GCP GS, Azure blob, HDFS, etc.); inject parameters and new data from environment, command line, and configuration providers/files.

For information about task configuration, see Task Configuration

Data versioning

With DBND, you can report and reuse pipeline data inputs and outputs. This is helpful for tracking information about your data sets, pipeline statuses, and your model performance in a machine learning example.

Tracking your data inputs and outputs, with the ability to capture full versions of your data sets, makes it possible to reproduce pipeline runs. If a pipeline fails or you need to change the task code, you can test iterations on consistent data.

To learn more about data versioning details, see Inputs and Outputs ].

Compute Engine

DBND provides you a way to dynamically switch between local and remote spark, local and k8s execution, Run your driver locally or at Kubernetes.

Not Another Orchestration Framework

DBND is not another orchestration library. There are many existing solutions that are improving quite quickly. DBND solves problems for data engineering teams that need to go beyond creating scheduled DAGs - those teams who are looking to adopt agile development methodologies and build a full lifecycle around their projects.

In fact, we encourage you to use DBND along with a production orchestration system like Airflow.


Did this page help you?