Databand's SDK ("DBND") enables data engineers to orchestrate dynamic pipelines. DBND is an advanced framework that allows you to manage your data flows and run them on supported orchestrators. You can use DBND by defining pipelines with DBND decorators and executing pipelines with DBND's runtime.
Pipeline frameworks today did not provide an easy abstraction layer that enabled engineers to focus cleanly on their business logic while maintaining high development agility - and DBND successfully solves this problem.
DBND provides the abstraction layer that enables you to:
- Decouple pipeline code from data & compute infrastructure, making it easier to iterate
- Enable adaptive/dynamic pipelines (so you don’t write as much hardcoded logic to define different pipeline behaviors)
- Track everything that you run, including performance metrics, data flow, and data quality information
Engineers should not have to worry so much about the surrounding noise, like passing data between tasks of the pipeline or running tasks on different compute engines. It slows down team agility by requiring more redundant code and overhead. Databand solves that with an agile framework for pipeline development.
With DBND, you can define data tasks and pipelines ("DAGs") at the function level which simplifies the process of building and iterating your pipelines.
The code snippet below is a Python function that takes some input parameters and returns two DataFrames. Using the
@task decorator makes the function a task in DBND. The decorator also tells DBND to manage and track the function's inputs and outputs:
from dbnd import task from pandas import DataFrame from sklearn.model_selection import train_test_split from typing import Tuple # define a function with a decorator @task def prepare_data(raw_data: DataFrame) -> Tuple[DataFrame, DataFrame]: train_df, validation_df = train_test_split(raw_data) return train_df, validation_df
Decorators are powerful. They offer a lot of capabilities in a small package. They act as function wrappers for your pipelines., You can write code the way you want to.
We can define a pipeline by wiring together any number of tasks:
from dbnd import task, pipeline from pandas import DataFrame @task def prepare_data(data: DataFrame, key: str) -> str: result = data.get(key) return result @pipeline def prepare_data_pipeline(raw_data: DataFrame): key = acquire_key(raw_data) treated_data = prepare_data(raw_data, key) return treated_data
When wiring pipelines with DBND, DBND will automate passing outputs between tasks, making it easier to iterate pipelines with new data files and parameters.
With DBND, you can manage single points of connection to data infrastructure. Since DBND takes care of passing outputs between tasks, you can centrally manage data set connections and run pipelines with new data files as needed.
Data portability is especially useful for pipeline development. You can easily inject specific inputs during debugging and testing. Through a simple CLI switch, you can run the same pipeline with a data file from Amazon S3 or a data file from your local file system.
$ dbnd run prepare_data --set prepare_data.raw_data=s3://my_backet/myfile.json $ dbnd run prepare_data --set prepare_data.raw_data=datafile.csv
Using this approach, you can read and write data from any data source (local, AWS S3, GCP GS, Azure blob, HDFS, etc.); inject parameters and new data from environment, command line, and configuration providers/files.
For information about task configuration, see Task Configuration
With DBND, you can report and reuse pipeline data inputs and outputs. This is helpful for tracking information about your data sets, pipeline statuses, and your model performance in a machine learning example.
Tracking your data inputs and outputs, with the ability to capture full versions of your data sets, makes it possible to reproduce pipeline runs. If a pipeline fails or you need to change the task code, you can test iterations on consistent data.
In DBND, a task is a function annotated with a DBND decorator (
@task). Through the decorator, the task can be tracked and run using the
dbnd run command in CLI or Python code.
A pipeline is a sequence of tasks wired together. Machine learning teams usually use pipelines for running data ingestion, aggregations, and model training.
DAG is a synonym for Pipeline and often used interchangeably. In the DBND documentation, the DAG term is used to denote an Airflow pipeline.
A run is an execution of a pipeline or a task.
Parameters are all the defined and changeable properties of a task or pipeline, for example, data input and model weights.
Environments define the location where pipelines are executed (for example, local machine, Docker, Kubernetes, Spark cluster) and the metadata store where artifacts are persisted (i.e., input source, system folders, temporary folders, output destination).
- Hadoop Distributed File System (HDFS)
- Amazon Simple Storage Service (S3)
- Google Storage (GS)
- Local Spark
- AWS EMR
- GCP DataProc
- Any Spark server via Apache Livy
DBND is not another orchestration library. There are many existing solutions that are improving quite quickly. DBND solves problems for data engineering teams that need to go beyond creating scheduled DAGs - those teams who are looking to adopt agile development methodologies and build a full lifecycle around their projects.
In fact, we encourage you to use DBND along with a production orchestration system like Airflow.
Updated 8 months ago