Runtime Environment Configuration
Post-installation configuration steps that you need to perform before you can start using DBND.
Environments Overview
DBND provides out-of-the-box environments that you need to configure before you can start running your pipelines.
Out-of-the-box, DBND supports the following environment types:
- Persistency - Local file system, AWS S3, Google Storage, Azure Blob Store, and HDFS
- Spark - Local Spark, Amazon EMR, Google DataProc, Databricks, Qubole, and Livy
- Docker Engine - Local Docker, AWS Batch, Kubernetes.
You can also create custom environments and engines.
Main Configuration
• The environments parameter in the core
section specifies a list of environments enabled and available for the project.
Possible values include local, gcp, aws, azure
, and local
- set by default.
[core]
environments = ['local', 'gcp']
The following describe the environment types supported in DBND:
Local
In the default local
environment setup, the configuration works as follows:
A persistent metadata store for task inputs/outputs, metrics, and runtime execution information is set to the local file system under $DBND_HOME/data
.
Python tasks run as processes on the local machine.
Spark tasks run locally by using the spark_submit
command if you have local Spark in place.
Docker tasks run on your local Docker container.
[local]
Configuration Section Parameter Reference
env_label
- Set the environment type to be used. E.g. dev, int, prod.production
- This indicates that the environment is production.conn_id
- Set the cloud connection settings.root
- Determine the main data output location.local_engine
- Set which engine will be used for local executionremote_engine
- Set the remote engine for the execution of driver/taskssubmit_driver
- Enable submitting driver toremote_engine
.submit_tasks
- Enable submitting tasks to remote engine one by one.spark_config
- Determine the Spark Configuration settingsspark_engine
- Set the cluster engine to be used. E.g. local, emr (aws), dataproc (gcp), etc.hdfs
- Set the Hdfs cluster configuration settingsbeam_config
- Set the Apache Beam configuration settingsbeam_engine
- Set the Apache Beam cluster engine. E.g. local or dataflow.docker_engine
- Set the Docker job engine, e.g. docker or aws_batch
Google Cloud Platform (GCP), out-of-the-box
The Spark engine is preset for Google DataProc. To set up a GCP environment, you will need to provide a GS bucket as a root path for the metadata store and the Airflow connection ID with cloud authentication info.
See Setting up GCP Environment..
Amazon Web Service (AWS), out-of-the-box
The Spark engine is preset for Amazon EMR. To set an AWS environment, you need to provide an S3 bucket as a root path for the metadata store and the Airflow connection ID with cloud authentication information.
See Setting Up an AWS Environment.
Microsoft Azure (Azure), out-of-the-box
The Spark engine is preset for Databricks. To set up an Azure environment, you need to provide an Azure Blob Store bucket as a root path for the metadata store and the Airflow connection ID with cloud authentication information.
See Setting Up an Azure Environment.
Custom environment
You can define a custom environment from scratch or inherit settings from an existing environment. You can create custom environments for managing dev/staging/production lifecycles, which normally involves switching data and execution locations.
See Setting up a Custom Environment (Extending Configurations).
To use out-of-the-box environments, set up one or more of the environments described in the referenced topics.
Updated 3 months ago