Spark on GCP Dataproc

Configuring Google Dataproc to use Spark with Databand.

You can use Google Dataproc cluster to run your jobs. This requires you to configure a GCP connection:

airflow connections --add \
--conn_id google_cloud_default \
--conn_type google_cloud_platform \
--conn_extra "{\"extra__google_cloud_platform__key_path\": \"<PATH/TO/KEY.json>\", \"extra__google_cloud_platform__project\": \"<PROJECT_ID>\"}"

Then you need to adjust your dbnd config:

_type = dbnd_gcp.env.GcpEnvConfig
dbnd_local_root = ${DBND_HOME}/data/dbnd
root = gs://<YOUR-BUCKET>
conn_id = google_cloud_default
spark_engine = dataproc

region = <REGION>
zone = <ZONE>
num_workers = 0
master_machine_type = n1-standard-1
worker_machine_type = n1-standard-1
cluster = <CLUSTER-NAME>

Once your configuration is ready you can use the cluster:

dbnd run prepare_data --env gcp

What’s Next
Did this page help you?