Spark on Databricks

Configurations for using Databricks with Databand.

Configuration for submission (dbnd driver) machine

  1. Uncomment [databricks] at databand-core.cfg section
  2. Set cluster_id to databricks cluster id value
  3. Set cloud_type = aws or cloud_type = azure
  4. Configure your environment to use Databricks as spark_engine.
    Example
 [aws_databricks]
_type = aws
spark_engine = databricks
.... additional configurations related to your environment
  1. Perform the following steps:
dbnd-airflow connections --delete --conn_id databricks_default
dbnd-airflow connections --add \
--conn_id databricks_default \
--conn_type databricks \
--conn_host <YOUR DATABRICKS CLUSTER URI> \
--conn_extra "{\"token\": \"<YOUR ACCESS TOCKEN>\", \"host\": \" <YOUR DATABRICKS CLUSTER URI>\"}"
  1. Set the configuration as follows:
tracker = ['console', 'file']

πŸ“˜

Geting Databricks Cluster Id

  • API: https://<CLUSTER_IP>/2.0/clusters/list
  • UI: Under clusters -> advanced options-> tags -> ClusterId

Configuring Databricks Cluster


You can configure DBND to spin up a new cluster for every job or use an existing cluster (default behavior). You will need to install and configure the DBND package on the cluster.

For an existing cluster, run the following script as a Databricks notebook. For a new cluster, make this script part of your Cluster Init script.

#databdand home is /home/ubuntu
export SLUGIFY_USES_TEXT_UNIDECODE=yes
export PIP_EXTRA_INDEX_URL=https://<pipi_user>:<password>@pypi.databand.ai/simple
pip install databand[databricks,airflow,docker,spark,aws]
dbnd airflow-db-init
echo "[core] \ntracker = ['console', 'file']" > /home/ubuntu/project.cfg
#for databricks on aws (used for s3 access
dbnd-airflow connections --delete --conn_id aws_default
#in case you are using azure blob store 
dbnd-airflow connections --add --conn_id=azure_blob_storage_default --conn_login=<blobname> --conn_type=wasb --conn_password=<AZURE_BLOB_KEY>

Run a sanity check to validate your setup:

dbnd run databand_examples.tool_spark.word_count_inline.word_count_inline --set text=s3://databand-playground/demo/customer_b.csv --env aws

What’s Next
Did this page help you?