GuidesAPI ReferenceDiscussions
GuidesBlogPlatform

Installing DBND on Spark Cluster

General Installation

Make sure that the {DBND__CORE__DATABAND_URL} is accessible from your Spark Cluster.

Main Configuration

The following environment variables should be defined in your Spark context.

  • DBND_HOME - Databand work folder
  • DBND__CORE__DATABAND_URL - a Databand server URL
  • DBND__CORE__DATABAND_ACCESS_TOKEN - a Databand server Access Token
  • DBND__ENABLE__SPARK_CONTEXT_ENV=True
  • DBND__TRACKING

[Option A] via Bootstrap

Add the following commands to your cluster initialization script:

#!/bin/bash -x

python -m pip install databand[spark]
echo "export DBND__CORE__DATABAND_URL=http://<databand-url>" | tee -a /usr/lib/spark/conf/spark-env.sh
echo "export DBND__CORE__DATABAND_ACCESS_TOKEN=<databand-access-token>" | tee -a /usr/lib/spark/conf/spark-env.sh

echo "export DBND_HOME=`pwd`" | tee -a /usr/lib/spark/conf/spark-env.sh
echo "export DBND__ENABLE__SPARK_CONTEXT_ENV=True" | tee -a /usr/lib/spark/conf/spark-env.sh
echo "export DBND__NO_TABLES=True" | tee -a /usr/lib/spark/conf/spark-env.sh

Be sure to replace <databand-url> and <databand-access-token> with your environment specific information:

[Option B] Via Spark CLI

An alternative approach is to add these variables to the environment variables available to your Spark Application. For spark-submit scripts, use spark.env for passing variables:

spark-submit \
    --conf "spark.env.dbnd.tracking=True" \
    --conf "spark.env.dbnd.tracking.log_value_preview=True" \
    --conf "spark.env.dbnd.run.job_name=spark_pipeline" \
    --conf "spark.env.dbnd.core.databand_url=https://tracker.databand.ai" \
    --conf "spark.env.dbnd.core.databand_access_token=1234567890"

Python Integration

If you are going to use python integration (PySpark), you need to make DBND Python Libraries available on your spark cluster. Install the dbnd packages on the Spark master and Spark workers by running pip install dbnd at bootstrap.
You can install DBND library as a part of your bootstrap script:

#!/bin/bash -x

DBND_VERSION=[ADD YOUR DBND VERSION HERE]
sudo python3 -m pip install databand[spark]==${DBND_VERSION}

You can provide dbnd package as an extra package for your job as an alternative approach. We do not recommend it for production use.

EMR Cluster

You can define environment variables during the cluster setup or add these variables to your bootstrap as described in the section "Main Configuration" :

"Configurations": [
    {
      "Classification": "spark-env",
      "Properties": {

      },
      "Configurations": [
        {
          "Classification": "export",
          "Properties": { 
            "DBND__CORE__DATABAND_URL": "DATABAND_SERVICE_URL",
            "DBND__CORE__DATABAND_TOKEN": "DATABAND_SERVICE_TOKEN",
            "DBND__ENABLE__SPARK_CONTEXT_ENV": "True", 
          }
        }
      ]
    }
  ]

Installing DBND Agent into your Spark Application (optional)

Download DBND Agent and change your cluster configuration to have spark.driver.extraJavaOptions=-javaagent:/<path-to-agent>/dbnd-agent-0.xx.x-all.jar configuration value.


Did this page help you?