Installing on Spark Cluster

General Installation

Make sure that the Databand Server is accessible from your Spark Cluster.

JVM Integration

The following environment variables should be defined in your Spark context.

  • DBND__CORE__DATABAND_URL - a Databand server URL
  • DBND__CORE__DATABAND_ACCESS_TOKEN - a Databand server Access Token
  • DBND__TRACKING=True
    Please see Installing JVM SDK and Agent for detailed information on all available parameters. Your cluster should have Databand .jars to be able to use Listener and other features. See below.

Python Integration

The following environment variables should be defined in your Spark context.

  • DBND__CORE__DATABAND_URL - a Databand server URL
  • DBND__CORE__DATABAND_ACCESS_TOKEN - a Databand server Access Token
  • DBND__TRACKING=True
  • DBND__ENABLE__SPARK_CONTEXT_ENV=True
    You should install dbnd-spark. See Installing Python SDK and bootstrap example below

Cluster Bootstrap

Most of the Spark clusters support bootstrap scripts (EMR, Dataproc, and others).

Configure Tracking Properties on Spark Cluster with Bootstrap script

Add the following commands to your cluster initialization script:

#!/bin/bash -x

DBND_VERSION=REPLACE_WITH_DBND_VERSION

#Configure your Databand Tracking Configuration (works only for generic cluster/dataproc, not for EMR)
echo "export DBND__TRACKING=True" | tee -a /usr/lib/spark/conf/spark-env.sh
echo "export DBND__ENABLE__SPARK_CONTEXT_ENV=True" | tee -a /usr/lib/spark/conf/spark-env.sh
echo "export DBND__CORE__DATABAND_URL=REPLACE_WITH_DATABAND_URL" | tee -a /usr/lib/spark/conf/spark-env.sh
echo "export DBND__CORE__DATABAND_ACCESS_TOKEN=REPLACE_WITH_DATABAND_TOKEN" | tee -a /usr/lib/spark/conf/spark-env.sh

# if you use Listeners/Agent, download Databand Agent which includes all jars
wget https://repo1.maven.org/maven2/ai/databand/dbnd-agent/${DBND_VERSION}/dbnd-agent-${DBND_VERSION}-all.jar -P /home/hadoop/

# install Databand Python package together with Airflow support
python -m pip install databand[spark]==${DBND_VERSION}
  • Be sure to replace <databand-url> and <databand-access-token> with your environment-specific information.
  • Install the dbnd packages on the Spark master and Spark workers by running pip install databand[spark] at bootstrap or manually.
  • Make sure you don't install "dbnd-airflow" to the cluster.

Spark Clusters

EMR Cluster

Setting Databand Configuration via Environment Variables

You need to define Environment Variables at the API call/EmrCreateJobFlowOperator Airflow Operator. An alternative way is to provide all these variables via AWS UI where you create a new cluster. EMR cluster doesn't have a way of defining Environment Variables in the bootstrap.

from airflow.hooks.base_hook import BaseHook

dbnd_config = BaseHook.get_connection("dbnd_config").extra_dejson
databand_url = dbnd_config["core"]["databand_url"]
databand_access_token = dbnd_config["core"]["databand_access_token"]

emr_create_job_flow = EmrCreateJobFlowOperator(
    job_flow_overrides={
        "Name": "<EMR Cluster Name>",
         ...
        "Configurations": [
            {
                "Classification": "spark-env",
                "Configurations": [
                    {
                        "Classification": "export",
                        "Properties": {
                            "DBND__TRACKING": "True",
                            "DBND__ENABLE__SPARK_CONTEXT_ENV": "True",
                            "DBND__CORE__DATABAND_URL": databand_url,
                            "DBND__CORE__DATABAND_ACCESS_TOKEN": databand_access_token,
                        },
                    }
                ],
            }
        ]
    }
    ...
)

Installing Databand on Cluster

As EMR cluster has built-in support for bootstrap scripts, please, follow bootstrap option documentation to install python and JVM integrations.

Databricks Cluster

Setting Databand Configuration via Environment Variables

In the cluster configuration screen, click 'Edit'>>Advanced Options>>Spark.
Inside the Environment Variables section, declare the configuration variables listed below. Be sure to replace <databand-url> and <databand-access-token> with your environment specific information:

  • DBND__TRACKING="True"
  • DBND__ENABLE__SPARK_CONTEXT_ENV="True"
  • DBND__CORE__DATABAND_URL="REPLACE_WITH_DATABAND_URL"
  • DBND__CORE__DATABAND_ACCESS_TOKEN="REPLACE_WITH_DATABAND_TOKEN"

Install Python DBND library in Databricks cluster

Under the Libraries tab of your cluster's configuration:

  • Click 'Install New'
  • Choose the PyPI option
  • Enter databand[spark]==REPLACE_WITH_DBND_VERSION as the Package name
  • Click 'Install'

Install Python DBND library for specific Airflow Operator

| Do not use this mode in production, use it only for trying out DBND in specific Task.
Ensure that the dbnd library is installed on the Databricks cluster by adding databand[spark] to the libraries parameter of the DatabricksSubmitRunOperator, shown in the example below:

DatabricksSubmitRunOperator(
     ...
     json={"libraries": [
        {"pypi": {"package": "databand[spark]==REPLACE_WITH_DBND_VERSION"}},
    ]},
)

Tracking Scala/Java Spark Jobs

Download DBND Agent and place it into your DBFS working folder.

To configure Tracking Spark Applications with automatic dataset logging, add ai.databand.spark.DbndSparkQueryExecutionListener as a spark.sql.queryExecutionListeners (this mode works only if DBND agent has been enabled)

Use the following configuration of the Databricks job to enable Databand Java Agent with automatic dataset tracking:

spark_operator = DatabricksSubmitRunOperator(
    json={
        ...
        "new_cluster": {
             ...
            "spark_conf": {
                "spark.sql.queryExecutionListeners": "ai.databand.spark.DbndSparkQueryExecutionListener",
                "spark.driver.extraJavaOptions": "-javaagent:/dbfs/apps/dbnd-agent-0.xx.x-all.jar",
            },
        },
        ...
    })

Make sure that you have published the agent to /dbfs/apps/ first
For more configuration options, see the Databricks Runs Submit API documentation.

GoogleCloud DataProc Cluster

Cluster Setup

You can define environment variables during the cluster setup or add these variables to your bootstrap as described at Installing on Spark Cluster :

from airflow.hooks.base_hook import BaseHook

dbnd_config = BaseHook.get_connection("dbnd_config").extra_dejson
databand_url = dbnd_config["core"]["databand_url"]
databand_access_token = dbnd_config["core"]["databand_access_token"]

cluster_create = DataprocClusterCreateOperator(
     ...
     properties={
        "spark-env:DBND__TRACKING": "True",
        "spark-env:DBND__ENABLE__SPARK_CONTEXT_ENV": "True",
        "spark-env:DBND__CORE__DATABAND_URL": databand_url,
        "spark-env:DBND__CORE__DATABAND_ACCESS_TOKEN": databand_access_token,
    },
    ...
)

You can install Databand PySpark support via the same operator:

cluster_create = DataprocClusterCreateOperator(
     ...
     properties={
             "dataproc:pip.packages": "dbnd-spark==REPLACE_WITH_DATABAND_VERSION",
     }
      ...
)

As Dataproc cluster has built-in support for bootstrap scripts, please, follow bootstrap option documentation to enable python and JVM integrations via bootstrap Installing on Spark Cluster

Tracking Python Spark Jobs

Use the following configuration of the PySpark DataProc job to enable Databand SparkQuery Listener with automatic dataset tracking:

pyspark_operator = DataProcPySparkOperator(
    ...
    dataproc_pyspark_jars=[ "gs://.../dbnd-agent-REPLACE_WITH_DATABAND_VERSION-all.jar"],
    dataproc_pyspark_properties={
        "spark.sql.queryExecutionListeners": "ai.databand.spark.DbndSparkQueryExecutionListener",
     },
      ...
)
  • You should publish your jar to Google Storage first.

See the list of all supported operators and extra information at Tracking Subprocess/Remote Tasks section.

Next Steps

See the Tracking Python section for implementing dbnd within your PySpark jobs. See the Tracking Spark/JVM Applications for Spark/JVM jobs


Did this page help you?