Installing on Spark Cluster
General Installation
Make sure that the Databand Server is accessible from your Spark Cluster.
JVM Integration
The following environment variables should be defined in your Spark context.
DBND__CORE__DATABAND_URL
- a Databand server URLDBND__CORE__DATABAND_ACCESS_TOKEN
- a Databand server Access TokenDBND__TRACKING=True
Please see Installing JVM SDK and Agent for detailed information on all available parameters. Your cluster should have Databand .jars to be able to use Listener and other features. See below.
Python Integration
The following environment variables should be defined in your Spark context.
DBND__CORE__DATABAND_URL
- a Databand server URLDBND__CORE__DATABAND_ACCESS_TOKEN
- a Databand server Access TokenDBND__TRACKING=True
DBND__ENABLE__SPARK_CONTEXT_ENV=True
You should install dbnd-spark
. See Installing Python SDK and bootstrap example below
Spark Clusters
Most clusters support setting up Spark environment variables via cluster metadata or via bootstrap scripts. Below are provider-specific instructions.
EMR Cluster
Setting Databand Configuration via Environment Variables
You need to define Environment Variables at the API call or EmrCreateJobFlowOperator
Airflow Operator. An alternative way is to provide all these variables via AWS UI where you create a new cluster. EMR cluster doesn't have a way of defining Environment Variables in the bootstrap. Please consult with official EMR documentation on Spark Configuration if you use custom operator or creating cluster outside Ariflow.
from airflow.hooks.base_hook import BaseHook
dbnd_config = BaseHook.get_connection("dbnd_config").extra_dejson
databand_url = dbnd_config["core"]["databand_url"]
databand_access_token = dbnd_config["core"]["databand_access_token"]
emr_create_job_flow = EmrCreateJobFlowOperator(
job_flow_overrides={
"Name": "<EMR Cluster Name>",
#...
"Configurations": [
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"DBND__TRACKING": "True",
"DBND__ENABLE__SPARK_CONTEXT_ENV": "True",
"DBND__CORE__DATABAND_URL": databand_url,
"DBND__CORE__DATABAND_ACCESS_TOKEN": databand_access_token,
},
}
],
}
]
}
#...
)
Installing Databand on Cluster
As EMR cluster has support for bootstrap actions, following snippet can be used to install python and jvm integrations:
#!/usr/bin/env bash
DBND_VERSION=REPLACE_WITH_DBND_VERSION
sudo python3 -m pip install pandas==1.2.0 pydeequ==1.0.1 databand[spark]==${DBND_VERSION}
sudo wget https://repo1.maven.org/maven2/ai/databand/dbnd-agent/${DBND_VERSION/dbnd-agent-${DBND_VERSION}-all.jar -P /home/hadoop/
Add this script to your cluster bootstrap actions list. For more details please, please follow bootstrap actions documentation.
Databricks Cluster
Setting Databand Configuration via Environment Variables
In the cluster configuration screen, click 'Edit'>>'Advanced Options'>>'Spark'.
Inside the Environment Variables section, declare the configuration variables listed below. Be sure to replace <databand-url>
and <databand-access-token>
with your environment specific information:
DBND__TRACKING="True"
DBND__ENABLE__SPARK_CONTEXT_ENV="True"
DBND__CORE__DATABAND_URL="REPLACE_WITH_DATABAND_URL"
DBND__CORE__DATABAND_ACCESS_TOKEN="REPLACE_WITH_DATABAND_TOKEN"
Install Python DBND library in Databricks cluster
Under the Libraries tab of your cluster's configuration:
- Click 'Install New'
- Choose the PyPI option
- Enter
databand[spark]==REPLACE_WITH_DBND_VERSION
as the Package name - Click 'Install'
Install Python DBND library for specific Airflow Operator
| Do not use this mode in production, use it only for trying out DBND in specific Task.
Ensure that the dbnd
library is installed on the Databricks cluster by adding databand[spark]
to the libraries
parameter of the DatabricksSubmitRunOperator
, shown in the example below:
DatabricksSubmitRunOperator(
#...
json={"libraries": [
{"pypi": {"package": "databand[spark]==REPLACE_WITH_DBND_VERSION"}},
]},
)
Tracking Scala/Java Spark Jobs
Download DBND Agent and place it into your DBFS working folder.
To configure Tracking Spark Applications with automatic dataset logging, add ai.databand.spark.DbndSparkQueryExecutionListener
as a spark.sql.queryExecutionListeners
(this mode works only if DBND agent has been enabled)
Use the following configuration of the Databricks job to enable Databand Java Agent with automatic dataset tracking:
spark_operator = DatabricksSubmitRunOperator(
json={
#...
"new_cluster": {
#...
"spark_conf": {
"spark.sql.queryExecutionListeners": "ai.databand.spark.DbndSparkQueryExecutionListener",
"spark.driver.extraJavaOptions": "-javaagent:/dbfs/apps/dbnd-agent-0.xx.x-all.jar",
},
},
#...
})
Make sure that you have published the agent to /dbfs/apps/
first
For more configuration options, see the Databricks Runs Submit API documentation.
GoogleCloud DataProc Cluster
Cluster Setup
You can define environment variables during the cluster setup or add these variables to your bootstrap as described at Installing on Spark Cluster :
from airflow.hooks.base_hook import BaseHook
dbnd_config = BaseHook.get_connection("dbnd_config").extra_dejson
databand_url = dbnd_config["core"]["databand_url"]
databand_access_token = dbnd_config["core"]["databand_access_token"]
cluster_create = DataprocClusterCreateOperator(
# ...
properties={
"spark-env:DBND__TRACKING": "True",
"spark-env:DBND__ENABLE__SPARK_CONTEXT_ENV": "True",
"spark-env:DBND__CORE__DATABAND_URL": databand_url,
"spark-env:DBND__CORE__DATABAND_ACCESS_TOKEN": databand_access_token,
},
# ...
)
You can install Databand PySpark support via the same operator:
cluster_create = DataprocClusterCreateOperator(
#...
properties={
"dataproc:pip.packages": "dbnd-spark==REPLACE_WITH_DATABAND_VERSION", # pragma: allowlist secret
}
#...
)
Dataproc cluster has support for initialization actions. Following script installs Databand libraries and set up environment variables required for tracking:
#!/usr/bin/env bash
DBND_VERSION=REPLACE_WITH_DBND_VERSION
# to use conda-provided python instead of system one
export PATH=/opt/conda/default/bin:${PATH}
python3 -m pip install pydeequ==1.0.1 databand[spark]==${DBND_VERSION}
DBND__CORE__DATABAND_ACCESS_TOKEN=$(/usr/share/google/get_metadata_value attributes/DBND__CORE__DATABAND_ACCESS_TOKEN)
sh -c "echo DBND__CORE__DATABAND_ACCESS_TOKEN=${DBND__CORE__DATABAND_ACCESS_TOKEN} >> /usr/lib/spark/conf/spark-env.sh"
DBND__CORE__DATABAND_URL=$(/usr/share/google/get_metadata_value attributes/DBND__CORE__DATABAND_URL)
sh -c "echo DBND__CORE__DATABAND_URL=${DBND__CORE__DATABAND_URL} >> /usr/lib/spark/conf/spark-env.sh"
sh -c "echo DBND__TRACKING=True >> /usr/lib/spark/conf/spark-env.sh"
sh -c "echo DBND__ENABLE__SPARK_CONTEXT_ENV=True >> /usr/lib/spark/conf/spark-env.sh"
Note that variables like access token and tracker url should be passed to the initialization action via cluster metadata properties. Please refer to official Dataproc documentation for details.
Tracking Python Spark Jobs
Use the following configuration of the PySpark DataProc job to enable Databand Spark Query Listener with automatic dataset tracking:
pyspark_operator = DataProcPySparkOperator(
#...
dataproc_pyspark_jars=[ "gs://.../dbnd-agent-REPLACE_WITH_DATABAND_VERSION-all.jar"],
dataproc_pyspark_properties={
"spark.sql.queryExecutionListeners": "ai.databand.spark.DbndSparkQueryExecutionListener",
},
#...
)
- You should publish your jar to Google Storage first.
See the list of all supported operators and extra information at Tracking Subprocess/Remote Tasks section.
Cluster Bootstrap
If you are using custom cluster installation, you have to install Databand packages, agent and configure environment variables for tracking.
Add the following commands to your cluster initialization script:
#!/bin/bash -x
DBND_VERSION=REPLACE_WITH_DBND_VERSION
# Configure your Databand Tracking Configuration (works only for generic cluster/dataproc, not for EMR)
sh -c "echo DBND__TRACKING=True >> /usr/lib/spark/conf/spark-env.sh"
sh -c "echo DBND__ENABLE__SPARK_CONTEXT_ENV=True >> /usr/lib/spark/conf/spark-env.sh"
# if you use Listeners/Agent, download Databand Agent which includes all jars
wget https://repo1.maven.org/maven2/ai/databand/dbnd-agent/${DBND_VERSION}/dbnd-agent-${DBND_VERSION}-all.jar -P /home/hadoop/
# install Databand Python package together with Airflow support
python -m pip install databand[spark]==${DBND_VERSION}
- Install the
dbnd
packages on the Spark master and Spark workers by runningpip install databand[spark]
at bootstrap or manually. - Make sure you don't install "dbnd-airflow" to the cluster.
How to provide Databand credentials via cluster bootstrap
In case you cluster type support configuring environment variables via a bootstrap script, you can use your bootstrap script to define Databand Credentials on cluster level:
sh -c "echo DBND__CORE__DATABAND_URL=REPLACE_WITH_DATABAND_URL >> /usr/lib/spark/conf/spark-env.sh"
sh -c "echo DBND__CORE__DATABAND_ACCESS_TOKEN=REPLACE_WITH_DATABAND_TOKEN >> /usr/lib/spark/conf/spark-env.sh"
- Be sure to replace
<databand-url>
and<databand-access-token>
with your environment-specific information.
Databand Agent Path and Query Listener configuration for Spark Operators
Databand can automatically alter spark-submit
command for variety for Spark operators and inject agent jar into the classpath and enable Query Listener. Following options are suitable for configuration in dbnd_config
airflow connection:
{
"tracking_spark": {
"query_listener": true,
"agent_path": "/home/hadoop/dbnd-agent-latest-all.jar",
"jar_path": null
}
}
query_listener
— enables Databand Spark Query Listener for auto-capturing dataset operations from Spark jobs.agent_path
— path to the Databand Java Agent FatJar. If provided, Databand will include this agent into Spark Job viaspark.driver.extraJavaOptions
configuration option. Agent is required if you want to track Java/Scala jobs annotated with@Task
. Agent has to be placed in a cluster local filesystem for proper functioning.jar_path
— path to the Databand Java Agent FatJar. If provided, Databand will include jar into Spark Job viaspark.jars
configuration option. Jar can be placed in a local filesystem as well as S3/GCS/DBFS path.
Properties can be configured via environment variables or .cfg files. Please refer to the SDK Configuration for details.
Next Steps
See the Tracking Python section for implementing dbnd
within your PySpark jobs. See the Tracking Spark/JVM Applications for Spark/JVM jobs
Updated 3 months ago