GuidesAPI ReferenceDiscussions
GuidesBlogPlatform

Installing DBND on Databricks Spark Cluster

How to track Databricks jobs.

Cluster Configuration — Setting DBND Context

The following steps will ensure that your Databricks cluster has the proper dbnd context for collecting and reporting metadata to Databand.

Setting Databand Configuration via Environment Variables

In the cluster configuration screen, click 'Edit'>>Advanced Options>>Spark.
Inside the Environment Variables section, declare the configuration variables listed below. Be sure to replace <databand-url> and <databand-access-token> with your environment specific information:

  • DBND__TRACKING="True"
  • DBND__CORE__DATABAND_URL="<databand-url>"
  • DBND__CORE__DATABAND_ACCESS_TOKEN="<databand-access-token>"

Install Python DBND library in Databricks cluster

Under the Libraries tab of your cluster's configuration:

  • Click 'Install New'
  • Choose the PyPI option
  • Enter dbnd as the Package name
  • Click 'Install'

Install Python DBND library for specific Airflow Operator

| Do not use this mode in production, use it only for trying out DBND in specific Task.
Ensure that the dbnd library is installed on the Databricks cluster by adding dbnd to the libraries parameter of the DatabricksSubmitRunOperator, shown in the example below:

params = {
    "libraries": [
        {"pypi": {"package": "dbnd"}},
    ],
}

DatabricksSubmitRunOperator(
    task_id="process_customer_data", json=params
)

See the Tracking Python section for implementing dbnd within your Databricks Python jobs.

Tracking Scala/Java Spark Jobs

Download DBND Agent and place it into your DBFS working folder.

Use the following configuration of the Databricks job to enable Databand Java Agent:

"new_cluster": {
    "spark_conf": {
        "spark.driver.extraJavaOptions": "-javaagent:/dbfs/<path_to_agent>/dbnd-agent-0.xx.x-all.jar"
    },
}

Automatic dataset logging

To configure Tracking Spark Applications with automatic dataset logging, add ai.databand.spark.DbndSparkQueryExecutionListener as a spark.sql.queryExecutionListeners (this mode works only if DBND agent has been enabled):

"new_cluster": {
    "spark_conf": {
        "spark.sql.queryExecutionListeners": "ai.databand.spark.DbndSparkQueryExecutionListener",
        "spark.driver.extraJavaOptions": "-javaagent:/dbfs/<path_to_agent>/dbnd-agent-0.xx.x-all.jar"
    },
}

Full Operator then will look like this:

boroughs_and_agencies_report = DatabricksSubmitRunOperator(
    task_id="boroughs_and_agencies_report",
    json={
        "name": "boroughs_and_agencies_report",
        "new_cluster": {
            "spark_version": "6.5.x-scala2.11",
            "node_type_id": "m5a.large",
            "aws_attributes": {
                "availability": "SPOT_WITH_FALLBACK",
                "ebs_volume_count": 1,
                "ebs_volume_type": "GENERAL_PURPOSE_SSD",
                "ebs_volume_size": 100,
                "instance_profile_arn": "arn:aws:iam::xxxxx:instance-profile/dev_databricks_ec2",
            },
            "num_workers": 1,
            "spark_conf": {
                "spark.sql.queryExecutionListeners": "ai.databand.spark.DbndSparkQueryExecutionListener",
                "spark.driver.extraJavaOptions": "-javaagent:/dbfs/apps/dbnd-agent-0.xx.x-all.jar",
            },
        },
        "max_retries": 1,
        "spark_python_task": {
            "python_file": "s3://playground/scripts/boroughs_and_agencies_report.py",
            "parameters": [S3_BUCKET, S3_KEY, "{{ ds }}"],
        },
    },

Please see the Installing DBND on Spark Cluster for more information. For more configuration options, see the Databricks Runs Submit API documentation.


Did this page help you?