Tracking Databricks

How to track Databricks jobs.

Cluster Configuration - Setting DBND Context

The following steps will ensure your Databricks cluster has the proper dbnd context for collecting and reporting metadata to Databand.

Install DBND library in Databricks cluster

Under the Libraries tab of your cluster's configuration:

  • Click 'Install New'
  • Choose the PyPI option
  • Enter dbnd as the Package name
  • Click 'Install'

Setting Databand Configuration via Environment Variables

In the cluster configuration screen, click 'Edit'>>Advanced Options>>Spark.

Inside the Environment Variables section, declare the configuration variables listed below. Be sure to replace <databand-url> and <databand-access-token> with your environment specific information:

  • DBND__TRACKING="True"
  • DBND__CORE__DATABAND_URL="<databand-url>"
  • DBND__CORE__DATABAND_ACCESS_TOKEN="<databand-access-token>"

Calling dbnd_tracking() Context

dbnd_tracking() uses the configuration created above to instantiate the Databand context during execution. Any processes executed inside the dbnd_tracking() context will be displayed in Databand for monitoring.

dbnd_tracking() accepts a name parameter which will be used to identify the Databricks job in the Pipelines screen of your Databand application. In the example below, we use databricks_test as our pipeline name.

from random import randint
from time import sleep

from dbnd import log_metric, dataset_op_logger, dbnd_tracking
import pandas as pd

def execute():
  for int in range(randint(0,10)):
    sleep(randint(0, 10))
    log_metric(f"interation_{int}", int)
    
  with dataset_op_logger("databricks://test/load/read", "read") as logger:
    data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
    read_df = pd.DataFrame.from_dict(data, orient='index')
    logger.set(data=read_df)
    
  with dataset_op_logger("databricks://test/load/write", "write") as logger:
    data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
    write_df = pd.DataFrame.from_dict(data, orient='index', columns=['A', 'B', 'C', 'D'])
    logger.set(data=write_df)
    

with dbnd_tracking("databricks_test"):
  execute()

πŸ“˜

BEST PRACTICE - dbnd_tracking()

Call your core control flow logic within the dbnd_tracking() context. In the example above, we invoke the execute() function within the dbnd_tracking() context, which calls the rest of our processing logic.

See our Tracking Python section for implementing dbnd within your Databricks Python jobs.

Tracking Databricks from Airflow DAGs

Ensure that the dbnd library is installed on the Databricks cluster by adding dbnd to the libraries parameter of the DatabricksSubmitRunOperator, shown in the example below:

params = {
    "libraries": [
        {"pypi": {"package": "dbnd"}},
    ],
}

DatabricksSubmitRunOperator(
    task_id="process_customer_data", json=params
)

Tracking Java/Scala Spark Jobs

After injecting Databand Java Agent into the Databricks runtime, Databand will track any Java/Scala job submitted to Databricks automatically. No specific Databricks configuration is necessary.

Please see the Tracking JVM Applications section for more details.

For more configuration options, see the Databricks Runs Submit API documentation.


What’s Next
Did this page help you?