Tracking JVM Applications

A guide to enabling tracking for JVM applications.

Databand provides a set of Java libraries for tracking JVM-specific applications, such as Spark jobs written in Scala or Java.
Follow this guide to start tracking JVM applications.

Step 1: Add the Databand libraries to your Application

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>ai.databand</groupId>
      <artifactId>dbnd-client</artifactId>
      <version>0.xx.x</version>
    </dependency>
  </dependencies>
</dependencyManagement>
// add Databand libraries
dependencies {
    implementation('ai.databand:dbnd-client:0.xx.x')
}
val databandVersion = "0.xx.x"
  
libraryDependencies ++= Seq(
  "ai.databand" % "dbnd-client" % databandVersion
)

Step 2: Configure the environment variables

Add the Databand tracking URL to the environment variables available to your application:

  1. dbnd.core.databand_url=https://tracker.databand.ai
  2. If your code is not running inside Airflow, opt in for tracking:
    Set the env variable dbnd.tracking=True
  3. For Spark, use spark.env for passing variables:
spark-submit \
    --conf "spark.driver.extraJavaOptions=-javaagent:dbnd-agent-0.xx.x-all.jar" \
    --conf "spark.env.dbnd.tracking=True" \
    --conf "spark.env.dbnd.tracking.log_value_preview=True" \
    --conf "spark.env.dbnd.run.job_name=spark_pipeline" \
    --conf "spark.env.dbnd.core.databand_url=https://tracker.databand.ai"

Step 3: Tracking pipeline metadata

The sections below describe different options available for tracking pipeline metadata.

Logging Metrics

You can log any custom metrics that are important for pipeline and data observability. Examples include custom metrics for data quality information, like data counts or null counts, and custom KPIs particular to your data.

To enable logging of strings and numeric values, use the ai.databand.log.DbndLogger.logMetric() method:
DbndLogger.logMetric("data", data);

Tracking pipeline functions with annotations

If you have a more complex pipeline structure, or want to present your pipeline functions and store metadata as separate tasks, you can add annotations to your pipeline code.
Method annotation will both enable input/output tracking for each method and link them visually.

To mark the methods that you want to be tracked with the @Task annotation, use:

import ai.databand.annotations.Task

object ScalaSparkPipeline {
  @Task
  def main(args: Array[String]): Unit = {
    // init code
    // ...
    // task 1
      val imputed = unitImputation(rawData, columnsToImpute, 10)
    // task 2
    val clean = dedupRecords(imputed, keyColumns)
    // task 3
    val report = createReport(clean)
  }
    
  @Task
  protected def unitImputation(rawData: DataFrame, columnsToImpute: Array[String], value: Int): DataFrame = {
    // ...
  }
  
  @Task
  protected def dedupRecords(data: Dataset[Row], keyColumns: Array[String]): DataFrame = {
    // ...
  }

  @Task
  protected def createReport(data: Dataset[Row]): Dataset[Row] = {
    // ...
  }
}
import ai.databand.annotations.Task;

public class ProcessDataSpark {

    @Task
    public void processCustomerData(String inputFile, String outputFile) {
        // setup code...
        // task 1
        Dataset<Row> imputed = unitImputation(rawData, columnsToImpute, 10);
        // task 2
        Dataset<Row> clean = dedupRecords(imputed, keyColumns);
        // task 3
        Dataset<Row> report = createReport(clean);
        // ...
    }

    @Task
    protected Dataset<Row> unitImputation(Dataset<Row> rawData, String[] columnsToImpute, int value) {
        // ...
    }

    @Task
    protected Dataset<Row> dedupRecords(Dataset<Row> data, String[] keyColumns) {
        // ...
    }

    @Task
    protected Dataset<Row> createReport(Dataset<Row> data) {
        // ...
    }
}

To use annotations and track the flow of tasks with annotations, the Databand Java agent instruments your application and should be included in the application startup script. This is how you can do it:

  1. Download the the Java agent library by using the following link template. You need to replace the 0.xx.x with the DBND version number you are running:
    https://repo1.maven.org/maven2/ai/databand/dbnd-agent/0.xx.x/dbnd-agent-0.xx.x-all.jar
    .
  2. Add the Java agent to the JVM startup params: -javaagent:/<path-to-agent>/dbnd-agent-0.xx.x-all.jar.
  3. For spark-submit, use spark.driver.extraJavaOptions: spark-submit --conf "spark.driver.extraJavaOptions=-javaagent:/<path-to-agent>/dbnd-agent-0.xx.x-all.jar" <other submit params>

Configuring Logging

Databand support logs limit and head/tail logging. Several properties are responsible for controlling it:

  • DBND__LOG__PREVIEW_HEAD_BYTES specifies how many bytes should be fetched from log head
  • DBND__LOG__PREVIEW_TAIL_BYTES specifies how many bytes should be fetched from log tail

Logging Dataframes

For logging dataframes, including with generated preview of the data, you can enable data preview tracking by setting dbnd.tracking.log_value_preview env variable to True.

You can log Spark dataframes along with histograms by calling the ai.databand.log.DbndLogger.logDataframe() method:
DbndLogger.logDataframe("data", result, true)

Customizing Histogram Logging

You can customize histograms logging by passing HistogramRequest object to the ai.databand.log.DbndLogger.logDataframe() method:

import ai.databand.log.HistogramRequest;
import ai.databand.log.DbndLogger;

//...

    @Task("create_report")
    public void createReport(Dataset<?> df) {
        // include all string columns
        DbndLogger.logDataframe("report_input", df, HistogramRequest.ALL_STRING);
        // include some columns
        HistogramRequest req = new HistogramRequest()
            .includeColumns("name", "created_at");
        DbndLogger.logDataframe("report_input", df, req);
    }
         
//...

Logging Dataset Operations - BETA FEATURE

πŸ“˜

BETA Release

Databand's Java SDK now supports dataset logging as a BETA feature. While our team has tested the internals of this implementation, there may be small differences in the user experience of dataset logging when compared with the Python SDK implementation.

Please contact Databand with any questions or concerns if you experience unexpected results in your Databand UI.

Databand provides ability to track your dataset operations. You need to use DbndLogger.logDatasetOperation():

import ai.databand.log.DbndLogger;

//...

    @Task("create_report")
    public void ingestData(String path) {
        Dataset<Row> data = sql.read().json(path);
        DbndLogger.logDatasetOperation(path, DatasetOperationType.READ, data);
    }
         
//...

For more details, see Dataset Logging.

Step 4: Further steps β€” tracking Spark metrics and data quality with Deequ

When you have successfully configured your application tracking, you can proceed to advanced topics:


Did this page help you?