Tracking Spark/JVM Applications

Databand provides a variety of options to cover a wide set of Spark appliances. Since there is a lot of ways to deploy Spark application, consider using the integration which is best suited for you. Currently, Databand supports Python and Scala/Java Spark applications. Databand also collects Spark-specific metadata such as job metrics as well as Spark execution logs. Databand can provide visibility into your spark execution in the context of your broader pipelines or orchestration system.

Tracking PySpark

The most basic variant of tracking will require including dbnd package to your application and using @task decorators or dbnd_tracking():

from dbnd import log_metric, task
from pyspark.sql import SparkSession
from operator import add

def calculate_counts(input_file, output_file):
    spark = SparkSession.builder.appName("PythonWordCount").getOrCreate()
    lines = r: r[0])
    counts = (
        lines.flatMap(lambda x: x.split(" ")).map(lambda x: (x, 1)).reduceByKey(add)
    output = counts.collect()

    log_metric("counts", len(output))

For more details and advanced configuration please proceed to the Tracking PySpark guide.

Tracking Scala/Java Spark applications

To get insights on your Scala/Java app, you need to include dbnd-client jar to your application and proceed with using DbndLogger.logMetric, DbndLogger.logDatasetOperation and other useful methods:

object GenerateReports {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.appName("example").getOrCreate
    val df: DataFrame ="header","true").csv("/data/daily_data")
    DbndLogger.logDatasetOperation(path, READ, df)

For the full guide please proceed to the Tracking Spark (Scala/Java).

Automatic Tracking of Dataset operations via DBND Listener

Databand can capture Spark I/O operations. Databand Query Execution listener will capture any read/write operation performed by spark and will track the following metrics: data path, schema, and rows count as a dataset operation. Follow Dataset Tracking for more details. For the installation details check Installing JVM SDK and Agent

Datasources supported by Databand Query Execution Listener:

  • Regular files on any storage (Filesystem/S3/GCS/any other)
  • Hive tables

Track your data quality by using the Deequ library

Deequ is a library for measuring data quality built on top of Spark. Deequ provides handy DSL for "unit-testing" your data.
Databand has the ability to capture any metrics produced during Deequ profiling. Histograms generated by Deequ during profiling are also reported to Databand.

For PySpark see Tracking PySpark , and for JVM Spark please check Tracking Spark (Scala/Java)

Deployment-specific Guides

Please see Installing JVM SDK and Agent

Databand also supports advanced tracking for following cluster types:

Did this page help you?