Submitting Deequ Metrics

How to integrate Deequ with Databand for measuring data quality.

Integrating with Deequ

Deequ is a library for measuring data quality, built on top of Spark. Databand provides the ability to capture any metrics produced during Deequ profiling. Histograms generated by Deequ during profiling are also reported to Databand.

To use Deequ, first, you need to add deequ and dbnd-deequ to your project:

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>com.amazon.deequ</groupId>
      <artifactId>deequ</artifactId>
      <version>x.x.x-spark-x.x</version>
    </dependency>
    <dependency>
      <groupId>ai.databand</groupId>
      <artifactId>dbnd-api-deequ</artifactId>
      <version>0.xx.x</version>
    </dependency>
  </dependencies>
</dependencyManagement>
// add Databand libraries
dependencies {
  implementation 'com.amazon.deequ:deequ:x.x.x-spark-x.x'
  implementation 'ai.databand:dbnd-api-deequ:0.xx.x'
}
libraryDependencies ++= Seq(
  "com.amazon.deequ" % "deequ" % "x.x.x-spark-x.x"
  "ai.databand" % "dbnd-api-deequ" % "0.xx.x",
)

Note on Scala/Spark compatibility

Databand library is Scala/Spark-agnostic and can be used with any combination of Scala/Spark. However, the Deequ version should be selected carefully to match your needs. Please refer to Deequ docs and select the exact version from the list of available versions.

Databand utilizes custom MetricsRepository and DbndResultKey, both of which you should explicitly add to the code:

import ai.databand.deequ.DbndMetricsRepository
  
@Task
protected def dedupRecords(data: Dataset[Row], keyColumns: Array[String]): Dataset[Row] = {
    val dedupedData = data.dropDuplicates(keyColumns)
    // custom metrics repository
    val metricsRepo = new DbndMetricsRepository(new InMemoryMetricsRepository)
    // capturing dataset verification results
    VerificationSuite()
        .onData(dedupedData)
        .addCheck(
            Check(CheckLevel.Error, "Dedup testing")
                .isUnique("name")
                .isUnique("id")
                .isComplete("name")
                .isComplete("id")
                .isPositive("score"))
        .useRepository(metricsRepo)
        .saveOrAppendResult(new DbndResultKey("dedupedData"))
        .run()
    // using metrics repositoty to capture dataset profiling results
    ColumnProfilerRunner()
        .onData(dedupedData)
        .useRepository(metricsRepo)
        .saveOrAppendResult(new DbndResultKey("dedupedData"))
        .run()
}

If you already use a metrics repository, you can wrap it inside Databand's new DbndMetricsRepository(new InMemoryMetricsRepository). Databand will first submit the metrics to the wrapped repository, and then to the Databand tracker.

To distinguish metric keys, you should use a special DbndResultKey. We recommend giving your your checks/profiles such names that will allow you to clearly distinguish them in the Databand's monitoring UI.


Did this page help you?