Tracking EMR

This page covers the steps required to install and configure DBND reporting from AWS EMR.

  1. Install DBND library as a part of your EMR bootstrap script:
#!/bin/bash -x
set -e
mkdir /home/hadoop/databand && cd /home/hadoop/databand
sudo python3 -m pip install databand[spark]
  1. Define environment variables during the cluster setup:
"Configurations": [
    {
      "Classification": "spark-env",
      "Properties": {

      },
      "Configurations": [
        {
          "Classification": "export",
          "Properties": {
            "PYSPARK_PYTHON": "/usr/bin/python3",
            "DBND_HOME": "/home/hadoop/databand",
            "DBND__CORE__DATABAND_URL": "http://<databand_host>:8081",
            "DBND__ENABLE__SPARK_CONTEXT_ENV": "True",
            "DBND__NO_TABLES:"True"
          }
        }
      ]
    }
  ]
  1. Make sure that the DBND__CORE__DATABAND_URL is accessible from the EMR Security Group.

  2. To report logs from EMR back to the DBND tracking service, add an additional environment variable: DBND__OVERRIDE_AIRFLOW_LOG_SYSTEM_FOR_TRACKING with a value set to True.

🚧

Note

The existing logging configuration will be overridden.

  1. Add the Airflow context metadata as part of your spark-submit command. This can be done as a part of the EMR step or using Apache Livy.
spark-submit --master yarn --conf spark.env.AIRFLOW_CTX_DAG_ID=my_dag spark.env.AIRFLOW_CTX_EXECUTION_DATE=2020-01-01 spark.env.AIRFLOW_CTX_TASK_ID=mytask --name myjob s3://<my-bucket/myscript.py> s3://<input_path> s3://<output_path>

What’s Next
Did this page help you?