Spark on AWS EMR

Using Spark in AWS EMR with Databand.

Configuring AWS EMR

Set spark_engine to emr

Define connection:

$ dbnd-airflow connections --delete --conn_id spark_emr
$ dbnd-airflow connections --add \
    --conn_id spark_emr \
    --conn_type docker \
    --conn_host local

Define cluster ID in EMR configuration:

cluster_id= < emr cluster name id >

πŸ“˜

EMR Steps

By default, DBND uses EMR steps to submit Spark jobs to EMR

Supporting Inline Spark Tasks on EMR

In order to support inline tasks, install the DBND package on your EMR cluster nodes, and set default Spark python to python 3.

Setting python3 as a default spark python version (EMR Configuration):

[
  {
     "Classification": "spark-env",
     "Configurations": [
       {
         "Classification": "export",
         "Properties": {
            "PYSPARK_PYTHON": "/usr/bin/python3",
            "DBND_HOME": "/home/hadoop/databand"
          }
       }
    ]
  }
]

In order to install DBND add the following commands as a bootstrap step into your cluster

mkdir /home/hadoop/databand && cd /home/hadoop/databand
python3 -m pip install --user databand[spark]

What’s Next
Did this page help you?