Spark Configuration

How to configure DBND for running Spark tasks.

Every Spark task has a spark_engine parameter that controls what Spark engine is used and a spark_config parameter that controls generic Spark configuration.

You can set global values for all spark tasks in the pipeline using your environment configuration

For example, the local environment uses local spark_submit by default, while the aws environment uses emr.
You can override the default spark_engine as a configuration setting or for any given run of the Spark task/pipeline

Configure DBND Spark Engine

️ Note

To use remote Spark engines, you must have dbnd-airflow installed.

To install dbnd-airflow, run pip install dbnd-airflow.

DBND supports the following Spark Engines:

Configuring Spark Configuration

Regardless of the engine type, numerous parameters are used to control the submission of a Spark job as described here. The most common parameters include an amount of memory/CPU per job, or additional JAR/EGG files. Each spark task has a SparkConfig object associated with it. You can change these parameters using the SparkConfig object.

The SparkConfig object can be mutated in the following ways:

  1. Adding values into a configuration file under [spark] section. This would affect all running spark tasks:
jars = ['${DBND_HOME}/databand_examples/tool_spark/spark_jvm/target/lib/jsoup-1.11.3.jar']
main_jar =${DBND_HOME}/databand_examples/tool_spark/spark_jvm/target/ai.databand.examples-1.0-SNAPSHOT.jar
driver_memory = 2.5g
  1. Override config value as part of the task definition:
from dbnd_spark import SparkTask, SparkConfig
from dbnd import parameter, output

class PrepareData(SparkTask):
    text =
    counters = output

    main_class = "org.predict_wine_quality.PrepareData"
    # overides value of SparkConfig object
    defaults = {SparkConfig.driver_memory: "2.5g", "spark.executor_memory" : "1g"}

    def application_args(self):
        return [self.text, self.counters]
  1. From command-line:
dbnd run PrepareData --set spark.executor_memory 2.5g  --extend spark.conf={"spark.driver.memoryOverhead": "4G"}
spark.main_jarMain application jar
spark.driver_classpathAdditional, driver-specific, classpath settings
spark.jarsSubmit additional jars to upload and place them in executor classpath.
spark.py_filesAdditional Python files used by the job; can be .zip, .egg or .py.
spark.filesUpload additional files to the executor running the job, separated by a comma. Files will be placed in the working directory of each executor; for example, serialized objects.
spark.packagesA comma-separated list of maven coordinates of jars to include on the driver and executor classpaths.
spark.exclude_packagesA comma-separated list of maven coordinates of jars to exclude while resolving the dependencies provided in 'packages'.
spark.repositoriesA comma-separated list of additional remote repositories to search for the maven coordinates given with 'packages'.
spark.confArbitrary Spark configuration properties. See Extending Values
spark.num_executorsNumber of executors to launch
spark.total_executor_cores(Standalone & Mesos only) Total cores for all executors
spark.executor_cores(Standalone & YARN only) Number of cores per executor
spark.executor_memoryMemory per executor (e.g. 1000M, 2G) (Default: 1G)
spark.driver_memodyMemory allocated to the driver (e.g. 1000M, 2G) (Default: 1G)
spark.driver_cores(Liby only) Number of cores in driver
spark.keytabFull path to the file that contains the keytab.
spark.principalThe name of the Kerberos principal used for the keytab.
spark.env_varsEnvironment variables for spark-submit. It supports yarn and k8s modes.
spark.verboseWhether to pass the verbose flag to the spark-submit process for debugging.

To override specific task configuration, use --set TASK_NAME.task_config="{ 'spark' { 'PARAMETER' : VALUE}}" . For example:

--set  PrepareData.task_config="{ 'spark' : {'num_executors' : 5} }"

Alternatively, you can also edit configuration from inside your code, or use environment variables. See more in Defaults for Engines and Nested Tasks.

Q: Is it possible to edit py_files from CLI, without editing the project.cfg? Which environment variables can I set to change this configuration?

A: You can always run set specific configuration for each run:
dbnd run ….. --set spark.py_files=s3://…

Q: How to use multiple clusters in the same pipeline?

A: You can change the configuration of the specific task via. --set prepare_data.spark_engine=another_engine. Make sure you have a definition of the engine. You can also change task_config of some pipeline, so all internal tasks will have specific spark_engine. For example --set prepare_data.task_config="{ 'some_qubole_engine' : {'cluster_label' : "another_label"} }"

Did this page help you?