Spark Input and Outputs (DataFrame)

Configuring Spark DataFrameReader/Writer objects in Databand.

Spark inline tasks load spark DataFrames automatically by providing the right configuration to Spark DataFrameReader/Writer objects.

Note that the physical read is done directly by Spark. DBND will map data loading/saving into standard pyspark calls. While running Spark DBND supports s3, gs protocols only if your Spark Instance supports direct read for these storage types.

By default, the data is loaded using default read/write option set by Spark.

CSV file is the only exception, where the default read option is:

  • header = True
  • inferSchema = True

The default write configuration is header=True.

To perform an override, use save_options and load_options methods of parameter builder as in the example below:

from dbnd_spark import spark_task
from dbnd import output, parameter
from targets.target_config import FileFormat
from pyspark.sql import DataFrame

@spark_task(result=output.save_options(FileFormat.csv, header=True)[DataFrame])
def prepare_data(
    data=parameter.load_options(FileFormat.csv, header=False, sep="\t")[DataFrame]
    return data

Supported file formats include:

  • FileFormat.txt
  • FileFormat.parquet
  • FileFormat.json

How to use your own code to read/write Spark

If you want to use your own way of reading/writing Spark, see following example:

from dbnd_spark import spark_task, get_spark_session
from dbnd import parameter
from targets.types import PathStr

def prepare_data(data_path=parameter[PathStr]):
    df = get_spark_session().read.format("csv").options(header=False, sep="\t").load(data_path)
    return df

Known limitations

  • For Azure Blob Store, your Spark instance should support the wasb protocol. DBND will translate the HTTPS protocol to wasb for Spark tasks automatically.
  • You can't use dbnd functionality for Spark DataFrame read/write in case you need basePath parameter or overwrite mode changes.

Did this page help you?