Spark eXperiments

$ $DEV/apache-github/spark/bin/spark-shell

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
Using Scala version 2.12.18

scala> :quit

Spark official resources

Build and test Spark

Latest Spark 4.0.0-SNAPSHOT built from source using Maven:

$ cd $DEV/apache-github/spark/
$ ./build/mvn -DskipTests clean package
build Spark submodules using the mvn -pl option, like:
$ ./build/mvn -pl :spark-streaming_2.12 clean install
or using SBT:
$ ./build/sbt package

Testing with SBT

$ ./build/sbt
sbt> core/test
sbt> testOnly org.apache.spark.scheduler.DAGSchedulerSuite
sbt> testOnly *DAGSchedulerSuite
sbt> testOnly org.apache.spark.scheduler.*
sbt> testOnly *DAGSchedulerSuite -- -z "[SPARK-3353]"
$ build/sbt "core/testOnly *DAGSchedulerSuite -- -z SPARK-3353"
To see test logs:
$ cat core/target/unit-tests.log

Testing with Maven

To run individual Scala tests:
$ build/mvn \
-Dtest=none -DwildcardSuites=org.apache.spark.scheduler.DAGSchedulerSuite
To run individual Java tests:
$ build/mvn test \
-DwildcardSuites=none -Dtest=org.apache.spark.streaming.JavaAPISuite test

Project notes: running with various versions of Spark, Scala and Log4j

To compile/test/package/assembly for all supportedScalaVersions (2.12 and 2.13), run:

sbt> clean;+test:compile;+test;+assemblies;+package

Run with the latest SNAPSHOT version of Spark and modified


$ cp $DEV/apache-github/spark/conf/ \
and change "rootLogger.level = info" to "rootLogger.level = warn".

Execute spark-submit:

$ cd /Users/owhite/dev/whitechno-github/spica/iskra

$ $DEV/apache-github/spark/bin/spark-submit \
  --master local[4] \
  --class "iskra.SimpleApp" \
~~~ Spark 4.0.0-SNAPSHOT (Scala 2.12.18, Java 1.8.0_381, Mac OS X 12.6.8) on local[4] with 4 cores ~~~
	applicationId=local-1691369604766, deployMode=client, isLocal=true
	uiWebUrl at

Execute spark-submit using Spark's default log4j profile

No control over Log4j

For Log4j 1.2, Spark's default log4j profile: org/apache/spark/ See Spark 3.2.1 Logging and Utils.setLogLevel.

$ $DEV/spark-bin/spark-3.2.1-bin-hadoop2.7/bin/spark-submit \
  --master local[4] \
  --class "iskra.SimpleApp" \

For Log4j 2.0, Spark's default log4j profile: org/apache/spark/ See Spark 3.3.0 Logging and Utils.setLogLevel.

$ $DEV/spark-bin/spark-3.3.0-bin-hadoop2/bin/spark-submit \
  --master local[4] \
  --class "iskra.SimpleApp" \
$ $DEV/spark-bin/spark-3.3.0-bin-hadoop3-scala2.13/bin/spark-submit \
  --master local[4] \
  --class "iskra.SimpleApp" \

Execute spark-submit with --driver-java-options

This is the FIRST preferred method of controlling Log4j.

This works only in local client mode. In standalone and cluster mode additional spark-submit settings are needed.

Use -Dlog4j.configuration=file:$DEV/spark-bin/conf/ with Log4j 1.2.

Run locally:

$ $DEV/spark-bin/spark-3.2.1-bin-hadoop2.7/bin/spark-submit \
  --master local[4] \
  --driver-java-options \
"-Dlog4j.configuration=file:simple-spark-submit/spark-submit-conf/" \
  --class "iskra.SimpleApp" \

Use -Dlog4j.configurationFile=file:$DEV/spark-bin/conf/ with Log4j 2.0

Run locally:

$ $DEV/spark-bin/spark-3.3.0-bin-hadoop3-scala2.13/bin/spark-submit \
  --master local[4] \
  --driver-java-options \
"-Dlog4j.configurationFile=file:simple-spark-submit/spark-submit-conf/" \
  --class "iskra.SimpleApp" \

Run on a Spark standalone cluster in client deploy mode:

$ $DEV/spark-bin/spark-3.3.0-bin-hadoop3-scala2.13/bin/spark-submit \
  --master spark://Olegs-MacBook-Pro.local:7077 \
  --driver-java-options \
"-Dlog4j.configurationFile=file:simple-spark-submit/spark-submit-conf/" \
  --class "iskra.SimpleApp" \

Run on a Spark standalone cluster in cluster deploy mode:

$ $DEV/spark-bin/spark-3.3.0-bin-hadoop3-scala2.13/bin/spark-submit \
  --master spark://Olegs-MacBook-Pro.local:7077 \
  --deploy-mode cluster \
  --conf "spark.driver.extraJavaOptions=\
-Dlog4j.configurationFile=file:$PWD/simple-spark-submit/spark-submit-conf/" \
  --conf "spark.executor.extraJavaOptions=\
-Dlog4j.configurationFile=file:$PWD/simple-spark-submit/spark-submit-conf/" \
  --class "iskra.SimpleApp" \

Note that the file needs to exist locally on all the nodes. To satisfy that condition, you can either upload the file to the location available for the nodes (like hdfs) or access it locally with driver if using deploy-mode client. Otherwise, upload a custom using spark-submit, by adding it to the --files list of files to be uploaded with the application. Something like this:

spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --conf "" \
    --conf "" \
    --files "/absolute/path/to/your/" \
    --class com.github.atais.Main \

Note that files uploaded to spark-cluster with --files will be available at root dir, so there is no need to add any path in Files listed in --files must be provided with absolute path! file: prefix in configuration URI is mandatory.

Create conf dir with and

Can be used with overridden SPARK_CONF_DIR environment variable.

For Log4j 1.2:

$ cp $DEV/spark-bin/spark-3.2.1-bin-hadoop3.2/conf/ \

and change "log4j.rootCategory=INFO, console" to "log4j.rootCategory=ERROR, console".

For Log4j 2.0:

$ cp $DEV/spark-bin/spark-3.3.0-bin-hadoop2/conf/ \

and change "rootLogger.level = info" to "rootLogger.level = error".

Execute spark-submit with overridden SPARK_CONF_DIR environment variable

This is the SECOND preferred method of controlling Log4j.

$ export SPARK_CONF_DIR=$DEV/spark-bin/conf
  $DEV/spark-bin/spark-3.2.1-bin-hadoop2.7/bin/spark-submit \
  --master local[4] \
  --class "iskra.SimpleApp" \
$ export SPARK_CONF_DIR=$DEV/spark-bin/conf 
  $DEV/spark-bin/spark-3.3.0-bin-hadoop3-scala2.13/bin/spark-submit \
  --master local[4] \
  --class "iskra.SimpleApp" \

Execute spark-submit with modified in its default conf location

This is the THIRD preferred method of controlling Log4j.

$ cp $DEV/spark-bin/spark-3.2.1-bin-hadoop3.2-scala2.13/conf/ \

and change "log4j.rootCategory=INFO, console" to "log4j.rootCategory=ERROR, console".

$ $DEV/spark-bin/spark-3.2.1-bin-hadoop3.2-scala2.13/bin/spark-submit \
  --master local[4] \
  --class "iskra.SimpleApp" \

Spark releases

github tags
Sonatype | Maven Central Repository

  • 3.4 both Scala 2.12 (Hadoop 2.7 and 3.3) and Scala 2.13 (Hadoop 3.3)
    • 3.4.1 - Jun 19, 2023
    • 3.4.0 - Apr 06, 2023
  • 3.3 both Scala 2.12 (Hadoop 2.7 and 3.3) and Scala 2.13 (Hadoop 3.3)
    • 3.3.2 - Feb 10, 2023
    • 3.3.1 - Oct 14, 2022
    • 3.3.0 - Jun 09, 2022 (first version with log4j 2.0)
  • 3.2 both Scala 2.12 (Hadoop 2.7 and 3.3) and Scala 2.13 (Hadoop 3.3)
    • 3.2.4 - Apr 09, 2023
    • 3.2.3 - Nov 14, 2022
    • 3.2.2 - Jul 11, 2022
    • 3.2.1 - Jan 19, 2022 (last version with log4j 1.2)
    • 3.2.0 - Oct 06, 2021
  • 3.1 Scala 2.12
    • 3.1.3 - Feb 06, 2022 (Hadoop 2.7 and 3.2)
    • 3.1.2 - May 23, 2021
    • 3.1.1 - Feb 21, 2021
    • 3.1.0 - Jan 05, 2021
  • 3.0 Scala 2.12
    • 3.0.3 - Jun 14, 2021 (Hadoop 2.7 and 3.2)
    • 3.0.2 - Feb 19, 2021
    • 3.0.1 - Aug 27, 2020
    • 3.0.0 - Jun 05, 2020
  • 2.4 Scala 2.11 (Hadoop 2.7.3)
    • 2.4.8 - May 09, 2021
    • 2.4.7 - Sep 07, 2020
    • 2.4.6 - May 29, 2020
    • 2.4.5 - Feb 02, 2020

Downloaded pre-built Spark packages

Download Apache Spark

  • 3.4 (Scala 2.12 and 2.13)
    • 3.4.1 - Jun 23, 2023 (Hadoop 3.3 only)
      • Scala 2.12 and Hadoop 3.3.4
      • Scala 2.13 and Hadoop 3.3.4
  • 3.3 (Scala 2.12 and 2.13)
    • 3.3.2 - Feb 17, 2023
    • 3.3.1 - Oct 25, 2022
      • Scala 2.12 and Hadoop 2.7.4
      • Scala 2.12 and Hadoop 3.3.2
      • Scala 2.13 and Hadoop 3.3.2
    • 3.3.0 - Jun 16, 2022 (first version with log4j 2.0)
      • Scala 2.12 and Hadoop 2.7.4
      • Scala 2.12 and Hadoop 3.3.2
      • Scala 2.13 and Hadoop 3.3.2
  • 3.2 (Scala 2.12 and 2.13)
    • 3.2.4 - Apr 13, 2023
    • 3.2.1 - Jan 26, 2022 (last version with log4j 1.2)
      • Scala 2.12 and Hadoop 2.7.4
      • Scala 2.12 and Hadoop 3.3.1
      • Scala 2.13 and Hadoop 3.3.1
    • 3.2.0 - Oct 13, 2021
      • Scala 2.12 and Hadoop 3.3.1
      • Scala 2.13 and Hadoop 3.3.1
  • 3.1 (Scala 2.12)
    • 3.1.3 - Feb 18, 2022
      • Hadoop 2.7.4
      • Hadoop 3.2.0
    • 3.1.2 - Jun 01, 2021 (Hadoop 3.2.0)
    • 3.1.1 - Mar 02, 2021 (Hadoop 2.7.4)
  • 3.0 (Scala 2.12)
    • 3.0.2 - Feb 19, 2021
    • 3.0.1
    • 3.0.0 (Hadoop 2.7.4)
  • 2.4 (Scala 2.11)
    • 2.4.8 - May 17, 2021 (Hadoop 2.7.3)
    • 2.4.7 - Sep 12, 2020 (Hadoop 2.7.3)


Releases Archive

  • 3.3
    • 3.3.4 - Aug 08, 2022
    • 3.3.3 - May 17, 2022
    • 3.3.2 - Mar 03, 2021 (Spark 3.3.0)
    • 3.3.1 - Jun 15, 2021 (Spark 3.2.0)
    • 3.3.0 - Jul 14, 2020
  • 3.2
    • 3.2.4 - Jul 22, 2022
    • 3.2.3 - Mar 28, 2022
    • 3.2.2 - Jan 09, 2021
    • 3.2.1 - Sep 22, 2019
    • 3.2.0 - Jan 16, 2019 (stable) (Spark 3.1.2)
  • 3.1
    • 3.1.4 - Aug 03, 2020
    • 3.1.3 - Oct 21, 2019
    • 3.1.2 - Feb 06, 2019
    • 3.1.1 - Aug 08, 2018 (stable)
    • 3.1.0 - Apr 06, 2018
  • 3.0
    • 3.0.3 - May 31, 2018
    • 3.0.2 - Apr 21, 2018
    • 3.0.1 - Mar 25, 2018
    • 3.0.0 - Dec 13, 2017
  • 2.10
    • 2.10.2 - May 31, 2022
    • 2.10.1 - Sep 21, 2020
    • 2.10.0 - Oct 29, 2019 (stable)
  • 2.9
    • 2.9.2 - Nov 19, 2018
    • 2.9.1 - May 03, 2018
    • 2.9.0 - Dec 17, 2017
  • 2.8
    • 2.8.5 - Sep 15, 2018
    • 2.8.4 - May 15, 2018
    • 2.8.3 - Dec 12, 2017
    • 2.8.2 - Oct 24, 2017
    • 2.8.1 - Jun 08, 2017
    • 2.8.0 - Mar 22, 2017
  • 2.7
    • 2.7.7 - May 31, 2018
    • 2.7.6 - Apr 16, 2018
    • 2.7.5 - Dec 14, 2017
    • 2.7.4 - Aug 04, 2017 (Spark 3.0.0)
    • 2.7.3 - Aug 26, 2016 (Spark 2.4)
    • 2.7.2 - Jan 25, 2016
    • 2.7.1 - Jul 06, 2015 (stable)
    • 2.7.0 - Apr 21, 2015


Maven Releases History

  • 3.9.0 - 2023-01-24
  • 3.8.7 - 2022-12-24
  • 3.8.6 - 2022-06-06
  • 3.8.5 - 2022-03-05
  • 3.8.4 - 2021-11-14
  • 3.8.3 - 2021-09-27
  • 3.8.2 - 2021-08-04
  • 3.8.1 - 2021-04-04
  • 3.6.3 - 2019-11-25
  • 3.6.2 - 2019-08-27
  • 3.6.1 - 2019-04-04
  • 3.6.0 - 2018-10-24

Spark and scalatest

Spark's own codebase provides good examples and best practices for using scalatest to do unit tests of Spark. In particular, the spark-sql has the following test setup.

SharedSparkSession and SharedSparkSessionBase Suites extending trait SharedSparkSession are sharing resources (e.g. SparkSession) in their tests. That trait initializes the spark session in its beforeAll() implementation. Helper trait SharedSparkSessionBase for SQL test suites where all tests share a single TestSparkSession.

TestSparkSession and TestSQLContext The TestSparkSession to use for all tests in SharedSparkSession suite. By default, the underlying org.apache.spark.SparkContext will be run in local mode with the default test configurations.

Example of a test suite: class DatasetSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlanHelper

Spark and Log4j

Spark submit, provided dependencies and assembly packages

Other resources


Configuring Spark for optimization.






