This is the code repository for Mastering Apache Spark 2.x - Second Edition, published by Packt. It contains all the supporting project files necessary to work through the book from start to finish.
Apache Spark is an in-memory cluster based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and SQL. This book aims to take your limited knowledge of Spark to the next level by teaching you how to expand Spark functionality and implement your data flows and machine/deep learning programs on top of the platform.
All of the code is organized into folders. Each folder starts with a number followed by the application name. For example, Chapter02.
The code will look like the following:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
You will need the following to work with the examples in this book:
-
A laptop or PC with at least 6 GB main memory running Windows, macOS, or Linux
-
VirtualBox 5.1.22 or above
-
Hortonworks HDP Sandbox V2.6 or above
-
Eclipse Neon or above
-
Maven
-
Eclipse Maven Plugin
-
Eclipse Scala Plugin
-
Eclipse Git Plugin
If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.
Simply click on the link to claim your free PDF.