Learn and master the art of framing data analysis problems as Spark problems through over 15 hands-on examples, and then scale them up to run on cloud computing services.
In this repo, I personally built over 15 real examples of increasing complexity, run and study by myself.
- Learn the concepts of Spark's Resilient Distributed Datastores
- Develop and run Spark jobs quickly using Python
- Translate complex analysis problems into iterative or multi-stage Spark scripts
- Scale up to larger data sets using Amazon's Elastic MapReduce service
- Understand how Hadoop YARN distributes Spark across computing clusters
- Learn about other Spark technologies, like Spark SQL, Spark Streaming, and GraphX