From 96af233501208738207189d8e756ee41e6b27a77 Mon Sep 17 00:00:00 2001 From: Franco Arratia Lopez Date: Tue, 30 Apr 2019 12:07:00 -0400 Subject: [PATCH 1/4] apache-spark Get started apache spark --- README.md | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 50 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index b4bc1e7..0cfb128 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,52 @@ -# sparkTutorial -Project source code for James Lee's Aparch Spark with Java course. +# SPARK TUTORIAL + +### INTRODUCTION TO APACHE SPARK +Apache spark is a fast, in-memory data processing engine which allows data workers to effiently execute streaming, machine learning or SQL workloads that require fast iteractive access to datasets. + +##### SPEED +It is a very critical aspect in processing large data sets as it means the difference between exploring data interactively and waiting minutes or hours. +- Run computations in memory +- Apache Spark has an advanced DAG execution engine that supports acyclic data flow and in-memory computing. +- Enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk than MapReduce. + +##### GENERALITY +- A general programming model that enables developers to write an application by composing arbitrary operators. +- Spark makes it easy to combine different processing models seamlessly in the same application. +- Example: + - Data classification through Spark machine learning library. + - Streaming data through source via Spark Streamming. + - Querying the resulting data in real time through Spark SQL. + + +### ENVIRONMENT NOTES +- Spark is built on top of the Scala programming language which will be compiled to Java bytecode. +- Our Spark examples use Java 8 features. + + +Using IntelliJ idea, just go to the folder where the code is cloned, then run +- gradlew idea + +RDD: Resiliant Distributed Dataset + + + + + + + + + + + + + + + + + + + + + -Check out the full list of DevOps and Big Data courses that James and Tao teach. -https://www.level-up.one/courses/ From ed1417569113ba3d882716e445d7d01be7408b0b Mon Sep 17 00:00:00 2001 From: Franco Arratia Lopez Date: Tue, 30 Apr 2019 12:21:52 -0400 Subject: [PATCH 2/4] RDD Basics of Resilient Distributed Datasets --- README.md | 32 +++++++++++++++++++++++++++++--- 1 file changed, 29 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 0cfb128..77dff39 100644 --- a/README.md +++ b/README.md @@ -26,9 +26,35 @@ It is a very critical aspect in processing large data sets as it means the diffe Using IntelliJ idea, just go to the folder where the code is cloned, then run - gradlew idea -RDD: Resiliant Distributed Dataset - - +### RDD: Resilient Distributed Dataset +RDD is a core object that we will be using when developing with Spark applications. + +##### What is a dataset? +A dataset is basically a collection of data; it can be a list of strings, a list of integers or even a number of rows in a relational database. +- RDDs can contain any types of objects, including user-defined classes. +- An RDD is simply a capsulation around a very large dataset. In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. +- Under the hood, Spark will automatically distribute the data contained in RDDs across your cluster and parallelize the operations you perform on them. + +##### What can we do with RDDs? +RDDs offered two types of operations TRANSFORMATIONS and ACTIONS. +###### TRANSFORMATIONS. +- Apply some functions to the data in RDD to create a new RDD. +- One of the most common transformations is ```filter``` which will return a new RDD with a subset of the data in the original RDD. +```java +JavaRDD lines = sc.textFile("in/uppercase.text"); +JavaRDD linesWithFriday = lines.filter(line -> line.contains("Friday")); +``` +###### ACTIONS. +- Compute a result based on an RDD. +- One of the most popular Actions is ```first```, which returns the first element in an RDD. +```java +JavaRDD lines = sc.textFile("in/uppercase.text"); +String firstLine = lines.first(); +``` +##### Spark RDD general Workflow +- Generate initial RDDs from external data. +- Apply transformations. +- Launch actions. From c4525e78ae1b73712b98e69fefffe9f2ac1319c8 Mon Sep 17 00:00:00 2001 From: Franco Arratia Lopez Date: Tue, 30 Apr 2019 12:35:43 -0400 Subject: [PATCH 3/4] rdd-basics How to create an RDD --- README.md | 44 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) diff --git a/README.md b/README.md index 77dff39..886a5fe 100644 --- a/README.md +++ b/README.md @@ -56,6 +56,50 @@ String firstLine = lines.first(); - Apply transformations. - Launch actions. +#### CREATE RDDs +##### How to create a RDD +- Take an existing collection in your program and pass it to SparkContext's parallelize method. +```java +List inputIntegers = Arrays.asList(1, 2, 3, 4, 5); +JavaRDD integerRdd = sc.parallelize(inputIntegers); +``` +- All the elements in the collection will then be copied to form a distributed dataset that can be operated on in parallel. +- Very handy to create an RDD with little effort. +- NOT practical working with large datasets (need of large memory). +- Load RDDs from external storage by calling textFile method on Sparkcontext. +```java +JavaSparkContext sc = new JavaSparkContext(conf); +JavaRDD lines = sc.textFile("in/uppercase.text"); +``` +- The external storage is usually a distributed file system such as **Amazon S3** or **HDFS** +- The are other data sources which can be integrated with Spark and used to create RDDs including JDBC, Cassandra, and ElasticSearch, etc. + + + + + + + + + + + + + + + + + + + + + + + + + + + From 02b36e12e8d67d73d82a0e4e09f879f7c46bc2a6 Mon Sep 17 00:00:00 2001 From: Franco Arratia Lopez Date: Tue, 30 Apr 2019 12:39:44 -0400 Subject: [PATCH 4/4] rdd-basics Map and filter transformations --- README.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 886a5fe..8755d7c 100644 --- a/README.md +++ b/README.md @@ -74,7 +74,10 @@ JavaRDD lines = sc.textFile("in/uppercase.text"); - The external storage is usually a distributed file system such as **Amazon S3** or **HDFS** - The are other data sources which can be integrated with Spark and used to create RDDs including JDBC, Cassandra, and ElasticSearch, etc. - +#### MAP AND FILTER TRANSFORMATION +##### Transformations +- Transformations are operations on RDDs which will return a new RDD. +- The two most common transformations are **filter and map**