From 96af233501208738207189d8e756ee41e6b27a77 Mon Sep 17 00:00:00 2001
From: Franco Arratia Lopez <franco148@users.noreply.github.com>
Date: Tue, 30 Apr 2019 12:07:00 -0400
Subject: [PATCH 1/4] apache-spark

Get started apache spark
---
 README.md | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 50 insertions(+), 4 deletions(-)
diff --git a/README.md b/README.md
index b4bc1e7..0cfb128 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,52 @@
-# sparkTutorial
-Project source code for James Lee's Aparch Spark with Java course.
+# SPARK TUTORIAL
+
+### INTRODUCTION TO APACHE SPARK
+Apache spark is a fast, in-memory data processing engine which allows data workers to effiently execute streaming, machine learning or SQL workloads that require fast iteractive access to datasets.
+
+##### SPEED
+It is a very critical aspect in processing large data sets as it means the difference between exploring data interactively and waiting minutes or hours.
+- Run computations in memory
+- Apache Spark has an advanced DAG execution engine that supports acyclic data flow and in-memory computing.
+- Enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk than MapReduce.
+
+##### GENERALITY
+- A general programming model that enables developers to write an application by composing arbitrary operators.
+- Spark makes it easy to combine different processing models seamlessly in the same application.
+- Example:
+  - Data classification through Spark machine learning library.
+  - Streaming data through source via Spark Streamming.
+  - Querying the resulting data in real time through Spark SQL.
+
+
+### ENVIRONMENT NOTES
+- Spark is built on top of the Scala programming language which will be compiled to Java bytecode.
+- Our Spark examples use Java 8 features.
+
+
+Using IntelliJ idea, just go to the folder where the code is cloned, then run
+- gradlew idea
+
+RDD: Resiliant Distributed Dataset
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
 
-Check out the full list of DevOps and Big Data courses that James and Tao teach.
 
-https://www.level-up.one/courses/

From ed1417569113ba3d882716e445d7d01be7408b0b Mon Sep 17 00:00:00 2001
From: Franco Arratia Lopez <franco148@users.noreply.github.com>
Date: Tue, 30 Apr 2019 12:21:52 -0400
Subject: [PATCH 2/4] RDD

Basics of Resilient Distributed Datasets
---
 README.md | 32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index 0cfb128..77dff39 100644
--- a/README.md
+++ b/README.md
@@ -26,9 +26,35 @@ It is a very critical aspect in processing large data sets as it means the diffe
 Using IntelliJ idea, just go to the folder where the code is cloned, then run
 - gradlew idea
 
-RDD: Resiliant Distributed Dataset
-
-
+### RDD: Resilient Distributed Dataset
+RDD is a core object that we will be using when developing with Spark applications.
+
+##### What is a dataset?
+A dataset is basically a collection of data; it can be a list of strings, a list of integers or even a number of rows in a relational database.
+- RDDs can contain any types of objects, including user-defined classes.
+- An RDD is simply a capsulation around a very large dataset. In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result.
+- Under the hood, Spark will automatically distribute the data contained in RDDs across your cluster and parallelize the operations you perform on them.
+
+##### What can we do with RDDs?
+RDDs offered two types of operations TRANSFORMATIONS and ACTIONS.
+###### TRANSFORMATIONS.
+- Apply some functions to the data in RDD to create a new RDD.
+- One of the most common transformations is ```filter``` which will return a new RDD with a subset of the data in the original RDD.
+```java
+JavaRDD<String> lines = sc.textFile("in/uppercase.text");
+JavaRDD<String> linesWithFriday = lines.filter(line -> line.contains("Friday"));
+```
+###### ACTIONS.
+- Compute a result based on an RDD.
+- One of the most popular Actions is ```first```, which returns the first element in an RDD.
+```java
+JavaRDD<String> lines = sc.textFile("in/uppercase.text");
+String firstLine = lines.first();
+```
+##### Spark RDD general Workflow
+- Generate initial RDDs from external data.
+- Apply transformations.
+- Launch actions.
 
 
 

From c4525e78ae1b73712b98e69fefffe9f2ac1319c8 Mon Sep 17 00:00:00 2001
From: Franco Arratia Lopez <franco148@users.noreply.github.com>
Date: Tue, 30 Apr 2019 12:35:43 -0400
Subject: [PATCH 3/4] rdd-basics

How to create an RDD
---
 README.md | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/README.md b/README.md
index 77dff39..886a5fe 100644
--- a/README.md
+++ b/README.md
@@ -56,6 +56,50 @@ String firstLine = lines.first();
 - Apply transformations.
 - Launch actions.
 
+#### CREATE RDDs
+##### How to create a RDD
+- Take an existing collection in your program and pass it to SparkContext's parallelize method.
+```java
+List<Integer> inputIntegers = Arrays.asList(1, 2, 3, 4, 5);
+JavaRDD<Integer> integerRdd = sc.parallelize(inputIntegers);
+```
+- All the elements in the collection will then be copied to form a distributed dataset that can be operated on in parallel.
+- Very handy to create an RDD with little effort.
+- NOT practical working with large datasets (need of large memory).
+- Load RDDs from external storage by calling textFile method on Sparkcontext.
+```java
+JavaSparkContext sc = new JavaSparkContext(conf);
+JavaRDD<String> lines = sc.textFile("in/uppercase.text");
+```
+- The external storage is usually a distributed file system such as **Amazon S3** or **HDFS**
+- The are other data sources which can be integrated with Spark and used to create RDDs including JDBC, Cassandra, and ElasticSearch, etc.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
 
 
 

From 02b36e12e8d67d73d82a0e4e09f879f7c46bc2a6 Mon Sep 17 00:00:00 2001
From: Franco Arratia Lopez <franco148@users.noreply.github.com>
Date: Tue, 30 Apr 2019 12:39:44 -0400
Subject: [PATCH 4/4] rdd-basics

Map and filter transformations
---
 README.md | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 886a5fe..8755d7c 100644
--- a/README.md
+++ b/README.md
@@ -74,7 +74,10 @@ JavaRDD<String> lines = sc.textFile("in/uppercase.text");
 - The external storage is usually a distributed file system such as **Amazon S3** or **HDFS**
 - The are other data sources which can be integrated with Spark and used to create RDDs including JDBC, Cassandra, and ElasticSearch, etc.
 
-
+#### MAP AND FILTER TRANSFORMATION
+##### Transformations
+- Transformations are operations on RDDs which will return a new RDD.
+- The two most common transformations are **filter and map**