Skip to content

Rdd basics #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 123 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,125 @@
# sparkTutorial
Project source code for James Lee's Aparch Spark with Java course.
# SPARK TUTORIAL

### INTRODUCTION TO APACHE SPARK
Apache spark is a fast, in-memory data processing engine which allows data workers to effiently execute streaming, machine learning or SQL workloads that require fast iteractive access to datasets.

##### SPEED
It is a very critical aspect in processing large data sets as it means the difference between exploring data interactively and waiting minutes or hours.
- Run computations in memory
- Apache Spark has an advanced DAG execution engine that supports acyclic data flow and in-memory computing.
- Enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk than MapReduce.

##### GENERALITY
- A general programming model that enables developers to write an application by composing arbitrary operators.
- Spark makes it easy to combine different processing models seamlessly in the same application.
- Example:
- Data classification through Spark machine learning library.
- Streaming data through source via Spark Streamming.
- Querying the resulting data in real time through Spark SQL.


### ENVIRONMENT NOTES
- Spark is built on top of the Scala programming language which will be compiled to Java bytecode.
- Our Spark examples use Java 8 features.


Using IntelliJ idea, just go to the folder where the code is cloned, then run
- gradlew idea

### RDD: Resilient Distributed Dataset
RDD is a core object that we will be using when developing with Spark applications.

##### What is a dataset?
A dataset is basically a collection of data; it can be a list of strings, a list of integers or even a number of rows in a relational database.
- RDDs can contain any types of objects, including user-defined classes.
- An RDD is simply a capsulation around a very large dataset. In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result.
- Under the hood, Spark will automatically distribute the data contained in RDDs across your cluster and parallelize the operations you perform on them.

##### What can we do with RDDs?
RDDs offered two types of operations TRANSFORMATIONS and ACTIONS.
###### TRANSFORMATIONS.
- Apply some functions to the data in RDD to create a new RDD.
- One of the most common transformations is ```filter``` which will return a new RDD with a subset of the data in the original RDD.
```java
JavaRDD<String> lines = sc.textFile("in/uppercase.text");
JavaRDD<String> linesWithFriday = lines.filter(line -> line.contains("Friday"));
```
###### ACTIONS.
- Compute a result based on an RDD.
- One of the most popular Actions is ```first```, which returns the first element in an RDD.
```java
JavaRDD<String> lines = sc.textFile("in/uppercase.text");
String firstLine = lines.first();
```
##### Spark RDD general Workflow
- Generate initial RDDs from external data.
- Apply transformations.
- Launch actions.

#### CREATE RDDs
##### How to create a RDD
- Take an existing collection in your program and pass it to SparkContext's parallelize method.
```java
List<Integer> inputIntegers = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> integerRdd = sc.parallelize(inputIntegers);
```
- All the elements in the collection will then be copied to form a distributed dataset that can be operated on in parallel.
- Very handy to create an RDD with little effort.
- NOT practical working with large datasets (need of large memory).
- Load RDDs from external storage by calling textFile method on Sparkcontext.
```java
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("in/uppercase.text");
```
- The external storage is usually a distributed file system such as **Amazon S3** or **HDFS**
- The are other data sources which can be integrated with Spark and used to create RDDs including JDBC, Cassandra, and ElasticSearch, etc.

#### MAP AND FILTER TRANSFORMATION
##### Transformations
- Transformations are operations on RDDs which will return a new RDD.
- The two most common transformations are **filter and map**












































Check out the full list of DevOps and Big Data courses that James and Tao teach.

https://www.level-up.one/courses/