Spark_ML

Use Spark ML to analyze data in RDD, DataFrame and DataSet

In this case, I analyze bike renting data with regression tree.

In the code, you can see the Spark ML pipeline implement!

Introduce File

hour.csv

The data I am going to analyze!

rdd.py

Analyze with RDD!

dataframe.py

Analyze with dataframe, better to look at the column of the data!

bike_dataframe.ipynb

Analyze the data with dataframe, and easy to look at the code and result!

DS_Bin.scala

Analyze with dataset, and use scala to implement!

DS_Bin.zip

The whole file that is needed! The file include the DS_Bin.scala and also Build.sbt!

Spark environment Setup

$ sudo add-apt-repository ppa:webupd8team/java
$ echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
$ sudo apt-get update
$ sudo apt install oracle-java8-installer ssh python3-pip sbt
$ pip3 --no-cache-dir install numpy pandas
$ wget http://ftp.tc.edu.tw/pub/Apache/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
$ wget http://ftp.twaren.net/Unix/Web/apache/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz
$ ssh-keygen
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Edit /usr/local/hadoop/etc/hadoop/hadoop-env.sh

export JAVA_HOME = /usr/lib/jvm/java-8-oracle

Append following lines to ~/.bashrc

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/usr/local/hadoop
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$HADOOP_HOME/bin 
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export PATH=$PATH:$SPARK_HOME/bin 
export PYSPARK_PYTHON=python3

To execute py file

$ spark-submit <<your py file>>

To execute scala file

$Sbt package 
$spark-submit target/scala-2.11/<<your jar file>>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark_ML

Use Spark ML to analyze data in RDD, DataFrame and DataSet

Introduce File

Spark environment Setup

To execute py file

To execute scala file

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
DS_Bin.scala		DS_Bin.scala
DS_Bin.zip		DS_Bin.zip
README.md		README.md
bike_dataframe.ipynb		bike_dataframe.ipynb
dataframe.py		dataframe.py
hour.csv		hour.csv
rdd.py		rdd.py

Yi-Hsiangf/Spark-ML-practice

Folders and files

Latest commit

History

Repository files navigation

Spark_ML

Use Spark ML to analyze data in RDD, DataFrame and DataSet

Introduce File

Spark environment Setup

To execute py file

To execute scala file

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages