SparkTeam

Version 2.0

1.Start spark via the command line with parameters, the input parameters are intput/output files path.

2.Use spark to train a machine learning model, and make predictions.

3.Save result as JSON into output path

4.Support different machine learning models(KMeans, Logistic Regression).

INSTRUCTION:

Download Spark from :http://spark.apache.org/downloads.html (version:1.5.2) please choose the approriate package type according to Hadoop version.
To build Spark and its example programs, run: build/mvn
Install python 2.7(should also support Python 3)
Login Cluster: ssh [email protected] (password: ask teammates)
Copy files into CLuster local host:

scp source_file_name [email protected]:/home/honeycomb/SparkTeam

e.g:

scp /Users/jacobliu/PySpark.py [email protected]:/home/honeycomb/SparkTeam
Put files into HDFS:

hdfs dfs -put LOCAL_FILE_PATH HDFS_FILE_PATH

e.g:

hdfs dfs -put /home/honeycomb/SparkTeam/sample_multiclass_classification_data_test.txt /user/honeycomb/sparkteam/input
Put PySpark.py and train/test dataset into HDFS and run command line:

YOUR_SPARK_PATH/spark-submit PySpark.py YOUR_TRAIN_DATA_PATH YOUT_TEST_DATA_PATH YOUR_OUTPUT_PATH

e.g:

/bin/spark-submit /home/honeycomb/SparkTeam/PySpark.py /user/honeycomb/sparkteam/input/sample_multiclass_classification_data.txt /user/honeycomb/sparkteam/input/sample_multiclass_classification_data_test.txt /home/honeycomb/SparkTeam

Resource:

Name		Name	Last commit message	Last commit date
Latest commit Jacob Liu and Jacob Liu reorganize the code Jan 19, 2016 6501133 · Jan 19, 2016 History 3 Commits
data		data
doc		doc
src		src
README.md		README.md

Provide feedback