Version 2.0
1.Start spark via the command line with parameters, the input parameters are intput/output files path.
2.Use spark to train a machine learning model, and make predictions.
3.Save result as JSON into output path
4.Support different machine learning models(KMeans, Logistic Regression).
INSTRUCTION:
-
Download Spark from :http://spark.apache.org/downloads.html (version:1.5.2) please choose the approriate package type according to Hadoop version.
-
To build Spark and its example programs, run: build/mvn
-
Install python 2.7(should also support Python 3)
-
Login Cluster: ssh [email protected] (password: ask teammates)
-
Copy files into CLuster local host:
scp source_file_name [email protected]:/home/honeycomb/SparkTeam
e.g:
scp /Users/jacobliu/PySpark.py [email protected]:/home/honeycomb/SparkTeam
-
Put files into HDFS:
hdfs dfs -put LOCAL_FILE_PATH HDFS_FILE_PATH
e.g:
hdfs dfs -put /home/honeycomb/SparkTeam/sample_multiclass_classification_data_test.txt /user/honeycomb/sparkteam/input
-
Put PySpark.py and train/test dataset into HDFS and run command line:
YOUR_SPARK_PATH/spark-submit PySpark.py YOUR_TRAIN_DATA_PATH YOUT_TEST_DATA_PATH YOUR_OUTPUT_PATH
e.g:
/bin/spark-submit /home/honeycomb/SparkTeam/PySpark.py /user/honeycomb/sparkteam/input/sample_multiclass_classification_data.txt /user/honeycomb/sparkteam/input/sample_multiclass_classification_data_test.txt /home/honeycomb/SparkTeam
Resource:
-
Deploy Spark: http://spark.apache.org/docs/latest/programming-guide.html
-
Hadoop Version: http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
-
Python API Docs: https://spark.apache.org/docs/1.5.2/api/python/index.html
-
Machine Learning Library(MLib) Guide: http://spark.apache.org/docs/latest/mllib-guide.html