Skip to content

Commit

Permalink
Merge pull request Intel-bigdata#507 from mpjlu/fixReadme
Browse files Browse the repository at this point in the history
fix readme
  • Loading branch information
Meng, Peng authored Oct 26, 2017
2 parents df9c276 + 1073235 commit b06a7c1
Showing 1 changed file with 31 additions and 9 deletions.
40 changes: 31 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,23 +50,45 @@ There are totally 19 workloads in HiBench. The workloads are divided into 6 cate

**Machine Learning:**

1. Bayesian Classification (bayes)
1. Bayesian Classification (Bayes)

This workload benchmarks NaiveBayesian Classification implemented in Spark-MLLib/Mahout examples.
This workload benchmarks NaiveBayesian Classification implemented in Spark-MLLib. The workload uses the automatically generated documents whose words follow the zipfian distribution. The dict used for text generation is also from the default linux file /usr/share/dict/linux.words.

Large-scale machine learning is another important use of MapReduce. This workload tests the Naive Bayesian (a popular classification algorithm for knowledge discovery and data mining) trainer in Mahout 0.7, which is an open source (Apache project) machine learning library. The workload uses the automatically generated documents whose words follow the zipfian distribution. The dict used for text generation is also from the default linux file /usr/share/dict/linux.words.
2. K-means clustering (Kmeans)

2. K-means clustering (kmeans)
This workload tests the K-means (a well-known clustering algorithm for knowledge discovery and data mining) clustering in Spark-MLlib. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Guassian Distribution.

This workload tests the K-means (a well-known clustering algorithm for knowledge discovery and data mining) clustering in Mahout 0.7/Spark-MLlib. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Guassian Distribution.
3. Logistic Regression (LR)

3. Logistic Regression (lr)
This workload benchmarks Logistic Regression (LR) implemented in Spark-MLLib with LBFGS optimizer. The input data set is generated by LogisticRegressionDataGenerator based on random balance decision tree. It contains three different kinds of data types, including categorical data, continuous data, and binary data.

This workload benchmarks Logistic Regression implemented in Spark-MLLib examples. Logistic Regreesion is realized with LBFGS. The input data set is generated by LabeledPointDataGenerator based on random balance decision tree. It contains three different kinds of data types, including categorical data, continuous data, and binary data.
4. Alternating Least Squares (ALS)

4. Alternating Least Squares (als)
This workload benchmarks Alternating Least Squares (ALS) implememnted in Spark-MLLib. The input data set is generated by RatingDataGenerator for a product recommendation system.

This workload benchmarks Alternating Least Squares implememnted in Spark-MLLib examples. The input data set is generated by RatingDataGenerator for a product recommendation system.
5. Gradient Boosting Tree (GBT)

This workload benchmarks Gradient Boosting Tree (GBT) implememnted in Spark-MLLib. The input data set is generated by GradientBoostingTreeDataGenerator.

6. Linear Regression (LiR)

This workload benchmarks Linear Regression (LiR) implemented in Spark-MLLib with SGD optimizer. The input data set is generated by LinearRegressionDataGenerator.

7. Latent Dirichlet Allocation (lda)

This workload benchmarks Latent Dirichlet Allocation (LDA) implemented in Spark-MLLib. The input data set is generated by LDADataGenerator.

8. Principal Components Analysis (PCA)

This workload benchmarks Principal Components Analysis (PCA) implemented in Spark-MLLib. The input data set is generated by PCADataGenerator.

9. Random Forest (RF)

This workload benchmarks Random Forest (RF) implemented in Spark-MLLib. The input data set is generated by RandomForestDataGenerator.

10. Support Vector Machine (SVM)

This workload benchmarks Support Vector Machine (SVM) implemented in Spark-MLLib. The input data set is generated by SVMDataGenerator.

**SQL:**

Expand Down

0 comments on commit b06a7c1

Please sign in to comment.