Skip to content

Commit 00c72d2

Browse files
BenFradetsrowen
authored andcommitted
[SPARK-12247][ML][DOC] Documentation for spark.ml's ALS and collaborative filtering in general
This documents the implementation of ALS in `spark.ml` with example code in scala, java and python. Author: BenFradet <[email protected]> Closes apache#10411 from BenFradet/SPARK-12247.
1 parent 827ed1c commit 00c72d2

File tree

10 files changed

+431
-298
lines changed

10 files changed

+431
-298
lines changed

data/mllib/als/sample_movielens_movies.txt

-100
This file was deleted.

docs/_data/menu-ml.yaml

+2
Original file line numberDiff line numberDiff line change
@@ -6,5 +6,7 @@
66
url: ml-classification-regression.html
77
- text: Clustering
88
url: ml-clustering.html
9+
- text: Collaborative filtering
10+
url: ml-collaborative-filtering.html
911
- text: Advanced topics
1012
url: ml-advanced.html

docs/ml-collaborative-filtering.md

+148
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
---
2+
layout: global
3+
title: Collaborative Filtering - spark.ml
4+
displayTitle: Collaborative Filtering - spark.ml
5+
---
6+
7+
* Table of contents
8+
{:toc}
9+
10+
## Collaborative filtering
11+
12+
[Collaborative filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
13+
is commonly used for recommender systems. These techniques aim to fill in the
14+
missing entries of a user-item association matrix. `spark.ml` currently supports
15+
model-based collaborative filtering, in which users and products are described
16+
by a small set of latent factors that can be used to predict missing entries.
17+
`spark.ml` uses the [alternating least squares
18+
(ALS)](http://dl.acm.org/citation.cfm?id=1608614)
19+
algorithm to learn these latent factors. The implementation in `spark.ml` has the
20+
following parameters:
21+
22+
* *numBlocks* is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10).
23+
* *rank* is the number of latent factors in the model (defaults to 10).
24+
* *maxIter* is the maximum number of iterations to run (defaults to 10).
25+
* *regParam* specifies the regularization parameter in ALS (defaults to 1.0).
26+
* *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for
27+
*implicit feedback* data (defaults to `false` which means using *explicit feedback*).
28+
* *alpha* is a parameter applicable to the implicit feedback variant of ALS that governs the
29+
*baseline* confidence in preference observations (defaults to 1.0).
30+
* *nonnegative* specifies whether or not to use nonnegative constraints for least squares (defaults to `false`).
31+
32+
### Explicit vs. implicit feedback
33+
34+
The standard approach to matrix factorization based collaborative filtering treats
35+
the entries in the user-item matrix as *explicit* preferences given by the user to the item,
36+
for example, users giving ratings to movies.
37+
38+
It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views,
39+
clicks, purchases, likes, shares etc.). The approach used in `spark.mllib` to deal with such data is taken
40+
from [Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
41+
Essentially, instead of trying to model the matrix of ratings directly, this approach treats the data
42+
as numbers representing the *strength* in observations of user actions (such as the number of clicks,
43+
or the cumulative duration someone spent viewing a movie). Those numbers are then related to the level of
44+
confidence in observed user preferences, rather than explicit ratings given to items. The model
45+
then tries to find latent factors that can be used to predict the expected preference of a user for
46+
an item.
47+
48+
### Scaling of the regularization parameter
49+
50+
We scale the regularization parameter `regParam` in solving each least squares problem by
51+
the number of ratings the user generated in updating user factors,
52+
or the number of ratings the product received in updating product factors.
53+
This approach is named "ALS-WR" and discussed in the paper
54+
"[Large-Scale Parallel Collaborative Filtering for the Netflix Prize](http://dx.doi.org/10.1007/978-3-540-68880-8_32)".
55+
It makes `regParam` less dependent on the scale of the dataset, so we can apply the
56+
best parameter learned from a sampled subset to the full dataset and expect similar performance.
57+
58+
## Examples
59+
60+
<div class="codetabs">
61+
<div data-lang="scala" markdown="1">
62+
63+
In the following example, we load rating data from the
64+
[MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
65+
consisting of a user, a movie, a rating and a timestamp.
66+
We then train an ALS model which assumes, by default, that the ratings are
67+
explicit (`implicitPrefs` is `false`).
68+
We evaluate the recommendation model by measuring the root-mean-square error of
69+
rating prediction.
70+
71+
Refer to the [`ALS` Scala docs](api/scala/index.html#org.apache.spark.ml.recommendation.ALS)
72+
for more details on the API.
73+
74+
{% include_example scala/org/apache/spark/examples/ml/ALSExample.scala %}
75+
76+
If the rating matrix is derived from another source of information (i.e. it is
77+
inferred from other signals), you can set `implicitPrefs` to `true` to get
78+
better results:
79+
80+
{% highlight scala %}
81+
val als = new ALS()
82+
.setMaxIter(5)
83+
.setRegParam(0.01)
84+
.setImplicitPrefs(true)
85+
.setUserCol("userId")
86+
.setItemCol("movieId")
87+
.setRatingCol("rating")
88+
{% endhighlight %}
89+
90+
</div>
91+
92+
<div data-lang="java" markdown="1">
93+
94+
In the following example, we load rating data from the
95+
[MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
96+
consisting of a user, a movie, a rating and a timestamp.
97+
We then train an ALS model which assumes, by default, that the ratings are
98+
explicit (`implicitPrefs` is `false`).
99+
We evaluate the recommendation model by measuring the root-mean-square error of
100+
rating prediction.
101+
102+
Refer to the [`ALS` Java docs](api/java/org/apache/spark/ml/recommendation/ALS.html)
103+
for more details on the API.
104+
105+
{% include_example java/org/apache/spark/examples/ml/JavaALSExample.java %}
106+
107+
If the rating matrix is derived from another source of information (i.e. it is
108+
inferred from other signals), you can set `implicitPrefs` to `true` to get
109+
better results:
110+
111+
{% highlight java %}
112+
ALS als = new ALS()
113+
.setMaxIter(5)
114+
.setRegParam(0.01)
115+
.setImplicitPrefs(true)
116+
.setUserCol("userId")
117+
.setItemCol("movieId")
118+
.setRatingCol("rating");
119+
{% endhighlight %}
120+
121+
</div>
122+
123+
<div data-lang="python" markdown="1">
124+
125+
In the following example, we load rating data from the
126+
[MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
127+
consisting of a user, a movie, a rating and a timestamp.
128+
We then train an ALS model which assumes, by default, that the ratings are
129+
explicit (`implicitPrefs` is `False`).
130+
We evaluate the recommendation model by measuring the root-mean-square error of
131+
rating prediction.
132+
133+
Refer to the [`ALS` Python docs](api/python/pyspark.ml.html#pyspark.ml.recommendation.ALS)
134+
for more details on the API.
135+
136+
{% include_example python/ml/als_example.py %}
137+
138+
If the rating matrix is derived from another source of information (i.e. it is
139+
inferred from other signals), you can set `implicitPrefs` to `True` to get
140+
better results:
141+
142+
{% highlight python %}
143+
als = ALS(maxIter=5, regParam=0.01, implicitPrefs=True,
144+
userCol="userId", itemCol="movieId", ratingCol="rating")
145+
{% endhighlight %}
146+
147+
</div>
148+
</div>

docs/mllib-collaborative-filtering.md

+15-15
Original file line numberDiff line numberDiff line change
@@ -31,17 +31,18 @@ following parameters:
3131
### Explicit vs. implicit feedback
3232

3333
The standard approach to matrix factorization based collaborative filtering treats
34-
the entries in the user-item matrix as *explicit* preferences given by the user to the item.
34+
the entries in the user-item matrix as *explicit* preferences given by the user to the item,
35+
for example, users giving ratings to movies.
3536

3637
It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views,
3738
clicks, purchases, likes, shares etc.). The approach used in `spark.mllib` to deal with such data is taken
38-
from
39-
[Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
40-
Essentially instead of trying to model the matrix of ratings directly, this approach treats the data
41-
as a combination of binary preferences and *confidence values*. The ratings are then related to the
42-
level of confidence in observed user preferences, rather than explicit ratings given to items. The
43-
model then tries to find latent factors that can be used to predict the expected preference of a
44-
user for an item.
39+
from [Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
40+
Essentially, instead of trying to model the matrix of ratings directly, this approach treats the data
41+
as numbers representing the *strength* in observations of user actions (such as the number of clicks,
42+
or the cumulative duration someone spent viewing a movie). Those numbers are then related to the level of
43+
confidence in observed user preferences, rather than explicit ratings given to items. The model
44+
then tries to find latent factors that can be used to predict the expected preference of a user for
45+
an item.
4546

4647
### Scaling of the regularization parameter
4748

@@ -50,9 +51,8 @@ the number of ratings the user generated in updating user factors,
5051
or the number of ratings the product received in updating product factors.
5152
This approach is named "ALS-WR" and discussed in the paper
5253
"[Large-Scale Parallel Collaborative Filtering for the Netflix Prize](http://dx.doi.org/10.1007/978-3-540-68880-8_32)".
53-
It makes `lambda` less dependent on the scale of the dataset.
54-
So we can apply the best parameter learned from a sampled subset to the full dataset
55-
and expect similar performance.
54+
It makes `lambda` less dependent on the scale of the dataset, so we can apply the
55+
best parameter learned from a sampled subset to the full dataset and expect similar performance.
5656

5757
## Examples
5858

@@ -64,11 +64,11 @@ We use the default [ALS.train()](api/scala/index.html#org.apache.spark.mllib.rec
6464
method which assumes ratings are explicit. We evaluate the
6565
recommendation model by measuring the Mean Squared Error of rating prediction.
6666

67-
Refer to the [`ALS` Scala docs](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS) for details on the API.
67+
Refer to the [`ALS` Scala docs](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS) for more details on the API.
6868

6969
{% include_example scala/org/apache/spark/examples/mllib/RecommendationExample.scala %}
7070

71-
If the rating matrix is derived from another source of information (e.g., it is inferred from
71+
If the rating matrix is derived from another source of information (i.e. it is inferred from
7272
other signals), you can use the `trainImplicit` method to get better results.
7373

7474
{% highlight scala %}
@@ -85,7 +85,7 @@ Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a
8585
calling `.rdd()` on your `JavaRDD` object. A self-contained application example
8686
that is equivalent to the provided example in Scala is given below:
8787

88-
Refer to the [`ALS` Java docs](api/java/org/apache/spark/mllib/recommendation/ALS.html) for details on the API.
88+
Refer to the [`ALS` Java docs](api/java/org/apache/spark/mllib/recommendation/ALS.html) for more details on the API.
8989

9090
{% include_example java/org/apache/spark/examples/mllib/JavaRecommendationExample.java %}
9191
</div>
@@ -99,7 +99,7 @@ Refer to the [`ALS` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.rec
9999

100100
{% include_example python/mllib/recommendation_example.py %}
101101

102-
If the rating matrix is derived from other source of information (i.e., it is inferred from other
102+
If the rating matrix is derived from other source of information (i.e. it is inferred from other
103103
signals), you can use the trainImplicit method to get better results.
104104

105105
{% highlight python %}

docs/mllib-guide.md

+1
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ We list major functionality from both below, with links to detailed guides.
7171
* [Extracting, transforming and selecting features](ml-features.html)
7272
* [Classification and regression](ml-classification-regression.html)
7373
* [Clustering](ml-clustering.html)
74+
* [Collaborative filtering](ml-collaborative-filtering.html)
7475
* [Advanced topics](ml-advanced.html)
7576

7677
Some techniques are not available yet in spark.ml, most notably dimensionality reduction

0 commit comments

Comments
 (0)