Journal of Machine Learning Research-

advertisement
Journal of Machine Learning Research
A Study of Mixture Models for Collaborative Filtering
Rong Jin
RONG@CS.CMU.EDU
School of Computer Science
Carnegie Mellon University
5000 Forbes Ave.
Pittsburgh, PA 15213, USA
Luo Si
LSI@CS.CMU.EDU
School of Computer Science
Carnegie Mellon University
5000 Forbes Ave.
Pittsburgh, PA 15213, USA
Chengxiang Zhai
CZHAI@CS.UIUC.EDU
Department of Computer Science
University of Illinois at Urbana-ChampaignUrbana
IL 61801, USA
Alex G. Hauptmann
ALEX@CS.CMU.EDU
Department of Computer Science
Carnegie Mellon University
5000 Forbes Ave.
Pittsburgh, PA 15213, USA
Abstract
Collaborative filtering is a very useful general technique for exploiting the preference patterns of a
group of users to predict the utility of items to a particular user. There are three different
components that need to be modeled in the collaborative filtering problem: the users, the items,
and the ratings. Even though previous research on applying probabilistic models to collaborative
filtering has shown promising results, there lacks systematic studies on how to model each of the
three components and their interactions. In this paper, we conducted a broad and systematic study
on different mixture models for collaborative filtering. We discuss general issues related to using a
mixture model for collaborative filtering, and propose three desirable properties (should we
illustrate the properties here) that any graphic model should satisfy. Using these properties, we
thoroughly examine five different mixture (probablistic) models, including Bayesian Clustering
(BC), Aspect Model (AM), Flexible Mixture Model (FMM), Joint Mixture Model (JMM), and the
Decoupled Model (DM). We compare these models both analytically and experimentally.
Experiments over two datasets of movie ratings under several different configurations show that in
general, whether a model satisfies the proposed properties tends to correlate with the model’s
performance. In particular, the DM model, which satisfies all the three properties that we want,
outperforms all the other mixture models as well as some other existing approaches to
collaborative filtering. Our study shows that graphic models are powerful tools for modeling
collaborative filtering, but careful design of the model is necessary to achieve good performance.
Keywords: collaborative filtering, graphic model, and probabilistic model
1
Introduction
The rapid growth of the information on the Internet demands intelligent information agent
that can sift through all the available information and find out the most valuable to us. These
©2003 Jin, Si and Zhai
JIN, SI AND ZHAI
intelligent systems can be categorized into two classes: Collaborative Filtering (CF) and Contentbased recommending. The difference between them is that collaborative filtering only utilizes the
ratings of training users in order to predict ratings for test users while content-based
recommendation systems rely on the contents of items for predictions. Therefore, collaborative
filtering systems have advantages in an environment where the contents of items are not available
due to either a privacy issue or the fact that contents are difficult for a computer to analyze. In
this paper, we will only focus on the collaborative filtering problems (research of collaborative
filtering problems).
Most collaborative filtering methods fall into two categories: Memory-based algorithms and
Model-based algorithms [Breese et al. 1998]. Memory-based algorithms store rating examples of
users in a training database, and in the predicting phase, they would predict a test user’s ratings
based on the corresponding ratings of the users in the training database that are similar to the test
user. In contrast, model-based algorithms build models that can explain the training examples
well and predict the ratings of test users using the estimated models. Both types of approaches
have been shown to be effective for collaborative filtering.
In general, all collaborative filtering approaches assume that users with similar “tastes” would
rate items similarly, and the idea of clustering is exploited in all approaches either explicitly or
implicitly. Compared with memory-based approaches, model-based approaches provide a more
principled way of performing clustering, and is also often much more efficient in terms of the
computation cost at the prediction time. The basic idea of a model-based approach is to cluster
items and/or training users into classes explicitly and predict ratings of a test user by using the
ratings of classes that fit the best with the test user and/or items to be rated. Several different
probabilistic models have been proposed and studied in the previous work (e.g., [Breese et al.
1998; Hofmann & Puzicha 1998; Pennock et al. 2000; Popescul et al. 2001; Ross & Zemel 2002;
Si et. al., 2003; Jin et. al., 2003; Hofmann, 2003]). These models have succeeded in capturing
user/item similarities through probabilistic clustering in one way or the other, and have all been
shown to be quite promising. Most of these method can be represented in the graphical model
framework.
However there has been no systematic study and comparison of all the graphic models
proposed for collaborative filtering. There are both theoretical and empirical reasons for such a
study: (1) Theoretically, different models make different assumptions. We need to understand the
differences and connections among these models in terms of the underlying assumptions. (2)
Empirically, these different models are evaluated with different experimental settings in previous
studies; it would be useful to see how they are compared with each other using the same
experimental settings. Moreover, a systematic study is necessary for explaining why some models
tend to perform better than others.
In this paper, we conduct a systematic study of a large subset of graphics models – mixture
models – for collaborative filtering. Mixture models are quite natural for modeling similarities
among users, items, and ratings. In general, there are three components that need to be modeled
carefully: the users, the items and the ratings. Not only need we to cluster each component into a
small number of groups but also be able to model the interactions between different components
appropriately. We propose three desirable properties that a reasonable graphic model for
collaborative filtering should satisfy: (1) separate clustering of users and items; (2) flexibility for
a user/item to be in multiple clusters; (3) decoupling of user preferences and rating patterns.
We thoroughly analyze five different mixture models, including Bayesian Clustering (BC),
Aspect Model (AM), Flexible Mixture Model (FMM), Joint Mixture Model (JMM) and the
Decoupled Model (DM) based on the three proposed properties. We also compare these models
empricially. Experiments over two datasets of movie ratings under several different
configurations show that in general, the fulfillment of the proposed properties tends to be
positively correlated with the model’s performance. In particular, the DM model, which satisfies
all the three properties that we want, outperforms all the other mixture models as well as some
2
PAPER TITLE POSSIBLY SHORTENED
other existing approaches to collaborative filtering. Our study shows that graphic models are
powerful tools for modeling collaborative filtering, but careful design of the model is necessary to
achieve good performance.
The rest of paper is arranged as follows: Section 2 gives a general discussion of using graphic
models for collaborative filtering and presents the three desirable properties that any graphic
model should satisfy. In Section 3, we present and examine five different mixture models in terms
of their connections and differences. We discuss model estimation and rating prediction in
Section 4 and Section 5. Empirical studies are presented in Section 6. Conclusions and future
work are discussed in Section 7.
2
2.1
Graphic Models for Collaborative Filtering
Problem definition
We first introduce some annotations and formulate the problem of collaborative filtering formally
in terms of graphic models. Let X  {x1 , x2 ,......, xM } be a set of items, Y  {y1, y2 ,......, yN } be a set of
users, and {1,..., R} be the range of ratings. In collaborative filtering, we have available a training
database consisting of the ratings of some items by some users, which we denote by
{( x(1) , y(1) , r(1) ),....., ( x( L) , y( L) , r( L) )} , where tuple ( x(i ) , y(i ) , r(i ) ) means that user y(i ) gives item x(i ) a
rating of r(i ) . Our task is to predict the rating r of an unrated item x by a user y based on all the
training information. To cast the problem in terms of graphic models, we treat each tuple
( x(i ) , y(i ) , r(i ) ) as an observation from three random variables – x, y, and r. Through the training
database, we are able to model the interaction between the three random variables. There are
three possible choices of likelihood that we can maximize for the training data: p(r|x,y), p(r,x|y)
and p(r,x,y). Even though there is strong correlation between these quantities, namely
p(r , x | y )  p(r , x, y ) / r , x p(r , x, y ' ) and p(r | x, y)  p(r, x | y) /  x' p(r | x' , y) , maximizing

data with different likelihood explain different aspect of the data. For the first choice, i.e., p(r|x,y),
we believe that the only meaningful part of the data is the observed ratings and the selection of
items and users are purely random. The second choice differs from the first choice by not only
explaining the observation of ratings but also why each user select the subset of movies for rating.
Due to the fact that only a small subset of all movies are rated by each user, it can be useful to
model why a particular subset of movies are selected for a particular user. As the consequence,
movies that have been rated by many users will have significantly higher impact on the
estimation of the model than the movies rated only by a few users. The third choice, i.e., p(r,x,y),
simply models the joint distribution between the three random variables. Under this choice, the
model also concerns with the behavior of users, for example, why some users rate a lot of movies
and others only rate a few. As a result, users with a lot of ratings will account for more in the
model than users with only few ratings. Based on the above discussion, we can see that the choice
of likelihood to model from the training dataset can have significant impact on the final
estimation. It is easy to see that most existing probabilistic approaches to collaborative filtering
fall into one of these three cases. For example, the personality diagnosis method is a special case
of the first, where we assume a Gaussian distribution for p(r|x,y), and directly compute p(r|x,y)
by performing a Bayesian average of the models of all users. The aspect model can be regarded as
a special case of the third, where we use a mixture model for p(r,x,y). In this paper, we will focus
on the second and third cases and systematically examine the different choices of mixture models.
We compare these models both analytically and experimentally.
3
JIN, SI AND ZHAI
2.2
Major issues in designing a graphic model
In general, in order to model the similarity among users, the items and the ratings, we need to
cluster each component into a number of groups and to model the interactions between different
components appropriately. More specifically, the following three important issues must be
addressed:
Property 1: how should we model user similarity similarity and item similarity? Generally,
we think users and items are from different sets of concepts and the two sets of concepts are
coupled with each other through the rating information. Therefore, we believe that a good
clustering model should be able to explicitly model both classes of users and items and be able to
leverage their correlation. This means that the choice of latent variables in our graphic model
should allow for separate, yet coupled modeling of user similarity and item similarity. However,
this may lead to complex clustering models that are hard to estimate accurately with the data
available. We will compare several different clustering models.
Property 2: should a user or an item be allowed to belong to multiple clusters? Since a user
can have diverse interests and an item may have multiple aspects, intuitively, it is desirable to
allow both items and users to be in multiple classes simultaneously. However, such a model may
be too flexible to leverage user and item similarity effectively. We will compare models with
different assumptions.
Property 3: how can we capture the variances in rating patterns among the users with the
same preferences over items? One common deficiency in most existing models is that they are all
based on the assumption that users with similar interests would rate items similarly, but the rating
pattern of a user is determined not only by his/her interests but also by the rating strategy/habit.
For example, some users are more “tolerate” than others, and therefore their ratings of items tend
to be higher than others even though they share very similar tastes of items. We will study how to
model such variances in a graphic model.
3
Mixture models for collaborative filtering (Should we use probabilistic model
instead of mixture model?)
In this section, we discuss a variety of possible mixture models and examine their
assumptions about user and item clustering and whether they address the variances in rating
patterns.
3.1 Bayesian Clustering (BC)
The basic idea of Bayesian Clustering (BC) is to assume that the same type of users would rate
items similarly, and thus users can be grouped together into a set of user classes according to their
ratings of items. Formally, given a user class ‘C’, the preferences regarding the various items
expressed as ratings are independent, and the joint probability of user class ‘C’ and ratings of
items can be written as the standard naïve Bayes formulation:
M
P(C, r1 , r2 ,..., rM )  P(C ) P(ri | C )
(1)
i 1
(I think equation 1 may be a little bit inconsistent with equation2, people maybe take P(C,r1,r2…,rM) as
rating patterns instead of the rating on specific obejects? I guess we can use r(xi) instead of ri and R(xi)
separately)
Then, the joint probability for the rating patterns of user y, i.e. {Ry ( x1 ), Ry ( x2 ),...,Ry ( xM )} , can be
expanded as:
P( Ry ( x1 ), Ry ( x2 ),..., Ry ( xM ))   P(C )
C
4
 P(R ( x ) | C)
iX ( y )
y
i
(2)
PAPER TITLE POSSIBLY SHORTENED
As indicated by Equation (2), this model will first select a user class ‘C’ from the distribution
P(C) and then rate all the items using the same selected class ‘C’. In another word, this model
assumes of a single user class applied to the ratings of all the items and therefore prevents the
case when a user is of multiple user classes and different user classes are applied to the ratings of
different items. Parameters P(r|C) can be learned automatically using the ExpectationMaximization (EM) algorithm. More details of this model can be found in (Breese et al. 1998).
As seen from Equation (1) and (2), the Bayesian Clustering approach chooses to model the
rating information directly and no explicit clustering of users and items is performed. Moreover,
this approach chooses to model the joint probability for all ratings of a single user, i.e.,
P( Ry ( x1 ), Ry ( x2 ),...,Ry ( xM )) , which is
equal to the assumption that a single user
can only belong to a single cluster.
According to the three criterions
mentioned in the previous section,
Bayesian Clustering appears to be the
simplest mixture model: no clustering of
users and items; each user is assumed to
be of a single cluster; no separation for
preference patterns and rating patterns.
Figure 1 illustrates the basic idea of the
Bayesian Clustering.
C
r1
…
r2
rn
Figure 1: Graphical model representation for
Bayesian Clustering.
3.2 Aspect Model (AM)
The aspect model is a probabilistic latent space model, which models individual preferences as a
convex combination of preference factors (Hofmann & Puzicha 1999). The latent class variable
z  Z  {z1, z2 ,....., zK } is associated with each observation pair of a user and an item. The aspect
model assumes that users and items are independent from each other given the latent class
variable. Thus, the probability for each observation pair (x,y) (i.e., item-user pair) is calculated as
follows:
P ( x, y ) 
 P( z ) P( x | z ) P( y | z )
zZ
(3)
where P(z) is class prior probability, P(x|z) and P(y|z) are class-dependent distributions for items
and users, respectively. Intuitively, this model means that the preference pattern of a user is
modeled by a combination of typical preference patterns, which are represented in the
distributions of P(z), P(x|z) and P(y|z).
There are two ways to incorporate the rating information ‘r’ into the basic aspect model,
which are expressed in Equation (4) and (5), respectively.
P( x(l ) , y(l ) , r(l ) ) 
 P( z ) P( x
(l )
| z ) P( y(l ) | z ) P(r(l ) | z )
zZ
P( x ( l ) , y ( l ) , r( l ) )   P( z ) P( x ( l ) | z ) P( y ( l ) | z ) P( r( l ) | z, x ( l ) )
zZ
Z
X
X
Y
R
R
(a)
(b)
(5)
The corresponding graphical models are
shown in Figure 2. The second model in
Equation (5) has to estimate the conditional
probability P ( r( l ) | z, x ( l ) ) , which has a large
parameter space and may not be estimated
reliably.
Z
Y
(4)
5
Figure 2: Graphical models for the two
extensions of aspect model in order to capture
rating values.
JIN, SI AND ZHAI
Unlike the Bayesian Clustering algorithm, where only the rating information is modeled, the
aspect model is able to model the users and the items with conditional probability P(y|z) and
P(x|z). Moreover, unlike the Bayesian Clustering algorithm, which models the joint probability
P( Ry ( x1 ), Ry ( x2 ),...,Ry ( xM )) directly, the aspect model models the joint probability P(x,y,r). By
doing so, the aspect model gives each item the freedom to select the appropriate user class for its
rating while in the Bayesian Clustering algorithm, the same user class is used to rate all the items.
However, the aspect model only introduces a single set of hidden variable for items, users, and
ratings. This essentially encodes the clustering of users, the clustering of items, and the
correlation between them together and the separation of the clustering of users and items is not
attempted. Moreover, user preferences and rating patterns are not separated neither. Therefore,
according to the criterion stated in Section 2.2, the aspect model is still a preliminary model: a
simple way to model the users and items but without the separation of clustering of users and
items; allowing the freedom of a user and an item to be in multiple clusters; no attempt for
modeling the preference patterns and rating patterns separately.
3.3 Joint Mixture Model (JMM) and Flexible Mixture Model (FMM)
In this section, we examine two graphic models for collaborative filtering that are different from
BC and AM in that they are able to separately model the classes of users and items.
Recall that the general goal of graphic models for collaborative filtering is to simplify the
joint probability P({ xi , R y ( xi )}iM1 | y ) , i.e. the likelihood for a user ‘y’ to rate a set of items {xi }iM1
with ratings {R y ( xi )}iM1 . By extending the idea of the Bayesian Clustering algorithm and the
Aspect Model (AM), we can have two different treatments for this joint probability:
1) As the first treatment, we can follow the spirit of the Bayesian Clustering algorithm and
expand the joint probability P({ xi , R y ( xi )}iM1 | y ) as:
M
P({ xi , R y ( xi )}iM1 | y )   P( z y | y ) P( xi , R y ( xi ) | z y )
i 1
zy
(11)
where variable zy stands for the class for user ‘y’. As indicated in Equation (11), to estimate
probability P({ xi , R y ( xi )}iM1 | y ) , we will first choose the user class zy according to the distribution
P(zy|y), and then computing the likelihood of rating every item using the same user class zy.
Similar to the Bayesian clustering algorithm, this model assumes that each user should only
belong to a single user class. For easy reference, we call this model Joint Mixture Model, or
JMM.
2) The second choice for estimating P({ xi , R y ( xi )}iM1 | y ) is to first expand the joint probability
into a product of likelihood P( xi , Ry ( xi ) | y) for every item and then expand likelihood
P( xi , Ry ( xi ) | y) over hidden variables for the user classe, i.e.,
M
M
i 1
i 1 z y
P({ xi , R y ( xi )}iM1 | y )   P( xi , R y ( xi ) | y )    P( xi , R y ( xi ) | z y ) P( z y | y )
(12)
Unlike the last approach, where the same user class is used to rate all the items, in the approach,
each item has its own freedom in choosing user class. For each reference, we call this model
Flexible Mixture Model, or FMM (Si, et. al., 2003).
Comparing Equation (12) to Equation (11), the difference between these two models is in the
order of product and sum. As the result of different order for product and sum operations, the
FMM model allows each item to choose the appropriate user class for its rating while the JMM
model enforces each item to use the same user class for it rating pattern. Mapping to the three
issues discussed in Section 2.2, we can see that for the first issue, both models are able to model
6
PAPER TITLE POSSIBLY SHORTENED
the clustering of users and items separately. For the second issue, the FMM model allows each
user to be in multiple clusters while the JMM model enforces each user to be a single cluster.
However, none of these two models makes any attempt to model the difference between rating
patterns and preference patterns.
The key component for these two models is the estimation of P( xi , Ry ( xi ) | z y ) . Direct
estimating P( xi , Ry ( xi ) | z y ) as a part of our parameter space can lead to a severe sparse data
problem because the number of different P( xi , Ry ( xi ) | z y ) can be quite large. In order to leverage
the sparse data problem, we can further introduce a class variable zx for item xi and have
P( xi , Ry ( xi ) | z y ) rewritten as:
P( xi , R y ( xi ) | z y )   P( xi , Ry ( xi ), z x | z y )   P( z x ) P( xi , R y ( xi ) | z x , z y )
zx
zx
(13)
  P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y )
zx
As indicated by Equation (13), probability P( xi , Ry ( xi ) | z y ) is decomposed as a sum of products
with three terms: P( z x ) , P( xi | z x ) , and P( Ry ( xi ) | z x , z y ) . Probability P( z x ) stands for the class
prior for item class zx, P( xi | z x ) stands for the likelihood for item xi to be in item class zx, and
P( Ry ( xi ) | z x , z y ) stands for user class zy to rate item class zx as rating category Ry(xi). Clearly, by
introducing a small number of classes for items, we are able to decrease the number of parameters
for P( xi , Ry ( xi ) | z y ) substantially, from M  R | Z y | to | Z x | (1  M  | Z y | R) where |Zy| and |Zx|
stand for the number of classes for users and items respectively, and M and R stand for the
number of items and rating categories.
In summary, both models contain the following four set of parameters:
1) P( z y | y) : the likelihood of assigning user ‘y’ to the user class zy,
2) P( z x ) : the class prior for item class zx,
3) P( xi | z x ) : the likelihood of generating item xi from the item class zx,
4) P(r | z x , z y ) : the likelihood of rating the item class zx as category ‘r’ by the user class zy.
The total number of parameters is | Z x | (1  M  | Z y | R)  N | Z y | where N stands for the number of
training users. The diagrams of the corresponding graphic model are displayed in Figure 3.
Y
R1
Z1x
X1
Zy
R2
Z2x
X2
Zy
Zx
……
RM
ZMx
Y
XM
R
X
(b)
(a)
Figure 3: Graphical model representation for the Joint Mixture Model and Flexible
Mixture Model. Diagram (a) represents the joint mixture model (JMM) and (b) for
flexible mixture model (FMM).
7
JIN, SI AND ZHAI
3.4 Decouple Models for Rating and Preference Patterns (DM)
To explicitly account for the fact that users with similar interests may have very different rating
patterns, we extend the FMM models by introducing two hidden variables ZP, ZR, with ZP for the
preference patterns of users and ZR for the rating patterns. The graphic representation of this
model is displayed in Figure 4. Similar to the previous models, hidden variable Zx represents the
class of items. However, unlike the mixture models discussed in the previous subsection, where
users are modeled by a single class variable Zy, in this new model, users are clustered from two
different perspectives, i.e., the clustering of preference patterns of user represented by hidden
variable ZP and the clustering of rating patterns or habits by hidden variable ZR. Furthermore,
random variable Zpref is introduced to indicate whether or not the class of items Zx is preferred by
the class of users ZP who presumably share similar
preferences of items. As indicated in the graphic
representation in Figure 4, this random variable is
ZP
Zx
able to connect between variables ZP, Zx, and the
rating ‘R’. According to Section 2.2, this new model
Y
X
is able to address all the three issues all together:
Zpref
model the clustering of users and items separately;
allow each user to be in multiple clusters; and model
the difference between preference patterns and rating
R
ZR
patterns.
Similar to the treatment in the FMM model,
P({ xi , R y ( xi )}iM1 | y ) can be estimated by
probability
Figure 4: Graphical model representation
for the decoupled model (DM).
expanding it into a product of likelihood
P( xi , Ry ( xi ) | y) for every item, as illustrated in
Equation (12). Combining class variables ZP, ZR, and Zpref together, we can express likelihood
P( xi , Ry ( xi ) | y) as:
P( x i , R y ( xi ) | y ) 


P( z P | y ) P( z R | y) P( z x ) P( xi | z x )  P( z pref | z P , z x ) P( R y ( xi ) | z R , z pref )
Z pref {0,1}

zP , zR , zx

(as Zpre is a binary variable, I changed the sum a little bit. )
There are totally six different sets of parameters involved in this model:
1) P( z P | y) : the likelihood of assigning user ‘y’ to the preference class zP,
2) P( z R | y) : the likelihood of assigning user ‘y’ to the rating class zP,
3) P( z x ) : the class prior for item class zx,
4) P( xi | z x ) : the likelihood of generating item xi from the item class zx,
5) P( z pref | zP , z x ) : the probability for preference class z P to favor item class z x ,
6) P( Ry ( xi ) | z R , z pref ) : the likelihood for rating classes z R to rate items as Ry ( xi ) given the
preference condition z pref .
The total number of parameters is | Z x | (1  M  | Z P || Z pref |)  N (| Z R |  | Z P |)  R | Z R || Z pref |
where | Z P | , | Z R | , | Z pref | , and | Z x | stand for the number of classes for preference, rating,
preference level, and items. For easy reference, we will call this model a Decouple Model or DM.
Comparing Figure 4 to Figure 3, two hidden variables ZR and Zpref variables are introduced to
account for the difference between the underlying preference of users and surface rating patterns.
Particularly, the rating variable ‘R’ is determined jointly by hidden variables ZR and Zpref. In
another word, not only the fact whether the user likes an item (i.e., Zpref) will influence the rating
of that item, but also the specific rating patterns of the user (i.e., ZR). Therefore, even if a user
8
(14)
PAPER TITLE POSSIBLY SHORTENED
appears to like a certain type of items, the rating value can still be low if he has a very ‘tough’
rating criterion.
We can further complicate this model by allowing the hidden variable Zpref to be of multiple
values instead of a binary variable. For example, we can have three different preference levels,
with zero for no preference, one for slight preference and two for strong preference. In our
experiments, we set the number of preference levels to the number of different rating categories.
In this case, the flexible mixture model stated in Equation (13) can be viewed as a special case of
Equation (14) with conditional probability P( R y ( xi ) | z R , z pref ) set as a delta
function  ( R y ( xi ), z pref ) , which is one only when the rating Ry ( xi ) is equal to the preference level
zpref and zero otherwise.
3.5
Summary and Comparison
In the above four section, we discuss five different mixture models with the increasing
complexity. The Bayesian Clustering approach chooses to model the rating information and
makes no attempt to cluster either users or items. The aspect model is able to model the clustering
of users and items explicitly. However, with only a single hidden variable, the clustering of users
and items are not separated. Another flexibility of aspect model over Bayesian Clustering is that,
the aspect model allows each user to be in multiple clusters while the Bayesian Clustering
doesn’t. As an improvement of Bayesian Clustering and aspect model, the joint mixture model
(JMM) and flexible mixture model (FMM) are introduced with the emphasis on the separation of
clustering of users and items. They differ in whether a single user should be allowed to be in
multiple clusters. The decouple model has two addition hidden variable about preference pattern
and rating pattern in order to address the issue that users with similar tastes may have different
rating patterns. Table 1 lists the properties of each model with respect to the three issues
discussed in Section 2.2. According to Table 1, both the mixture models and decouple model are
able to satisfy more properties than the Bayesian clustering approach and the aspect model. As a
result, they are able to give better description of training data and achieve better performance in
prediction. On the other hand, as the tradeoff, in order to satisfy more properties, we have to
increase the model complexity, which can degrade the accuracy of prediction when we only have
small amount of training data. As will be observed in the later experiment, with a large number of
training users, models satisfying more properties usually perform better than models with fewer
properties. However, when the number of training users is small, the simple model may perform
even better.
Property 1
Property 2
Property 3
Bayesian Clustering (BC)
---Aspect Model (AM)
-x
-Joint Mixture Model (JMM)
x
--Flexible Mixture Model (FMM)
x
x
-Decouple Model (DM)
x
x
x
Table 1: Properties of five different mixture models for collaborative filtering. Property 1
corresponds to the separation of clustering for users and items; Property 2 corresponds to the
freedom for a single user to be multiple clusters; Property 3 corresponds to the capture of
difference between preference patterns and rating patterns.
4
Model Estimation
4.1 General approach -- EM Algorithms
In general, all these mixture models can be estimated using the EM algorithm. Here we give some
detail on the derivation of the EM algorithm for the Joint Mixture Model (JMM), which is
slightly more complicated than the other models.
9
JIN, SI AND ZHAI
To train the optimal parameters for this model, the EM algorithm can be employed in order to
maximize the log-likelihood of training data, i.e.,
L   log P ({ xi , R y ( xi )}iM1 | y )
(15)
y
The expression for
P({ xi , R y ( xi )}iM1
| y ) by putting Equation (11) and (13) together, i.e.,
M
P({ xi , R y ( xi )}iM1 | y )   P( z y | y )  P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y )
zy
i 1 z x
(16)
The Expectation Maximization (EM) algorithm (Demspter, et. al., 1977) is usually used to
optimize the above objective function. The EM algorithm alternates between the expectation
steps and maximization steps. In the expectation step, the posterior probabilities of the latent
variables P( z y | {xi , R y ( xi )}iM1 , y ) and P( z ix | {xi , R y ( xi )}iM1 , y, z y ) are computed as follows:
M
P( z y | {xi , R y ( xi )}iM1 , y ) 
P( z y | y ) P( xi , R y ( xi ) | z y )
i 1
M
 P( z y | y ) P( xi , R y ( xi ) | z y )
i 1
zy
P( z ix | {xi , R y ( xi )}iM1 , y, z y ) 
(17)
P( z ix ) P( xi |
i
 P( z x ) P( xi
i
zx
z ix ) P( R y ( xi ) | z ix , z y )
| z ix ) P( R y ( xi ) | z ix , z y )
(18)
In the maximization step, the model parameters will be updated according to the estimated
posteriors as follows:
P( z y | y)  P( z y | {xi , R y ( xi )}iM1 , y)
(19)
i
M
   P( z x  z x | {xi , R y ( xi )}i 1, y, z y ) P( z y | y)
P( z x ) 
zy y i
i
M
    P( z x  z x | {xi , R y ( xi )}i 1 , y, z y ) P( z y | y )
(20)
zx zy y i
i
M
   P( z x  z x | {xi , R y ( xi )}i 1 , y, z y ) P( z y | y ) ( x  xi )
P( x | z x ) 
zy y i
i
M
   P( z x  z x | {xi , R y ( xi )}i 1 , y, z y ) P( z y | y )
(21)
zy y i
i
M
  P( z x  z x | {xi , R y ( xi )}i 1 , y, z y ) P( z y | y ) ( x  xi ) ( R y ( xi )  r )
P(r | z x , z y ) 
y i
i
M
  P( z x  z x | {xi , R y ( xi )}i 1 , y, z y ) P( z y | y ) ( x  xi )
(22)
y i
Since the EM solution to objective Equation (15) is not obvious, the proof of the EM updating
equations (17-22) is presented in the Appendix A.
4.2 Smoothing in the EM Algorithms
The EM algorithm is infamous for its tendency to find undesirable local optimal solutions. In this
section, we will discuss two techniques that have potential to avoid unfavorable solutions. Again,
we use the JMM as the example and other models will follow the same style.
The first technique to avoid unfavorable local optimal solution is called annealed EM
algorithm (AEM) (Hofmann & Puzicha, 1998), which is an EM algorithm with regularization. In
10
PAPER TITLE POSSIBLY SHORTENED
order to avoid the posteriors to be skewed distributed at the early stage of the EM iterations, we
introduce a variable ‘b’. In the Anealing EM algorithm, the posteriors for the JMM model will be
rewritten as:
b
M


 P( z y | y ) P( xi , R y ( xi ) | z y )
i 1

P( z y | {xi , R y ( xi )}iM1, y )  
b
M


 P( z y | y ) P( xi , Ry ( xi ) | z y )
zy 
i 1

P( z ix
| {xi , R y ( xi )}iM1 ,
(24)
P( z )P( x | z )P(R ( x ) | z , z )
y, z ) 
 P( z ) P( x | z ) P( R ( x ) | z , z )
i
x
y
z ix
i
x
i
i
x
y
i
x
i
y
b
i
x
i
i
y
i
x
(25)
b
y
As an analogy to the true annealing process, variable ‘b’ corresponds the inverse of so-called
‘temperature’. At first we set the temperature to be infinitely high, or variable ‘b’ is set to be zero.
In this case, training data are ignored and all the posteriors are simply uniform distribution. Then,
by decreasing the ‘temperature’, or increasing the value of variable ‘b’, we let the training data
play more and more important role in the estimation of posteriors and therefore the estimated
posteriors are more and more away from the uniform distribution. When variable ‘b’ is increase
to be 1, we return back to the posterior expression of normal EM algorithm as stated in Equation
(17) and (18). Thus, by slowly increasing the value of ‘b’, we avoid the estimated posteriors to
become skewed distributions at the early stage of EM iterations and may have a better chance to
find useful local optimal solution.
The second smoothing strategy is to introduce an appropriate model prior and maximize a
posterior (MAP) instead of the likelihood (MLE). Let L(D|M) stands for the log-likelihood of
training data (in Equation (15)) and P(M) stands for the prior for model M. Instead of maximizing
the likelihood of training data L(D|M) as we did in the last subsections, we can maximize the
posterior of the model ‘M’, which can be written as L(D|M)P(M). For the convenience of
computation, we use the following Dirichlet prior (with a uniform mean) for model M, which is:
a

 

P( M | a, b, c, d )   P( z x )   P( xi | z x )
 zx
 i, zx

b


  P( z y | y ) 
 y, z y

c


  P(r | z x , z y )
zx , z y ,r

d
(26)
It is not difficult to prove that the EM updating equations for the JMM model by using MAP is:
M
P ( z y | {xi , R y ( xi )}iM1 , y ) 
c  P ( z y | y ) P ( xi , R y ( xi ) | z y )
i 1
M


 c  P( z y | y ) P ( xi , R y ( xi ) | z y )
zy 
i 1

(19’)
a     P( z ix  z x | {xi , R y ( xi )}iM1 , y, z y ) P( z y | y )
zy y i
P( z x ) 




i
M
 a     P( z x  z x | {xi , R y ( xi )}i 1, y, z y ) P( z y | y )
zx


(20’)


zy y i
b     P( z ix  z x | {xi , R y ( xi )}iM1 , y, z y ) P( z y | y ) ( x  xi )
P( x | z x ) 
zy y i
Mb     P( z ix  z x | {xi , R y ( xi )}iM1 , y, z y ) P( z y | y )
zy y i
11
(21’)
JIN, SI AND ZHAI
d    P( z ix  z x | {xi , R y ( xi )}iM1 , y, z y ) P( z y | y ) ( x  xi ) ( R y ( xi )  r )
P(r | z x , z y ) 
y i
Rd    P ( z ix  z x | {xi , R y ( xi )}iM1 , y, z y ) P( z y | y ) ( x  xi )
(22’)
y i
The details can be found in Appendix B.
Comparing the above equations to the original EM updating equations, the only difference
between them is that by maximizing a posterior (MAP), we introduce Laplacian smoothing to the
updating equations for model parameters, which essentially gives a nonzero initialization to each
parameter and therefore prevent model parameters from being skewed distributed.
Similar to the discussion for the mixture model, both the annealing EM and the MAP approach
will be applied to avoid unfavorable local optimal solutions.
5
Rating prediction
To predict the ratings of items by a test user y t , in general, we need to compute the probability
distribution over all the related latent class variables, and sum over all the latent variable values.
In addition to the ratings by training users, ratings over a small number of items by the test user
will be given, which are used to discover the distribution of related latent class variables for the
test user. Let Dtrain and Dtest stand for the rating data for the training users and the test user,
respectively. Let {hi }im1 be the hidden variables and M test  {P(hi | y t )}im1 be the parameter space
related to the test user. Mtrain is used to represent the parameter space that is related to the training
user. In order to predict the rating of an item x by the test user yt, we need to compute the
likelihood P( Ry t ( x) | Dtrain, Dtest ) , which can be expanded over the model space Mtrain and Mtest.
P( R yt ( x) | Dtrain, Dtest )    P( R yt ( x) | M test , M train) P(M train | Dtrain) P(M test | M train, Dtest )
M test M train
 P( R yt ( x) | M *test , M *train) P(M *train | Dtrain) P(M *test | M train*, Dtest )
(27)
where M*train and M*test stand for the optimal model that maximizes the likelihood P(M*train|Dtrain)
and P(M*test| M*test, Dtest), respectively. In the above expression, we approximate the average over
model space with a simple computation on the optimal model. The optimal model M*test is
derived by the application of EM algorithm to maximize the rating data of the test user Dtest. As
an example, for the JMM model, the parameter space related to the user is M test  {P( z y | y t )} and
the optimal P( z y | y t ) is computed by simply maximizing the likelihood of rating data by the test
user.
6
Experiments
In this section, we will present experiment results in order to address the following six issues:
1) Whether modeling users and items separately is important to collaborative filtering? In
Section 3, we proposed two mixture models for collaborative filtering. Unlike the previous
probabilistic models for collaborative filtering, the proposed mixture models introduce two
different class variables for modeling users and items separately, and are able to cluster users
and items simultaneously. In this experiment, we will compare them to both the Aspect
Model (AM) and the Bayesian Clustering algorithm (BC) to see if modeling users and items
separately is effective for collaborative filtering.
2) Would it be beneficial to allow a user/item to belong to multiple clusters? The difference
between these two models is that the joint mixture model (JMM) assumes that each user
belongs to a single user class while the flexible mixture model (FMM) allows each user to be
12
PAPER TITLE POSSIBLY SHORTENED
of multiple user classes. By comparing these two models, we are able to see which
assumption is more appropriate for collaborative filtering.
3) Which of smoothing techniques is more effective? By the end of Section 3, we discussed two
different methods for smoothing the EM algorithm, including an Annealing EM algorithm
(AEM) and a MAP approach. Both of these two approaches try to prevent the estimation of
parameters from being skewed distribution at the early stage of EM iterations. In this
experiment, we will compare the effectiveness of the two smoothing methods for
collaborative filtering.
4) Would modeling the distinction between the preferences and ratings help improve the
performance? In order to see the effectiveness of the ‘DM’ model, we compare it to the
flexible mixture model (FMM), which is essentially the same model as ‘DM’ except that the
‘DM’ model uses two sets of class variables to describe the preference patterns and rating
patterns of each user while the FMM model only uses one set of class variables.
5) How effective are the proposed models compared to other proposed models? In this
experiment, we will compare these mixture models with the other graphic models and
memory-based approaches. In previous studies, when compared with the memory-based
approaches, the model-based approaches tend to have mixed results (Breese et al. 1998). It is
thus interesting to see if our models, which decouple the preference patterns from rating
patterns, can outperform memory-based approaches.
MovieRating
500
1000
EachMovie
2000
1682
Number of Users
Number of Items
Avg. # of rated
87.7
129.6
Items/User
Number of Ratings
5
6
Table 2: Characteristics of MovieRating and EachMovie.
Two datasets of movie ratings are used in our experiments, i.e., ‘MovieRating’ 1 and
‘EachMovie’2. Specifically, we extracted a subset of 2,000 users with more than 40 ratings from
‘EachMovie’ since evaluation based on users with few ratings can be unreliable. The global
statistics of these two datasets as used in our experiments are summarized in Table 2.
A major challenge in collaborative filtering applications is for the system to operate
effectively when it has not yet acquired a large amount of training data (i.e., the so-called “cold
start” problem). To test our algorithms in such a challenging and realistic scenario, we vary the
number of training users from a small value to a large value. To get a better sense of sparseness of
the training data, we introduce the measurement called ‘movie coverage’, which is computed by
multiplying the number of training users with the average number of movies rated by each user
and dividing it by the total number of movies in the dataset. In another word, this ‘movie
coverage’ measures the number of times that each movie has been rated in the training data.
Particularly, we consider three different cases of training data:
1) Small Training. For this case, we only use the rating information of first 20 users as the
training data for both datasets and the leftover users as testing users. The ‘movie coverage’
for the ‘MovieRating’ dataset in this case is only 1.8 and 1.5 for the ‘EachMovie’ dataset, i.e.,
in average each movie is rated by less than 2 training users.
2) Medium Training. For this case we use the rating information of the first 100 users as
training users for ‘MovieRating’ dataset and first 200 users for ‘EachMovie’ dataset. The
1
http://www.cs.usyd.edu.au/~irena/movie_data.zip
2
http://research.compaq.com/SRC/eachmovie
13
JIN, SI AND ZHAI
‘movie coverage’ for this case is 8.8 and 15.4, which is substantially larger than the case of
small training.
3) Large Training. For this case we use the rating information of the first 200 users as training
users for ‘MovieRating’ dataset and first 400 users for ‘EachMovie’ dataset. The ‘movie
coverage’ for this case is 17.7 and 30.8.
Going form ‘small training’ to ‘medium training’ to ‘large training’, we are able to increase the
‘movie coverage’, or the averaged number of times that each movie has been rated in the training
dataset, from less than 2 to around 20 or 30. With this large variance in training data, we are able
to see the robustness of the learning procedure. The other dimension to be examined in this
experiment is the robustness of the model with respect to the number of given items rated by the
test user. In this experiment, we examine our models against test users with 5, 10, and 20 given
items. By varying the number of given items, we can test the robustness of the prediction
procedure.
For the mixture models, namely the joint mixture model (JMM) and the flexible mixture
model (FMM), the numbers of classes for users and items, i.e., |Zy| and |Zx|, are set to be 10 and
20. For the decoupled model (DM), the numbers of classes for items and users are same as the
mixture model, and the number of classes for rating patterns, i.e., |ZR|, is set to be 3. For the
previously studied models, we use the similar number, namely the number of clusters in the
Bayesian Clustering algorithm (BC) is set to be 10 and the number of classes in Aspect Model
(AM) is set to be 20. We tried a few other values, and found that all turned out with similar
performance.
For evaluation, we look at the mean absolute deviation of the predicted ratings from the
actual ratings on items that users in the test set have actually rated, i.e.
1
S y0 
 | R y ( x)  Rˆ y0 ( x) |
(28)
m y0 xX~ ( y0 ) 0
where Rˆ y0 ( x) is the predicted rating on item x by user y0, R y0 ( x) is the actual rating on item x by
user y0 and my0 is the number of test items that have been rated by the test user y0. We refer to this
measure as the mean absolute error (MAE) in the rest of this paper. There are some other
measures like the Receiver Operating Characteristic (ROC) as a decision-surrport accuracy
measure (Breese, et. al., 1998) and the normalized MAE. But since MAE has been the most
commonly used metric and has been reported in most previous research (Breese, et. al., 1998;
Herlocker, et. al., 1999; Melville, et. al., 2002; SWAMI, 2000; Pennock, et. al., 2000), we chose it
as the evaluation measure in our experiments to make our results more comparable.
6.1 Experiment with Mixture Models
In this experiment, we need to address the first two questions in the questionnaires listed at the
beginning of this section, namely whether modeling users and items separately is important to
collaborative filtering and whether it is beneficial to allow a user/item to belong to multiple
clusters. The results for all three types of training data and three different numbers of given items
are listed in Table 3 and 4.
Several interesting observations can be drawn from Table 2 and 3:
1) According to Table 3 and 4, the flexible mixture model (FMM) performs substantially better
than the joint mixture model (JMM) in most of the configurations except for the collection
‘MovieRating’ when the number of training users is only 20. In the next experiment where
smoothing methods are applied to regularize the EM algorithm, we will see that the FMM
model is able to outperform the JMM model substantially even for this single case. The only
difference between these two models is whether or not a user should be treated as of multiple
user types instead of a single one. The fact that the FMM model outperforms the JMM model
14
PAPER TITLE POSSIBLY SHORTENED
indicates that a user should belong to multiple user classes and each item should be allowed
to choose its own appropriate user class for rating. The hypothesis is further confirmed by the
fact that the aspect model is able to perform better than the Bayesian Clustering approach in
most cases (except for the EachMovie dataset when the number of training users is 400).
2) Based on Table 3 and 4, the FMM is able to perform substantially better than the Bayesian
Clustering algorithm (BC) and the Aspect Model (AM) in most case except for the case of
small training. In the next experiment, we will see that with appropriate smoothing technique,
the FMM model is able to outperform both approaches even in the case of small training. One
important difference between the proposed model and these two models is that FMM models
users and items separately with two different sets of hidden variables while the Bayesian
Clustering algorithm and the Aspect Model (AM) only introduce a single set of class
variables for describing the rating information of users on different items. Therefore, the fact
that the FMM model performs better than the two previously studies models indicates that
modeling users and items separately is effective for collaborative filtering.
Training
Users Size
Algorithms
5 Items
10 Items
20 Items
Given
Given
Given
FMM
1.000
0.994
0.990
JMM
0.990
0.968
0.920
20
BC
1.10
1.09
1.08
AM
0.982
0.976
0.958
FMM
0.823
0.822
0.817
JMM
0.868
0.868
0.854
100
BC
0.968
0.946
0.941
AM
0.882
0.856
0.836
FMM
0.804
0.801
0.799
JMM
0.840
0.837
0.831
200
BC
0.949
0.942
0.912
AM
0.891
0.850
0.818
Table 3: MAE of proposed mixture models compared to the Bayesian
Clustering algorithm (BC) and the Aspect Model (AM) on the
‘MovieRating’ dataset. A smaller value means a better performance.
Training
Users Size
Algorithms
5 Items
10 Items
20 Items
Given
Given
Given
FMM
1.31
1.31
1.30
JMM
1.38
1.37
1.36
20
BC
1.46
1.45
1.44
AM
1.28
1.24
1.23
FMM
1.08
1.06
1.05
JMM
1.17
1.15
1.15
200
BC
1.25
1.22
1.17
AM
1.27
1.18
1.14
FMM
1.06
1.05
1.04
JMM
1.10
1.09
1.09
400
BC
1.17
1.15
1.14
AM
1.28
1.19
1.16
Table 4: MAE of proposed mixture models compared to the Bayesian
Clustering algorithm (BC) and the Aspect Model (AM) on the
‘EachMovie’ dataset. A smaller value means a better performance.
15
JIN, SI AND ZHAI
6.2 Experiments with Smoothing Methods
In Section 3, we discussed two different methods for smoothing the EM algorithms. The first one
is called Annealing EM algorithm (AEM), which control the convergence rate of parameter
estimation by slowly increasing the value of ‘b’ in Equation (24)-(25). In experiment, we increase
variable ‘b’ from 0 to 1 with a step size 0.1. We run the EM iterations three times for every ‘b’
and ten times when ‘b’ is set to 1. The second smoothing strategy is to run an EM algorithm for
maximizing a posterior (MAP) instead of the likelihood of the training data (MLE). As indicated
in Equation (19’)-(22’), this method equals to a Laplacian smoothing in the estimation of
parameters. The parameters ‘a’, ‘b’, ‘c’, and ‘d’ are set as follows:
 | X ( y) |
 | X ( y) |
 | X ( y) |
 | X ( y) |
a
y
10000  | Z x |
b
y
10000  M  | Z x |
c
y
10000  N  | Z y |
d
y
10000  R | Z x |  | Z y |
where X(y) stands for the number of items rated by the user ‘y’. Finally, since the last experiment
has shown that the FMM model is substantially better than the JMM model, in this experiment,
The results for the FMM model for two different smoothing methods over dataset ‘MovieRating’
and ‘EachMovie’ are presented in Table 5 and Table 6. Meanwhile, the results for the JMM
model with smoothing methods for the two datasets are presented in Table 7 and 8.
Training
Users Size
Algorithms
5 Items
10 Items
20 Items
Given
Given
Given
AEM
1.000
0.994
0.990
20
MAP
0.881
0.877
0.870
AEM
0.823
0.822
0.817
100
MAP
0.821
0.820
0.813
AEM
0.804
0.801
0.799
200
MAP
0.797
0.786
0.781
Table 5: MAE for the flexible mixture model (FMM) using different
smoothing methods on the ‘MovieRating’ dataset. ‘AEM stands for
annealing EM algorithm and ‘MAP’ stands for the EM algorithm for
maximizing a posterior. A smaller value means a better performance.
Training
Users Size
Algorithms
5 Items
10 Items
20 Items
Given
Given
Given
AEM
1.31
1.31
1.30
20
MAP
1.23
1.22
1.22
AEM
1.08
1.06
1.05
200
MAP
1.08
1.05
1.04
AEM
1.06
1.05
1.04
400
MAP
1.06
1.04
1.03
Table 6: MAE for the flexible mixture model (FMM) using different
smoothing methods on the ‘EachMovie’ dataset. ‘AEM stands for
annealing EM algorithm and ‘MAP’ stands for the EM algorithm for
maximizing a posterior. A smaller value means a better performance.
16
PAPER TITLE POSSIBLY SHORTENED
Training
Users Size
Algorithms
5 Items
10 Items
20 Items
Given
Given
Given
AEM
0.990
0.968
0.920
20
MAP
0.986
0.963
0.920
AEM
0.868
0.868
0.854
100
MAP
0.864
0.863
0.854
AEM
0.840
0.837
0.831
200
MAP
0.837
0.833
0.831
Table 7: MAE for the joint mixture model (JMM) using different
smoothing methods on the ‘MovieRating’ dataset. ‘AEM stands for
annealing EM algorithm and ‘MAP’ stands for the EM algorithm for
maximizing a posterior. A smaller value means a better performance.
Training
Users Size
Algorithms
5 Items
10 Items
20 Items
Given
Given
Given
AEM
1.38
1.37
1.36
20
MAP
1.37
1.35
1.34
AEM
1.17
1.15
1.15
200
MAP
1.17
1.15
1.14
AEM
1.10
1.10
1.09
400
MAP
1.10
1.09
1.09
Table 8: MAE for the joint mixture model (JMM) using different
smoothing methods on the ‘EachMovie’ dataset. ‘AEM stands for
annealing EM algorithm and ‘MAP’ stands for the EM algorithm for
maximizing a posterior. A smaller value means a better performance.
Two observations can be drawn from Table (5)-(8):
1) According to Table (5)-(8), the proposed MAP (i.e., maximizing a posterior) is able to
outperform the Annealing EM algorithm for both the JMM model and FMM model in all the
cases. As a matter of fact, if we compare the results using AEM in Table (5)-(8) to the results
without smoothing in Table (3) and (4), we can see that the AEM algorithm only achieves the
same performance as the original EM for all the cases. These two facts indicate that the MAP
approach is an effective method for collaborative filtering while the AEM algorithm doesn’t
have any impact on the performance of the mixture models.
2) With a more careful examination of Table (5) and (5), we can see that the MAP smoothing
method is able to improve the performance of the FMM substantially when the number of
training users is small (i.e., 20 for both ‘MovieRating’ and ‘EachMovie’). But, the
improvement becomes very modest when the number of training data becomes large (i.e., 100
and 200 for ‘MovieRating’, and 200 and 400 for ‘EachMovie’). This is consistent with the
spirit of Bayesian statistics, namely the model prior is important and useful only when the
amount of training data is small. When the amount of training data is sufficient, the effect of
model prior will be ignorable.
3) In the previous experiment, the aspect model is the winner for the case of small training. With
the help of appropriate smoothing, the FMM model is able to perform better than the aspect
model even in the case of small training. This fact again indicates that the smoothing method
is able to effectively alleviate the difficulty caused by the sparse data.
Due to the success of the MAP smoothing method, it is used for the remaining experiments.
17
JIN, SI AND ZHAI
6.3 Experiments with DM
Comparing to the other four models, the decouple model introduced in Section 3 is able to
address the distinction between preferences and ratings by explicitly or inexplicitly model the
preference patterns and rating patterns of users separately. In this experiment we need to answer
the question, would modeling the distinction between the preferences and ratings help improve
the performance. The results for the model DM on ‘MovieRating’ and ‘EachMovie’ datasets are
listed in Table 9 and 10 together with the results for the FMM model (copy from Table 4 and 5)
because the FMM model close relative to the DM model and differs from the DM model only by
the lack of modeling for rating patterns. By comparing the performance of the DM model to that
of the FMM model, we are able to see if the introduction of hidden variables for preferences and
ratings separately is effective for collaborative filtering.
By comparing the model ‘DM’ to its baseline peer, i.e., the FMM model, we can see that the
DM model outperforms the FMM model in all the cases. These two models have exactly the same
setup except that the model DM introduces the extra hidden nodes ZR and Zpref in order to account
for the variance in the rating behavior among the users with similar interests. Although the
difference appears to be insignificant in some cases, it is interesting to note that when the number
of rated items given increases, the gap between ‘DM’ and the baseline model also increases. This
may suggest that when there are only a small number of items with given ratings, it is rather
difficult to determine the type of rating patterns for the testing user. As the number of given items
increases, this ambiguity will decrease quickly and therefore the advantage of the ‘DM’ model
over the FMM model will be more clear. Indeed, it is a bit surprising that even with only five
rated items and only a couple of hundreds of users the ‘DM’ model still slightly improves the
performance as ‘DM’ has many more parameters to learn than the baseline model. We suspect
that the skewed distribution of ratings among items, i.e., a few items account for a large number
of ratings, may have helped.
Training
Users Size
Algorithms
5 Items
10 Items
20 Items
Given
Given
Given
DM
0.874
0.871
0.860
20
FMM
0.881
0.877
0.870
DM
0.814
0.810
0.799
100
FMM
0.821
0.820
0.813
DM
0.790
0.777
0.761
200
FMM
0.797
0.786
0.781
Table 9: MAE for the flexible mixture model (FMM) and the decoupled
model (DM) on the ‘MovieRating’ dataset. A smaller value means a better
performance.
Training
Users Size
Algorithms
5 Items
10 Items
20 Items
Given
Given
Given
DM
1.20
1.18
1.17
20
FMM
1.23
1.22
1.22
DM
1.07
1.04
1.03
200
FMM
1.08
1.05
1.04
DM
1.05
1.03
1.02
400
FMM
1.06
1.04
1.03
Table 10: MAE for the flexible mixture model (FMM) and the decoupled
model (DM) on the ‘EachMovie’ dataset. A smaller value means a better
performance.
18
PAPER TITLE POSSIBLY SHORTENED
6.4 Experiments with the Comparison to Other Approaches
In this subsection, we compare all five mixture models to the memory based approaches for
collaborative filtering, including the Personal Diagnosis (PD), the Vector Similarity method (VS)
and the Pearson Correlation Coefficient method (PCC). In the following part, we will first briefly
introduce the three memory-based approaches and then present the empirical results.
6.4.1 Memory-based Methods for Collaborative Filtering
Memory-based algorithms store the rating examples of training users and predict a test user’s
ratings based on the corresponding ratings of the users in the training database that are similar to
the test user. Three commonly used methods will be compared in this experiment. They are:
 Pearson Correlation Coefficient (PCC)
According to (Resnick et. al., 1994), Pearson Correlation Coefficient method predicts the rating
of a test user y0 on item x as:
 w y 0 , y ( R y ( x)  R y )
y
Y
Rˆ y 0 ( x)  R y 0 
 wy 0 , y
yY
where the coefficient wy, y0 is computed as

w y0 , y 
~
~
xX ( y )^ X ( yo )

~
~
xX ( y )^ X ( yo )
( R y ( x )  R y )( R y0 ( x )  R y0 )

( R y ( x)  R y ) 2
~
~
xX ( y )^ X ( yo )
( R y ( x)  R y ) 2
 Vector Similarity (VS)
This method is very similar to the PCC method except that the correlation coefficient wy, y0 is
computed as:

w y0 , y 
~
~
xX ( y )^ X ( yo )
R y ( x) 2

~
xX ( y )
R y ( x ) R y0 ( x )
R y0 ( x ) 2

~
xX ( y0 )

Personality Diagnosis (PD)
In the personality diagnosis model, the observed rating for the test user yt on an item x is
assumed to be drawn from an independent normal distribution with the mean as the true rating as
RTrue
(x) :
yt
P( R y t ( x) | RTrue
( x))  e
yt
2
 ( R t ( x )  R True
2 2
t ( x ))
y
y
where the standard deviation  is set to constant 1 in our experiments. Then, the probability of
generating the observed rating values of the test user by any user y in the training database can be
written as:
P( R y t | R y ) 
 et
( Ry ( x ) R t ( x ))2 2 2
y
xX ( y )
The likelihood for the test user yt to rate an unseen item x as category r is computed as:
P ( R y t ( x )  r )   P ( R y t | R y )e
y
19
 ( Ry ( x )r )2 2 2
JIN, SI AND ZHAI
The final predicted rating for item ‘x’ by the test user will be the rating category ‘r’ with the
highest likelihood P( R yt ( x)  r ) . Empirical studies have shown that the PD method is able to
outperform several other approaches for collaborative filtering (Pennock et al., 2000).
6.4.2
Empirical Results
Training
Users Size
5 Items
10 Items
20 Items
Given
Given
Given
PCC
0.912
0.840
0.812
VS
0.912
0.840
0.812
PD
0.888
0.882
0.875
AM
0.982
0.976
0.958
20
BC
1.10
1.09
1.08
DM
0.871
0.860
0.874
FMM
0.881
0.877
0.870
JMM
0.986
0.963
0.920
PCC
0.881
0.832
0.809
VS
0.859
0.834
0.823
PD
0.839
0.826
0.818
AM
0.882
0.856
0.836
100
BC
0.968
0.946
0.941
DM
0.814
0.810
0.799
FMM
0.821
0.820
0.813
JMM
0.864
0.863
0.854
PCC
0.878
0.828
0.801
VS
0.862
0.950
0.854
PD
0.835
0.816
0.806
AM
0.891
0.850
0.818
200
BC
0.949
0.942
0.912
DM
0.790
0.777
0.761
FMM
0.797
0.786
0.781
JMM
0.837
0.833
0.831
Table 11: MAE for nine different models on the ‘MovieRating’ dataset,
including a Pearson Correlation Coefficient approach (PCC), a Vector
Similarity approach (VS), a Personality Diagnosis approach (PD), a
Aspect Model (AM), a Bayesian Clustering approach (BC), a decoupled
model (DM), a flexible mixture model (FMM) and a joint mixture
model (JMM). A smaller value means a better performance
Algorithms
The results are shown in Table 11 and 12. The proposed models ‘DM’ and ‘FMM’ are
substantially better than all existing methods for collaborative filtering including both memorybased approaches and model-based approaches except for the case when the number of given
items is only 20, in which the memory-based model performs substantially better than all modelbased approaches. The overall success of the DM model and FMM model suggests that,
compared with the memory-based approaches, graphic models are not only advantageous in
principle, but also empirically superior due to their capabilities of capturing the distinction
between the preference patterns and rating patterns in a principled way.
The fact that the memory-based approaches outperforms the model-based approaches in the
case of small training data can be explained by the fact that the number of training data is actually
20
PAPER TITLE POSSIBLY SHORTENED
smaller than the number of parameters required for the model. When there are only 20 training
users, the number of rated items is much less than 10,000 (1700 for the ‘MovieRating’ dataset
and 2500 for ‘EachMovie’ dataset) while, the number of parameters is actually over 20,000 for all
the models (over 20,000 for ‘MovieRating’ dataset and 30,000 for ‘EachMovie’ dataset.).
Therefore, when there are only 20 training users, the amount of training data is not sufficient for
creating a reliable and effective model for collaborative filtering. Based on this analysis, we can
see that usually the performance of memory-based approaches depend strongly on the availability
of training data. When the amount of training data is small, it is better off using memory-based
approaches since no reliable model can be trained over a small amount of training data.
Training
Users Size
5 Items
10 Items
20 Items
Given
Given
Given
PCC
1.26
1.19
1.18
VS
1.24
1.19
1.17
PD
1.25
1.24
1.23
AM
1.28
1.24
1.23
BC
1.46
1.45
1.44
DM
1.20
1.18
1.17
FMM
1.23
1.22
1.22
JMM
1.37
.135
1.34
PCC
1.22
1.16
1.13
VS
1.25
1.24
1.26
PD
1.19
1.16
1.15
AM
1.27
1.18
1.14
BC
1.25
1.22
1.17
DM
1.07
1.04
1.03
FMM
1.08
1.05
1.04
JMM
1.17
1.15
1.14
PCC
1.22
1.16
1.13
VS
1.32
1.33
1.37
PD
1.18
1.16
1.15
AM
1.28
1.19
1.16
BC
1.17
1.15
1.14
DM
1.05
1.03
1.02
FMM
1.06
1.04
1.03
JMM
1.10
1.09
1.09
Table 12: MAE for nine different models on the ‘EachMovie’ dataset,
including a Pearson Correlation Coefficient approach (PCC), a Vector
Similarity approach (VS), a Personality Diagnosis approach (PD), a
Aspect Model (AM), a Bayesian Clustering approach (BC), a decoupled
model (DM), a flexible mixture model (FMM) and a joint mixture
model (JMM). A smaller value means a better performance.
7
Algorithms
Conclusions and Future Work
In this paper, we focus on two important issues with collaborative filtering:
1) Whether users and items should be modeled separately and whether users should belong to
multiple user classes?
2) How to address the issue that users with similar taste can have very different rating
behaviors?
21
JIN, SI AND ZHAI
Two mixture models are proposed to address the first problem: the flexible mixture model
(FMM) assumes that each user can belong to multiple user classes whereas the joint mixture
model (JMM) enforces each user to be in a single user class. For the second question, two
preference-based models: the decoupled model (DM) avoids the variance in rating patterns by
decoupling the rating patterns from the preference patterns whereas the model for preferred
ordering (MP) model tries to achieve a similar effect by modeling the relative orderings of items
instead of the absolute values of ratings.
Empirical results with mixture models show that the FMM model is consistently better than
the JMM model by the MAE measure, which indicates that a user should be of multiple user
types instead of just a single one. Meanwhile, the success of FMM model over the previously
studied models such as the Bayesian Clustering algorithm and an Aspect Model implies that it is
better off to have users and items modeled separately. Empirical results with preference-based
models show that the DM model performs consistently better than the MP model, which is
somehow expected due to the way the ‘MP’ model is designed. Furthermore, the experiments
confirmed that the decoupling of rating patterns and preference patterns is important for
collaborative filtering, and modeling such a decoupling in a graphic model leads to improvement
in performance. Comparison with other methods for collaborative filtering indicates that the
proposed method is superior, suggesting advantages of graphic models for collaborative filtering.
The idea of modeling preferences has also been explored in some other related work (Ha &
Haddawy 1998; Freund et al. 1998;Cohen et al. 1999). We plan to further explore this direction
by considering all these different approaches and using a more appropriate evaluation criterion
such as one based on inconsistent orderings. We also believe that the decoupling problem that we
addressed may represent a more general need of modeling “noise” in similar problems such as
gene microarray data analysis in bininformatics. We plan to explore a more general framework
for all these similar problems.
Acknowledgements
This work was supported in part by the advanced Research and Development Activity (ARDA)
under contract number MDA908-00-C-0037 and the National Science Foundation under
Cooperative Agreement No. IRI-9817496.
22
PAPER TITLE POSSIBLY SHORTENED
Appendix A: Proof of EM Updating Equations for Joint Mixture Model
In Section 3.1, we show the EM updating equations for the joint mixture model (JMM). Here
comes the proof of those Equations.
The goal of the iterative procedures is to maximize the log-likelihood of training data, i.e.,
M




L   log P({ xi , R y ( xi )}iM1 | y)   log  P( z y | y)  P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y )
y
y
i 1 z x


z y



(a1)
Let

stands
for
the
model
of
the
current
iteration,
i.e.,
  ({P( z x )},{P( x | z x )},{P( z y | y)},{P(r | z x , z y )}) , and ’ stands for the model obtained from the
last iteration, i.e.,  '  ({P' ( z x )},{P' ( x | z x )},{P' ( z y | y)},{P' (r | z x , z y )}) . Follow the spirit of EM
algorithm, the goal is to maximize the difference between the likelihood of two consecutive
iterations, which can be written as:
M


  P( z y | y )  P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) 
i 1 z x
 z

L( )  L( ' )   log  y

M
y
  P' ( z y | y )  P' ( z x ) P' ( xi | z x ) P' ( R y ( xi ) | z x , z y ) 


i 1 z x
 zy

(a2)
By defining
M
P( z y | {xi , R y ( xi )}iM1, y ) 
P( z y | y ) P( xi , R y ( xi ) | z y )
i 1
M
 P( z y | y ) P( xi , R y ( xi ) | z y )
(a3)
i 1
zy
, we will have Equation (a2) rewritten as:
L( )  L( ' )
M


P( z y | y )  P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) 



z
i

1
x
  log  P' ( z y | {xi , R y ( xi )}iM1 , y )

M
y
z y
P' ( z y | y )  P' ( z x ) P' ( xi | z x ) P' ( R y ( xi ) | z x , z y ) 


i 1 z x


M


P
(
z
|
y
)
 P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) 

y



i 1 z x
   P' ( z y | {xi , R y ( xi )}iM1 , y ) log 

M
y zy
 P' ( z y | y )  P' ( z x ) P' ( xi | z x ) P' ( R y ( xi ) | z x , z y ) 


i 1 z x



   P' ( z y | {xi , R y ( xi )}iM1 , y ) log P( z y | y )
(a4)

y zy

M
  P' ( z y | {xi , R y ( xi )}i 1 ,
y zy
  P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) 
 z

y ) log  x

i
  P' ( z x ) P' ( xi | z x ) P' ( R y ( xi ) | z x , z y ) 
 zx


   P' ( z y | {xi , R y ( xi )}iM1 , y ) log P' ( z y | y )

y zy
The second step in the above equation uses the Jensen’s inequality. In the final expression of the
above Equation, we have three terms: the first term only contains the parameter P( z y | y) .
Therefore by setting the derivative of the above equation with respect to P( z y | y) , we will have
23
JIN, SI AND ZHAI
the updating equation for P( z y | y) as listed in Equation (18); the last term doesn’t contain any
model parameters of current iteration. Therefore, we can simply ignore it. The second term is the
most sophisticated term. We can simplify the second term in Equation (a4) by applying the
Jensen’s inequality again, i.e.
  P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) 
 z


y ) log x

i
  P' ( z x ) P' ( xi | z x ) P' ( R y ( xi ) | z x , z y ) 
 zx

i
M
 P' ( z x | {xi , R y ( xi )}i 1 , y, z y ) 



M
   P' ( z y | {xi , R y ( xi )}i 1 , y ) log  P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) 
 zx

y zy
i
 P' ( z x ) P' ( xi | z x ) P' ( R y ( xi ) | z x , z y ) 


M
  P' ( z y | {xi , R y ( xi )}i 1 ,
y zy
(a5)
   P' ( z y | {xi , R y ( xi )}iM1 , y )  P' ( z ix | {xi , R y ( xi )}iM1 , y, z y ) log P( z x )
y zy
i zx

M
  P' ( z y | {xi , R y ( xi )}i 1 ,
y zy
y )  P' ( z ix | {xi , R y ( xi )}iM1 , y, z y ) log P( xi | z x )

M
  P' ( z y | {xi , R y ( xi )}i 1 ,
y zy
y )  P' ( z ix | {xi , R y ( xi )}iM1 , y, z y ) log P( R y ( xi ) | z x , z y )
   P' ( z y | {xi , R y ( xi )}iM1 ,
y zy
i zx
i zx

y )  P' ( z ix | {xi , R y ( xi )}iM1 , y, z y ) log P' ( z x ) P' ( xi | z x ) P' ( R y ( xi ) | z x , z y )

i zx
where P( z ix | {xi , R y ( xi )}iM1, y, z y ) is defined as
P( z ix
| {xi , R y ( xi )}iM1 ,
y, z y ) 
P( z ix ) P( xi | z ix ) P( R y ( xi ) | z ix , z y )
 P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y )
i
i
i
(a6)
z ix
In the last expression for , we have four terms in the sum. The first three terms are linked with
each of the three parameters P( z x ) , P( xi | z x ) , and P(r | z x , z y ) . Therefore, we can set the
derivatives of Equation (a5) with respect to each of the three parameters and obtain the updating
equations as listed in Equation (19)-(21). The last term for  in Equation (a5) has nothing to do
with the parameters of current iteration and therefore can be ignored.
24
PAPER TITLE POSSIBLY SHORTENED
Appendix B: Proof of EM Updating Equations for the MAP Smoothing Method
In section 3.3, we show that by introducing a Dirichlet prior on the parameters, we can maximize
the posterior of the model instead of the likelihood of the model, which results in a set of
updating Equations from (25’) to (28’). In this appendix, we will prove those EM updating
equations are correct. For the sake of simplicity, we will only do the proof for the flexible mixture
model (FMM). The proof for the joint mixture model (JMM) is almost identical. By putting
Equation (23) and (32) together, we have the logarithm of the posterior for FMM model
expressed as:
Q   log(  P( z y | y ) P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) 
y i
zx ,z y
a  log P( z x )  b log P( xi | z x )  c  log P( z y | y )  d   log P(r | z x , z y ) (b1)
zx
i zx
y zy
r zx , z y
Similar to the proof presented in Appendix A,  stands for the model of the current iteration, i.e.,
  ({P( z x )},{P( x | z x )},{P( z y | y)},{P(r | z x , z y )}) , and ’ stands for the model obtained from the
last iteration, i.e.,  '  ({P' ( z x )},{P' ( x | z x )},{P' ( z y | y)},{P' (r | z x , z y )}) . The difference in the
logarithm of posteriors between two consecutive iterations can be written as:
25
JIN, SI AND ZHAI
 P( z y | y ) P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y )
Q( )  Q( ' )   log(
y i
zx ,z y
 P' ( z y | y ) P' ( z x ) P' ( xi | z x ) P' ( R y ( xi ) | z x , z y )
zx , z y
 a  log
zx
P( z y | y )
P(r | z x , z y )
P( z x )
P( xi | z x )
 b log
 c  log
 d   log
P' ( z x )
P' ( xi | z x )
P' ( z y | y )
P' (r | z x , z y )
i zx
y zy
r zx , z y

P( z y | y ) P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) 

  log  P' ( z x , z y | xi , y, R y ( xi ))


P
'
(
z
|
y
)
P
'
(
z
)
P
'
(
x
|
z
)
P
'
(
R
(
x
)
|
z
,
z
)
y i
z
,
z
y
x
i
x
y i
x y 
 x y
P( z y | y )
P(r | z x , z y )
P( z x )
P( xi | z x )
 a  log
 b log
 c  log
 d   log
P' ( z x )
P' ( xi | z x )
P' ( z y | y )
P' (r | z x , z y )
zx
i zx
y zy
r zx , z y
 P( z y | y ) P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) 

   P' ( z x , z y | xi , y, R y ( xi )) log
 P' ( z y | y ) P' ( z x ) P' ( xi | z x ) P' ( R y ( xi ) | z x , z y ) 
y i zx , z y


P( z y | y )
P(r | z x , z y )
P( z x )
P( xi | z x )
 a  log
 b log
 c  log
 d   log
P' ( z x )
P' ( xi | z x )
P' ( z y | y )
P' (r | z x , z y )
zx
i zx
y zy
r zx , z y
(b2)


P( z x )
   a   P' ( z x , z y | xi , y, R y ( xi ))  log


P' ( z x )
zx 
y ,i , z y



P( xi | z x )
   b   P' ( z x , z y | xi , y, R y ( xi ))  log

P' ( xi | z x )
i, zx
y,zy


P( z y | y )


   c   P' ( z x , z y | xi , y, R y ( xi ))  log
P' ( z y | y )
y,zy 
i, zx


 P(r | z x , z y )
    c   P' ( z x , z y | xi , y, R y ( xi )) (r , R y ( xi )) 
r zx , z y 
i,zx
 P' (r | z x , z y )
In the last expression of the above equation, we have all the parameters decoupled into four
independent terms. Therefore, the updating equations can be obtained by simply setting the
derivatives of the last expression with respect to each of the four parameters
P( z x ), P( x | z x ), P( z y | y), P(r | z x , z y ) to be zero, which will result in the Equation (25’)-(28’).
References
Breese, J. S., Heckerman, D., Kadie C., (1998). Empirical Analysis of Predictive Algorthms for
Collaborative Filtering. In the Proceeding of the Fourteenth Conference on Uncertainty in
Artificial Intelligence.
Cohen, W., Shapire, R., and Singer, Y., (1998) Learning to Order Things, In Advances in Neural
Processing Systems 10, Denver, CO, 1997, MIT Press.
OConnor, M. & Herlocker, Jon. (2001) Clustering Items for Collaborative Filtering. In the
Proceedings of SIGIR-2001 Workshop on Recommender Systems, New Orleans, LA.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical Society, B39: 1-38.
Freund, Y, Iyer, R., Shapire, R., and Singer, Y., (1998) An efficient boosting algorithm for
combining preferences. In Proceedings of ICML 1998.
26
PAPER TITLE POSSIBLY SHORTENED
Ha,V. and Haddawy, P., (1998) Toward Case-Based Preference Elicitation: Similar- ity Measures
on Preference Structures, in Proceedings of UAI 1998.
Hofmann, T., & Puzicha, J. (1999). Latent Class Models for Collaborative Filtering. In the
Proceedings of International Joint Conference on Artificial Intelligence.
Hofmann, T., & Puzicha, J. (1998). Statistical models for co-occurrence data (Technical report).
Artificial Intelligence Laboratory Memo 1625, M.I.T.
Pennock, D. M., Horvitz, E., Lawrence, S., & Giles, C. L. (2000) Collaborative Filtering by
Personality Diagosis: A Hybrid Memory- and Model-Based Approach. In the Proceeding of
the Sixteenth Conference on Uncertainty in Artificial Intelligence.
Popescul A., Ungar, L. H., Pennock, D.M. & Lawrence, S. (2001) Probabilistic Models for
Unified Collaborative and content-Based Recommendation in Sparse-Data Environments. In
the Proceeding of the Seventeenth Conference on Uncertainty in Artificial Intelligence.
Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., & Riedl, J. (1994) Grouplens: An Open
Architecture for Collaborative Filtering of Netnews. In Proceeding of the ACM 1994
Conference on Computer Supported Cooperative Work.
Ross, D. A. and Zemel, R. S. (2002). "Multiple-cause vector quantization." In NIPS-15: Advances
in Neural Information Processing Systems 15.
Ueda, Naonori and Ryohei Nakano. 1998. Deterministic annealing EM algorithm. Neural
Networks, 11(2):271--282.
Herlocker, J. L., Konstan, J. A., Brochers, A. and Riedl, J. (1999). An Algorithm Framework for
Performing Collaborative Filtering. In Proceedings of the 22nd Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 1999.
Melville, P., Mooney, R. J., and Nagarajan, R.: Content-Boosted Collaborative Filtering for
Improved Recommendations. In Proceedings of the Eighteenth National Conference on
Artificial Intelligence (AAAI), 2002.
SWAMI: a framework for collaborative filtering algorithm development and evaluation. In
Proceedings of the 23rd Annual International Conference on Researech and Development in
Information Retrieval (SIGIR), 2000.
L. Si and R. Jin, Product Space Mixture Model for Collaborative Filtering, In Proceedings of the
Twentieth International Conference on Machine Learning (ICML 2003), 2003
R. Jin, L. Si, and C.X. Zhai, Preference-based Graphic Models for Collaborative Filtering, In
Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI
2003), 2003
T. Hofmann, Gaussian Latent Semantic Models for Collaborative Filtering, In Proceedings of the
26th Annual International ACM SIGIR Conference, 2003
27
Download