Journal of Machine Learning Research A Study of Mixture Models for Collaborative Filtering Rong Jin RONG@CS.CMU.EDU School of Computer Science Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA 15213, USA Luo Si LSI@CS.CMU.EDU School of Computer Science Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA 15213, USA Chengxiang Zhai CZHAI@CS.UIUC.EDU Department of Computer Science University of Illinois at Urbana-ChampaignUrbana IL 61801, USA Alex G. Hauptmann ALEX@CS.CMU.EDU Department of Computer Science Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA 15213, USA Abstract Collaborative filtering is a very useful general technique for exploiting the preference patterns of a group of users to predict the utility of items to a particular user. There are three different components that need to be modeled in the collaborative filtering problem: the users, the items, and the ratings. Even though previous research on applying probabilistic models to collaborative filtering has shown promising results, there lacks systematic studies on how to model each of the three components and their interactions. In this paper, we conducted a broad and systematic study on different mixture models for collaborative filtering. We discuss general issues related to using a mixture model for collaborative filtering, and propose three desirable properties (should we illustrate the properties here) that any graphic model should satisfy. Using these properties, we thoroughly examine five different mixture (probablistic) models, including Bayesian Clustering (BC), Aspect Model (AM), Flexible Mixture Model (FMM), Joint Mixture Model (JMM), and the Decoupled Model (DM). We compare these models both analytically and experimentally. Experiments over two datasets of movie ratings under several different configurations show that in general, whether a model satisfies the proposed properties tends to correlate with the model’s performance. In particular, the DM model, which satisfies all the three properties that we want, outperforms all the other mixture models as well as some other existing approaches to collaborative filtering. Our study shows that graphic models are powerful tools for modeling collaborative filtering, but careful design of the model is necessary to achieve good performance. Keywords: collaborative filtering, graphic model, and probabilistic model 1 Introduction The rapid growth of the information on the Internet demands intelligent information agent that can sift through all the available information and find out the most valuable to us. These ©2003 Jin, Si and Zhai JIN, SI AND ZHAI intelligent systems can be categorized into two classes: Collaborative Filtering (CF) and Contentbased recommending. The difference between them is that collaborative filtering only utilizes the ratings of training users in order to predict ratings for test users while content-based recommendation systems rely on the contents of items for predictions. Therefore, collaborative filtering systems have advantages in an environment where the contents of items are not available due to either a privacy issue or the fact that contents are difficult for a computer to analyze. In this paper, we will only focus on the collaborative filtering problems (research of collaborative filtering problems). Most collaborative filtering methods fall into two categories: Memory-based algorithms and Model-based algorithms [Breese et al. 1998]. Memory-based algorithms store rating examples of users in a training database, and in the predicting phase, they would predict a test user’s ratings based on the corresponding ratings of the users in the training database that are similar to the test user. In contrast, model-based algorithms build models that can explain the training examples well and predict the ratings of test users using the estimated models. Both types of approaches have been shown to be effective for collaborative filtering. In general, all collaborative filtering approaches assume that users with similar “tastes” would rate items similarly, and the idea of clustering is exploited in all approaches either explicitly or implicitly. Compared with memory-based approaches, model-based approaches provide a more principled way of performing clustering, and is also often much more efficient in terms of the computation cost at the prediction time. The basic idea of a model-based approach is to cluster items and/or training users into classes explicitly and predict ratings of a test user by using the ratings of classes that fit the best with the test user and/or items to be rated. Several different probabilistic models have been proposed and studied in the previous work (e.g., [Breese et al. 1998; Hofmann & Puzicha 1998; Pennock et al. 2000; Popescul et al. 2001; Ross & Zemel 2002; Si et. al., 2003; Jin et. al., 2003; Hofmann, 2003]). These models have succeeded in capturing user/item similarities through probabilistic clustering in one way or the other, and have all been shown to be quite promising. Most of these method can be represented in the graphical model framework. However there has been no systematic study and comparison of all the graphic models proposed for collaborative filtering. There are both theoretical and empirical reasons for such a study: (1) Theoretically, different models make different assumptions. We need to understand the differences and connections among these models in terms of the underlying assumptions. (2) Empirically, these different models are evaluated with different experimental settings in previous studies; it would be useful to see how they are compared with each other using the same experimental settings. Moreover, a systematic study is necessary for explaining why some models tend to perform better than others. In this paper, we conduct a systematic study of a large subset of graphics models – mixture models – for collaborative filtering. Mixture models are quite natural for modeling similarities among users, items, and ratings. In general, there are three components that need to be modeled carefully: the users, the items and the ratings. Not only need we to cluster each component into a small number of groups but also be able to model the interactions between different components appropriately. We propose three desirable properties that a reasonable graphic model for collaborative filtering should satisfy: (1) separate clustering of users and items; (2) flexibility for a user/item to be in multiple clusters; (3) decoupling of user preferences and rating patterns. We thoroughly analyze five different mixture models, including Bayesian Clustering (BC), Aspect Model (AM), Flexible Mixture Model (FMM), Joint Mixture Model (JMM) and the Decoupled Model (DM) based on the three proposed properties. We also compare these models empricially. Experiments over two datasets of movie ratings under several different configurations show that in general, the fulfillment of the proposed properties tends to be positively correlated with the model’s performance. In particular, the DM model, which satisfies all the three properties that we want, outperforms all the other mixture models as well as some 2 PAPER TITLE POSSIBLY SHORTENED other existing approaches to collaborative filtering. Our study shows that graphic models are powerful tools for modeling collaborative filtering, but careful design of the model is necessary to achieve good performance. The rest of paper is arranged as follows: Section 2 gives a general discussion of using graphic models for collaborative filtering and presents the three desirable properties that any graphic model should satisfy. In Section 3, we present and examine five different mixture models in terms of their connections and differences. We discuss model estimation and rating prediction in Section 4 and Section 5. Empirical studies are presented in Section 6. Conclusions and future work are discussed in Section 7. 2 2.1 Graphic Models for Collaborative Filtering Problem definition We first introduce some annotations and formulate the problem of collaborative filtering formally in terms of graphic models. Let X {x1 , x2 ,......, xM } be a set of items, Y {y1, y2 ,......, yN } be a set of users, and {1,..., R} be the range of ratings. In collaborative filtering, we have available a training database consisting of the ratings of some items by some users, which we denote by {( x(1) , y(1) , r(1) ),....., ( x( L) , y( L) , r( L) )} , where tuple ( x(i ) , y(i ) , r(i ) ) means that user y(i ) gives item x(i ) a rating of r(i ) . Our task is to predict the rating r of an unrated item x by a user y based on all the training information. To cast the problem in terms of graphic models, we treat each tuple ( x(i ) , y(i ) , r(i ) ) as an observation from three random variables – x, y, and r. Through the training database, we are able to model the interaction between the three random variables. There are three possible choices of likelihood that we can maximize for the training data: p(r|x,y), p(r,x|y) and p(r,x,y). Even though there is strong correlation between these quantities, namely p(r , x | y ) p(r , x, y ) / r , x p(r , x, y ' ) and p(r | x, y) p(r, x | y) / x' p(r | x' , y) , maximizing data with different likelihood explain different aspect of the data. For the first choice, i.e., p(r|x,y), we believe that the only meaningful part of the data is the observed ratings and the selection of items and users are purely random. The second choice differs from the first choice by not only explaining the observation of ratings but also why each user select the subset of movies for rating. Due to the fact that only a small subset of all movies are rated by each user, it can be useful to model why a particular subset of movies are selected for a particular user. As the consequence, movies that have been rated by many users will have significantly higher impact on the estimation of the model than the movies rated only by a few users. The third choice, i.e., p(r,x,y), simply models the joint distribution between the three random variables. Under this choice, the model also concerns with the behavior of users, for example, why some users rate a lot of movies and others only rate a few. As a result, users with a lot of ratings will account for more in the model than users with only few ratings. Based on the above discussion, we can see that the choice of likelihood to model from the training dataset can have significant impact on the final estimation. It is easy to see that most existing probabilistic approaches to collaborative filtering fall into one of these three cases. For example, the personality diagnosis method is a special case of the first, where we assume a Gaussian distribution for p(r|x,y), and directly compute p(r|x,y) by performing a Bayesian average of the models of all users. The aspect model can be regarded as a special case of the third, where we use a mixture model for p(r,x,y). In this paper, we will focus on the second and third cases and systematically examine the different choices of mixture models. We compare these models both analytically and experimentally. 3 JIN, SI AND ZHAI 2.2 Major issues in designing a graphic model In general, in order to model the similarity among users, the items and the ratings, we need to cluster each component into a number of groups and to model the interactions between different components appropriately. More specifically, the following three important issues must be addressed: Property 1: how should we model user similarity similarity and item similarity? Generally, we think users and items are from different sets of concepts and the two sets of concepts are coupled with each other through the rating information. Therefore, we believe that a good clustering model should be able to explicitly model both classes of users and items and be able to leverage their correlation. This means that the choice of latent variables in our graphic model should allow for separate, yet coupled modeling of user similarity and item similarity. However, this may lead to complex clustering models that are hard to estimate accurately with the data available. We will compare several different clustering models. Property 2: should a user or an item be allowed to belong to multiple clusters? Since a user can have diverse interests and an item may have multiple aspects, intuitively, it is desirable to allow both items and users to be in multiple classes simultaneously. However, such a model may be too flexible to leverage user and item similarity effectively. We will compare models with different assumptions. Property 3: how can we capture the variances in rating patterns among the users with the same preferences over items? One common deficiency in most existing models is that they are all based on the assumption that users with similar interests would rate items similarly, but the rating pattern of a user is determined not only by his/her interests but also by the rating strategy/habit. For example, some users are more “tolerate” than others, and therefore their ratings of items tend to be higher than others even though they share very similar tastes of items. We will study how to model such variances in a graphic model. 3 Mixture models for collaborative filtering (Should we use probabilistic model instead of mixture model?) In this section, we discuss a variety of possible mixture models and examine their assumptions about user and item clustering and whether they address the variances in rating patterns. 3.1 Bayesian Clustering (BC) The basic idea of Bayesian Clustering (BC) is to assume that the same type of users would rate items similarly, and thus users can be grouped together into a set of user classes according to their ratings of items. Formally, given a user class ‘C’, the preferences regarding the various items expressed as ratings are independent, and the joint probability of user class ‘C’ and ratings of items can be written as the standard naïve Bayes formulation: M P(C, r1 , r2 ,..., rM ) P(C ) P(ri | C ) (1) i 1 (I think equation 1 may be a little bit inconsistent with equation2, people maybe take P(C,r1,r2…,rM) as rating patterns instead of the rating on specific obejects? I guess we can use r(xi) instead of ri and R(xi) separately) Then, the joint probability for the rating patterns of user y, i.e. {Ry ( x1 ), Ry ( x2 ),...,Ry ( xM )} , can be expanded as: P( Ry ( x1 ), Ry ( x2 ),..., Ry ( xM )) P(C ) C 4 P(R ( x ) | C) iX ( y ) y i (2) PAPER TITLE POSSIBLY SHORTENED As indicated by Equation (2), this model will first select a user class ‘C’ from the distribution P(C) and then rate all the items using the same selected class ‘C’. In another word, this model assumes of a single user class applied to the ratings of all the items and therefore prevents the case when a user is of multiple user classes and different user classes are applied to the ratings of different items. Parameters P(r|C) can be learned automatically using the ExpectationMaximization (EM) algorithm. More details of this model can be found in (Breese et al. 1998). As seen from Equation (1) and (2), the Bayesian Clustering approach chooses to model the rating information directly and no explicit clustering of users and items is performed. Moreover, this approach chooses to model the joint probability for all ratings of a single user, i.e., P( Ry ( x1 ), Ry ( x2 ),...,Ry ( xM )) , which is equal to the assumption that a single user can only belong to a single cluster. According to the three criterions mentioned in the previous section, Bayesian Clustering appears to be the simplest mixture model: no clustering of users and items; each user is assumed to be of a single cluster; no separation for preference patterns and rating patterns. Figure 1 illustrates the basic idea of the Bayesian Clustering. C r1 … r2 rn Figure 1: Graphical model representation for Bayesian Clustering. 3.2 Aspect Model (AM) The aspect model is a probabilistic latent space model, which models individual preferences as a convex combination of preference factors (Hofmann & Puzicha 1999). The latent class variable z Z {z1, z2 ,....., zK } is associated with each observation pair of a user and an item. The aspect model assumes that users and items are independent from each other given the latent class variable. Thus, the probability for each observation pair (x,y) (i.e., item-user pair) is calculated as follows: P ( x, y ) P( z ) P( x | z ) P( y | z ) zZ (3) where P(z) is class prior probability, P(x|z) and P(y|z) are class-dependent distributions for items and users, respectively. Intuitively, this model means that the preference pattern of a user is modeled by a combination of typical preference patterns, which are represented in the distributions of P(z), P(x|z) and P(y|z). There are two ways to incorporate the rating information ‘r’ into the basic aspect model, which are expressed in Equation (4) and (5), respectively. P( x(l ) , y(l ) , r(l ) ) P( z ) P( x (l ) | z ) P( y(l ) | z ) P(r(l ) | z ) zZ P( x ( l ) , y ( l ) , r( l ) ) P( z ) P( x ( l ) | z ) P( y ( l ) | z ) P( r( l ) | z, x ( l ) ) zZ Z X X Y R R (a) (b) (5) The corresponding graphical models are shown in Figure 2. The second model in Equation (5) has to estimate the conditional probability P ( r( l ) | z, x ( l ) ) , which has a large parameter space and may not be estimated reliably. Z Y (4) 5 Figure 2: Graphical models for the two extensions of aspect model in order to capture rating values. JIN, SI AND ZHAI Unlike the Bayesian Clustering algorithm, where only the rating information is modeled, the aspect model is able to model the users and the items with conditional probability P(y|z) and P(x|z). Moreover, unlike the Bayesian Clustering algorithm, which models the joint probability P( Ry ( x1 ), Ry ( x2 ),...,Ry ( xM )) directly, the aspect model models the joint probability P(x,y,r). By doing so, the aspect model gives each item the freedom to select the appropriate user class for its rating while in the Bayesian Clustering algorithm, the same user class is used to rate all the items. However, the aspect model only introduces a single set of hidden variable for items, users, and ratings. This essentially encodes the clustering of users, the clustering of items, and the correlation between them together and the separation of the clustering of users and items is not attempted. Moreover, user preferences and rating patterns are not separated neither. Therefore, according to the criterion stated in Section 2.2, the aspect model is still a preliminary model: a simple way to model the users and items but without the separation of clustering of users and items; allowing the freedom of a user and an item to be in multiple clusters; no attempt for modeling the preference patterns and rating patterns separately. 3.3 Joint Mixture Model (JMM) and Flexible Mixture Model (FMM) In this section, we examine two graphic models for collaborative filtering that are different from BC and AM in that they are able to separately model the classes of users and items. Recall that the general goal of graphic models for collaborative filtering is to simplify the joint probability P({ xi , R y ( xi )}iM1 | y ) , i.e. the likelihood for a user ‘y’ to rate a set of items {xi }iM1 with ratings {R y ( xi )}iM1 . By extending the idea of the Bayesian Clustering algorithm and the Aspect Model (AM), we can have two different treatments for this joint probability: 1) As the first treatment, we can follow the spirit of the Bayesian Clustering algorithm and expand the joint probability P({ xi , R y ( xi )}iM1 | y ) as: M P({ xi , R y ( xi )}iM1 | y ) P( z y | y ) P( xi , R y ( xi ) | z y ) i 1 zy (11) where variable zy stands for the class for user ‘y’. As indicated in Equation (11), to estimate probability P({ xi , R y ( xi )}iM1 | y ) , we will first choose the user class zy according to the distribution P(zy|y), and then computing the likelihood of rating every item using the same user class zy. Similar to the Bayesian clustering algorithm, this model assumes that each user should only belong to a single user class. For easy reference, we call this model Joint Mixture Model, or JMM. 2) The second choice for estimating P({ xi , R y ( xi )}iM1 | y ) is to first expand the joint probability into a product of likelihood P( xi , Ry ( xi ) | y) for every item and then expand likelihood P( xi , Ry ( xi ) | y) over hidden variables for the user classe, i.e., M M i 1 i 1 z y P({ xi , R y ( xi )}iM1 | y ) P( xi , R y ( xi ) | y ) P( xi , R y ( xi ) | z y ) P( z y | y ) (12) Unlike the last approach, where the same user class is used to rate all the items, in the approach, each item has its own freedom in choosing user class. For each reference, we call this model Flexible Mixture Model, or FMM (Si, et. al., 2003). Comparing Equation (12) to Equation (11), the difference between these two models is in the order of product and sum. As the result of different order for product and sum operations, the FMM model allows each item to choose the appropriate user class for its rating while the JMM model enforces each item to use the same user class for it rating pattern. Mapping to the three issues discussed in Section 2.2, we can see that for the first issue, both models are able to model 6 PAPER TITLE POSSIBLY SHORTENED the clustering of users and items separately. For the second issue, the FMM model allows each user to be in multiple clusters while the JMM model enforces each user to be a single cluster. However, none of these two models makes any attempt to model the difference between rating patterns and preference patterns. The key component for these two models is the estimation of P( xi , Ry ( xi ) | z y ) . Direct estimating P( xi , Ry ( xi ) | z y ) as a part of our parameter space can lead to a severe sparse data problem because the number of different P( xi , Ry ( xi ) | z y ) can be quite large. In order to leverage the sparse data problem, we can further introduce a class variable zx for item xi and have P( xi , Ry ( xi ) | z y ) rewritten as: P( xi , R y ( xi ) | z y ) P( xi , Ry ( xi ), z x | z y ) P( z x ) P( xi , R y ( xi ) | z x , z y ) zx zx (13) P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) zx As indicated by Equation (13), probability P( xi , Ry ( xi ) | z y ) is decomposed as a sum of products with three terms: P( z x ) , P( xi | z x ) , and P( Ry ( xi ) | z x , z y ) . Probability P( z x ) stands for the class prior for item class zx, P( xi | z x ) stands for the likelihood for item xi to be in item class zx, and P( Ry ( xi ) | z x , z y ) stands for user class zy to rate item class zx as rating category Ry(xi). Clearly, by introducing a small number of classes for items, we are able to decrease the number of parameters for P( xi , Ry ( xi ) | z y ) substantially, from M R | Z y | to | Z x | (1 M | Z y | R) where |Zy| and |Zx| stand for the number of classes for users and items respectively, and M and R stand for the number of items and rating categories. In summary, both models contain the following four set of parameters: 1) P( z y | y) : the likelihood of assigning user ‘y’ to the user class zy, 2) P( z x ) : the class prior for item class zx, 3) P( xi | z x ) : the likelihood of generating item xi from the item class zx, 4) P(r | z x , z y ) : the likelihood of rating the item class zx as category ‘r’ by the user class zy. The total number of parameters is | Z x | (1 M | Z y | R) N | Z y | where N stands for the number of training users. The diagrams of the corresponding graphic model are displayed in Figure 3. Y R1 Z1x X1 Zy R2 Z2x X2 Zy Zx …… RM ZMx Y XM R X (b) (a) Figure 3: Graphical model representation for the Joint Mixture Model and Flexible Mixture Model. Diagram (a) represents the joint mixture model (JMM) and (b) for flexible mixture model (FMM). 7 JIN, SI AND ZHAI 3.4 Decouple Models for Rating and Preference Patterns (DM) To explicitly account for the fact that users with similar interests may have very different rating patterns, we extend the FMM models by introducing two hidden variables ZP, ZR, with ZP for the preference patterns of users and ZR for the rating patterns. The graphic representation of this model is displayed in Figure 4. Similar to the previous models, hidden variable Zx represents the class of items. However, unlike the mixture models discussed in the previous subsection, where users are modeled by a single class variable Zy, in this new model, users are clustered from two different perspectives, i.e., the clustering of preference patterns of user represented by hidden variable ZP and the clustering of rating patterns or habits by hidden variable ZR. Furthermore, random variable Zpref is introduced to indicate whether or not the class of items Zx is preferred by the class of users ZP who presumably share similar preferences of items. As indicated in the graphic representation in Figure 4, this random variable is ZP Zx able to connect between variables ZP, Zx, and the rating ‘R’. According to Section 2.2, this new model Y X is able to address all the three issues all together: Zpref model the clustering of users and items separately; allow each user to be in multiple clusters; and model the difference between preference patterns and rating R ZR patterns. Similar to the treatment in the FMM model, P({ xi , R y ( xi )}iM1 | y ) can be estimated by probability Figure 4: Graphical model representation for the decoupled model (DM). expanding it into a product of likelihood P( xi , Ry ( xi ) | y) for every item, as illustrated in Equation (12). Combining class variables ZP, ZR, and Zpref together, we can express likelihood P( xi , Ry ( xi ) | y) as: P( x i , R y ( xi ) | y ) P( z P | y ) P( z R | y) P( z x ) P( xi | z x ) P( z pref | z P , z x ) P( R y ( xi ) | z R , z pref ) Z pref {0,1} zP , zR , zx (as Zpre is a binary variable, I changed the sum a little bit. ) There are totally six different sets of parameters involved in this model: 1) P( z P | y) : the likelihood of assigning user ‘y’ to the preference class zP, 2) P( z R | y) : the likelihood of assigning user ‘y’ to the rating class zP, 3) P( z x ) : the class prior for item class zx, 4) P( xi | z x ) : the likelihood of generating item xi from the item class zx, 5) P( z pref | zP , z x ) : the probability for preference class z P to favor item class z x , 6) P( Ry ( xi ) | z R , z pref ) : the likelihood for rating classes z R to rate items as Ry ( xi ) given the preference condition z pref . The total number of parameters is | Z x | (1 M | Z P || Z pref |) N (| Z R | | Z P |) R | Z R || Z pref | where | Z P | , | Z R | , | Z pref | , and | Z x | stand for the number of classes for preference, rating, preference level, and items. For easy reference, we will call this model a Decouple Model or DM. Comparing Figure 4 to Figure 3, two hidden variables ZR and Zpref variables are introduced to account for the difference between the underlying preference of users and surface rating patterns. Particularly, the rating variable ‘R’ is determined jointly by hidden variables ZR and Zpref. In another word, not only the fact whether the user likes an item (i.e., Zpref) will influence the rating of that item, but also the specific rating patterns of the user (i.e., ZR). Therefore, even if a user 8 (14) PAPER TITLE POSSIBLY SHORTENED appears to like a certain type of items, the rating value can still be low if he has a very ‘tough’ rating criterion. We can further complicate this model by allowing the hidden variable Zpref to be of multiple values instead of a binary variable. For example, we can have three different preference levels, with zero for no preference, one for slight preference and two for strong preference. In our experiments, we set the number of preference levels to the number of different rating categories. In this case, the flexible mixture model stated in Equation (13) can be viewed as a special case of Equation (14) with conditional probability P( R y ( xi ) | z R , z pref ) set as a delta function ( R y ( xi ), z pref ) , which is one only when the rating Ry ( xi ) is equal to the preference level zpref and zero otherwise. 3.5 Summary and Comparison In the above four section, we discuss five different mixture models with the increasing complexity. The Bayesian Clustering approach chooses to model the rating information and makes no attempt to cluster either users or items. The aspect model is able to model the clustering of users and items explicitly. However, with only a single hidden variable, the clustering of users and items are not separated. Another flexibility of aspect model over Bayesian Clustering is that, the aspect model allows each user to be in multiple clusters while the Bayesian Clustering doesn’t. As an improvement of Bayesian Clustering and aspect model, the joint mixture model (JMM) and flexible mixture model (FMM) are introduced with the emphasis on the separation of clustering of users and items. They differ in whether a single user should be allowed to be in multiple clusters. The decouple model has two addition hidden variable about preference pattern and rating pattern in order to address the issue that users with similar tastes may have different rating patterns. Table 1 lists the properties of each model with respect to the three issues discussed in Section 2.2. According to Table 1, both the mixture models and decouple model are able to satisfy more properties than the Bayesian clustering approach and the aspect model. As a result, they are able to give better description of training data and achieve better performance in prediction. On the other hand, as the tradeoff, in order to satisfy more properties, we have to increase the model complexity, which can degrade the accuracy of prediction when we only have small amount of training data. As will be observed in the later experiment, with a large number of training users, models satisfying more properties usually perform better than models with fewer properties. However, when the number of training users is small, the simple model may perform even better. Property 1 Property 2 Property 3 Bayesian Clustering (BC) ---Aspect Model (AM) -x -Joint Mixture Model (JMM) x --Flexible Mixture Model (FMM) x x -Decouple Model (DM) x x x Table 1: Properties of five different mixture models for collaborative filtering. Property 1 corresponds to the separation of clustering for users and items; Property 2 corresponds to the freedom for a single user to be multiple clusters; Property 3 corresponds to the capture of difference between preference patterns and rating patterns. 4 Model Estimation 4.1 General approach -- EM Algorithms In general, all these mixture models can be estimated using the EM algorithm. Here we give some detail on the derivation of the EM algorithm for the Joint Mixture Model (JMM), which is slightly more complicated than the other models. 9 JIN, SI AND ZHAI To train the optimal parameters for this model, the EM algorithm can be employed in order to maximize the log-likelihood of training data, i.e., L log P ({ xi , R y ( xi )}iM1 | y ) (15) y The expression for P({ xi , R y ( xi )}iM1 | y ) by putting Equation (11) and (13) together, i.e., M P({ xi , R y ( xi )}iM1 | y ) P( z y | y ) P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) zy i 1 z x (16) The Expectation Maximization (EM) algorithm (Demspter, et. al., 1977) is usually used to optimize the above objective function. The EM algorithm alternates between the expectation steps and maximization steps. In the expectation step, the posterior probabilities of the latent variables P( z y | {xi , R y ( xi )}iM1 , y ) and P( z ix | {xi , R y ( xi )}iM1 , y, z y ) are computed as follows: M P( z y | {xi , R y ( xi )}iM1 , y ) P( z y | y ) P( xi , R y ( xi ) | z y ) i 1 M P( z y | y ) P( xi , R y ( xi ) | z y ) i 1 zy P( z ix | {xi , R y ( xi )}iM1 , y, z y ) (17) P( z ix ) P( xi | i P( z x ) P( xi i zx z ix ) P( R y ( xi ) | z ix , z y ) | z ix ) P( R y ( xi ) | z ix , z y ) (18) In the maximization step, the model parameters will be updated according to the estimated posteriors as follows: P( z y | y) P( z y | {xi , R y ( xi )}iM1 , y) (19) i M P( z x z x | {xi , R y ( xi )}i 1, y, z y ) P( z y | y) P( z x ) zy y i i M P( z x z x | {xi , R y ( xi )}i 1 , y, z y ) P( z y | y ) (20) zx zy y i i M P( z x z x | {xi , R y ( xi )}i 1 , y, z y ) P( z y | y ) ( x xi ) P( x | z x ) zy y i i M P( z x z x | {xi , R y ( xi )}i 1 , y, z y ) P( z y | y ) (21) zy y i i M P( z x z x | {xi , R y ( xi )}i 1 , y, z y ) P( z y | y ) ( x xi ) ( R y ( xi ) r ) P(r | z x , z y ) y i i M P( z x z x | {xi , R y ( xi )}i 1 , y, z y ) P( z y | y ) ( x xi ) (22) y i Since the EM solution to objective Equation (15) is not obvious, the proof of the EM updating equations (17-22) is presented in the Appendix A. 4.2 Smoothing in the EM Algorithms The EM algorithm is infamous for its tendency to find undesirable local optimal solutions. In this section, we will discuss two techniques that have potential to avoid unfavorable solutions. Again, we use the JMM as the example and other models will follow the same style. The first technique to avoid unfavorable local optimal solution is called annealed EM algorithm (AEM) (Hofmann & Puzicha, 1998), which is an EM algorithm with regularization. In 10 PAPER TITLE POSSIBLY SHORTENED order to avoid the posteriors to be skewed distributed at the early stage of the EM iterations, we introduce a variable ‘b’. In the Anealing EM algorithm, the posteriors for the JMM model will be rewritten as: b M P( z y | y ) P( xi , R y ( xi ) | z y ) i 1 P( z y | {xi , R y ( xi )}iM1, y ) b M P( z y | y ) P( xi , Ry ( xi ) | z y ) zy i 1 P( z ix | {xi , R y ( xi )}iM1 , (24) P( z )P( x | z )P(R ( x ) | z , z ) y, z ) P( z ) P( x | z ) P( R ( x ) | z , z ) i x y z ix i x i i x y i x i y b i x i i y i x (25) b y As an analogy to the true annealing process, variable ‘b’ corresponds the inverse of so-called ‘temperature’. At first we set the temperature to be infinitely high, or variable ‘b’ is set to be zero. In this case, training data are ignored and all the posteriors are simply uniform distribution. Then, by decreasing the ‘temperature’, or increasing the value of variable ‘b’, we let the training data play more and more important role in the estimation of posteriors and therefore the estimated posteriors are more and more away from the uniform distribution. When variable ‘b’ is increase to be 1, we return back to the posterior expression of normal EM algorithm as stated in Equation (17) and (18). Thus, by slowly increasing the value of ‘b’, we avoid the estimated posteriors to become skewed distributions at the early stage of EM iterations and may have a better chance to find useful local optimal solution. The second smoothing strategy is to introduce an appropriate model prior and maximize a posterior (MAP) instead of the likelihood (MLE). Let L(D|M) stands for the log-likelihood of training data (in Equation (15)) and P(M) stands for the prior for model M. Instead of maximizing the likelihood of training data L(D|M) as we did in the last subsections, we can maximize the posterior of the model ‘M’, which can be written as L(D|M)P(M). For the convenience of computation, we use the following Dirichlet prior (with a uniform mean) for model M, which is: a P( M | a, b, c, d ) P( z x ) P( xi | z x ) zx i, zx b P( z y | y ) y, z y c P(r | z x , z y ) zx , z y ,r d (26) It is not difficult to prove that the EM updating equations for the JMM model by using MAP is: M P ( z y | {xi , R y ( xi )}iM1 , y ) c P ( z y | y ) P ( xi , R y ( xi ) | z y ) i 1 M c P( z y | y ) P ( xi , R y ( xi ) | z y ) zy i 1 (19’) a P( z ix z x | {xi , R y ( xi )}iM1 , y, z y ) P( z y | y ) zy y i P( z x ) i M a P( z x z x | {xi , R y ( xi )}i 1, y, z y ) P( z y | y ) zx (20’) zy y i b P( z ix z x | {xi , R y ( xi )}iM1 , y, z y ) P( z y | y ) ( x xi ) P( x | z x ) zy y i Mb P( z ix z x | {xi , R y ( xi )}iM1 , y, z y ) P( z y | y ) zy y i 11 (21’) JIN, SI AND ZHAI d P( z ix z x | {xi , R y ( xi )}iM1 , y, z y ) P( z y | y ) ( x xi ) ( R y ( xi ) r ) P(r | z x , z y ) y i Rd P ( z ix z x | {xi , R y ( xi )}iM1 , y, z y ) P( z y | y ) ( x xi ) (22’) y i The details can be found in Appendix B. Comparing the above equations to the original EM updating equations, the only difference between them is that by maximizing a posterior (MAP), we introduce Laplacian smoothing to the updating equations for model parameters, which essentially gives a nonzero initialization to each parameter and therefore prevent model parameters from being skewed distributed. Similar to the discussion for the mixture model, both the annealing EM and the MAP approach will be applied to avoid unfavorable local optimal solutions. 5 Rating prediction To predict the ratings of items by a test user y t , in general, we need to compute the probability distribution over all the related latent class variables, and sum over all the latent variable values. In addition to the ratings by training users, ratings over a small number of items by the test user will be given, which are used to discover the distribution of related latent class variables for the test user. Let Dtrain and Dtest stand for the rating data for the training users and the test user, respectively. Let {hi }im1 be the hidden variables and M test {P(hi | y t )}im1 be the parameter space related to the test user. Mtrain is used to represent the parameter space that is related to the training user. In order to predict the rating of an item x by the test user yt, we need to compute the likelihood P( Ry t ( x) | Dtrain, Dtest ) , which can be expanded over the model space Mtrain and Mtest. P( R yt ( x) | Dtrain, Dtest ) P( R yt ( x) | M test , M train) P(M train | Dtrain) P(M test | M train, Dtest ) M test M train P( R yt ( x) | M *test , M *train) P(M *train | Dtrain) P(M *test | M train*, Dtest ) (27) where M*train and M*test stand for the optimal model that maximizes the likelihood P(M*train|Dtrain) and P(M*test| M*test, Dtest), respectively. In the above expression, we approximate the average over model space with a simple computation on the optimal model. The optimal model M*test is derived by the application of EM algorithm to maximize the rating data of the test user Dtest. As an example, for the JMM model, the parameter space related to the user is M test {P( z y | y t )} and the optimal P( z y | y t ) is computed by simply maximizing the likelihood of rating data by the test user. 6 Experiments In this section, we will present experiment results in order to address the following six issues: 1) Whether modeling users and items separately is important to collaborative filtering? In Section 3, we proposed two mixture models for collaborative filtering. Unlike the previous probabilistic models for collaborative filtering, the proposed mixture models introduce two different class variables for modeling users and items separately, and are able to cluster users and items simultaneously. In this experiment, we will compare them to both the Aspect Model (AM) and the Bayesian Clustering algorithm (BC) to see if modeling users and items separately is effective for collaborative filtering. 2) Would it be beneficial to allow a user/item to belong to multiple clusters? The difference between these two models is that the joint mixture model (JMM) assumes that each user belongs to a single user class while the flexible mixture model (FMM) allows each user to be 12 PAPER TITLE POSSIBLY SHORTENED of multiple user classes. By comparing these two models, we are able to see which assumption is more appropriate for collaborative filtering. 3) Which of smoothing techniques is more effective? By the end of Section 3, we discussed two different methods for smoothing the EM algorithm, including an Annealing EM algorithm (AEM) and a MAP approach. Both of these two approaches try to prevent the estimation of parameters from being skewed distribution at the early stage of EM iterations. In this experiment, we will compare the effectiveness of the two smoothing methods for collaborative filtering. 4) Would modeling the distinction between the preferences and ratings help improve the performance? In order to see the effectiveness of the ‘DM’ model, we compare it to the flexible mixture model (FMM), which is essentially the same model as ‘DM’ except that the ‘DM’ model uses two sets of class variables to describe the preference patterns and rating patterns of each user while the FMM model only uses one set of class variables. 5) How effective are the proposed models compared to other proposed models? In this experiment, we will compare these mixture models with the other graphic models and memory-based approaches. In previous studies, when compared with the memory-based approaches, the model-based approaches tend to have mixed results (Breese et al. 1998). It is thus interesting to see if our models, which decouple the preference patterns from rating patterns, can outperform memory-based approaches. MovieRating 500 1000 EachMovie 2000 1682 Number of Users Number of Items Avg. # of rated 87.7 129.6 Items/User Number of Ratings 5 6 Table 2: Characteristics of MovieRating and EachMovie. Two datasets of movie ratings are used in our experiments, i.e., ‘MovieRating’ 1 and ‘EachMovie’2. Specifically, we extracted a subset of 2,000 users with more than 40 ratings from ‘EachMovie’ since evaluation based on users with few ratings can be unreliable. The global statistics of these two datasets as used in our experiments are summarized in Table 2. A major challenge in collaborative filtering applications is for the system to operate effectively when it has not yet acquired a large amount of training data (i.e., the so-called “cold start” problem). To test our algorithms in such a challenging and realistic scenario, we vary the number of training users from a small value to a large value. To get a better sense of sparseness of the training data, we introduce the measurement called ‘movie coverage’, which is computed by multiplying the number of training users with the average number of movies rated by each user and dividing it by the total number of movies in the dataset. In another word, this ‘movie coverage’ measures the number of times that each movie has been rated in the training data. Particularly, we consider three different cases of training data: 1) Small Training. For this case, we only use the rating information of first 20 users as the training data for both datasets and the leftover users as testing users. The ‘movie coverage’ for the ‘MovieRating’ dataset in this case is only 1.8 and 1.5 for the ‘EachMovie’ dataset, i.e., in average each movie is rated by less than 2 training users. 2) Medium Training. For this case we use the rating information of the first 100 users as training users for ‘MovieRating’ dataset and first 200 users for ‘EachMovie’ dataset. The 1 http://www.cs.usyd.edu.au/~irena/movie_data.zip 2 http://research.compaq.com/SRC/eachmovie 13 JIN, SI AND ZHAI ‘movie coverage’ for this case is 8.8 and 15.4, which is substantially larger than the case of small training. 3) Large Training. For this case we use the rating information of the first 200 users as training users for ‘MovieRating’ dataset and first 400 users for ‘EachMovie’ dataset. The ‘movie coverage’ for this case is 17.7 and 30.8. Going form ‘small training’ to ‘medium training’ to ‘large training’, we are able to increase the ‘movie coverage’, or the averaged number of times that each movie has been rated in the training dataset, from less than 2 to around 20 or 30. With this large variance in training data, we are able to see the robustness of the learning procedure. The other dimension to be examined in this experiment is the robustness of the model with respect to the number of given items rated by the test user. In this experiment, we examine our models against test users with 5, 10, and 20 given items. By varying the number of given items, we can test the robustness of the prediction procedure. For the mixture models, namely the joint mixture model (JMM) and the flexible mixture model (FMM), the numbers of classes for users and items, i.e., |Zy| and |Zx|, are set to be 10 and 20. For the decoupled model (DM), the numbers of classes for items and users are same as the mixture model, and the number of classes for rating patterns, i.e., |ZR|, is set to be 3. For the previously studied models, we use the similar number, namely the number of clusters in the Bayesian Clustering algorithm (BC) is set to be 10 and the number of classes in Aspect Model (AM) is set to be 20. We tried a few other values, and found that all turned out with similar performance. For evaluation, we look at the mean absolute deviation of the predicted ratings from the actual ratings on items that users in the test set have actually rated, i.e. 1 S y0 | R y ( x) Rˆ y0 ( x) | (28) m y0 xX~ ( y0 ) 0 where Rˆ y0 ( x) is the predicted rating on item x by user y0, R y0 ( x) is the actual rating on item x by user y0 and my0 is the number of test items that have been rated by the test user y0. We refer to this measure as the mean absolute error (MAE) in the rest of this paper. There are some other measures like the Receiver Operating Characteristic (ROC) as a decision-surrport accuracy measure (Breese, et. al., 1998) and the normalized MAE. But since MAE has been the most commonly used metric and has been reported in most previous research (Breese, et. al., 1998; Herlocker, et. al., 1999; Melville, et. al., 2002; SWAMI, 2000; Pennock, et. al., 2000), we chose it as the evaluation measure in our experiments to make our results more comparable. 6.1 Experiment with Mixture Models In this experiment, we need to address the first two questions in the questionnaires listed at the beginning of this section, namely whether modeling users and items separately is important to collaborative filtering and whether it is beneficial to allow a user/item to belong to multiple clusters. The results for all three types of training data and three different numbers of given items are listed in Table 3 and 4. Several interesting observations can be drawn from Table 2 and 3: 1) According to Table 3 and 4, the flexible mixture model (FMM) performs substantially better than the joint mixture model (JMM) in most of the configurations except for the collection ‘MovieRating’ when the number of training users is only 20. In the next experiment where smoothing methods are applied to regularize the EM algorithm, we will see that the FMM model is able to outperform the JMM model substantially even for this single case. The only difference between these two models is whether or not a user should be treated as of multiple user types instead of a single one. The fact that the FMM model outperforms the JMM model 14 PAPER TITLE POSSIBLY SHORTENED indicates that a user should belong to multiple user classes and each item should be allowed to choose its own appropriate user class for rating. The hypothesis is further confirmed by the fact that the aspect model is able to perform better than the Bayesian Clustering approach in most cases (except for the EachMovie dataset when the number of training users is 400). 2) Based on Table 3 and 4, the FMM is able to perform substantially better than the Bayesian Clustering algorithm (BC) and the Aspect Model (AM) in most case except for the case of small training. In the next experiment, we will see that with appropriate smoothing technique, the FMM model is able to outperform both approaches even in the case of small training. One important difference between the proposed model and these two models is that FMM models users and items separately with two different sets of hidden variables while the Bayesian Clustering algorithm and the Aspect Model (AM) only introduce a single set of class variables for describing the rating information of users on different items. Therefore, the fact that the FMM model performs better than the two previously studies models indicates that modeling users and items separately is effective for collaborative filtering. Training Users Size Algorithms 5 Items 10 Items 20 Items Given Given Given FMM 1.000 0.994 0.990 JMM 0.990 0.968 0.920 20 BC 1.10 1.09 1.08 AM 0.982 0.976 0.958 FMM 0.823 0.822 0.817 JMM 0.868 0.868 0.854 100 BC 0.968 0.946 0.941 AM 0.882 0.856 0.836 FMM 0.804 0.801 0.799 JMM 0.840 0.837 0.831 200 BC 0.949 0.942 0.912 AM 0.891 0.850 0.818 Table 3: MAE of proposed mixture models compared to the Bayesian Clustering algorithm (BC) and the Aspect Model (AM) on the ‘MovieRating’ dataset. A smaller value means a better performance. Training Users Size Algorithms 5 Items 10 Items 20 Items Given Given Given FMM 1.31 1.31 1.30 JMM 1.38 1.37 1.36 20 BC 1.46 1.45 1.44 AM 1.28 1.24 1.23 FMM 1.08 1.06 1.05 JMM 1.17 1.15 1.15 200 BC 1.25 1.22 1.17 AM 1.27 1.18 1.14 FMM 1.06 1.05 1.04 JMM 1.10 1.09 1.09 400 BC 1.17 1.15 1.14 AM 1.28 1.19 1.16 Table 4: MAE of proposed mixture models compared to the Bayesian Clustering algorithm (BC) and the Aspect Model (AM) on the ‘EachMovie’ dataset. A smaller value means a better performance. 15 JIN, SI AND ZHAI 6.2 Experiments with Smoothing Methods In Section 3, we discussed two different methods for smoothing the EM algorithms. The first one is called Annealing EM algorithm (AEM), which control the convergence rate of parameter estimation by slowly increasing the value of ‘b’ in Equation (24)-(25). In experiment, we increase variable ‘b’ from 0 to 1 with a step size 0.1. We run the EM iterations three times for every ‘b’ and ten times when ‘b’ is set to 1. The second smoothing strategy is to run an EM algorithm for maximizing a posterior (MAP) instead of the likelihood of the training data (MLE). As indicated in Equation (19’)-(22’), this method equals to a Laplacian smoothing in the estimation of parameters. The parameters ‘a’, ‘b’, ‘c’, and ‘d’ are set as follows: | X ( y) | | X ( y) | | X ( y) | | X ( y) | a y 10000 | Z x | b y 10000 M | Z x | c y 10000 N | Z y | d y 10000 R | Z x | | Z y | where X(y) stands for the number of items rated by the user ‘y’. Finally, since the last experiment has shown that the FMM model is substantially better than the JMM model, in this experiment, The results for the FMM model for two different smoothing methods over dataset ‘MovieRating’ and ‘EachMovie’ are presented in Table 5 and Table 6. Meanwhile, the results for the JMM model with smoothing methods for the two datasets are presented in Table 7 and 8. Training Users Size Algorithms 5 Items 10 Items 20 Items Given Given Given AEM 1.000 0.994 0.990 20 MAP 0.881 0.877 0.870 AEM 0.823 0.822 0.817 100 MAP 0.821 0.820 0.813 AEM 0.804 0.801 0.799 200 MAP 0.797 0.786 0.781 Table 5: MAE for the flexible mixture model (FMM) using different smoothing methods on the ‘MovieRating’ dataset. ‘AEM stands for annealing EM algorithm and ‘MAP’ stands for the EM algorithm for maximizing a posterior. A smaller value means a better performance. Training Users Size Algorithms 5 Items 10 Items 20 Items Given Given Given AEM 1.31 1.31 1.30 20 MAP 1.23 1.22 1.22 AEM 1.08 1.06 1.05 200 MAP 1.08 1.05 1.04 AEM 1.06 1.05 1.04 400 MAP 1.06 1.04 1.03 Table 6: MAE for the flexible mixture model (FMM) using different smoothing methods on the ‘EachMovie’ dataset. ‘AEM stands for annealing EM algorithm and ‘MAP’ stands for the EM algorithm for maximizing a posterior. A smaller value means a better performance. 16 PAPER TITLE POSSIBLY SHORTENED Training Users Size Algorithms 5 Items 10 Items 20 Items Given Given Given AEM 0.990 0.968 0.920 20 MAP 0.986 0.963 0.920 AEM 0.868 0.868 0.854 100 MAP 0.864 0.863 0.854 AEM 0.840 0.837 0.831 200 MAP 0.837 0.833 0.831 Table 7: MAE for the joint mixture model (JMM) using different smoothing methods on the ‘MovieRating’ dataset. ‘AEM stands for annealing EM algorithm and ‘MAP’ stands for the EM algorithm for maximizing a posterior. A smaller value means a better performance. Training Users Size Algorithms 5 Items 10 Items 20 Items Given Given Given AEM 1.38 1.37 1.36 20 MAP 1.37 1.35 1.34 AEM 1.17 1.15 1.15 200 MAP 1.17 1.15 1.14 AEM 1.10 1.10 1.09 400 MAP 1.10 1.09 1.09 Table 8: MAE for the joint mixture model (JMM) using different smoothing methods on the ‘EachMovie’ dataset. ‘AEM stands for annealing EM algorithm and ‘MAP’ stands for the EM algorithm for maximizing a posterior. A smaller value means a better performance. Two observations can be drawn from Table (5)-(8): 1) According to Table (5)-(8), the proposed MAP (i.e., maximizing a posterior) is able to outperform the Annealing EM algorithm for both the JMM model and FMM model in all the cases. As a matter of fact, if we compare the results using AEM in Table (5)-(8) to the results without smoothing in Table (3) and (4), we can see that the AEM algorithm only achieves the same performance as the original EM for all the cases. These two facts indicate that the MAP approach is an effective method for collaborative filtering while the AEM algorithm doesn’t have any impact on the performance of the mixture models. 2) With a more careful examination of Table (5) and (5), we can see that the MAP smoothing method is able to improve the performance of the FMM substantially when the number of training users is small (i.e., 20 for both ‘MovieRating’ and ‘EachMovie’). But, the improvement becomes very modest when the number of training data becomes large (i.e., 100 and 200 for ‘MovieRating’, and 200 and 400 for ‘EachMovie’). This is consistent with the spirit of Bayesian statistics, namely the model prior is important and useful only when the amount of training data is small. When the amount of training data is sufficient, the effect of model prior will be ignorable. 3) In the previous experiment, the aspect model is the winner for the case of small training. With the help of appropriate smoothing, the FMM model is able to perform better than the aspect model even in the case of small training. This fact again indicates that the smoothing method is able to effectively alleviate the difficulty caused by the sparse data. Due to the success of the MAP smoothing method, it is used for the remaining experiments. 17 JIN, SI AND ZHAI 6.3 Experiments with DM Comparing to the other four models, the decouple model introduced in Section 3 is able to address the distinction between preferences and ratings by explicitly or inexplicitly model the preference patterns and rating patterns of users separately. In this experiment we need to answer the question, would modeling the distinction between the preferences and ratings help improve the performance. The results for the model DM on ‘MovieRating’ and ‘EachMovie’ datasets are listed in Table 9 and 10 together with the results for the FMM model (copy from Table 4 and 5) because the FMM model close relative to the DM model and differs from the DM model only by the lack of modeling for rating patterns. By comparing the performance of the DM model to that of the FMM model, we are able to see if the introduction of hidden variables for preferences and ratings separately is effective for collaborative filtering. By comparing the model ‘DM’ to its baseline peer, i.e., the FMM model, we can see that the DM model outperforms the FMM model in all the cases. These two models have exactly the same setup except that the model DM introduces the extra hidden nodes ZR and Zpref in order to account for the variance in the rating behavior among the users with similar interests. Although the difference appears to be insignificant in some cases, it is interesting to note that when the number of rated items given increases, the gap between ‘DM’ and the baseline model also increases. This may suggest that when there are only a small number of items with given ratings, it is rather difficult to determine the type of rating patterns for the testing user. As the number of given items increases, this ambiguity will decrease quickly and therefore the advantage of the ‘DM’ model over the FMM model will be more clear. Indeed, it is a bit surprising that even with only five rated items and only a couple of hundreds of users the ‘DM’ model still slightly improves the performance as ‘DM’ has many more parameters to learn than the baseline model. We suspect that the skewed distribution of ratings among items, i.e., a few items account for a large number of ratings, may have helped. Training Users Size Algorithms 5 Items 10 Items 20 Items Given Given Given DM 0.874 0.871 0.860 20 FMM 0.881 0.877 0.870 DM 0.814 0.810 0.799 100 FMM 0.821 0.820 0.813 DM 0.790 0.777 0.761 200 FMM 0.797 0.786 0.781 Table 9: MAE for the flexible mixture model (FMM) and the decoupled model (DM) on the ‘MovieRating’ dataset. A smaller value means a better performance. Training Users Size Algorithms 5 Items 10 Items 20 Items Given Given Given DM 1.20 1.18 1.17 20 FMM 1.23 1.22 1.22 DM 1.07 1.04 1.03 200 FMM 1.08 1.05 1.04 DM 1.05 1.03 1.02 400 FMM 1.06 1.04 1.03 Table 10: MAE for the flexible mixture model (FMM) and the decoupled model (DM) on the ‘EachMovie’ dataset. A smaller value means a better performance. 18 PAPER TITLE POSSIBLY SHORTENED 6.4 Experiments with the Comparison to Other Approaches In this subsection, we compare all five mixture models to the memory based approaches for collaborative filtering, including the Personal Diagnosis (PD), the Vector Similarity method (VS) and the Pearson Correlation Coefficient method (PCC). In the following part, we will first briefly introduce the three memory-based approaches and then present the empirical results. 6.4.1 Memory-based Methods for Collaborative Filtering Memory-based algorithms store the rating examples of training users and predict a test user’s ratings based on the corresponding ratings of the users in the training database that are similar to the test user. Three commonly used methods will be compared in this experiment. They are: Pearson Correlation Coefficient (PCC) According to (Resnick et. al., 1994), Pearson Correlation Coefficient method predicts the rating of a test user y0 on item x as: w y 0 , y ( R y ( x) R y ) y Y Rˆ y 0 ( x) R y 0 wy 0 , y yY where the coefficient wy, y0 is computed as w y0 , y ~ ~ xX ( y )^ X ( yo ) ~ ~ xX ( y )^ X ( yo ) ( R y ( x ) R y )( R y0 ( x ) R y0 ) ( R y ( x) R y ) 2 ~ ~ xX ( y )^ X ( yo ) ( R y ( x) R y ) 2 Vector Similarity (VS) This method is very similar to the PCC method except that the correlation coefficient wy, y0 is computed as: w y0 , y ~ ~ xX ( y )^ X ( yo ) R y ( x) 2 ~ xX ( y ) R y ( x ) R y0 ( x ) R y0 ( x ) 2 ~ xX ( y0 ) Personality Diagnosis (PD) In the personality diagnosis model, the observed rating for the test user yt on an item x is assumed to be drawn from an independent normal distribution with the mean as the true rating as RTrue (x) : yt P( R y t ( x) | RTrue ( x)) e yt 2 ( R t ( x ) R True 2 2 t ( x )) y y where the standard deviation is set to constant 1 in our experiments. Then, the probability of generating the observed rating values of the test user by any user y in the training database can be written as: P( R y t | R y ) et ( Ry ( x ) R t ( x ))2 2 2 y xX ( y ) The likelihood for the test user yt to rate an unseen item x as category r is computed as: P ( R y t ( x ) r ) P ( R y t | R y )e y 19 ( Ry ( x )r )2 2 2 JIN, SI AND ZHAI The final predicted rating for item ‘x’ by the test user will be the rating category ‘r’ with the highest likelihood P( R yt ( x) r ) . Empirical studies have shown that the PD method is able to outperform several other approaches for collaborative filtering (Pennock et al., 2000). 6.4.2 Empirical Results Training Users Size 5 Items 10 Items 20 Items Given Given Given PCC 0.912 0.840 0.812 VS 0.912 0.840 0.812 PD 0.888 0.882 0.875 AM 0.982 0.976 0.958 20 BC 1.10 1.09 1.08 DM 0.871 0.860 0.874 FMM 0.881 0.877 0.870 JMM 0.986 0.963 0.920 PCC 0.881 0.832 0.809 VS 0.859 0.834 0.823 PD 0.839 0.826 0.818 AM 0.882 0.856 0.836 100 BC 0.968 0.946 0.941 DM 0.814 0.810 0.799 FMM 0.821 0.820 0.813 JMM 0.864 0.863 0.854 PCC 0.878 0.828 0.801 VS 0.862 0.950 0.854 PD 0.835 0.816 0.806 AM 0.891 0.850 0.818 200 BC 0.949 0.942 0.912 DM 0.790 0.777 0.761 FMM 0.797 0.786 0.781 JMM 0.837 0.833 0.831 Table 11: MAE for nine different models on the ‘MovieRating’ dataset, including a Pearson Correlation Coefficient approach (PCC), a Vector Similarity approach (VS), a Personality Diagnosis approach (PD), a Aspect Model (AM), a Bayesian Clustering approach (BC), a decoupled model (DM), a flexible mixture model (FMM) and a joint mixture model (JMM). A smaller value means a better performance Algorithms The results are shown in Table 11 and 12. The proposed models ‘DM’ and ‘FMM’ are substantially better than all existing methods for collaborative filtering including both memorybased approaches and model-based approaches except for the case when the number of given items is only 20, in which the memory-based model performs substantially better than all modelbased approaches. The overall success of the DM model and FMM model suggests that, compared with the memory-based approaches, graphic models are not only advantageous in principle, but also empirically superior due to their capabilities of capturing the distinction between the preference patterns and rating patterns in a principled way. The fact that the memory-based approaches outperforms the model-based approaches in the case of small training data can be explained by the fact that the number of training data is actually 20 PAPER TITLE POSSIBLY SHORTENED smaller than the number of parameters required for the model. When there are only 20 training users, the number of rated items is much less than 10,000 (1700 for the ‘MovieRating’ dataset and 2500 for ‘EachMovie’ dataset) while, the number of parameters is actually over 20,000 for all the models (over 20,000 for ‘MovieRating’ dataset and 30,000 for ‘EachMovie’ dataset.). Therefore, when there are only 20 training users, the amount of training data is not sufficient for creating a reliable and effective model for collaborative filtering. Based on this analysis, we can see that usually the performance of memory-based approaches depend strongly on the availability of training data. When the amount of training data is small, it is better off using memory-based approaches since no reliable model can be trained over a small amount of training data. Training Users Size 5 Items 10 Items 20 Items Given Given Given PCC 1.26 1.19 1.18 VS 1.24 1.19 1.17 PD 1.25 1.24 1.23 AM 1.28 1.24 1.23 BC 1.46 1.45 1.44 DM 1.20 1.18 1.17 FMM 1.23 1.22 1.22 JMM 1.37 .135 1.34 PCC 1.22 1.16 1.13 VS 1.25 1.24 1.26 PD 1.19 1.16 1.15 AM 1.27 1.18 1.14 BC 1.25 1.22 1.17 DM 1.07 1.04 1.03 FMM 1.08 1.05 1.04 JMM 1.17 1.15 1.14 PCC 1.22 1.16 1.13 VS 1.32 1.33 1.37 PD 1.18 1.16 1.15 AM 1.28 1.19 1.16 BC 1.17 1.15 1.14 DM 1.05 1.03 1.02 FMM 1.06 1.04 1.03 JMM 1.10 1.09 1.09 Table 12: MAE for nine different models on the ‘EachMovie’ dataset, including a Pearson Correlation Coefficient approach (PCC), a Vector Similarity approach (VS), a Personality Diagnosis approach (PD), a Aspect Model (AM), a Bayesian Clustering approach (BC), a decoupled model (DM), a flexible mixture model (FMM) and a joint mixture model (JMM). A smaller value means a better performance. 7 Algorithms Conclusions and Future Work In this paper, we focus on two important issues with collaborative filtering: 1) Whether users and items should be modeled separately and whether users should belong to multiple user classes? 2) How to address the issue that users with similar taste can have very different rating behaviors? 21 JIN, SI AND ZHAI Two mixture models are proposed to address the first problem: the flexible mixture model (FMM) assumes that each user can belong to multiple user classes whereas the joint mixture model (JMM) enforces each user to be in a single user class. For the second question, two preference-based models: the decoupled model (DM) avoids the variance in rating patterns by decoupling the rating patterns from the preference patterns whereas the model for preferred ordering (MP) model tries to achieve a similar effect by modeling the relative orderings of items instead of the absolute values of ratings. Empirical results with mixture models show that the FMM model is consistently better than the JMM model by the MAE measure, which indicates that a user should be of multiple user types instead of just a single one. Meanwhile, the success of FMM model over the previously studied models such as the Bayesian Clustering algorithm and an Aspect Model implies that it is better off to have users and items modeled separately. Empirical results with preference-based models show that the DM model performs consistently better than the MP model, which is somehow expected due to the way the ‘MP’ model is designed. Furthermore, the experiments confirmed that the decoupling of rating patterns and preference patterns is important for collaborative filtering, and modeling such a decoupling in a graphic model leads to improvement in performance. Comparison with other methods for collaborative filtering indicates that the proposed method is superior, suggesting advantages of graphic models for collaborative filtering. The idea of modeling preferences has also been explored in some other related work (Ha & Haddawy 1998; Freund et al. 1998;Cohen et al. 1999). We plan to further explore this direction by considering all these different approaches and using a more appropriate evaluation criterion such as one based on inconsistent orderings. We also believe that the decoupling problem that we addressed may represent a more general need of modeling “noise” in similar problems such as gene microarray data analysis in bininformatics. We plan to explore a more general framework for all these similar problems. Acknowledgements This work was supported in part by the advanced Research and Development Activity (ARDA) under contract number MDA908-00-C-0037 and the National Science Foundation under Cooperative Agreement No. IRI-9817496. 22 PAPER TITLE POSSIBLY SHORTENED Appendix A: Proof of EM Updating Equations for Joint Mixture Model In Section 3.1, we show the EM updating equations for the joint mixture model (JMM). Here comes the proof of those Equations. The goal of the iterative procedures is to maximize the log-likelihood of training data, i.e., M L log P({ xi , R y ( xi )}iM1 | y) log P( z y | y) P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) y y i 1 z x z y (a1) Let stands for the model of the current iteration, i.e., ({P( z x )},{P( x | z x )},{P( z y | y)},{P(r | z x , z y )}) , and ’ stands for the model obtained from the last iteration, i.e., ' ({P' ( z x )},{P' ( x | z x )},{P' ( z y | y)},{P' (r | z x , z y )}) . Follow the spirit of EM algorithm, the goal is to maximize the difference between the likelihood of two consecutive iterations, which can be written as: M P( z y | y ) P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) i 1 z x z L( ) L( ' ) log y M y P' ( z y | y ) P' ( z x ) P' ( xi | z x ) P' ( R y ( xi ) | z x , z y ) i 1 z x zy (a2) By defining M P( z y | {xi , R y ( xi )}iM1, y ) P( z y | y ) P( xi , R y ( xi ) | z y ) i 1 M P( z y | y ) P( xi , R y ( xi ) | z y ) (a3) i 1 zy , we will have Equation (a2) rewritten as: L( ) L( ' ) M P( z y | y ) P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) z i 1 x log P' ( z y | {xi , R y ( xi )}iM1 , y ) M y z y P' ( z y | y ) P' ( z x ) P' ( xi | z x ) P' ( R y ( xi ) | z x , z y ) i 1 z x M P ( z | y ) P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) y i 1 z x P' ( z y | {xi , R y ( xi )}iM1 , y ) log M y zy P' ( z y | y ) P' ( z x ) P' ( xi | z x ) P' ( R y ( xi ) | z x , z y ) i 1 z x P' ( z y | {xi , R y ( xi )}iM1 , y ) log P( z y | y ) (a4) y zy M P' ( z y | {xi , R y ( xi )}i 1 , y zy P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) z y ) log x i P' ( z x ) P' ( xi | z x ) P' ( R y ( xi ) | z x , z y ) zx P' ( z y | {xi , R y ( xi )}iM1 , y ) log P' ( z y | y ) y zy The second step in the above equation uses the Jensen’s inequality. In the final expression of the above Equation, we have three terms: the first term only contains the parameter P( z y | y) . Therefore by setting the derivative of the above equation with respect to P( z y | y) , we will have 23 JIN, SI AND ZHAI the updating equation for P( z y | y) as listed in Equation (18); the last term doesn’t contain any model parameters of current iteration. Therefore, we can simply ignore it. The second term is the most sophisticated term. We can simplify the second term in Equation (a4) by applying the Jensen’s inequality again, i.e. P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) z y ) log x i P' ( z x ) P' ( xi | z x ) P' ( R y ( xi ) | z x , z y ) zx i M P' ( z x | {xi , R y ( xi )}i 1 , y, z y ) M P' ( z y | {xi , R y ( xi )}i 1 , y ) log P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) zx y zy i P' ( z x ) P' ( xi | z x ) P' ( R y ( xi ) | z x , z y ) M P' ( z y | {xi , R y ( xi )}i 1 , y zy (a5) P' ( z y | {xi , R y ( xi )}iM1 , y ) P' ( z ix | {xi , R y ( xi )}iM1 , y, z y ) log P( z x ) y zy i zx M P' ( z y | {xi , R y ( xi )}i 1 , y zy y ) P' ( z ix | {xi , R y ( xi )}iM1 , y, z y ) log P( xi | z x ) M P' ( z y | {xi , R y ( xi )}i 1 , y zy y ) P' ( z ix | {xi , R y ( xi )}iM1 , y, z y ) log P( R y ( xi ) | z x , z y ) P' ( z y | {xi , R y ( xi )}iM1 , y zy i zx i zx y ) P' ( z ix | {xi , R y ( xi )}iM1 , y, z y ) log P' ( z x ) P' ( xi | z x ) P' ( R y ( xi ) | z x , z y ) i zx where P( z ix | {xi , R y ( xi )}iM1, y, z y ) is defined as P( z ix | {xi , R y ( xi )}iM1 , y, z y ) P( z ix ) P( xi | z ix ) P( R y ( xi ) | z ix , z y ) P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) i i i (a6) z ix In the last expression for , we have four terms in the sum. The first three terms are linked with each of the three parameters P( z x ) , P( xi | z x ) , and P(r | z x , z y ) . Therefore, we can set the derivatives of Equation (a5) with respect to each of the three parameters and obtain the updating equations as listed in Equation (19)-(21). The last term for in Equation (a5) has nothing to do with the parameters of current iteration and therefore can be ignored. 24 PAPER TITLE POSSIBLY SHORTENED Appendix B: Proof of EM Updating Equations for the MAP Smoothing Method In section 3.3, we show that by introducing a Dirichlet prior on the parameters, we can maximize the posterior of the model instead of the likelihood of the model, which results in a set of updating Equations from (25’) to (28’). In this appendix, we will prove those EM updating equations are correct. For the sake of simplicity, we will only do the proof for the flexible mixture model (FMM). The proof for the joint mixture model (JMM) is almost identical. By putting Equation (23) and (32) together, we have the logarithm of the posterior for FMM model expressed as: Q log( P( z y | y ) P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) y i zx ,z y a log P( z x ) b log P( xi | z x ) c log P( z y | y ) d log P(r | z x , z y ) (b1) zx i zx y zy r zx , z y Similar to the proof presented in Appendix A, stands for the model of the current iteration, i.e., ({P( z x )},{P( x | z x )},{P( z y | y)},{P(r | z x , z y )}) , and ’ stands for the model obtained from the last iteration, i.e., ' ({P' ( z x )},{P' ( x | z x )},{P' ( z y | y)},{P' (r | z x , z y )}) . The difference in the logarithm of posteriors between two consecutive iterations can be written as: 25 JIN, SI AND ZHAI P( z y | y ) P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) Q( ) Q( ' ) log( y i zx ,z y P' ( z y | y ) P' ( z x ) P' ( xi | z x ) P' ( R y ( xi ) | z x , z y ) zx , z y a log zx P( z y | y ) P(r | z x , z y ) P( z x ) P( xi | z x ) b log c log d log P' ( z x ) P' ( xi | z x ) P' ( z y | y ) P' (r | z x , z y ) i zx y zy r zx , z y P( z y | y ) P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) log P' ( z x , z y | xi , y, R y ( xi )) P ' ( z | y ) P ' ( z ) P ' ( x | z ) P ' ( R ( x ) | z , z ) y i z , z y x i x y i x y x y P( z y | y ) P(r | z x , z y ) P( z x ) P( xi | z x ) a log b log c log d log P' ( z x ) P' ( xi | z x ) P' ( z y | y ) P' (r | z x , z y ) zx i zx y zy r zx , z y P( z y | y ) P( z x ) P( xi | z x ) P( R y ( xi ) | z x , z y ) P' ( z x , z y | xi , y, R y ( xi )) log P' ( z y | y ) P' ( z x ) P' ( xi | z x ) P' ( R y ( xi ) | z x , z y ) y i zx , z y P( z y | y ) P(r | z x , z y ) P( z x ) P( xi | z x ) a log b log c log d log P' ( z x ) P' ( xi | z x ) P' ( z y | y ) P' (r | z x , z y ) zx i zx y zy r zx , z y (b2) P( z x ) a P' ( z x , z y | xi , y, R y ( xi )) log P' ( z x ) zx y ,i , z y P( xi | z x ) b P' ( z x , z y | xi , y, R y ( xi )) log P' ( xi | z x ) i, zx y,zy P( z y | y ) c P' ( z x , z y | xi , y, R y ( xi )) log P' ( z y | y ) y,zy i, zx P(r | z x , z y ) c P' ( z x , z y | xi , y, R y ( xi )) (r , R y ( xi )) r zx , z y i,zx P' (r | z x , z y ) In the last expression of the above equation, we have all the parameters decoupled into four independent terms. Therefore, the updating equations can be obtained by simply setting the derivatives of the last expression with respect to each of the four parameters P( z x ), P( x | z x ), P( z y | y), P(r | z x , z y ) to be zero, which will result in the Equation (25’)-(28’). References Breese, J. S., Heckerman, D., Kadie C., (1998). Empirical Analysis of Predictive Algorthms for Collaborative Filtering. In the Proceeding of the Fourteenth Conference on Uncertainty in Artificial Intelligence. Cohen, W., Shapire, R., and Singer, Y., (1998) Learning to Order Things, In Advances in Neural Processing Systems 10, Denver, CO, 1997, MIT Press. OConnor, M. & Herlocker, Jon. (2001) Clustering Items for Collaborative Filtering. In the Proceedings of SIGIR-2001 Workshop on Recommender Systems, New Orleans, LA. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B39: 1-38. Freund, Y, Iyer, R., Shapire, R., and Singer, Y., (1998) An efficient boosting algorithm for combining preferences. In Proceedings of ICML 1998. 26 PAPER TITLE POSSIBLY SHORTENED Ha,V. and Haddawy, P., (1998) Toward Case-Based Preference Elicitation: Similar- ity Measures on Preference Structures, in Proceedings of UAI 1998. Hofmann, T., & Puzicha, J. (1999). Latent Class Models for Collaborative Filtering. In the Proceedings of International Joint Conference on Artificial Intelligence. Hofmann, T., & Puzicha, J. (1998). Statistical models for co-occurrence data (Technical report). Artificial Intelligence Laboratory Memo 1625, M.I.T. Pennock, D. M., Horvitz, E., Lawrence, S., & Giles, C. L. (2000) Collaborative Filtering by Personality Diagosis: A Hybrid Memory- and Model-Based Approach. In the Proceeding of the Sixteenth Conference on Uncertainty in Artificial Intelligence. Popescul A., Ungar, L. H., Pennock, D.M. & Lawrence, S. (2001) Probabilistic Models for Unified Collaborative and content-Based Recommendation in Sparse-Data Environments. In the Proceeding of the Seventeenth Conference on Uncertainty in Artificial Intelligence. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., & Riedl, J. (1994) Grouplens: An Open Architecture for Collaborative Filtering of Netnews. In Proceeding of the ACM 1994 Conference on Computer Supported Cooperative Work. Ross, D. A. and Zemel, R. S. (2002). "Multiple-cause vector quantization." In NIPS-15: Advances in Neural Information Processing Systems 15. Ueda, Naonori and Ryohei Nakano. 1998. Deterministic annealing EM algorithm. Neural Networks, 11(2):271--282. Herlocker, J. L., Konstan, J. A., Brochers, A. and Riedl, J. (1999). An Algorithm Framework for Performing Collaborative Filtering. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 1999. Melville, P., Mooney, R. J., and Nagarajan, R.: Content-Boosted Collaborative Filtering for Improved Recommendations. In Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI), 2002. SWAMI: a framework for collaborative filtering algorithm development and evaluation. In Proceedings of the 23rd Annual International Conference on Researech and Development in Information Retrieval (SIGIR), 2000. L. Si and R. Jin, Product Space Mixture Model for Collaborative Filtering, In Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), 2003 R. Jin, L. Si, and C.X. Zhai, Preference-based Graphic Models for Collaborative Filtering, In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI 2003), 2003 T. Hofmann, Gaussian Latent Semantic Models for Collaborative Filtering, In Proceedings of the 26th Annual International ACM SIGIR Conference, 2003 27