CS395T Data Mining: A Statistical Learning Perspective Term Project Report I Team members: Topic: (2) Hui Wu, Qi Li The Netflix Prize---Co-clustering and Local Regression 1 Background Recommender systems are a computer-based intelligent technique to deal with the problem of information and product overload. They analyze patterns of user interest in items and products to provide personalized services which match user’s interest in most business domains, benefiting both the user and the merchant [4]. Content-based approach and Collaborative Filtering (CF) are the two dominant strategies for recommender systems[1]. While content based strategies require gathering large amounts of external information that might not be available or easy to collect, collaborative filtering, which relied only on past user behavior on previous items ratings, becomes preferred for researchers. Many collaborative filtering techniques in the literature are based on correlation criteria [7,8], and matrix factorization[9,10]. The correlation-based techniques adopt similarity measures, such as Pearson correlation [7], and cosine similarity [8], to determine a neighborhood of similar users for each user and predict the user’s rating for an item by using a weighted average of ratings of the neighbors. These techniques have much reduced coverage for they can’t detect item synonymy [1]. The matrix factorization approaches are SVD [9] and NNMF-based [10] filtering techniques that predict the unknown ratings based on a low rank approximation of the original ratings matrix. These techniques treat the users and items symmetrically and handle item synonymy and sparsity [1] and the computation of these techniques is intensive. 1.1 Netflix problem Netflix, which is the largest online DVD rental service, has held a contest named ‘Netflix Prize’ to improve their movie recommendation system. Basically, the recommendation problem in this recommendation system is the problem of estimating ratings for the items that have not been seen by a user. There are 2 gigs of csv files containing 100 million ratings, which were made by 480K (anonymous) users on a library of almost 18K movies. The goal is to make sense of the data and spit out a 20 mb file of results for a set of ratings not contained in the sample, which are compared against the users' actual ratings to test the performance of the estimation. Many talented researchers have been attracted to attack this problem for its great challenges: while 100 million ratings may sound like a lot, there would actually be 8.5 billion ratings if every user rated every movie, which means only over 1% of a complete sample is provided and we have to make good predictions on the user’s ratings with such spare dataset. 2 Co-clustering and local regression In this project, we employ a collaborative filtering approach based on weighted Bregman co-clustering algorithm [3], which involves simultaneous clustering of users and items. Co-clustering is a problem of simultaneously clustering rows and columns of a data matrix. Not just like general clustering which seeks similar rows or columns, co-clustering seeks “blocks” (or “co-clusters”) of rows and columns that are inter-related [3]. Co-clustering is desirable over traditional “single-sided” clustering from a number of Perspectives [3]: 1. “Simultaneous clustering of row and column clusters is informative and digestible. Co-clustering provides compressed representations that are easily interpretable while preserving most of the information contained in the original data, which makes it valuable to a large class of statistical data analysis applications”. [3] 2 “A row (or column) clustering can be thought of as dimensionality reduction along the rows (or columns). Simultaneous clustering along rows and columns reduces dimensionality along both axes, thus leading to a statistical problem with dramatically smaller number of parameters and hence, a much more compact representation for subsequent analysis”. [3] 3 “As the size of data matrices increases, so does the need for scalable clustering algorithms. Single-sided, geometric clustering algorithms such as k-means and its variants have computation time proportional to m*n*k per iteration, where m is the number of rows, n is the number of columns and k is the number of row clusters. Co-clustering algorithms based on a similar iterative process involve optimizing over a smaller number of parameters, and can relax this dependence to (m*k*l + n*k*l) where m, n and k are the number of rows, the number of columns , the number of row clusters and l is the number of column clusters. Since the number of row and column clusters is usually much lower than the original number of rows and columns, co-clustering can lead to substantial reduction in the running time”. [3] 3 Problem definition[1] n U {ui }im1 is the set of users and |U|=m. P { p j } j 1 is the set of items and |P|=n. A is m x n ratings matrix and Aij is the rating of the user ui for the item p j . W is m x n matrix corresponding to the confidence of the ratings in A. Also, we assume Wij 1 when the rating is known and 0 otherwise. :{1,...., m} {1,...., k} and :{1,...., n} {1,...., l} denote the user and item clustering , where k and l are number of user and item clusters.  is approximation matrix for A. Two definitions of approximation matrix  : ˆ 1. A gh COC ˆ COC ( R RC ) ( C CC ) 2.  : ij gh i g j h Definition 1 is the simplest but intuitive approximation matrix for A, while definition 2 incorporates the biases of the individual users and items by including the terms (user average - user cluster average) and (item average - item cluster average) in addition to the co-cluster average. R C COC When (i ) g and ( j ) h , i , j are average ratings of user ui and item p j and gh , CC RC g , h are the average ratings of the corresponding co-cluster, user-cluster and item-cluster respectively[1]: COC gh iR i '| ( i ') g j '| ( j ) h Ai ' j ' i '| ( i ') g j '| ( j ) hWi ' j ' nj '1 Aij ' nj '1Wij ' , Cj , RC g i '| (i ') g nj '1 Ai ' j ' i '| (i ') g nj '1Wi ' j ' , CC h im'1 j '| ( j )h Ai ' j ' im'1 j '| ( j )hWi ' j ' im'1 Ai ' j im'1Wi ' j Using  , we can treat the prediction of unknown ratings as a co-clustering problem where we seek to find the optimal user and item clustering ( , ) such that minimize the sum-squared residue[1] m n ˆ )2 min Wij ( ij ij ( , ) i 1 j 1 4 Algorithm Design In this project, we mainly base our algorithm on the following papers: [1] [2] [3] [4] 4.1 Batch iteration Algorithm 1: batch iteration using ˆ COC ij gh Algorithm 3: batch iteration using ˆ COC ( R RC ) (C CC ) ij gh i g j h The algorithms operate in a batch fashion in the sense that at each iteration the column clustering ρ is updated only after determining the nearest column cluster for every column of A(likewise for rows). The algorithm iterates till the decrease in objective function becomes small as governed by the tolerance factor τ. 4.2 Incremental algorithms Algorithm 2: Incremental algorithms using ˆ COC ij gh Algorithm 4: Incremental algorithms using ˆ COC ( R RC ) (C CC ) ij gh i g j h Incrementally move columns (rows) between column (row) clusters if such a move leads to decrease in the objective function. Each invocation of the incremental procedures tries to perform such a move for each row and column of the data matrix. Effectively escape from poor local minima and avoiding empty clusters. A ping-pong approach is used to invocate the batch and incremental algorithms, where the incremental local search algorithm refines the clustering produced by the batch algorithms and triggers further runs of the batch algorithms. The whole process will continue until the objective function converges. Algorithm 1 Input: Rating matrix A, Non-zeros matrix W, # row clusters l, #col. clusters k. Output: Co-Clustering (ρ, γ) Initialize ρ and γ Compute ACOC objval ← ∆ ← 1; τ ← 10-2||A||2 While ∆ > τ Foreach 1 ≤ j ≤ n 2 ( j ) arg min im1Wij ( ij COC (i ) h ) 1 h l COC Update A Foreach 1 ≤ i ≤ m 2 (i) arg min nj 1Wij ( ij COC g ( j ) ) 1 g k COC Update A Oldobj ← objval; objval ← ∆ ← |oldobj - objval| Algorithm 2 Incremental algorithm (for column) Input: A, W, l and γ Output: column clustering γ τ ← 10-5 ||A||2 {* Adjustable parameters} Foreach 1 ≤ j ≤ n For g = 1 to l, g ≠ γ(j) = g0 For g= 1 to k 2 j ( g ) Wij ( ij COC )2 Wij ( ij COC gh ) gh ig jh0 0 ig jh0 0 2 Wij ( ij COC )2 Wij (ij COC gh ) gh ig jh ig jh {find the best column to move along with best cluster} (j*,c*)← arg max j ( g ) ( j,g ) If j* ( g *) τ then γ (j*) ← c* Update ACOC Algorithm 3 Input: Rating matrix A, Non-zeros matrix W, # row clusters l, #col. clusters k. Output: Co-Clustering (ρ, γ) Initialize ρ and γ Compute ACOC, ARC, ACC, AR and AC objval ← ∆ ← 1; τ ← 10-2 ||A||2 {* Adjustable parameters} While ∆ > τ Foreach 1 ≤ j ≤ n R RC C CC 2 ( j ) arg min im1Wij ( ij COC ( i ) h i ( i ) j h ) 1 h l Update ACOC, ACC Foreach 1 ≤ i ≤ m R RC C CC 2 (i) arg min nj 1Wij ( ij COC g ( j ) i g j ( j ) ) 1 g k COC Update A RC ,A Oldobj ← objval; objval ← * ∆ ← |oldobj ← objval| Algorithm 4 Incremental algorithm (for column) Input: A, W, l and γ Output: column clustering γ τ ← 10-5 ||A||2 {* Adjustable parameters} Foreach 1 ≤ j ≤ n For g = 1 to l, g ≠ γ(j) = g0 ˆ ( g , h )) 2 j ( g ) Wij ( ij ij ( g , h0 )) 2 Wij ( ij ij 0 ig jh0 ig jh0 ˆ ( g , h)) 2 Wij ( ij ij ( g , h )) 2 Wij ( ij ij ig jh ig jh {Find the best column to move along with best cluster} (j*,c*)← arg max j ( g ) ( j,g ) If j* ( g *) τ then γ (j*) ← c* Update ACOC, ACC 4.3 Simplifying the computation in these algorithms Algorithm 1: Computational time is O((nl+mk) Wgh) . Wgh is the number of non zeros in Agh Algorithm 3 can be implemented efficiently by pre-computing the invariant parts of the update cost functions [1]. temp1 ij Let A Aij A A R i C j temp 2 ih A , j '| ( j ') h j '| 1 Aijtemp ' W ( j ') h ij A CC h temp 3 ih A , i '| ( i ') g i '| 1 Aitemp 'j Wij ( i ') g AgRC Minimizing the row update cost function is equivalent to minimizing n j 1 2 COC RC (Aitemp ( j ) Ag j( ) Ag ) Minimizing the column update cost function is equivalent to minimizing m temp 3 COC CC ( A A A ) ( i ) j i ( h ) h i 1 Each iteration in algorithm 2 can be speeded up by performing the computations in a different Manner. We use column reassignment to show how to do simplification in incremental algorithm 2. Assume column jc is assigned to h 0 originally ( ( jc ) h0 ), it is reassigned to h during incremental algorithms ( ( jc ) h ), then h0 h0 , h h are all the change that has contribution to || ||. (Column clusters in the current clustering are denoted by h, h0, and column clusters in the new clustering are represented by h , h0 . Define (i ) g i g ). For co-cluster (i ) g , h0 h0 : 2 Wij ( ij COC )2 Wij ( ij COC gh0 ) gh ig jh0 ig jh0 0 2 COC 2 Wij [( ij COC )2 (ij COC gh0 ) ] Wijc ( ijc gh0 ) gh ig jh0 ig 0 2 COC 2 ( Wij )[ COC COC gh0 ] Wijc ( ijc gh0 ) gh ig jh0 (by proving ) ig 0 h h: 2 Wij ( ij COC )2 Wij (ij COC gh ) gh ig jh ig jh 2 COC 2 ( Wij )[COC COC gh ] Wijc ( ijc gh ) gh ig jh ig Since COC gh 0 Aij ig jh0 Wij ig jh0 Aij Aijc ig jh 0 ig Wij Wijc ig jh0 COC gh0 , ig Aij ig jh 0 Wij , ig jh0 COC ( Wij )COC Aijc gh0 ( Wij Wijc ) gh ig jh0 ig jh0 ig Wij Wijc ) gh Similarly, (ig j h ig COC Thus, if we store COC gh 0 (1) ig ( Wij ) COC Aijc gh ig jh ig (2) , A and W in the main memory, then we can compute COC and COC gh gh 0 in a faster manner. Also, the computational time of step * in algorithm 2 the O(lk Wgh) .Wgh is the number of non zeros in Agh. Since the Netflix rating matrix A is a sparse matrix, it is possible to speed up algorithm 2 in this manner. Iteration in algorithm 4 can be speeded up in a similar manner. We still use column reassignment to show how to do simplification in incremental algorithm 4. For co-cluster (i ) g , h0 h0 : C CC 2 Wij ( ij COC iR RC g jc h ) gh ig jh0 0 Wij ( ij ig jh0 0 COC gh0 R i RC g 2 CC h0 ) C jc C COC CC COC COC (2 RC CC COC CC AhCC )( Wij ) g 2 jc gh gh0 h0 )( gh0 gh h h 0 0 ( COC gh0 COC gh0 Wijc ( ijc ig 0 0 )[2 CC h0 CC h0 COC gh0 R i COC gh0 ig jh0 0 ( Wij ) 2 Wij ] ig jh0 R i ig jh0 ) RC g C jc CC 2 h0 h h: C CC 2 Wij ( ij COC iR RC g jc h ) gh ig jh C CC 2 Wij ( ij COC iR RC gh g jc h ) ig jh C COC COC (2 RC CC COC CC COC CC AhCC )( Wij ) g 2 jc gh gh h )( gh h gh h ig jh ( COC gh COC gh Wijc ( ijc ig CC h COC gh )[2 CC h R i RC g COC gh ( Wij ) 2 Wij ] ig jh R i ig jh ) C jc CC 2 h Since CC h 0 im1 Aij jh0 m i 1 Wij jh0 im1 Aij im1 Aijc jh m i 1 0 Wij Wijc m i 1 jh0 , CC h0 im1 Aij jh m i 1 0 Wij , jh0 m m CC m (im1 Wij )CC h0 (i 1 Wij i 1Wijc ) h0 i 1 Aijc . jh jh 0 (3) 0 Similarly, (i 1 Wij i 1Wijc ) h ( i 1 Wij ) h i 1 Aijc m m CC m jh CC m jh (4) We can calculate CC , CC , COC and COC in the same manner according to formula h gh gh h 0 0 (1)(2)(3)(4).Also, since ACOC, ARC, ACC, AR and AC are stored in the main memory, the computational time of step * in algorithm 4 is O(lk Wgh) .Wgh is the number of non zeros in Agh. Since the Netflix rating matrix A is a sparse matrix, it is possible to speed up algorithm 2 in this manner. 3.5 other issues Spectral approximation for initialization Test if removing global effect before we do co-clustering is helpful RERERENCES [1] T. George and S. Merugu. “A Scalable Collaborative Filtering Framework based on Co-clustering.” Proceedings of the 5th IEEE Conference on Data Mining (ICDM), Nov 2005. [2] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data. Proceedings of the Fourth SIAM International Conference on Data Mining, pages 114-125, April 2004. [3] A.Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha. A generalized maximum entropy approach to bregman co-clustering and matrix approximation. In KDD, pages 509–514, 2004. [4] R. Bell and Y. Koren. Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights.ICDM, 2007. [5] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In UAI, pages 43–52, 1998. [6] B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Reidl. Item-based collaborative filtering recommendation algorithms. In WWW, pages 285–295, 2001. [7] P. Resnick, N. Iacovou, M. Suchak, P. Bergstorm, and J. Riedl. GroupLens: An Open Architecture for Collaborative Filtering of Netnews. In Proc. of ACM Conf. on Computer Supported Cooperative Work, pages 175–186, 1994. [8] U. Shardanand and P. Maes. Social information filtering: Algorithms for automating “word of mouth”. In Proc. of ACM Conf. on Human Factors in Computing Systems, volume 1, pages 210–217, 1995. [9] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Application of dimensionality reduction in recommender systems– a case study. In WebKDD Workshop., 2000. [10] N. Srebro and T. Jaakkola. Weighted low rank approximation. In ICML, pages 720–728, 2003.