CS395T Data Mining: A Statistical Learning Perspective Term Project Report I

advertisement
CS395T Data Mining: A Statistical Learning Perspective
Term Project Report I
Team members:
Topic: (2)
Hui Wu,
Qi Li
The Netflix Prize---Co-clustering and Local Regression
1 Background
Recommender systems are a computer-based intelligent technique to deal with the problem of
information and product overload. They analyze patterns of user interest in items and products to
provide personalized services which match user’s interest in most business domains, benefiting
both the user and the merchant [4].
Content-based approach and Collaborative Filtering (CF) are the two dominant strategies for
recommender systems[1]. While content based strategies require gathering large amounts of
external information that might not be available or easy to collect, collaborative filtering, which
relied only on past user behavior on previous items ratings, becomes preferred for researchers.
Many collaborative filtering techniques in the literature are based on correlation criteria [7,8], and
matrix factorization[9,10]. The correlation-based techniques adopt similarity measures, such as
Pearson correlation [7], and cosine similarity [8], to determine a neighborhood of similar users for
each user and predict the user’s rating for an item by using a weighted average of ratings of the
neighbors. These techniques have much reduced coverage for they can’t detect item synonymy
[1].
The matrix factorization approaches are SVD [9] and NNMF-based [10] filtering techniques that
predict the unknown ratings based on a low rank approximation of the original ratings matrix.
These techniques treat the users and items symmetrically and handle item synonymy and sparsity
[1] and the computation of these techniques is intensive.
1.1 Netflix problem
Netflix, which is the largest online DVD rental service, has held a contest named ‘Netflix Prize’
to improve their movie recommendation system. Basically, the recommendation problem in this
recommendation system is the problem of estimating ratings for the items that have not been seen
by a user.
There are 2 gigs of csv files containing 100 million ratings, which were made by 480K
(anonymous) users on a library of almost 18K movies. The goal is to make sense of the data and
spit out a 20 mb file of results for a set of ratings not contained in the sample, which are
compared against the users' actual ratings to test the performance of the estimation.
Many talented researchers have been attracted to attack this problem for its great challenges:
while 100 million ratings may sound like a lot, there would actually be 8.5 billion ratings if every
user rated every movie, which means only over 1% of a complete sample is provided and we have
to make good predictions on the user’s ratings with such spare dataset.
2 Co-clustering and local regression
In this project, we employ a collaborative filtering approach based on weighted Bregman
co-clustering algorithm [3], which involves simultaneous clustering of users and items.
Co-clustering is a problem of simultaneously clustering rows and columns of a data matrix. Not
just like general clustering which seeks similar rows or columns, co-clustering seeks “blocks” (or
“co-clusters”) of rows and columns that are inter-related [3].
Co-clustering is desirable over traditional “single-sided” clustering from a number of
Perspectives [3]:
1. “Simultaneous clustering of row and column clusters is informative and digestible.
Co-clustering provides compressed representations that are easily interpretable while preserving
most of the information contained in the original data, which makes it valuable to a large class of
statistical data analysis applications”. [3]
2 “A row (or column) clustering can be thought of as dimensionality reduction along the rows (or
columns). Simultaneous clustering along rows and columns reduces dimensionality along both
axes, thus leading to a statistical problem with dramatically smaller number of parameters and
hence, a much more compact representation for subsequent analysis”. [3]
3 “As the size of data matrices increases, so does the need for scalable clustering algorithms.
Single-sided, geometric clustering algorithms such as k-means and its variants have computation
time proportional to m*n*k per iteration, where m is the number of rows, n is the number of
columns and k is the number of row clusters. Co-clustering algorithms based on a similar iterative
process involve optimizing over a smaller number of parameters, and can relax this dependence
to (m*k*l + n*k*l) where m, n and k are the number of rows, the number of columns , the
number of row clusters and l is the number of column clusters. Since the number of row and
column clusters is usually much lower than the original number of rows and columns,
co-clustering can lead to substantial reduction in the running time”. [3]
3 Problem definition[1]
n
U  {ui }im1 is the set of users and |U|=m. P  { p j } j 1 is the set of items and |P|=n. A is m x n
ratings matrix and Aij is the rating of the user ui for the item p j . W is m x n matrix
corresponding to the confidence of the ratings in A. Also, we assume Wij  1 when the rating is
known and 0 otherwise.  :{1,...., m}  {1,...., k} and  :{1,...., n}  {1,...., l} denote the user
and item clustering , where k and l are number of user and item clusters. Â is approximation
matrix for A.
Two definitions of approximation matrix  :
ˆ 
1. A
gh
COC
ˆ  COC  (  R  RC )  ( C  CC )
2. Â : 
ij
gh
i
g
j
h
Definition 1 is the simplest but intuitive approximation matrix for A, while definition 2 incorporates
the biases of the individual users and items by including the terms (user average - user cluster
average) and (item average - item cluster average) in addition to the co-cluster average.
R
C
COC
When  (i )  g and  ( j )  h ,  i ,  j are average ratings of user ui and item p j and  gh ,
CC
 RC
g ,  h are the average ratings of the corresponding co-cluster, user-cluster and item-cluster
respectively[1]:

COC
gh
iR 

i '| ( i ')  g  j '| ( j )  h Ai ' j '
i '| ( i ')  g  j '| ( j )  hWi ' j '
 nj '1 Aij '
 nj '1Wij '
, Cj 
, 
RC
g

i '| (i ')  g nj '1 Ai ' j '
i '| (i ') g nj '1Wi ' j '
,
CC
h

im'1 j '| ( j )h Ai ' j '
im'1 j '| ( j )hWi ' j '
im'1 Ai ' j
im'1Wi ' j
Using  , we can treat the prediction of unknown ratings as a co-clustering problem where we
seek to find the optimal user and item clustering (  ,  ) such that minimize the sum-squared
residue[1]
m
n
ˆ )2
min   Wij ( ij  
ij
(  , ) i 1 j 1
4 Algorithm Design
In this project, we mainly base our algorithm on the following papers: [1] [2] [3] [4]
4.1 Batch iteration
Algorithm 1: batch iteration using
ˆ  COC

ij
gh
Algorithm 3: batch iteration using
ˆ  COC  ( R   RC )  (C  CC )

ij
gh
i
g
j
h
 The algorithms operate in a batch fashion in the sense that at each iteration the column
clustering ρ is updated only after determining the nearest column cluster for every column of
A(likewise for rows).
 The algorithm iterates till the decrease in objective function becomes small as governed
by the tolerance factor τ.
4.2 Incremental algorithms
Algorithm 2: Incremental algorithms using
ˆ  COC

ij
gh
Algorithm 4: Incremental algorithms using
ˆ  COC  ( R   RC )  (C  CC )

ij
gh
i
g
j
h
 Incrementally move columns (rows) between column (row) clusters if such a move
leads to decrease in the objective function. Each invocation of the incremental procedures tries to
perform such a move for each row and column of the data matrix.
 Effectively escape from poor local minima and avoiding empty clusters.
A ping-pong approach is used to invocate the batch and incremental algorithms, where the
incremental local search algorithm refines the clustering produced by the batch algorithms and
triggers further runs of the batch algorithms. The whole process will continue until the objective
function converges.
Algorithm 1
Input: Rating matrix A, Non-zeros matrix W, # row clusters l, #col. clusters k.
Output: Co-Clustering (ρ, γ)
Initialize ρ and γ
Compute ACOC
objval ←
∆ ← 1; τ ← 10-2||A||2
While ∆ > τ
Foreach 1 ≤ j ≤ n
2
 ( j )  arg min im1Wij ( ij  COC
 (i ) h )
1 h l
COC
Update A
Foreach 1 ≤ i ≤ m
2
 (i)  arg min nj 1Wij ( ij  COC
g ( j ) )
1 g  k
COC
Update A
Oldobj ← objval; objval ←
∆ ← |oldobj - objval|
Algorithm 2 Incremental algorithm (for column)
Input: A, W, l and γ
Output: column clustering γ
τ ← 10-5 ||A||2
{* Adjustable parameters}
Foreach 1 ≤ j ≤ n
For g = 1 to l, g ≠ γ(j) = g0
For g= 1 to k
2
 j ( g )    Wij ( ij  COC
)2    Wij ( ij  COC
gh )
gh
ig jh0
0
ig jh0
0
2
   Wij ( ij  COC
)2    Wij (ij  COC
gh )
gh
ig jh
ig jh
{find the best column to move along with best cluster}
(j*,c*)← arg max  j ( g )
( j,g )
If  j* ( g *)  τ then γ (j*) ← c*
Update ACOC
Algorithm 3
Input: Rating matrix A, Non-zeros matrix W, # row clusters l, #col. clusters k.
Output: Co-Clustering (ρ, γ)
Initialize ρ and γ
Compute ACOC, ARC, ACC, AR and AC
objval ←
∆ ← 1; τ ← 10-2 ||A||2
{* Adjustable parameters}
While ∆ > τ
Foreach 1 ≤ j ≤ n
R
RC
C
CC 2
 ( j )  arg min im1Wij ( ij  COC
 ( i ) h  i    ( i )   j   h )
1 h l
Update ACOC, ACC
Foreach 1 ≤ i ≤ m
R
RC
C
CC 2
 (i)  arg min nj 1Wij ( ij  COC
g ( j )  i   g   j   ( j ) )
1 g  k
COC
Update A
RC
,A
Oldobj ← objval; objval ←
*
∆ ← |oldobj ← objval|
Algorithm 4
Incremental algorithm (for column)
Input: A, W, l and γ
Output: column clustering γ
τ ← 10-5 ||A||2
{* Adjustable parameters}
Foreach 1 ≤ j ≤ n
For g = 1 to l, g ≠ γ(j) = g0
ˆ ( g , h )) 2
 j ( g )    Wij ( ij  ij ( g , h0 )) 2    Wij ( ij  
ij
0
ig jh0
ig jh0
ˆ ( g , h)) 2
   Wij ( ij  ij ( g , h )) 2    Wij ( ij  
ij
ig jh
ig jh
{Find the best column to move along with best cluster}
(j*,c*)← arg max  j ( g )
( j,g )
If  j* ( g *)  τ then γ (j*) ← c*
Update ACOC, ACC
4.3 Simplifying the computation in these algorithms
Algorithm 1: Computational time is O((nl+mk) Wgh) . Wgh is the number of non zeros in Agh
Algorithm 3 can be implemented efficiently by pre-computing the invariant parts of the update
cost functions [1].
temp1
ij
Let A
 Aij  A  A
R
i
C
j
temp 2
ih
A
,



j '| ( j ')  h
j '|
1
Aijtemp
'
W
( j ')  h ij
A
CC
h
temp 3
ih
A
,



i '| ( i ')  g
i '|
1
Aitemp
'j
Wij
( i ')  g
 AgRC
Minimizing the row update cost function is equivalent to minimizing

n
j 1
2
COC
RC
(Aitemp
 ( j )  Ag j( 
) Ag )
Minimizing the column update cost function is equivalent to minimizing

m
temp 3
COC
CC
(
A

A

A
)

(
i
)
j

i
(
h
)
h
i 1
Each iteration in algorithm 2 can be speeded up by performing the computations in a different
Manner. We use column reassignment to show how to do simplification in incremental algorithm 2.
Assume column jc is assigned to h 0 originally (  ( jc )  h0 ), it is reassigned to h during
incremental algorithms (  ( jc )  h ), then h0  h0 , h  h are all the change that has
contribution to ||
||. (Column clusters in the current clustering are denoted by h, h0, and
column clusters in the new clustering are represented by h , h0 . Define  (i )  g  i  g ).
For co-cluster  (i )  g ,
h0  h0 :
2
  Wij ( ij  COC
)2    Wij ( ij  COC
gh0 )
gh
ig jh0
ig jh0
0
2
COC 2
   Wij [( ij  COC
)2  (ij  COC
gh0 ) ]   Wijc (  ijc   gh0 )
gh
ig jh0
ig
0
2
COC 2
 (   Wij )[ COC
 COC
gh0 ]   Wijc ( ijc   gh0 )
gh
ig jh0
(by proving )
ig
0
h h:
2
  Wij ( ij  COC
)2    Wij (ij  COC
gh )
gh
ig jh
ig jh
2
COC 2
 (   Wij )[COC
 COC
gh ]   Wijc ( ijc   gh )
gh
ig jh
ig
Since
COC

gh
0
  Aij
ig jh0
  Wij

ig jh0
  Aij   Aijc
ig jh
0
ig
  Wij   Wijc
ig jh0
COC
gh0 
,
ig
  Aij
ig jh
0
  Wij
,
ig jh0
COC
(   Wij )COC
  Aijc
gh0  (   Wij   Wijc )  gh
ig jh0
ig jh0
ig
Wij   Wijc )  gh
Similarly, (ig j
h
ig
COC
Thus, if we store 
COC
gh
0
(1)
ig
 (   Wij ) COC
  Aijc
gh
ig jh
ig
(2)
, A and W in the main memory, then we can compute COC
and COC
gh
gh
0
in a faster manner. Also, the computational time of step * in algorithm 2 the O(lk Wgh) .Wgh is the
number of non zeros in Agh. Since the Netflix rating matrix A is a sparse matrix, it is possible to
speed up algorithm 2 in this manner.
Iteration in algorithm 4 can be speeded up in a similar manner. We still use column reassignment
to show how to do simplification in incremental algorithm 4.
For co-cluster  (i )  g ,
h0  h0 :
C
CC 2
  Wij ( ij  COC
 iR   RC
g   jc   h )
gh
ig jh0
0
   Wij ( ij  
ig jh0
0
COC
gh0
 
R
i
RC
g
2
   CC
h0 )
C
jc
C
COC
CC
COC
COC
 (2 RC
 CC
 COC
 CC
 AhCC
)(   Wij )
g  2  jc   gh
gh0   h0 )(  gh0   gh
h
h
0
0
( 
COC
gh0


COC
gh0
  Wijc ( ijc  
ig
0
0
  )[2
CC
h0
CC
h0
 
COC
gh0
R
i
COC
gh0
ig jh0
0
(   Wij )  2    Wij ]
ig jh0
R
i
ig
jh0
  )
RC
g
C
jc
CC 2
h0
h h:
C
CC 2
  Wij ( ij  COC
 iR   RC
g   jc   h )
gh
ig jh
C
CC 2
   Wij ( ij  COC
 iR   RC
gh
g   jc   h )
ig jh
C
COC
COC
 (2  RC
 CC
 COC
 CC
 COC
 CC
 AhCC )(   Wij )
g  2  jc   gh
gh
h )(  gh
h
gh
h
ig jh
( 
COC
gh

COC
gh

  Wijc ( ijc  
ig
CC
h
COC
gh
  )[2 
CC
h
 
R
i
RC
g
COC
gh
(   Wij )  2    Wij ]
ig jh
R
i
ig
jh
  )
C
jc
CC 2
h
Since
CC

h
0
im1  Aij
jh0

m
i 1
 Wij
jh0

im1  Aij  im1 Aijc
jh

m
i 1
0
 Wij   Wijc
m
i 1
jh0
, CC
h0 
im1  Aij
jh

m
i 1
0
 Wij
,
jh0
m
m
CC
m
(im1  Wij )CC
h0  (i 1  Wij  i 1Wijc ) h0  i 1 Aijc .
jh
jh
0
(3)
0
Similarly, (i 1  Wij  i 1Wijc )  h  ( i 1  Wij )  h   i 1 Aijc
m
m
CC
m
jh
CC
m
jh
(4)
We can calculate CC
, CC
, COC
and COC
in the same manner according to formula
h
gh
gh
h
0
0
(1)(2)(3)(4).Also, since ACOC, ARC, ACC, AR and AC are stored in the main memory, the
computational time of step * in algorithm 4 is O(lk Wgh) .Wgh is the number of non zeros in Agh.
Since the Netflix rating matrix A is a sparse matrix, it is possible to speed up algorithm 2 in this
manner.
3.5 other issues
 Spectral approximation for initialization
 Test if removing global effect before we do co-clustering is helpful
RERERENCES
[1] T. George and S. Merugu. “A Scalable Collaborative Filtering Framework based on
Co-clustering.” Proceedings of the 5th IEEE Conference on Data Mining (ICDM), Nov
2005.
[2] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum Sum-Squared Residue
Co-Clustering of Gene Expression Data. Proceedings of the Fourth SIAM International
Conference on Data Mining, pages 114-125, April 2004.
[3] A.Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha. A generalized maximum
entropy approach to bregman co-clustering and matrix approximation. In KDD, pages
509–514, 2004.
[4] R. Bell and Y. Koren. Scalable Collaborative Filtering with Jointly Derived
Neighborhood Interpolation Weights.ICDM, 2007.
[5] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for
collaborative filtering. In UAI, pages 43–52, 1998.
[6] B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Reidl. Item-based collaborative filtering
recommendation algorithms. In WWW, pages 285–295, 2001.
[7] P. Resnick, N. Iacovou, M. Suchak, P. Bergstorm, and J. Riedl. GroupLens: An Open
Architecture for Collaborative Filtering of Netnews. In Proc. of ACM Conf. on Computer
Supported Cooperative Work, pages 175–186, 1994.
[8] U. Shardanand and P. Maes. Social information filtering: Algorithms for automating “word of
mouth”. In Proc. of ACM Conf. on Human Factors in Computing Systems, volume
1, pages 210–217, 1995.
[9] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Application of dimensionality reduction in
recommender systems– a case study. In WebKDD Workshop., 2000.
[10] N. Srebro and T. Jaakkola. Weighted low rank approximation. In ICML, pages 720–728,
2003.
Download