CF-1 - SEAS

advertisement
The BellKor 2008 Solution
to the Netflix Prize
by
Leenarat Leelapanyalert
Netflix Dataset
• Over 100 million movie ratings with date-stamp
(100,480,507 ratings)
• M = 17,770 movies
• N = 480,189 customers
• 1 (star) = no interest, 5(stars) = strong interest
• Dec 31, 1999 – Dec 31, 2005
The user-item matrix
N*M = 8,532,958,530 elements
98.9% values are missing
Netflix Competition
• 4.2 from 100 million ratings
– Training set (Probe set)
– Qualifying set (Quiz set & Test set)
• Scoring
– Show RMSE achieved on the Quiz set
– Best RMSE on the Test set → THE WINNER!!
Outline
• Necessary index letters
• Baseline predictors
– With temporal effects
• Latent factor models
– with temporal effects
• Neighborhood models
– with temporal effects
• Integrated models
• Extra: Shrinking towards recent actions
Outline
• Necessary index letters
• Baseline predictors → Adjust deviations of
– With temporal effects
each user (rater,
• Latent factor models
customer) and item
– with temporal effects
(movie)
• Neighborhood models
– with temporal effects
• Integrated models
• Extra: Shrinking towards recent actions
Outline
• Necessary index letters
• Baseline predictors
– With temporal effects
Compare between
• Latent factor models → items and users
– with temporal effects
by SVD
• Neighborhood models
– with temporal effects
• Integrated models
• Extra: Shrinking towards recent actions
Outline
• Necessary index letters
• Baseline predictors
– with temporal effects
• Latent factor models
– with temporal effects
Compute the
• Neighborhood models→ relationship
between items
– with temporal effects
• Integrated models
(or users)
• Extra: Shrinking towards recent actions
Outline
• Necessary index letters
• Baseline predictors
- with temporal effects
• Latent factor models
–with temporal effects
• Neighborhood models
–with temporal effects
• Integrated models
• Extra: Other methods
– Shrinking towards recent actions
– Blending multiple solutions
Outline
• Necessary index letters
• Baseline predictors
– with temporal effects
• Latent factor models
– with temporal effects
• Neighborhood models
– with temporal effects
Combine Latent
factor models and
Neighborhood
• Integrated models → models together
• Extra:Shrinking towards recent actions
Outline
• Necessary index letters
• Baseline predictors
– with temporal effects
• Latent factor models
– with temporal effects
• Neighborhood models
– with temporal effects
• Integrated models
• Extra: Shrinking towards recent actions
→ New ideas
Index Letters
•
•
•
•
•
•
•
•
•
u,v →
i,j →
rui →
^
rui →
tui →
K
→
R(u) →
R(i) →
N(u) →
users, raters, or customers
movies, or items
the score by user u of movie i
predicted value of rui
the time of rating rui
the training set which rui is known
all the items for which rating by u
the set of users who rated item i
all items that can estimated u’s score
Baseline Predictors (bui)
bui    bu  bi
µ → the overall average rating
bu → deviations of user u
bi → deviation of item i
Example: µ = 3.7, Simha(bu) = -0.3,
Titanic (bi) = 0.5
bui = 3.7 – 0.3 + 0.5 = 3.9 stars
Estimate Parameter (bu, bi) – Formula
bui    bu  bi
bi 
bu 
 uR (i ) (r   )
1  R(i )
 iR ( u ) (r    bi )
2  R(u )
The regularization parameters (𝜆1,𝜆2) are
determined by validation on the Probe set.
In this case: 𝜆1 = 25, 𝜆2 = 10
Estimate Parameter (bu, bi) –
The Least Squares Problem
bui    bu  bi
min
b*

( u ,i )K
(rui    bu  bi ) 2  1 ( bu2   bi2 )
u
i
Estimate Parameter (bu, bi) –
The Least Squares Problem
bui    bu  bi
min
b*

( u ,i )K
(rui    bu  bi ) 2  1 ( bu2   bi2 )
to fit the given rating
u
i
to avoid overfitting
by penalizing the
magnitudes of the
parameters
Time Change VS Baseline Predictors
• An item’s popularity may change over time
• Users change their baseline rating over time
bui    bu  bi
bui    bu (tui )  bi (tui )
bi(tui)
• We do not expect movie likeability to fluctuate
on a daily basis
• Time periods → Bins
• 30 bins
bui    bu (tui )  bi (tui )
bi (t )  bi  bi , Bin(t )
bu(tui)
• Unlike movies, user effects can change on a
daily basis
• Time deviation
devu (t )  sign (t  tu )  t  tu

tu → the mean date of rating by tu
t → the date that user u rated the movie
β = 0.4 by validation on the Probe set
bu(tui)
bui    bu (tui )  bi (tui )
b (t )  bu   u  devu (t )
(1)
u
devu (t )  sign (t  tu )  t  tu
• Suit well with gradual drifts

bu(tui)
• How about sudden drifts?
– Since we found that multiple ratings that a user
gives in a single day
b (t )  bu   u  devu (t )  but
( 3)
u
• A user rates on 40 different days on average
• Thus, but requires about 40 parameters per user
Baseline Predictors
bui (t )    bu   u  devu (tui )  bu ,tui  bi  bi , Bin(tui )
Baseline Predictors
Bu (user bias)
Bi (movie bias)
bui (t )    bu   u  devu (tui )  bu ,tui  bi  bi , Bin(tui )
Baseline Predictors
Bu (user bias)
Bi (movie bias)
bui (t )    bu   u  devu (tui )  bu ,tui  bi  bi , Bin(tui )
• Movie bias is not completely user-independent
bui (t )    bu   u  devu (tui )  bu ,tu i  (bi  bi , Bin(tu i ) )  cu (tui )
cu(t) → time-dependent scaling feature
cu
→ (stable part)
cut
→ (day-specific variable)
cu (t )  cu  cut
bui (t )    bu   u  devu (tui )  bu ,tu i  (bi  bi , Bin(tu i ) )  cu (tui )
RMSE = 0.9555
Frequencies (additional)
• The number of ratings a user gave on a specific day
SIGNIFICANT
f ui  log a Fui
Fui → the overall number of ratings that user u gave on
day tui
bui (t )    bu   u  devu (tui )  bu ,tui  (bi  bi , Bin(tui ) )  cu (tui )  bi , f ui
bif → the bias specific for the item i at log-frequency f
RMSE 0.9555 → 0.9278
Why Frequencies Work?
• Bad when using with user-movie interaction terms
• Nothing when using with user-related parameters
• Rate a lot in a bulk → Not closely to the actual
watching day
– Positive approach
– Negative approach
• High frequencies (or bulk ratings) do not represent
much change in people’s taste, but mostly biased
selection of movies
Predicting Future Days
• The day-specific parameters should be set to
default value
• cu(tui) = cu
• bu,t = 0
• The transient temporal model doesn’t attempt
to capture future changes.
Latent Factor Models
• To transform both items and users to the
same latent factor space
– Obvious dimensions
• Comedy VS Drama
• Amount of action
• Orientation to children
– Less well defined dimensions
• Depth of character development
• Tool → SVD
Singular Value Decomposition (SVD)
• Factoring matrices into a series of linear
approximations that expose the underlying
structure of the matrix
Singular Value Decomposition (SVD)
Predicted Score = User Baseline Rating * Movie Average Score
A
B
C
Simha
4
4
4
4
Ateeq
5
5
5
5
Smith
3
3
3
3
Greg
4
4
4
Mcq
4
4
4
Ramin
4
4
4
4
Xiao
4
4
4
3
Wu
3
3
3
5
Riz
5
5
5
=
4
4
4
*
1
1
1
Singular Value Decomposition (SVD)
Predicted Score = User Baseline Rating * Movie Average Score
A
B
C
Simha
4
4
5
Ateeq
4
5
5
Smith
3
3
2
Greg
4
5
4
Mcq
4
4
4
Ramin
3
5
4
Xiao
4
4
3
Wu
2
4
4
Riz
5
5
5
Singular Value Decomposition (SVD)
Predicted Score = User Baseline Rating * Movie Average Score
A
B
C
Simha
3.95
4.64
4.34
4.34
Ateeq
4.27
5.02
4.69
4.69
Smith
2.42
2.85
2.66
2.66
Greg
3.97
4.67
4.36
Mcq
3.64
4.28
4.00
Ramin
3.69
4.33
4.05
3.66
Xiao
3.33
3.92
3.66
3.39
Wu
3.08
3.63
3.39
5.00
Riz
4.55
5.35
5.00
=
4.36
4.00
4.05
*
0.91
1.07
1.00
Singular Value Decomposition (SVD)
Predicted Score = User Baseline Rating * Movie Average Score
A
B
C
Simha
3.95
4.64
4.34
5
Ateeq
4.27
5.02
4.69
3
2
Smith
2.42
2.85
2.66
4
5
4
Greg
3.97
4.67
4.36
Mcq
4
4
4
Mcq
3.64
4.28
4.00
Ramin
3
5
4
Ramin
3.69
4.33
4.05
Xiao
4
4
3
Xiao
3.33
3.92
3.66
Wu
2
4
4
Wu
3.08
3.63
3.39
Riz
5
5
5
Riz
4.55
5.35
5.00
A
B
C
Simha
4
4
5
Ateeq
4
5
Smith
3
Greg
-
Singular Value Decomposition (SVD)
Predicted Score = User Baseline Rating * Movie Average Score
A
B
C
Simha
0.05
-0.64
0.66
-0.18
Ateeq
-0.28
-0.02
0.31
-0.38
Smith
0.58
0.15
-0.66
0.80
Greg
0.03
0.33
-0.36
Mcq
0.36
-0.28
0.00
0.67
-0.05
0.89
Ramin -0.69
=
0.15
0.35
-0.67
Xiao
0.67
0.08
-0.66
-1.29
Wu
-1.08
0.37
0.61
0.44
Riz
0.45
-035
0.00
*
0.82 -0.20 -0.53
Singular Value Decomposition (SVD)
Predicted Score = User Baseline Rating * Movie Average Score
A
B
C
Simha
4
4
5
4.34
-0.18 -0.90
Ateeq
4
5
5
4.69
-0.38 -0.15
Smith
3
3
2
2.66
0.80
0.40
4.36
0.15
0.47
4.00
0.35
-0.29
4.05
-0.67
0.68
=
Greg
4
5
4
Mcq
4
4
4
Ramin
3
5
4
3.66
0.89
0.33
Xiao
4
4
3
3.39
-1.29
0.14
Wu
2
4
4
5.00
0.44
-0.36
Riz
5
5
5
*
0.91
1.07
1.00
0.82
-0.20
-0.53
-0.21
0.76
-0.62
Latent Factor Models
rˆui  bui  puT qi
pu → user-factors vector
qi → item-factors vector
• Add implicit feedback
– Asymmetric-SVD


 12
 12
rˆui  bui  q  R(u )  (ruj  buj ) x j  N (u )  y j 
jR ( u )
jN ( u ) 

T
i
– SVD++


 12
rˆui  bui  q  pu  N (u )  y j 
jN ( u ) 

T
i
60 factors
RMSE = 0.8966
Temporal Effects
• Time
– Movie biases – go in and out of popularity over time
bi
– User biases – user change their baseline ratings over time
bu
– User preferences – genre, perception on actors and
directors, household
pu


 12
rˆui  bui (t )  q  pu (t )  N (u )  y j 
jN ( u ) 

T
i
Temporal Effects
b (t )  bu   u  devu (t )
(1)
u
b (t )  bu   u  devu (t )  but
( 3)
u
• The same way we treat user bias we can also treat the user
preferences
pu (t )T   pu1 (t ), pu 2 (t ),..., puf (t )
p (t )  puk   uk  devu (t )
(1)
uk
p (t )  p (t )  puk ,t
(3)
uk
(1)
uk
k=1,2,…,f
k=1,2,…,f
RMSE


 12
rˆui  bui (t )  q  pu (t )  N (u )  y j 
jN ( u ) 

T
i
p (t )  puk   uk  devu (t )
(1)
uk
p (t )  p (t )  puk ,t
(3)
uk
(1)
uk
f = 500
RMSE = 0.8815
f = 500
RMSE = 0.8841 !!
• Most accurate factor model (add frequencies)
rˆui  bui (t )  (q  q
T
i
f = 500, RMSE = 0.8784
T
i , f ui


 12
) pu (t )  N (u )  y j 
jN ( u ) 

f = 2000, RMSE = 0.8762
Neighborhood Models
• To compute the relationship between items
• Evaluate the score of a user to an item based
on ratings of similar items by the same user
The Similarity Measure
• The Pearson correlation coefficient, ρij
The Similarity Measure
• The Pearson correlation coefficient, ρij
nij
def
sij 
nij  2
 ij
;λ2 = 100
sij – similarity
nij – the number of users that rated both i and j
• A weighted average of the ratings of neighborhood items
rˆui  bui


s (ruj  buj )
jS k ( i ;u ) ij

s
jS k ( i ;u ) ij
Sk(i;u) – the set of k items rated by u, which are most similar to i
Problem With The Model
rˆui  bui


s (ruj  buj )
jS k ( i ;u ) ij

s
jS k ( i ;u ) ij
• Isolate the relations between 2 items
• Fully rely on the neighbors, even if they are absent
rˆui  bui 
 (r
jR ( u )
• The wij’s are not user specific
• Sum over all item rated by u
uj
 buj ) wij
Improving The Model
rˆui  bui


s (ruj  buj )
jS k ( i ;u ) ij

s
jS k ( i ;u ) ij
• Isolate the relations between 2 items
• Fully rely on the neighbors, even if they are absent
rˆui  bui 
 (r
jR ( u )
uj
 buj ) wij
• The wij’s are not user specific
• Sum over all item rated by u
rˆui  bui 
 (r
jR ( u )
uj
 buj ) wij 
c
jN ( u )
• Not only what he rated, but also what he did not rate.
• cij is expected to be high if j is predictive on i
ij
Improving The Model
rˆui  bui 
 (r
jR ( u )
uj
 buj ) wij 
c
jN ( u )
ij
• The current model somewhat overemphasizes the dichotomy between
heavy raters and those that rarely rate
• Moderate this behavior by normalization
rˆui  bui  R(u )
 12
 (r
jR ( u )
uj
 buj ) wij  N (u )
 12
c
jN ( u )
ij
• 𝛼 = 0 → non-normalized rule – encourages greater deviations
• 𝛼 = 1 → fully normalized rule – eliminate the effect of number of rating
• In this case, 𝛼 = 0.5
RMSE = 0.9002
Improving The Model
rˆui  bui  R(u )
 12
 (r
uj
jR ( u )
 buj ) wij  N (u )
 12
c
jN ( u )
ij
RMSE = 0.9002
• Reduce the model by pruning parameters
rˆui  bui  R (i; u )
k
 12
 (r
uj
 buj ) wij  N (i; u )
k
jR ( i ;u )
k
Sk(i) – the set of k items most similar i
def
R k (i; u )  R(u )  S k (i)
def
N k (i; u )  N (u )  S k (i)
k = 17,770 → RMSE
= 0.8906
k = 2000 → RMSE = 0.9067
 12
c
ij
jN ( i ;u )
k
Integrated Models
• Baseline predictors + Factor models + Neighborhood models
 (1)

 12
 12
 12
k
k
rˆui    b (t )  bi (t )  q  pu (t )  N (u )  y j   R (i; u )
(ruj  buj ) wij  N (i; u )
cij


jN ( u ) 
jR k ( i ;u )
jN k ( i ;u )

(1)
u
T
i
f = 170, k = 300 → RMSE = 0.8827
• Further improve accuracy, we add a more elaborated
temporal model for the user bias
 (1)

 12
 12
 12
k
k
rˆui    b (t )  bi (t )  q  pu (t )  N (u )  y j   R (i; u )
(ruj  buj ) wij  N (i; u)
cij


jN ( u ) 
jR k ( i ;u )
jN k ( i ;u )

( 3)
u
T
i
f = 170, k = 300 → RMSE = 0.8786
EXTRA: Shrinking Towards Recent Actions
• To correct rui
• Shrink rui towards the average rating of u on day t
• The single day effect is among the strongest
temporal effects in data
  rˆui  cut  rut α = 8
cut  nut  exp(  Vut )
β = 11
  cut
nut – the number of ratings u gave on day t
rut – the mean rating of u at day t
Vut – the variance of u’s ratings at day t
Shrinking Towards Recent Actions
• A stronger corrections accounts for periods longer than a single day
• And tries to characterize the recent user behavior on similar movies
wiju  sij  exp(  tui  tuj )
cut  nui  exp(  Vui )
nui 
u
w
 ij
u _ rated _ j
rˆui  cui  rui
1  cui
rui 
Vui 
w r
w
u
ij
u _ rated _ j
uj
u
ij
u _ rated _ j
 w  (r
w
u
ij
u _ rated _ j
u
ij
u _ rated _ j
uj
)2
 (rui ) 2
Q&A
Download