Recommender Systems CS910: Foundations of Data Analytics

advertisement
CS910: Foundations
of Data Analytics
Graham Cormode
G.Cormode@warwick.ac.uk
Recommender
Systems
Objectives




To understand the concept of recommendation
To see neighbour based methods
To see latent factor methods
To see how recommender systems are evaluated
 Recommended reading:
Recommender systems (Encyclopedia of Machine Learning)
– Matrix Factorization Techniques for Recommender Systems
Koren, Bell, Volinsky, IEEE Software
–
2
CS910 Foundations of Data Analytics
Recommendations
 A modern problem: a very large number of possible items
–
Which item should I try next, based on my preferences?
 Arises in many different places:
Recommendations of content: books, music, movies, videos...
– Recommendations of places to travel, hotels, restaurants
– Recommendations of food to eat, sites to visit
– Recommendations of articles to read: news, research, gossip
–
 Each person has different preferences/interests
How to elicit and model these preferences?
– How to customize recommendations to a particular individual?
–
3
CS910 Foundations of Data Analytics
Recommendations in the Wild
4
CS910 Foundations of Data Analytics
Recommender Systems
 Recommender systems: produce tailored recommendations
Inputs: ratings of items by users
 Possibly also: user profiles, item profiles
– Outputs: for a given user, output a list of recommended items
 Or, for a given (user, item) pair, output a predicted rating
–
 Ratings can be in many forms
“Star rating” (out of 5)
– Binary rating (thumbs up/thumbs down)
– Likert scale (Strongly like, like, neutral, dislike, strongly dislike)
– Comparisons: prefer X to Y
–
 Will use movie recommendation as a running example
5
CS910 Foundations of Data Analytics
Ratings Matrix
User 1
Item 1
Item 2
5
User 2
Item 4
Item 5
3
1
2
2
?
User 3
User 4
User 5
Item 3
...
5
3
4
?
2
1
4
...
 n  m matrix of ratings R, where ru,i is rating of user u for item i
 Typically, matrix is large and sparse
 Thousands or millions of users (n) and thousands of items (m)
 Each user has rated only a few items
 Each item is rated by at most a small fraction of users
 Goal is to provide predictions pu,i for certain (user, item) pairs
6
 Sometimes called ‘Matrix Completion’
Evaluating a recommender system
 Evaluation is similar to evaluating classifiers
Break labeled data into training and test data
– For each test (user, item) pair, predict the user score for the item
– Measure the difference, and aim to minimize over N tests
–
 Combine the differences into a single score to compare systems
Most common: Root-Mean-Square-Error (RMSE) between pu,i & ru,i
 RMSE = √(u,i (pu,i – ru,i)2 / N )
– Sometimes also use Mean Absolute Error (MAE) u,i |pu,i – ru,i| / N
– If recommendations are either ‘good’ or ‘bad’, can use precision,
recall
–
7
CS910 Foundations of Data Analytics
Initial attempts
 Can we use existing methods: classification, regression etc.?
 Assume we have features for each user and each item:
User: Demographics, stated preferences
– Item: E.g. Genre, director, actors
–
 Can treat as a classification problem: predict a score
–
Train classifier from examples
 Limitations of the classifier approach:
Don’t necessarily have user and item information
– Ignores what we do have: lots of ratings between users and items
 Hard to use as features, unless everyone has rated a fixed set
–
8
CS910 Foundations of Data Analytics
Neighbourhood method
 Neighbourhood-based collaborative filtering
–
Users “collaborate” to help recommend (filter) items
1. Find k other users K who are similar to target user u
–
Possibly assign a weight based on how similar, wu,v
2. Combine the k users’ (weighted) preferences
–
Use these to make predictions for u
 Can use existing methods to measure similarity
PMCC to measure correlation of ratings as wu,v
– Cosine similarity of vectors
–
9
CS910 Foundations of Data Analytics
Neighbourhood example (unweighted)
 3 users like the same set of movies as Joe (exact match)
–
10
All three like “Saving Private Ryan”, so this is top recommendation
CS910 Foundations of Data Analytics
Different Rating Scales
 Every user rates slightly differently
–
Some consistently rate high, some consistently rate low
 Using PMCC avoids this effect when picking neighbours but
needs adjustment for making predictions
 Make an adjustment when computing a score:
Predict: pu,i = ru + (vK (rv,i – rv ) wu,v )/ (vK wu,v )
– ru : average rating for user u
– wu,v : weight assigned to user v based on their similarity to u
 E.g. The correlation coefficient value
– pu,i computes the weighted deviation from v’s average score,
and adds onto u’s average score
–
11
CS910 Foundations of Data Analytics
Item-based Collaborative Filtering
 Often there are many more users than items
E.g. Only few thousand movies available, but millions of users
– Comparing to all users can be slow
–
 Can do neighbourhood-based filtering using items
Two items are similar if the users rating them are similar
– Compute PMCC between the users rating them both as wi,j
– Find k most similar items J
– Compute simple weighted average pu,i = jJ ru,j wi,j / (jJ wi,j )
 No adjustment by mean as we assume no bias from items
–
12
CS910 Foundations of Data Analytics
Latent Factor Analysis
 We rejected methods based on features of items, since we
could not guarantee they would be available
 Latent Factor Analysis tries to find “hidden” features from the
rating matrix.
Factors might correspond to recognisable features like genre
– Other factors: Child-friendly, comedic, light/dark
– More abstract: depth of character, quirkiness
– Could find factors that are hard to interpret
–
13
CS910 Foundations of Data Analytics
Latent Factor Example
14
CS910 Foundations of Data Analytics
Matrix Factorization
 Model each user and item as a vector of (inferred) factors
Let qi be the vector for item i, wu be the vector for user u
– The predicted rating pu,i is then the dot product (wu ∙ qi)
–
 How to learn the factors from the given data?
Given ratings matrix R, try to express R as a product WQ
 W is n  f matrix of users and their latent factors
 Q is a f  m matrix of items and their latent factors
– A matrix factorization problem: factor R into W  Q
–
 Can be solved by Singular Value Decomposition
15
CS910 Foundations of Data Analytics
Singular Value Decomposition
 Given m x n matrix M, decompose into M = U  VT, where:
U is a m x m matrix of orthogonal columns [left singular vectors]
–  is a rectangular m x n diagonal matrix [singular values]
– VT is a n x n matrix of orthogonal rows [right singular vectors]
–
 The Singular Value Decomposition is highly structured
The singular values are the square roots of eigenvalues of MMT
– The left (right) singular vectors are eigenvectors of MMT (MTM)
–
 SVD can be used to give approximate representations
Take the k largest singular values, set rest to zero
– Picks out the k most important “directions”
– Gives the k latent factors to describe the data
CS910 Foundations of Data Analytics
–
16
SVD for recommender systems
 Textbook SVD doesn’t work when matrix has missing values!
–
Could try to fill in the missing values somehow, then factor
 Instead, set up as an optimization problem:
–
Learn length k vectors qi, wu to solve the following optimization:
Minq, w ∑(u,i)  R (ru,i – qi  wu)2
Minimize the squared error between the predicted and true value
 If we had a complete matrix, SVD would solve this problem
–
Set W = Uk ½k and Q = ½k Vk
Uk, Vk are singular vectors corresponding to k largest singular values
 Additional problem: too much freedom (not enough ratings)
–
17
Risk of overfitting the training data, failing to generalize
CS910 Foundations of Data Analytics
Regularization
 Regularization is a technique used in many places
Here, avoid overfitting by penalizing having too many parameters
– Achieve this by adding the size of the parameters to optimization
Minq, w ∑(u,i)  R (ru,i – qi  wu)2 + (ǁqiǁ22 + ǁwuǁ22)
ǁxǁ22 is the L2 (Euclidean) norm squared: sum of squared values
– Effect is to set more values of q and v to 0 to minimize complexity
–
 Many different forms of regularization:
L2 regularization: add terms of the form ǁxǁ22
– L1 regularization: terms of the form ǁxǁ1 (can give sparser solutions)
–
 The form of the regularization should fit the optimization
18
CS910 Foundations of Data Analytics
Solving the optimization: Gradient Descent
 How to solve Minq, w ∑(u,i)  R (ru,i – qi  wu)2 + (ǁqiǁ22 + ǁwuǁ22) ?
 Gradient Descent
For each training example, find error of current prediction
eu,i = ru,i – qi  wu
– Modify the parameters by taking a step in direction of the gradient
qi  qi + γ (eu,i wu - λ qi) [derivative of target with respect to q]
wu  wu + γ (eu,i qi - λ wu) [derivative with respect to w]
– γ is step size parameter to control the speed of descent
–
 Advantages and disadvantages of gradient descent
++ Fairly easy to implement: easy to compute update at each step
– -- Can be slow: hard to parallelize
–
19
CS910 Foundations of Data Analytics
Solving the optimization: Least Squares
 How to solve Minq, w ∑(u,i)  R (ru,i – qi  wu)2 + (ǁqiǁ22 + ǁwuǁ22) ?
 Reducing to Least Squares
Suppose the values of wu are fixed
– Then the goal is to minimize a function of the squares of qis
– Solved by techniques from regression: (regularized) least squares
–
 The alternating least squares method
Pretend values of wu are fixed, optimize values of qi
– Swap, pretend values of qi are fixed, optimize values of wu
– Repeat until convergence
–
 Can be slower than gradient descent on a single machine
–
20
But can parallelize: compute each qi independently
CS910 Foundations of Data Analytics
Adding biases
 Can generalize matrix factorization to incorporate other factors
E.g. Fred always rates 1 star less than average
– E.g. Citizen Kane is rated 0.5 higher than other films on average
–
 These are not captured as well by a model of the form (qi  wu)
–
Explicitly modeling biases (intercepts) can give better fit
Model with biases: pu,i =  + bi + bu + (wu ∙ qi)
 : global average rating
bi : bias for item i
bu : rating bias from user u (similar to neighborhood method)
 Optimize the new error function in the same way:
Minq,w,b ∑(u,i)R(ru,i –  – bu – bi – qiwu)2 + (ǁqiǁ22 + ǁwuǁ22 + bu2 + bi2)
–
21
Can add more biases e.g. incorporate variation over time
CS910 Foundations of Data Analytics
Cold start problem: new items
 How to cope when new objects are added to the system?
–
New users arrive, new movies are released: “cold start” problem
 New item is created: no ratings, so will not be recommended?
Use attributes of the item (actors, genre) to give some score
– Randomly suggest it to users to get some ratings
–
22
CS910 Foundations of Data Analytics
Cold start problem: new users
 New users arrive: we have no idea what they like!
Recommend globally popular items to them (Harry Potter…)
 May not give much specific information about their tastes
– Encourage new users to rate some items before recommending
– Suggest items that are “divisive”: try to maximize information
 Tradeoff: “poor” recommendations may drive users away
–
23
CS910 Foundations of Data Analytics
Case Study: The Netflix Prize
 Netflix ran competition from 2006-09
Netflix streams movies over internet (and rents DVDs by mail)
– Users rate each movie on a 5-star scale
– Netflix makes recommendations of what to watch next
–
 Object of competition: improve over current recommendations
“Cinematch” algorithm: “uses straightforward linear models…”
– Prize: $1M to improve RMSE by 10%
–
 Training data: 100M dated ratings from 480K users to 18K movies
Can submit ratings of test data at most once per day
– Avoid stressing of servers, attempts to elicit true answers
–
24
CS910 Foundations of Data Analytics
The Netflix Prize
https://www.youtube.com/watch?v=Imp
V70uLxyw
https://www.youtube.com/watch?v=ImpV70uLxyw
25
CS910 Foundations of Data Analytics
Netflix prize factors
 Postscript: Netflix adopted some ideas but not all
“Explainability” of recommendations is an additional requirement
– Cost of fitting models, making predictions is also important
–
26
CS910 Foundations of Data Analytics
Recommender Systems Summary




Introduced the concept of recommendation
Saw neighbour based methods
Saw latent factor methods
Understood how recommender systems are evaluated
–
Netflix prize as a case study in applied recommender systems
 Recommended reading:
Recommender systems (Encyclopedia of Machine Learning)
– Matrix Factorization Techniques for Recommender Systems
Koren, Bell, Volinsky, IEEE Software
–
27
CS910 Foundations of Data Analytics
Download