CS910: Foundations of Data Analytics Graham Cormode G.Cormode@warwick.ac.uk Recommender Systems Objectives To understand the concept of recommendation To see neighbour based methods To see latent factor methods To see how recommender systems are evaluated Recommended reading: Recommender systems (Encyclopedia of Machine Learning) – Matrix Factorization Techniques for Recommender Systems Koren, Bell, Volinsky, IEEE Software – 2 CS910 Foundations of Data Analytics Recommendations A modern problem: a very large number of possible items – Which item should I try next, based on my preferences? Arises in many different places: Recommendations of content: books, music, movies, videos... – Recommendations of places to travel, hotels, restaurants – Recommendations of food to eat, sites to visit – Recommendations of articles to read: news, research, gossip – Each person has different preferences/interests How to elicit and model these preferences? – How to customize recommendations to a particular individual? – 3 CS910 Foundations of Data Analytics Recommendations in the Wild 4 CS910 Foundations of Data Analytics Recommender Systems Recommender systems: produce tailored recommendations Inputs: ratings of items by users Possibly also: user profiles, item profiles – Outputs: for a given user, output a list of recommended items Or, for a given (user, item) pair, output a predicted rating – Ratings can be in many forms “Star rating” (out of 5) – Binary rating (thumbs up/thumbs down) – Likert scale (Strongly like, like, neutral, dislike, strongly dislike) – Comparisons: prefer X to Y – Will use movie recommendation as a running example 5 CS910 Foundations of Data Analytics Ratings Matrix User 1 Item 1 Item 2 5 User 2 Item 4 Item 5 3 1 2 2 ? User 3 User 4 User 5 Item 3 ... 5 3 4 ? 2 1 4 ... n m matrix of ratings R, where ru,i is rating of user u for item i Typically, matrix is large and sparse Thousands or millions of users (n) and thousands of items (m) Each user has rated only a few items Each item is rated by at most a small fraction of users Goal is to provide predictions pu,i for certain (user, item) pairs 6 Sometimes called ‘Matrix Completion’ Evaluating a recommender system Evaluation is similar to evaluating classifiers Break labeled data into training and test data – For each test (user, item) pair, predict the user score for the item – Measure the difference, and aim to minimize over N tests – Combine the differences into a single score to compare systems Most common: Root-Mean-Square-Error (RMSE) between pu,i & ru,i RMSE = √(u,i (pu,i – ru,i)2 / N ) – Sometimes also use Mean Absolute Error (MAE) u,i |pu,i – ru,i| / N – If recommendations are either ‘good’ or ‘bad’, can use precision, recall – 7 CS910 Foundations of Data Analytics Initial attempts Can we use existing methods: classification, regression etc.? Assume we have features for each user and each item: User: Demographics, stated preferences – Item: E.g. Genre, director, actors – Can treat as a classification problem: predict a score – Train classifier from examples Limitations of the classifier approach: Don’t necessarily have user and item information – Ignores what we do have: lots of ratings between users and items Hard to use as features, unless everyone has rated a fixed set – 8 CS910 Foundations of Data Analytics Neighbourhood method Neighbourhood-based collaborative filtering – Users “collaborate” to help recommend (filter) items 1. Find k other users K who are similar to target user u – Possibly assign a weight based on how similar, wu,v 2. Combine the k users’ (weighted) preferences – Use these to make predictions for u Can use existing methods to measure similarity PMCC to measure correlation of ratings as wu,v – Cosine similarity of vectors – 9 CS910 Foundations of Data Analytics Neighbourhood example (unweighted) 3 users like the same set of movies as Joe (exact match) – 10 All three like “Saving Private Ryan”, so this is top recommendation CS910 Foundations of Data Analytics Different Rating Scales Every user rates slightly differently – Some consistently rate high, some consistently rate low Using PMCC avoids this effect when picking neighbours but needs adjustment for making predictions Make an adjustment when computing a score: Predict: pu,i = ru + (vK (rv,i – rv ) wu,v )/ (vK wu,v ) – ru : average rating for user u – wu,v : weight assigned to user v based on their similarity to u E.g. The correlation coefficient value – pu,i computes the weighted deviation from v’s average score, and adds onto u’s average score – 11 CS910 Foundations of Data Analytics Item-based Collaborative Filtering Often there are many more users than items E.g. Only few thousand movies available, but millions of users – Comparing to all users can be slow – Can do neighbourhood-based filtering using items Two items are similar if the users rating them are similar – Compute PMCC between the users rating them both as wi,j – Find k most similar items J – Compute simple weighted average pu,i = jJ ru,j wi,j / (jJ wi,j ) No adjustment by mean as we assume no bias from items – 12 CS910 Foundations of Data Analytics Latent Factor Analysis We rejected methods based on features of items, since we could not guarantee they would be available Latent Factor Analysis tries to find “hidden” features from the rating matrix. Factors might correspond to recognisable features like genre – Other factors: Child-friendly, comedic, light/dark – More abstract: depth of character, quirkiness – Could find factors that are hard to interpret – 13 CS910 Foundations of Data Analytics Latent Factor Example 14 CS910 Foundations of Data Analytics Matrix Factorization Model each user and item as a vector of (inferred) factors Let qi be the vector for item i, wu be the vector for user u – The predicted rating pu,i is then the dot product (wu ∙ qi) – How to learn the factors from the given data? Given ratings matrix R, try to express R as a product WQ W is n f matrix of users and their latent factors Q is a f m matrix of items and their latent factors – A matrix factorization problem: factor R into W Q – Can be solved by Singular Value Decomposition 15 CS910 Foundations of Data Analytics Singular Value Decomposition Given m x n matrix M, decompose into M = U VT, where: U is a m x m matrix of orthogonal columns [left singular vectors] – is a rectangular m x n diagonal matrix [singular values] – VT is a n x n matrix of orthogonal rows [right singular vectors] – The Singular Value Decomposition is highly structured The singular values are the square roots of eigenvalues of MMT – The left (right) singular vectors are eigenvectors of MMT (MTM) – SVD can be used to give approximate representations Take the k largest singular values, set rest to zero – Picks out the k most important “directions” – Gives the k latent factors to describe the data CS910 Foundations of Data Analytics – 16 SVD for recommender systems Textbook SVD doesn’t work when matrix has missing values! – Could try to fill in the missing values somehow, then factor Instead, set up as an optimization problem: – Learn length k vectors qi, wu to solve the following optimization: Minq, w ∑(u,i) R (ru,i – qi wu)2 Minimize the squared error between the predicted and true value If we had a complete matrix, SVD would solve this problem – Set W = Uk ½k and Q = ½k Vk Uk, Vk are singular vectors corresponding to k largest singular values Additional problem: too much freedom (not enough ratings) – 17 Risk of overfitting the training data, failing to generalize CS910 Foundations of Data Analytics Regularization Regularization is a technique used in many places Here, avoid overfitting by penalizing having too many parameters – Achieve this by adding the size of the parameters to optimization Minq, w ∑(u,i) R (ru,i – qi wu)2 + (ǁqiǁ22 + ǁwuǁ22) ǁxǁ22 is the L2 (Euclidean) norm squared: sum of squared values – Effect is to set more values of q and v to 0 to minimize complexity – Many different forms of regularization: L2 regularization: add terms of the form ǁxǁ22 – L1 regularization: terms of the form ǁxǁ1 (can give sparser solutions) – The form of the regularization should fit the optimization 18 CS910 Foundations of Data Analytics Solving the optimization: Gradient Descent How to solve Minq, w ∑(u,i) R (ru,i – qi wu)2 + (ǁqiǁ22 + ǁwuǁ22) ? Gradient Descent For each training example, find error of current prediction eu,i = ru,i – qi wu – Modify the parameters by taking a step in direction of the gradient qi qi + γ (eu,i wu - λ qi) [derivative of target with respect to q] wu wu + γ (eu,i qi - λ wu) [derivative with respect to w] – γ is step size parameter to control the speed of descent – Advantages and disadvantages of gradient descent ++ Fairly easy to implement: easy to compute update at each step – -- Can be slow: hard to parallelize – 19 CS910 Foundations of Data Analytics Solving the optimization: Least Squares How to solve Minq, w ∑(u,i) R (ru,i – qi wu)2 + (ǁqiǁ22 + ǁwuǁ22) ? Reducing to Least Squares Suppose the values of wu are fixed – Then the goal is to minimize a function of the squares of qis – Solved by techniques from regression: (regularized) least squares – The alternating least squares method Pretend values of wu are fixed, optimize values of qi – Swap, pretend values of qi are fixed, optimize values of wu – Repeat until convergence – Can be slower than gradient descent on a single machine – 20 But can parallelize: compute each qi independently CS910 Foundations of Data Analytics Adding biases Can generalize matrix factorization to incorporate other factors E.g. Fred always rates 1 star less than average – E.g. Citizen Kane is rated 0.5 higher than other films on average – These are not captured as well by a model of the form (qi wu) – Explicitly modeling biases (intercepts) can give better fit Model with biases: pu,i = + bi + bu + (wu ∙ qi) : global average rating bi : bias for item i bu : rating bias from user u (similar to neighborhood method) Optimize the new error function in the same way: Minq,w,b ∑(u,i)R(ru,i – – bu – bi – qiwu)2 + (ǁqiǁ22 + ǁwuǁ22 + bu2 + bi2) – 21 Can add more biases e.g. incorporate variation over time CS910 Foundations of Data Analytics Cold start problem: new items How to cope when new objects are added to the system? – New users arrive, new movies are released: “cold start” problem New item is created: no ratings, so will not be recommended? Use attributes of the item (actors, genre) to give some score – Randomly suggest it to users to get some ratings – 22 CS910 Foundations of Data Analytics Cold start problem: new users New users arrive: we have no idea what they like! Recommend globally popular items to them (Harry Potter…) May not give much specific information about their tastes – Encourage new users to rate some items before recommending – Suggest items that are “divisive”: try to maximize information Tradeoff: “poor” recommendations may drive users away – 23 CS910 Foundations of Data Analytics Case Study: The Netflix Prize Netflix ran competition from 2006-09 Netflix streams movies over internet (and rents DVDs by mail) – Users rate each movie on a 5-star scale – Netflix makes recommendations of what to watch next – Object of competition: improve over current recommendations “Cinematch” algorithm: “uses straightforward linear models…” – Prize: $1M to improve RMSE by 10% – Training data: 100M dated ratings from 480K users to 18K movies Can submit ratings of test data at most once per day – Avoid stressing of servers, attempts to elicit true answers – 24 CS910 Foundations of Data Analytics The Netflix Prize https://www.youtube.com/watch?v=Imp V70uLxyw https://www.youtube.com/watch?v=ImpV70uLxyw 25 CS910 Foundations of Data Analytics Netflix prize factors Postscript: Netflix adopted some ideas but not all “Explainability” of recommendations is an additional requirement – Cost of fitting models, making predictions is also important – 26 CS910 Foundations of Data Analytics Recommender Systems Summary Introduced the concept of recommendation Saw neighbour based methods Saw latent factor methods Understood how recommender systems are evaluated – Netflix prize as a case study in applied recommender systems Recommended reading: Recommender systems (Encyclopedia of Machine Learning) – Matrix Factorization Techniques for Recommender Systems Koren, Bell, Volinsky, IEEE Software – 27 CS910 Foundations of Data Analytics