Recommender Systems CS910: Foundations of Data Analytics

CS910: Foundations of Data Analytics Graham Cormode G.Cormode@warwick.ac.uk Recommender Systems Objectives     To understand the concept of recommendation To see neighbour based methods To see latent factor methods To see how recommender systems are evaluated  Recommended reading: Recommender systems (Encyclopedia of Machine Learning) – Matrix Factorization Techniques for Recommender Systems Koren, Bell, Volinsky, IEEE Software – 2 CS910 Foundations of Data Analytics Recommendations  A modern problem: a very large number of possible items – Which item should I try next, based on my preferences?  Arises in many different places: Recommendations of content: books, music, movies, videos... – Recommendations of places to travel, hotels, restaurants – Recommendations of food to eat, sites to visit – Recommendations of articles to read: news, research, gossip –  Each person has different preferences/interests How to elicit and model these preferences? – How to customize recommendations to a particular individual? – 3 CS910 Foundations of Data Analytics Recommendations in the Wild 4 CS910 Foundations of Data Analytics Recommender Systems  Recommender systems: produce tailored recommendations Inputs: ratings of items by users  Possibly also: user profiles, item profiles – Outputs: for a given user, output a list of recommended items  Or, for a given (user, item) pair, output a predicted rating –  Ratings can be in many forms “Star rating” (out of 5) – Binary rating (thumbs up/thumbs down) – Likert scale (Strongly like, like, neutral, dislike, strongly dislike) – Comparisons: prefer X to Y –  Will use movie recommendation as a running example 5 CS910 Foundations of Data Analytics Ratings Matrix User 1 Item 1 Item 2 5 User 2 Item 4 Item 5 3 1 2 2 ? User 3 User 4 User 5 Item 3 ... 5 3 4 ? 2 1 4 ...  n  m matrix of ratings R, where ru,i is rating of user u for item i  Typically, matrix is large and sparse  Thousands or millions of users (n) and thousands of items (m)  Each user has rated only a few items  Each item is rated by at most a small fraction of users  Goal is to provide predictions pu,i for certain (user, item) pairs 6  Sometimes called ‘Matrix Completion’ Evaluating a recommender system  Evaluation is similar to evaluating classifiers Break labeled data into training and test data – For each test (user, item) pair, predict the user score for the item – Measure the difference, and aim to minimize over N tests –  Combine the differences into a single score to compare systems Most common: Root-Mean-Square-Error (RMSE) between pu,i & ru,i  RMSE = √(u,i (pu,i – ru,i)2 / N ) – Sometimes also use Mean Absolute Error (MAE) u,i |pu,i – ru,i| / N – If recommendations are either ‘good’ or ‘bad’, can use precision, recall – 7 CS910 Foundations of Data Analytics Initial attempts  Can we use existing methods: classification, regression etc.?  Assume we have features for each user and each item: User: Demographics, stated preferences – Item: E.g. Genre, director, actors –  Can treat as a classification problem: predict a score – Train classifier from examples  Limitations of the classifier approach: Don’t necessarily have user and item information – Ignores what we do have: lots of ratings between users and items  Hard to use as features, unless everyone has rated a fixed set – 8 CS910 Foundations of Data Analytics Neighbourhood method  Neighbourhood-based collaborative filtering – Users “collaborate” to help recommend (filter) items 1. Find k other users K who are similar to target user u – Possibly assign a weight based on how similar, wu,v 2. Combine the k users’ (weighted) preferences – Use these to make predictions for u  Can use existing methods to measure similarity PMCC to measure correlation of ratings as wu,v – Cosine similarity of vectors – 9 CS910 Foundations of Data Analytics Neighbourhood example (unweighted)  3 users like the same set of movies as Joe (exact match) – 10 All three like “Saving Private Ryan”, so this is top recommendation CS910 Foundations of Data Analytics Different Rating Scales  Every user rates slightly differently – Some consistently rate high, some consistently rate low  Using PMCC avoids this effect when picking neighbours but needs adjustment for making predictions  Make an adjustment when computing a score: Predict: pu,i = ru + (vK (rv,i – rv ) wu,v )/ (vK wu,v ) – ru : average rating for user u – wu,v : weight assigned to user v based on their similarity to u  E.g. The correlation coefficient value – pu,i computes the weighted deviation from v’s average score, and adds onto u’s average score – 11 CS910 Foundations of Data Analytics Item-based Collaborative Filtering  Often there are many more users than items E.g. Only few thousand movies available, but millions of users – Comparing to all users can be slow –  Can do neighbourhood-based filtering using items Two items are similar if the users rating them are similar – Compute PMCC between the users rating them both as wi,j – Find k most similar items J – Compute simple weighted average pu,i = jJ ru,j wi,j / (jJ wi,j )  No adjustment by mean as we assume no bias from items – 12 CS910 Foundations of Data Analytics Latent Factor Analysis  We rejected methods based on features of items, since we could not guarantee they would be available  Latent Factor Analysis tries to find “hidden” features from the rating matrix. Factors might correspond to recognisable features like genre – Other factors: Child-friendly, comedic, light/dark – More abstract: depth of character, quirkiness – Could find factors that are hard to interpret – 13 CS910 Foundations of Data Analytics Latent Factor Example 14 CS910 Foundations of Data Analytics Matrix Factorization  Model each user and item as a vector of (inferred) factors Let qi be the vector for item i, wu be the vector for user u – The predicted rating pu,i is then the dot product (wu ∙ qi) –  How to learn the factors from the given data? Given ratings matrix R, try to express R as a product WQ  W is n  f matrix of users and their latent factors  Q is a f  m matrix of items and their latent factors – A matrix factorization problem: factor R into W  Q –  Can be solved by Singular Value Decomposition 15 CS910 Foundations of Data Analytics Singular Value Decomposition  Given m x n matrix M, decompose into M = U  VT, where: U is a m x m matrix of orthogonal columns [left singular vectors] –  is a rectangular m x n diagonal matrix [singular values] – VT is a n x n matrix of orthogonal rows [right singular vectors] –  The Singular Value Decomposition is highly structured The singular values are the square roots of eigenvalues of MMT – The left (right) singular vectors are eigenvectors of MMT (MTM) –  SVD can be used to give approximate representations Take the k largest singular values, set rest to zero – Picks out the k most important “directions” – Gives the k latent factors to describe the data CS910 Foundations of Data Analytics – 16 SVD for recommender systems  Textbook SVD doesn’t work when matrix has missing values! – Could try to fill in the missing values somehow, then factor  Instead, set up as an optimization problem: – Learn length k vectors qi, wu to solve the following optimization: Minq, w ∑(u,i)  R (ru,i – qi  wu)2 Minimize the squared error between the predicted and true value  If we had a complete matrix, SVD would solve this problem – Set W = Uk ½k and Q = ½k Vk Uk, Vk are singular vectors corresponding to k largest singular values  Additional problem: too much freedom (not enough ratings) – 17 Risk of overfitting the training data, failing to generalize CS910 Foundations of Data Analytics Regularization  Regularization is a technique used in many places Here, avoid overfitting by penalizing having too many parameters – Achieve this by adding the size of the parameters to optimization Minq, w ∑(u,i)  R (ru,i – qi  wu)2 + (ǁqiǁ22 + ǁwuǁ22) ǁxǁ22 is the L2 (Euclidean) norm squared: sum of squared values – Effect is to set more values of q and v to 0 to minimize complexity –  Many different forms of regularization: L2 regularization: add terms of the form ǁxǁ22 – L1 regularization: terms of the form ǁxǁ1 (can give sparser solutions) –  The form of the regularization should fit the optimization 18 CS910 Foundations of Data Analytics Solving the optimization: Gradient Descent  How to solve Minq, w ∑(u,i)  R (ru,i – qi  wu)2 + (ǁqiǁ22 + ǁwuǁ22) ?  Gradient Descent For each training example, find error of current prediction eu,i = ru,i – qi  wu – Modify the parameters by taking a step in direction of the gradient qi  qi + γ (eu,i wu - λ qi) [derivative of target with respect to q] wu  wu + γ (eu,i qi - λ wu) [derivative with respect to w] – γ is step size parameter to control the speed of descent –  Advantages and disadvantages of gradient descent ++ Fairly easy to implement: easy to compute update at each step – -- Can be slow: hard to parallelize – 19 CS910 Foundations of Data Analytics Solving the optimization: Least Squares  How to solve Minq, w ∑(u,i)  R (ru,i – qi  wu)2 + (ǁqiǁ22 + ǁwuǁ22) ?  Reducing to Least Squares Suppose the values of wu are fixed – Then the goal is to minimize a function of the squares of qis – Solved by techniques from regression: (regularized) least squares –  The alternating least squares method Pretend values of wu are fixed, optimize values of qi – Swap, pretend values of qi are fixed, optimize values of wu – Repeat until convergence –  Can be slower than gradient descent on a single machine – 20 But can parallelize: compute each qi independently CS910 Foundations of Data Analytics Adding biases  Can generalize matrix factorization to incorporate other factors E.g. Fred always rates 1 star less than average – E.g. Citizen Kane is rated 0.5 higher than other films on average –  These are not captured as well by a model of the form (qi  wu) – Explicitly modeling biases (intercepts) can give better fit Model with biases: pu,i =  + bi + bu + (wu ∙ qi)  : global average rating bi : bias for item i bu : rating bias from user u (similar to neighborhood method)  Optimize the new error function in the same way: Minq,w,b ∑(u,i)R(ru,i –  – bu – bi – qiwu)2 + (ǁqiǁ22 + ǁwuǁ22 + bu2 + bi2) – 21 Can add more biases e.g. incorporate variation over time CS910 Foundations of Data Analytics Cold start problem: new items  How to cope when new objects are added to the system? – New users arrive, new movies are released: “cold start” problem  New item is created: no ratings, so will not be recommended? Use attributes of the item (actors, genre) to give some score – Randomly suggest it to users to get some ratings – 22 CS910 Foundations of Data Analytics Cold start problem: new users  New users arrive: we have no idea what they like! Recommend globally popular items to them (Harry Potter…)  May not give much specific information about their tastes – Encourage new users to rate some items before recommending – Suggest items that are “divisive”: try to maximize information  Tradeoff: “poor” recommendations may drive users away – 23 CS910 Foundations of Data Analytics Case Study: The Netflix Prize  Netflix ran competition from 2006-09 Netflix streams movies over internet (and rents DVDs by mail) – Users rate each movie on a 5-star scale – Netflix makes recommendations of what to watch next –  Object of competition: improve over current recommendations “Cinematch” algorithm: “uses straightforward linear models…” – Prize: $1M to improve RMSE by 10% –  Training data: 100M dated ratings from 480K users to 18K movies Can submit ratings of test data at most once per day – Avoid stressing of servers, attempts to elicit true answers – 24 CS910 Foundations of Data Analytics The Netflix Prize https://www.youtube.com/watch?v=Imp V70uLxyw https://www.youtube.com/watch?v=ImpV70uLxyw 25 CS910 Foundations of Data Analytics Netflix prize factors  Postscript: Netflix adopted some ideas but not all “Explainability” of recommendations is an additional requirement – Cost of fitting models, making predictions is also important – 26 CS910 Foundations of Data Analytics Recommender Systems Summary     Introduced the concept of recommendation Saw neighbour based methods Saw latent factor methods Understood how recommender systems are evaluated – Netflix prize as a case study in applied recommender systems  Recommended reading: Recommender systems (Encyclopedia of Machine Learning) – Matrix Factorization Techniques for Recommender Systems Koren, Bell, Volinsky, IEEE Software – 27 CS910 Foundations of Data Analytics

Recommender Systems CS910: Foundations of Data Analytics

Related documents

Products

Support

Recommender Systems CS910: Foundations of Data Analytics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib