Abhinandan Das abhinandan@google.com
Mayur Datar - mayur@google.com
Ashutosh Garg - ashutosh@google.com
Shyam Rajaram rajaram1@ifp.uiuc.edu
Presented by: Aniket Zamwar zamwar@usc.edu
1
•
Map Reduce
•
Collaborative Filtering
•
Content Based Recommendation
•
Clustering Techniques - Pros/Cons
2
•
Scale of operation is very huge - order of several million news stories dynamically changing at high rate.
Presented with the click history for N users ( U = { u 1 ,u 2 ,...,u N } ) over M items ( S = {s 1 ,s 2 ,...,s M } ), and given a specific user ‘ u ’ with click history set
C u consisting of stories { s i1 , . . . , s i|Cu | } , recommend K stories to the user.
Strict timing constraints for recommendation engine to generate recommendations.
Google News
Personalization
4/18/13 3
4/18/13
•
Collaborative Clustering
•
Probabilistic Latent Semantic Indexing
•
Covisitation Counts
Google News
Personalization
4
•
Record User Queries and Clicks
•
Recommendations of News using user click history and click history of the community
4/18/13
Google News
Personalization
5
•
•
Collaborative Filtering Systems
‣ Memory-based Algorithms
‣ Prediction calculated as weighted average of the ratings given by other users weight is proportional to to “similarity” between users.
‣
‣ Model-based Algorithms
4/18/13
‣ Model the users based on their past ratings and use these models to predict ratings of unseen items.
•
Mix of memory based + model based systems
Google News
Personalization
6
4/18/13
•
Model based approach
•
Clustering Techniques: Probabilistic
Latent Semantic Indexing(PLSI) and min hash
•
Memory based approach
•
Item Covisitation
Google News
Personalization
7
•
Probabilistic clustering technique - assigns pairs of users to same cluster with probability proportional to overlab between the set of items the users have voted for.
•
Similarity calculated using Jaccard Coefficient
•
To Do: Given user u-i, compute similarity S(u-i, u-j) for all users u-j, and recommend stories to u-i voted by u-j with weight equal to S(u-i, u-j)
Issues: Real time not scalable, using hash table to find vote for specific user is also not feasible, offline computation is also not feasible
•
Locality Sensitive Hashing (LSH) comes for rescue
Google News
4/18/13
Personalization
8
4/18/13
•
Key Idea: Hash data points using several hash functions, such that for each hash function the probability of collision is much higher for objects which are close to each other.
•
Min-hashing technique is used to randomly permute the set of items (S) and for each user u-i compute its hash value h(u-i) as the index of first item under the permutation that belongs to user ’s item set Cu-i.
•
Min-hashing = probabilistic clustering where each hash bucket corresponds to a cluster, that puts two users together in the same cluster with probability equal to item set similarity S(u-i, u-j)
Google News
Personalization
9
•
Probabilistic Latent Semantic models
•
Models users and items as random variables relationship between users and items is learned by modeling joint distribution of users and items as mixed distribution
•
A hidden variable Z is used to define the relationship, it represents user communities and item communities.
Google News
Personalization
4/18/13 10
•
Covisitation is defined as event in which two stories are clicked by same user within a
4/18/13 certain time interval.
•
A graph whose nodes represent items and weighted edges represent time discounted number of covisitation instances.
•
For each user click the adjacency list representing graph is updated: for entry for each item in user history, new entry corresponding to clicked item is added if not there; if it is already there then the age discounted count is updated.
Google News
Personalization
11
•
Fetch user u-i ’s recent click history - limited to past few hours or days.
•
For each item s-i in click history of user, lookup the entry for pair (s-i, s) in adjacency list for s-i stored in Big Table.
•
The value stored in entry normalized by sum of all entries for s-i is stored to recommendation score.
•
Recommendation score is normalized to a value between 0 and 1 by linear scaling.
Google News
Personalization
4/18/13 12
•
Three Components:
•
Offline component to cluster users based on click history
•
Online servers:
•
Updating story and user statistics each time user clicks on news story
• Generating news story when user requests
•
Two Data Tables
•
User Table (UT) indexed by user-id, stores user click history and clustering information.
4/18/13
• Story Table (ST) indexed by story-id, stores real time click counts for every storystory and story-cluster pair.
Google News
Personalization
13
NSS f f b u e r
UT user click
NFE view personalized news page request
4/18/13
NPS c h c a e
Fetch
NFE: News Front End
NSS: News Statistics Server
NPS: News Personalization
Server
Google News
Personalization
Update Statistics
ST
Statistics
UT: User Table
ST: Story Table
Min Hashing
PLSI
Offline
Log
Analysis
14
4/18/13
•
Scalable collaboration of Content based and Collaborative clustering
•
Recommendation system using Min
Hashing and PLSI Algorithms
•
Scaling the algorithms by using Map
Reduce and Big Table representation for data
•
Using click events as vote for news
•
Dynamically providing the latest likely news that suits the interest of the user
Google News
Personalization
4/18/13
•
Depends a lot on User Clicks
•
User Clicks considered as positive vote
•
Does not say anything about negative vote
Google News
Personalization
16