Google News Personalization Scalable Online Collaborative Filtering

advertisement

Google News

Personalization

Scalable Online Collaborative Filtering

Abhinandan Das abhinandan@google.com

Mayur Datar - mayur@google.com

Ashutosh Garg - ashutosh@google.com

Shyam Rajaram rajaram1@ifp.uiuc.edu

Presented by: Aniket Zamwar zamwar@usc.edu

1

Already Studied in

Class

Map Reduce

Collaborative Filtering

Content Based Recommendation

Clustering Techniques - Pros/Cons

2

Problem Statement

Scale of operation is very huge - order of several million news stories dynamically changing at high rate.

Presented with the click history for N users ( U = { u 1 ,u 2 ,...,u N } ) over M items ( S = {s 1 ,s 2 ,...,s M } ), and given a specific user ‘ u ’ with click history set

C u consisting of stories { s i1 , . . . , s i|Cu | } , recommend K stories to the user.

Strict timing constraints for recommendation engine to generate recommendations.

Google News

Personalization

4/18/13 3

4/18/13

Approaches

Collaborative Clustering

Probabilistic Latent Semantic Indexing

Covisitation Counts

Google News

Personalization

4

Problem Setting

Record User Queries and Clicks

Recommendations of News using user click history and click history of the community

4/18/13

Google News

Personalization

5

Recommender

System

Collaborative Filtering Systems

‣ Memory-based Algorithms

‣ Prediction calculated as weighted average of the ratings given by other users weight is proportional to to “similarity” between users.

‣ Model-based Algorithms

4/18/13

‣ Model the users based on their past ratings and use these models to predict ratings of unseen items.

Mix of memory based + model based systems

Google News

Personalization

6

4/18/13

Algorithms

Model based approach

Clustering Techniques: Probabilistic

Latent Semantic Indexing(PLSI) and min hash

Memory based approach

Item Covisitation

Google News

Personalization

7

Min Hashing

Probabilistic clustering technique - assigns pairs of users to same cluster with probability proportional to overlab between the set of items the users have voted for.

Similarity calculated using Jaccard Coefficient

To Do: Given user u-i, compute similarity S(u-i, u-j) for all users u-j, and recommend stories to u-i voted by u-j with weight equal to S(u-i, u-j)

Issues: Real time not scalable, using hash table to find vote for specific user is also not feasible, offline computation is also not feasible

Locality Sensitive Hashing (LSH) comes for rescue

Google News

4/18/13

Personalization

8

4/18/13

Locality Sensitive

Hashing

Key Idea: Hash data points using several hash functions, such that for each hash function the probability of collision is much higher for objects which are close to each other.

Min-hashing technique is used to randomly permute the set of items (S) and for each user u-i compute its hash value h(u-i) as the index of first item under the permutation that belongs to user ’s item set Cu-i.

Min-hashing = probabilistic clustering where each hash bucket corresponds to a cluster, that puts two users together in the same cluster with probability equal to item set similarity S(u-i, u-j)

Google News

Personalization

9

PLSI

Probabilistic Latent Semantic models

Models users and items as random variables relationship between users and items is learned by modeling joint distribution of users and items as mixed distribution

A hidden variable Z is used to define the relationship, it represents user communities and item communities.

Google News

Personalization

4/18/13 10

Covisitation

Covisitation is defined as event in which two stories are clicked by same user within a

4/18/13 certain time interval.

A graph whose nodes represent items and weighted edges represent time discounted number of covisitation instances.

For each user click the adjacency list representing graph is updated: for entry for each item in user history, new entry corresponding to clicked item is added if not there; if it is already there then the age discounted count is updated.

Google News

Personalization

11

Covisitation based

Recommendation

Fetch user u-i ’s recent click history - limited to past few hours or days.

For each item s-i in click history of user, lookup the entry for pair (s-i, s) in adjacency list for s-i stored in Big Table.

The value stored in entry normalized by sum of all entries for s-i is stored to recommendation score.

Recommendation score is normalized to a value between 0 and 1 by linear scaling.

Google News

Personalization

4/18/13 12

System Setup

Three Components:

Offline component to cluster users based on click history

Online servers:

Updating story and user statistics each time user clicks on news story

• Generating news story when user requests

Two Data Tables

User Table (UT) indexed by user-id, stores user click history and clustering information.

4/18/13

• Story Table (ST) indexed by story-id, stores real time click counts for every storystory and story-cluster pair.

Google News

Personalization

13

System Components

NSS f f b u e r

UT user click

NFE view personalized news page request

4/18/13

NPS c h c a e

Fetch

NFE: News Front End

NSS: News Statistics Server

NPS: News Personalization

Server

Google News

Personalization

Update Statistics

ST

Statistics

UT: User Table

ST: Story Table

Min Hashing

PLSI

Offline

Log

Analysis

14

4/18/13

Pros

Scalable collaboration of Content based and Collaborative clustering

Recommendation system using Min

Hashing and PLSI Algorithms

Scaling the algorithms by using Map

Reduce and Big Table representation for data

Using click events as vote for news

Dynamically providing the latest likely news that suits the interest of the user

Google News

Personalization

4/18/13

Cons

Depends a lot on User Clicks

User Clicks considered as positive vote

Does not say anything about negative vote

Google News

Personalization

16

Download