P2P RECOMMENDER SYSTEMS: A (SMALL) SURVEY Giulio Rossetti Talk Outline Problem Definition What is a recommender System? Why recommender Systems? Centralized RS Well-known families of approaches Collaborative Filtering (Idea) Decentralized RS Why P2P? A (small) survey What are Recommender Systems? RSs are a class of information filtering system that seek to predict: • • the rating or, preference that user would give to • • an item (such as music, books, or movies) or social element (e.g. people or groups) they had not yet considered, using a model built from the characteristics of • • items (content-based approaches) or user's social environment (collaborative filtering approaches) Why recommender sistems? Nowadays the amount of information we are retrieving have become increasingly enormous (Big Data) What we really need is a technology that can assist us find resources of interest among the overwhelming data available “[…] a personalized information filtering used to either predict whether a particular user will like a particular item (prediction problem) or to identify a set of N items that will be of interest to a certain user.” Well-known families of approaches Random Prediction Frequent sequences Collaborative filtering algorithms Content based algorithms • randomly choosing of items from the set of available ones to recommends them to the user • if a customer frequently rates items we can exploit his frequent rating pattern to recommend other items (with similar rate) to him. • requires the recommendation seekers to express their preferences by rating items: more users rate items (or categories) more accurate the recommendation becomes. • attempt to recommend items that are similar to items the user liked in the past. Centralized Approaches Two main family of metodologies were studied in recent years: • User-based CF • are CF algorithms that work on the assumption that each user belongs to a group of similar behaving users. The basis for the recommendation is composed by items that are liked by users. Items are recommended based on users tastes. The algorithm considers that users who are similar (have similar attributes) will be interested on same items. • Item-based CF • are a CF algorithms that look at the similarity between items to make a prediction. The idea is that users are most likely to purchase items that are similar to the ones already bought in the past; so by analyzing the purchasing information we can have an idea about what he may want in the future. P2P: Motivations The need for efficient decentralized recommender systems has been appreciated for some time, both for the intrinsic advantages of decentralization and the necessity of integrating recommender systems into P2P applications. The two main advantages gathered are: 1. 2. the predictions can be distributed among all users, removing the need for a costly central server and enhancing scalability a decentralized recommender improves the privacy of the users for there is no central entity storing owning the private information of the users. P2P Recommender systems: a small survey User-based CF: Buddicast, kNN (Random Samples & T-MAN) P2PRec: a social based recommender system SoCS: Social Graph Embedding Random Walks User-Based Collaborative Filtering Ormándi, I. Hegedas and M. Jelasity Node Balancing issue: Overlay topologies defined by node similarity have often highly unbalanced degree distributions (i.e. power-law). Overlay management: how can be builded and maintained the best possible overlay for computing recommendation scores (taking care bandwith of usage at the nodes)? Desiderata: a minimal, uniform load from overlay management even when the in-degree distribution of the expected overlay graph is unbalanced Approaches: BuddiCast, kNN (Random Sampling & T-MAN) BuddiCast • Each node local view contains a full descriptor of the node’s neighbors (i.e. ratings). Computing reccomendations do not load the network (local information approach). • Load balancing: • Block list: If a node communicates with another peer, it is put on the block list for few hours. • Candidate list: contains close peers for potential communication • Random list: contains random samples from the network. • For overlay maintenance, each node connects to the best node from the candidate list with probability α, and to a random list with probability 1−α, and exchanges its buddy list with the selected peer. kNN: Random Samples • Every node has a local view of size k that contains node descriptors. • Each node is initialized with k random samples from the network, which iteratively approximate the kNN graph. • The convergence is based on an iterative random sampling process. • Random nodes are inserted into the view (which is implemented as a bounded priority queue) • The queue’s priority is based on the similarity function provided by the recommender module. kNN: T-Man sampling • Overlay managed with the T-MAN algorithm: • T-MAN periodically updates the node’s view (of size k) by: 1. selecting a peer node to communicate with 2. exchanging its view with the peer 3. merging the two views and keeping the closest k descriptors • Peer (communitication) selection methods: • Global: selects the node from the whole network randomly • View: selects the node from the view uniformly at random • Proportional: selects a node from view but with different probability distribution • Best: selects the most similar node without any restriction User-based CF: Observations 1. In unbalanced distribution cases is not optimal to use the kNN (T-Man Best) view (a more relaxed one can give better recommendation performance) 2. Overlay construction converges reasonably fast even in the case of random updates or with T-MAN 3. T-MAN with Global selection is a good choice: 1. it has a fully uniform load distribution combined with an acceptable convergence speed, which is better than that of the random view update P2PRec: a social based P2P recommender system Draidi and Pacitti The idea: recommend high quality documents related to query topics and contents hold by friends (or FOAF), who are expert on the topics related to the query. Assumptions: • each node represents a peer labelled with the contents it stores and its topics of interests; • expertise is deduced based on the contents stored by a user; • the topics each peer is interested in are calculated by analyzing the documents he holds; • to disseminate information about experts is adopted a semantic-based gossip algorithms that provide scalability, robustness and load balancing. How P2Prec works 1. Latent Dirichlet Allocation (LDA) is used to automatically model the topics in the system Training - Global level: identification of the complete set of topics Inference - local (node) level: extraction of the topics of interest for the user 1. 2. 2. Dissemination of local information by a gossip algorithm FOAF descriptor: topics of interest, trust level At each gossip exchange, each user u checks its local-view for relevant similar peer with respect topics of interests and friendship networks: 1. 2. • 3. If founded, a demand of friendship is launched. Querying 1. A key-word query q is associated a TTL and is routed recursively in a P2P top-k manner Social Graph Embedding A. Kermarrec, V. Leroy and G. Trédan A proximity metric between users enable to predict potential relevant future relationships (Link Prediction) SoCS (Social Coordinate System) • Fully distribuited algorithm that embeds a social graph in an Eucliedean space • Nodes gets assigned coordinate w.r.t. their social position • Community structure is preserved Force-based embedding (FBE): Edges represent springs and nodes represent electrically equally charged particles. Edges (springs) attract the vertices they link, whereas vertices (particles) repulse each other. The embedding is achieved once the system reaches an equilibrium. SoCS Algorithm Social Neighbors: Nodes that have close social positions. Graph neighbors and social neighbors of a node are not necessarily the same. Each node regularly updates its position in the social space: 1. 2. 3. 4. first gathers the positions of its graph and social neighbors using these positions computes the forces that are applied to it, and derives its updated social position a gossip protocol provides to the node a list of its new social neighbors this list is then used to compute new positions Similarity metrics: SoCS will recommend to a node its closest social neighbors that are not already graph neighbors. • Common Neighbors, Jaccard, Adamic\Adar, Path Length, Katz… SoCS Algorithm (2) SoCS relies on gossip to discover the social neighbors. Each node runs a clustering algorithm (Neighbors Peer Sampling - NPS) in order to maintain and update its social neighbors list. Gossip protocols have been shown to be cheap, robust against churn, and to converge quickly Decentralized Random Walks A. Kermarrec, V. Leroy, A. Moin and C. Thraves The application of random walks to decentralized environments is different from the centralized version. • Centralized RS: Random walks are used as clustering mechanism (e.g. community discovery) • Decentralized RS: CD infeasible: the knowledge of each peer about the P2P network is limited to its neighborhood. Proposed Approach Each peer is provided with a neighborhood composed of a small set of similar peers by means of an epidemic (gossip) protocol; 2. Ratings for unknown items are estimated by a random walk on the neighborhood. • Once peers have stabilized their neighborhood they can calculate recommendations indipendently 1. • Similarity measure: Pearson Correlation, Jaccard Random Walks observed properties The users in the neighborhood are modeled as Markov Chain graph vertices, and a random walk is applied on this graph. • A Markov chain can be represented by a directed graph where vertices are the states of the chain and edges represent the transition probabilities from one state to another. Results: • Random walk works well when the data is so sparse that classic similarity measures fail to detect meaningful relation between users; • Increasing the neighborhood size the accuracy increase; • decentralized user-based approaches perform better (low complexyty, high precision) than their item-based counterparts in P2P recommender applications; • Cosine similarity performed better in decentralized item-based algorithms, while Pearson correlation worked better for decentralized user-based algorithms Conclusions • P2P Recommender systems are needed in order to overcome scalability and privacy issues • Several approaches were analyzed • Each one relying (to some extent) to gossip algorithm in order to maintain and update the overlay network • Allmost all the discussed approaches takle the problem with a user-based similarity strategy exploiting classical network theory approaches; • Unsupervised Link Prediction • Community Discovery • Force directed embedding Bibliography • D. Almazro and G. Shahatah. A survey paper on recommender systems (2010) • F. Draidi and E. Pacitti. Demo of P2Prec: a Social-based P2P Recommendation System. (2011) • A. Kermarrec, V. Leroy, A. Moin and C. Thraves. Application of random walks to decentralized recommender systems. (2010) • A. Kermarrec, V. Leroy and G. Trédan. Distributed social graph embedding. (2011) • R. Ormándi, I. Hegedas and M. Jelasity. Overlay management for fully distributed user-based collaborative filtering. (2010) …questions?