1st Gossple Workshop on Social Networking (december 2010) Large-scale data sharing by exploiting gossiping Esther Pacitti SOPHIA ANTIPOLIS - MÉDITERRANÉE Context: P2P Data Sharing • We consider P2P online communities where participants can be – Professionals (researchers, engineers, support staff, etc.) who use web-scale collaboration in their workplace – Large scale of users and data (clouds, grids, internet) • Example of applications: – P2P Recommendation Systems • Useful for processing scientific workflows among participants’ peers – P2P Query Reformulation • Clinical case sharing among doctors or physicians – P2P CDN • Projects: – ANR DataRing (2009-2012, P2P online communities ) – Datluge (2010-2012, with UFRJ, Brazil on P2P scientific workflows) MOTIVATIONS Chemistry, Materials Science and Physics Bioinformatics Computer Science P2PRec: document recommender • Hudge graph:G = (D,U,E,T), where – D is the set of shared documents – U is the set of users in the system – E is the set of edges between the users such that there is an edge e(u,v) if users u and v are friends – T is the set of users’ topics of intrest. • Problem: Given a query, recommend the most relevant documents • Our approach – Reduce the research space by indentifing relevant users – Identify relevant users • Users that stores/downloads enough high-quality documents, and become kind of providers in specific topics • Recommended by trusted friends • P2P Overlay : Semantic-Gossiping • Disseminate relevant users and their topics of intrests P2PRec*: document recommender • • • • Topics of intrest – With respect to the documents a user store – Extracted automatically Friendship network – Explicit friendship (maybe laveraged with implicit) – Expresses users trusts – Implemented is FOAF files (friend of friends files, machine-readable vocabulary serialized in RDF/XML ) Key-word Queries – Mapped to topics – Mostly related to the user topics of intrest Mesure to – Check the similarity of users wrt to their topics (Dice coefficient) – Relevance of a user *Joint work with F. Draidi, P. Valduriez, B. Kemme, to appear as Inria report Semantic-Gossiping u11’s local-view view after before gossip gossip u1 FOAF u1 topics: t1,t2 Friends: link to u5 FOAF user Gossip information information Gossip u5 tt11,, tt22 u6 tt22,t,t33 u4 t1 If distance between uu and uv > τ ask for friendship u5 topics Dice coefficient u1 u t3 1 t1 u2 u6 t2,t3 u5 t1,t2 u4 t1 u55’s gossip ’s local-view local-view before after gossip user Gossip information u4 t1 u6 t2,t3 If friendship is accepted add uv to FOAF file Relevant Users • Users topics of intrest are automatically extracted using LDA* – by inspecting the documents topic vector • A user is considered relevant on a topic tTu, if a percentage of its documents have high quality in topic t • Each document doc at user u has – A rate given to doc : ratedoc – doc topic Vector (extracted using LDA) • Vdoc={wdoct1,…..,wdoctd} • doc is considered a high quality in a topic t qualityt(doc,u) • If wdoct *ratedoc > a threshold value • A user can be relevant in more than one topic *Latent Dirichlet Allocation (topic classifier) Query Processing • Implements Recommendation • Input: Key words • Output: – Links to a set of good quality documents. May include links to documents on the topic of intrests of a friend (query expansion) – Popularity and Similarity info • Example: doctors studing the behavior of a gene X may be glad to learn about the deseases it can cause and check some experimental data sets Query Processing query q requester q.t = t1, q.TTL=2 Summary of Docs similarity and classification info Compute sim(doc,q) t3 u1 t1 u7 q.TTL=1 q.TTL=1 u2 q.TTL=0 u6 t2,t3 u1 FOAF u1 topics of intrests Friends: link to u5 FOAF u5 topics u3 t2 u5 t1,t2 Compute sim(doc,q) query Rec. docs u4 t1 Compute sim(doc,q) 1) Query q is mapped to a topic or topics Tq 2) Select Top-k friends in the FOAF wrt to the query topics (cosine similarity) 3) Redirect Query 4) Do 2) and 3) Recursively until TTL Conclusions P2PRec • P2PRec (BDA2010) – Find friends (relevant users on similar topics) while gossiping – Query processing exploits relevant users wrt to the query topics, recursively (FOAF friends) – Perf. Evaluation • Recall x Precision x Response Times – Limitation of LDA: needs some centralization for training, but good to validate our general approach – However there are other possibilities: • Ontology based automatic annotation • This exists for biomedical documents P2P Query Reformulation* • P2P Data Management System (PDMS) • Each peer has: – Its own schema (and data) – 1 or more mapping acquaintances to/from which at least 1 mapping rule exists • Goal: Given a query, exploit mapping acquaintances as much a possible to enhance query responses. ?= Hospital(x, “San Francisco”) data Schema A _____ _____ B Mb,a A Schema B _____ _____ data *Joint work with A. Bonifati, G. Summa, P. Valduriez, to appear as Inria report Concepts Hospital($X, “San Francisco”) HealtCareInst($X, “San Francisco”, $Z) ?= Q B Schema _____ _____ data Mb,a ALONG Source Hospital [0..*] name location Grant [0..*] amount istitution manager Doctor [0..*] name salary A Schema _____ _____ data atoms Target HealthCareInst [0..*] name city id Grant [0..*] amount scientist MAPPING RULE Mb,a Hospital(x, y) ⇢ HealthCareInst(x, y, z) BODY HEAD Mapping Relevance • Each time a query gets translated by exploiting a mapping we got a Relevant Rewriting • The relevance can be Forward (along) or Backward (against) depending on the matched side of the mapping • Goal: – Collect as many rewriting as possible – Find the most intresting paths to take (avoid useless paths) ?= Hospital(x, “San Francisco”) M1 Hospital(x, y) ⇢ HealthCareInst(x, y, z) M2 Institution(x, y, z) ⇢ Hospital(x, y) Problem ?= Q B Mb,a A ALONG Mc,a AGAINST D 1) How to choose the most relevant paths to undertake in the reformulation task? 2) Are there other peers in the network which can be contacted? C Acquaintances • Gossiping acquaintances – Potential friends that dynamically appears in the local semantic view (LSV) • Mapping acquaintance – There is at least 1 direct mapping towards it (friend) – Established manually • Social acquaintance (FOAF friend) – No direct mapping is needed towards it – There are some common interests – Established explicitly Our Approach • Gossip to disseminate mapping rules information to find friends • Users topics of intrest – are expressed according to the schema information or past queries topics • Measure to – Compute the relevance of a mapping wrt to a query – Compute similarity between users • Exploits recursively (to translate a query) – Mapping acquaintances – Social acquaintances Gossiping Acquaintances Social Acquaintances • Friend – Share common topics of – interests • Interests – Formulated by queries – Elements of peer’s schema • Approach: use the semantic view to discover friends ?= Hospital(x, “San Francisco”) ?= State( y, z, “California”) ?= Doctor( w, k) ?= Patology(“heart”, x) ……… Schema _____ _____ Compute Relevance Goal: Given an Query and a mapping rule, determine if the mapping is relevant to the query Method (Standard Match Semantics) – Atom Label matching – Parameters compatibility ?= Hospital(x, “San Francisco”) M1 Hospital(x, y) AND State (x,z) ⇢ HealthCareInst(x, y, z) M2 Hospital(x, y,w) AND State (x,z) ⇢ HealthCareInst(x, y, z) M3 Ospedale(x,y) AND State (x,z) ⇢ HealthCareInst(x, y, z) Compute Relevance • AF-IMF Measure, inspired by TF-IDF* • AF (Atom Frequency) – Local measure, establishing the importance of the query atom in the current mapping • IMF (Inverse Mapping Frequency) – Distributed measure, establishing the overall importance of the query atom • Relevance of a mapping wrt to q is AF * IMF *term frequency-inverse document frequency Compute Relevance (AF) About the applied measure ◦ To increase the effectiveness of the measure we distinguish, again, Forward/Backward relevance FORWARD MEASURE body BACKWARD MEASURE head ?= Hospital(x, “San Francisco”) M1 Hospital(x, y) AND State (x,z) ⇢ HealthCareInst(x, y, z) M2 Institution(x, y, z) ⇢ Hospital(x, y) Compute Relevance (IMF) • IMF requires a way to get a value for – The total number of mappings – The total number of mappings containing that atom • To do that, we can inspect the semantic view of the peer – Also by sending inquiries to peers in the FOAF Translate-Query • Compute Relevance on Local Mappings wrt Q – Choose the TopK Mappings – Apply the translation semantics, along/against the mapping direction – Trigger Translate-Query on the mapping acquaintance, recursively (until TTL) • Select FOAF friends to be contacted – By looking at the best Mapping summaries wrt Q – Trigger query Translate-Query on the social acquaintance, recursively (until TTL) Performance Evaluation Baseline – No gossiping, original query propagated • Baseline+ – No gossiping, translated query propagated • Baseline# – No gossiping, translated query propagated, local measure to sort mappings (by using AF only) • Full– Gossiping, translated query propagated, AF-IMF measure to sort mappings, no FOAF links (only local mappings) • Full (P2PRec) – Gossiping, translated query propagated, AF-IMF measure to sort mappings, FOAF links exploited 100.0% Baseline Baseline+ Baseline# Full- Full 90.0% 80.0% 70.0% 60.0% Recall • Recall 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% TopK Mapping Threshold Effectiveness of AF-IMF, LSV and gossiping Conclusions P2P Query Reformulation – Gossiping is used to disseminated mappings rules information – Exploits recursively relevant mappings • Mapping acquaintances • Social acquaintances – Initial Perf. Resuts: • Very good recall results (over 90%) • Linear scale-up • Trade-off of Recall and Responses Times – Previous work uses • DHTs or a centralized mediation model. About Montpellier Best quality of life in France Important laboratories (LIRMM) and research instituts (INRA, CIRAD, etc) University of Montpellier is part of the « opération campus » Soon we will have a direct TGV line to Barcelona (1 hour)