Decentralized Recommendation Protocols for File Sharing

advertisement
1st Gossple Workshop on Social Networking (december 2010)
Large-scale data sharing by
exploiting gossiping
Esther Pacitti
SOPHIA ANTIPOLIS - MÉDITERRANÉE
Context: P2P Data Sharing
• We consider P2P online communities where participants can be
– Professionals (researchers, engineers, support staff, etc.) who
use web-scale collaboration in their workplace
– Large scale of users and data (clouds, grids, internet)
• Example of applications:
– P2P Recommendation Systems
• Useful for processing scientific workflows among participants’ peers
– P2P Query Reformulation
• Clinical case sharing among doctors or physicians
– P2P CDN
• Projects:
– ANR DataRing (2009-2012, P2P online communities )
– Datluge (2010-2012, with UFRJ, Brazil on P2P scientific
workflows)
MOTIVATIONS
Chemistry, Materials Science and Physics
Bioinformatics
Computer Science
P2PRec: document recommender
• Hudge graph:G = (D,U,E,T), where
– D is the set of shared documents
– U is the set of users in the system
– E is the set of edges between the users such that there is an edge e(u,v) if
users u and v are friends
– T is the set of users’ topics of intrest.
• Problem: Given a query, recommend the most relevant
documents
• Our approach
– Reduce the research space by indentifing relevant users
– Identify relevant users
• Users that stores/downloads enough high-quality documents, and
become kind of providers in specific topics
• Recommended by trusted friends
• P2P Overlay : Semantic-Gossiping
• Disseminate relevant users and their topics of intrests
P2PRec*: document recommender
•
•
•
•
Topics of intrest
– With respect to the documents a user store
– Extracted automatically
Friendship network
– Explicit friendship (maybe laveraged with implicit)
– Expresses users trusts
– Implemented is FOAF files (friend of friends files, machine-readable vocabulary
serialized in RDF/XML )
Key-word Queries
– Mapped to topics
– Mostly related to the user topics of intrest
Mesure to
– Check the similarity of users wrt to their topics (Dice coefficient)
– Relevance of a user
*Joint work with F. Draidi, P. Valduriez, B. Kemme, to appear as Inria report
Semantic-Gossiping
u11’s local-view
view after before
gossip gossip
u1 FOAF
u1 topics: t1,t2
Friends:
link to u5 FOAF
user
Gossip information
information
Gossip
u5
tt11,, tt22
u6
tt22,t,t33
u4
t1
If distance between uu and uv > τ
ask for friendship
u5 topics
Dice coefficient
u1
u
t3 1
t1
u2
u6
t2,t3
u5
t1,t2
u4
t1
u55’s
gossip
’s local-view
local-view before
after gossip
user
Gossip information
u4
t1
u6
t2,t3
If friendship is accepted
add uv to FOAF file
Relevant Users
• Users topics of intrest are automatically
extracted using LDA*
– by inspecting the documents topic vector
• A user is considered relevant on a topic tTu, if
a percentage of its documents have high
quality in topic t
• Each document doc at user u has
– A rate given to doc : ratedoc
– doc topic Vector (extracted using LDA)
• Vdoc={wdoct1,…..,wdoctd}
• doc is considered a high quality in a topic t
qualityt(doc,u)
• If wdoct *ratedoc > a threshold value
• A user can be relevant in more than one
topic
*Latent Dirichlet Allocation (topic classifier)
Query Processing
• Implements Recommendation
• Input: Key words
• Output:
– Links to a set of good quality documents. May include links to
documents on the topic of intrests of a friend (query
expansion)
– Popularity and Similarity info
• Example: doctors studing the behavior of a gene X may be glad to
learn about the deseases it can cause and check some
experimental data sets
Query Processing
query q requester
q.t = t1, q.TTL=2
Summary of Docs similarity and
classification info
Compute sim(doc,q)
t3
u1
t1
u7
q.TTL=1
q.TTL=1
u2
q.TTL=0
u6
t2,t3
u1 FOAF
u1 topics of intrests
Friends:
link to u5 FOAF
u5 topics
u3
t2
u5
t1,t2
Compute sim(doc,q)
query
Rec. docs
u4
t1
Compute sim(doc,q)
1) Query q is mapped to a topic or topics Tq
2) Select Top-k friends in the FOAF wrt to the query topics
(cosine similarity)
3) Redirect Query
4) Do 2) and 3) Recursively until TTL
Conclusions P2PRec
• P2PRec (BDA2010)
– Find friends (relevant users on similar topics) while gossiping
– Query processing exploits relevant users wrt to the query
topics, recursively (FOAF friends)
– Perf. Evaluation
• Recall x Precision x Response Times
– Limitation of LDA: needs some centralization for training, but
good to validate our general approach
– However there are other possibilities:
• Ontology based automatic annotation
• This exists for biomedical documents
P2P Query Reformulation*
• P2P Data Management System (PDMS)
• Each peer has:
– Its own schema (and data)
– 1 or more mapping acquaintances to/from which at
least 1 mapping rule exists
• Goal: Given a query, exploit mapping acquaintances as
much a possible to enhance query responses.
?= Hospital(x, “San Francisco”)
data
Schema A
_____
_____
B
Mb,a
A
Schema
B
_____
_____
data
*Joint work with A. Bonifati, G. Summa, P. Valduriez, to appear as Inria report
Concepts
Hospital($X, “San Francisco”)
HealtCareInst($X, “San Francisco”, $Z)
?= Q
B
Schema
_____
_____
data
Mb,a
ALONG
Source
Hospital [0..*]
name
location
Grant [0..*]
amount
istitution
manager
Doctor [0..*]
name
salary
A
Schema
_____
_____
data
atoms
Target
HealthCareInst [0..*]
name
city
id
Grant [0..*]
amount
scientist
MAPPING RULE
Mb,a
Hospital(x, y) ⇢ HealthCareInst(x, y, z)
BODY
HEAD
Mapping Relevance
• Each time a query gets translated by exploiting a mapping we got a
Relevant Rewriting
• The relevance can be Forward (along) or Backward (against)
depending on the matched side of the mapping
• Goal:
– Collect as many rewriting as possible
– Find the most intresting paths to take (avoid useless paths)
?= Hospital(x, “San Francisco”)
M1 Hospital(x, y) ⇢ HealthCareInst(x, y, z)
M2 Institution(x, y, z) ⇢ Hospital(x, y)
Problem
?= Q
B
Mb,a
A
ALONG
Mc,a
AGAINST
D
1) How to choose the most relevant paths to
undertake in the reformulation task?
2) Are there other peers in the network which can be
contacted?
C
Acquaintances
• Gossiping acquaintances
– Potential friends that dynamically appears in the local
semantic view (LSV)
• Mapping acquaintance
– There is at least 1 direct mapping towards it (friend)
– Established manually
• Social acquaintance (FOAF friend)
– No direct mapping is needed towards it
– There are some common interests
– Established explicitly
Our Approach
• Gossip to disseminate mapping rules information to
find friends
• Users topics of intrest
– are expressed according to the schema information or
past queries topics
• Measure to
– Compute the relevance of a mapping wrt to a query
– Compute similarity between users
• Exploits recursively (to translate a query)
– Mapping acquaintances
– Social acquaintances
Gossiping Acquaintances
Social Acquaintances
• Friend
– Share common topics of
– interests
• Interests
– Formulated by queries
– Elements of peer’s schema
• Approach: use the semantic view to discover friends
?= Hospital(x, “San Francisco”)
?= State( y, z, “California”)
?= Doctor( w, k)
?= Patology(“heart”, x)
………
Schema
_____
_____
Compute Relevance
Goal: Given an Query and a mapping rule,
determine if the mapping is relevant to the query
Method (Standard Match Semantics)
– Atom Label matching
– Parameters compatibility
?= Hospital(x, “San Francisco”)
M1 Hospital(x, y) AND State (x,z) ⇢ HealthCareInst(x, y, z)
M2 Hospital(x, y,w) AND State (x,z) ⇢ HealthCareInst(x, y, z)
M3 Ospedale(x,y) AND State (x,z) ⇢ HealthCareInst(x, y, z)
Compute Relevance
• AF-IMF Measure, inspired by TF-IDF*
• AF (Atom Frequency)
– Local measure, establishing the importance of the
query atom in the current mapping
• IMF (Inverse Mapping Frequency)
– Distributed measure, establishing the overall
importance of the query atom
• Relevance of a mapping wrt to q is AF * IMF
*term frequency-inverse document frequency
Compute Relevance (AF)

About the applied measure
◦ To increase the effectiveness of the measure we distinguish,
again, Forward/Backward relevance
FORWARD MEASURE
body
BACKWARD MEASURE
head
?= Hospital(x, “San Francisco”)
M1 Hospital(x, y) AND State (x,z) ⇢ HealthCareInst(x, y, z)
M2 Institution(x, y, z) ⇢ Hospital(x, y)
Compute Relevance (IMF)
• IMF requires a way to get a value for
– The total number of mappings
– The total number of mappings containing that
atom
• To do that, we can inspect the semantic
view of the peer
– Also by sending inquiries to peers in the FOAF
Translate-Query
• Compute Relevance on Local Mappings wrt Q
– Choose the TopK Mappings
– Apply the translation semantics, along/against the
mapping direction
– Trigger Translate-Query on the mapping
acquaintance, recursively (until TTL)
• Select FOAF friends to be contacted
– By looking at the best Mapping summaries wrt Q
– Trigger query Translate-Query on the social
acquaintance, recursively (until TTL)
Performance Evaluation
Baseline
– No gossiping, original query propagated
•
Baseline+
– No gossiping, translated query
propagated
•
Baseline#
– No gossiping, translated query
propagated, local measure to sort
mappings (by using AF only)
•
Full– Gossiping, translated query propagated,
AF-IMF measure to sort mappings, no
FOAF links (only local mappings)
•
Full (P2PRec)
– Gossiping, translated query
propagated, AF-IMF measure to sort
mappings, FOAF links exploited
100.0%
Baseline
Baseline+
Baseline#
Full-
Full
90.0%
80.0%
70.0%
60.0%
Recall
•
Recall
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
TopK Mapping Threshold
Effectiveness of AF-IMF, LSV and gossiping
Conclusions
P2P Query Reformulation
– Gossiping is used to disseminated mappings rules
information
– Exploits recursively relevant mappings
• Mapping acquaintances
• Social acquaintances
– Initial Perf. Resuts:
• Very good recall results (over 90%)
• Linear scale-up
• Trade-off of Recall and Responses Times
– Previous work uses
• DHTs or a centralized mediation model.
About Montpellier
Best quality of life in France
Important laboratories (LIRMM) and research instituts (INRA, CIRAD, etc)
University of Montpellier is part of the « opération campus »
Soon we will have a direct TGV line to Barcelona (1 hour)
Download