Peer-to-Peer Discovery of Semantic Associations Matthew Perry, Maciej Janik, Cartic Ramakrishnan,

advertisement
Peer-to-Peer Discovery of
Semantic Associations
Matthew Perry, Maciej Janik, Cartic Ramakrishnan,
Conrad Ibanez, Budak Arpinar, Amit Sheth
2nd International Workshop on Peer-to-Peer Knowledge Management,
San Diego, California, July 17, 2005
Semantic Discovery1
From …..
Finding things
To …..
Finding out about things
Relationships!
1. http://lsdis.cs.uga.edu/semdis
Semantic Associations
• Relationship-centric nature of Semantic
Web data models
• We can ask questions about the
relationships between objects
• How is entity A related to entity B?
• Applications
– National Security – Insider Threat1
– Improved Searching – Bio Patent Miner2
1.
2.
B. Aleman-Meza, P. Burns, M. Eavenson, D. Palaniswami, A. Sheth, An Ontological Approach to the Document Access Problem of
Insider Threat, Proceedings of the IEEE Intl. Conference on Intelligence and Security Informatics (ISI-2005), May 19-20, 2005
Sougata Mukherjea, Bhuvan Bamba, BioPatentMiner: An Information Retrieval System for BioMedical Patents, VLDB 2004.
Semantic Associations
Define a set of operators ρ for querying complex
relationships between entities (Semantic
Associations)1
“Matt”
Semantic Association
&r1
&r6
“Perry”
&r5
name
“The University
of Georgia”
ρ-path
name
“LSDIS Lab”
1.
Adapted From: Kemafor Anyanwu, and Amit Sheth, ρ-Queries: Enabling Querying for Semantic Associations on the Semantic Web,
The Twelfth International World Wide Web Conference, Budapest, Hungary, pp. 690-699.
Uniqueness of Semantic Association Queries
• Simple query specification (only the two
endpoints)
• Doesn’t require extensive knowledge of
schema
ρ-path (A, B)
Difficult to express with existing Query Languages
SELECT ?startURI, ?property_1, ?endURI
FROM (?startURI ?property_1 ?endURI)
SELECT ?startURI, ?property_1, ?endURI
FROM (?endURI ?property_1 ?start)
SELECT ?startURI, ?property_1, ?x, ?property_2, ?endURI
FROM (?startURI ?property_1 ?x)(?x ?property_2 ?endURI)
WHERE ?startURI ne ?x && ?endURI ne ?x
SELECT ?startURI, ?property_1, ?x, ?property_2, ?endURI
FROM (?startURI ?property_1 ?x)(?endURI ?property_2 ?x)
WHERE ?startURI ne ?x && ?endURI ne ?x
SELECT ?startURI, ?property_1, ?x, ?property_2, ?endURI
FROM (?x ?property_1 ?startURI)(?x ?property_2 ?endURI)
WHERE ?startURI ne ?x && ?endURI ne ?x
SELECT ?startURI, ?property_1, ?x, ?property_2, ?endURI
FROM (?x ?property_1 ?startURI)(?endURI ?property_2 ?x)
WHERE ?startURI ne ?x && ?endURI ne ?x
RDQL: Find paths of length at most 2 from
startURI to endURI
Why Semantic Associations in P2P?
• Data on the web by its nature is
distributed
• Knowledge will be stored in multiple stores
and multiple ontologies
• Search for semantic paths will have to
include many knowledge sources
• In the spirit of the Semantic Web
(collaborative knowledge discovery)
Contributions
• Super-Peer Architecture for Querying
Semantic Associations
• Knowledgebase Borders and Distances
between Borders
• Query Planning Algorithm based on
Knowledgebase Borders and Distances
Assumptions
• Pair-wise mapping of resources between
peers (solution to Entity Disambiguation /
Reference Reconciliation problem)
• We use URIs to solve Entity
Disambiguation problem
• Main focus is Query Planning over P2P
network
• Not concerned with fault tolerance, details
of network formation, etc. at this point
Passenger
Ticket
subClassOf(isA)
number
for
String
Flight
String
FFlyer
RDF InstancePayment
Graph
Customer
fflierno
ffid
FFNo
String
subPropertyOf
Bank
Account
float
String
CCard
&r4
typeOf(instance)
String
purchased
Cash
Client
ffid “XYZ123”
&r11
“John”
&r1
purchased
&r2
paidby
&r3
“Smith”
&r5
“Bill”
&r7
“Jones”
lname
purchased
&r8
for
paidby
&r6
&r9
holder
ρ-path Problem (k-hop limited)
• Given:
– An RDF instance graph G, vertices a and b in
G, an integer k
• Find:
– All simple, undirected paths p, with length less
than or equal to k, which connect a and b
Distributed ρ-path problem: Find all paths from
a start node to an end node over the distributed
RDF graphs
Knowledge bases - ontologies
What do we need?
• Efficiently explore node neighborhoods
• When to stop a search in one peer and
continue it in another
• Determine the search distance in each
peer
• Determine which peers to include in the
search
Approach
RDF data store (sesame,
bhrams)
No data store
ρ-path (a, b, k)
Responsible for Query
returns subgraph
Planning
Peer
KB
KB
Peer
KB
Peer
Super-Peer
subgraph
ρ-path
ρ-sub-plan
ρ-plan
ρ-path
KB
Peer
ρ-sub-plan
ρ-sub-plan
KB
Peer
Super-Peer
Peer
KB
subgraph
ρ-path
Peer
Super-Peer
subgraph
ρ-path
ρ-sub-plan
ρ-sub-plan
KB
KB
KB
Peer
Peer
Knowledgebase Borders
Overlap (Peer_1:Peer_2 Border)
Peer 2
Peer 1
Border Node
Distance Between Borders
P1:P2
Peer 2
Peer 1
Border node
Query end point
End
P1:P3
dist (P1:P2, P1:P3) = 3
P2:P3
dist (P1:P2, P2:P3) = 1
Dist (P1:P3, P2:P3) = 1
Start
Peer 3
Query Planning Graph
• Directed Graph
• Node for each distinct border
• For each pair of connected borders, create
2 edges (one in each direction)
• Weight is the minimum of the minimum
distances (reported by peers)
– For example you can get from A:B to A:B:C
through either A or B
Query Planning Graph
A
B
3
C
AB
2
Minimum Distances
Borders
dist (AB, BC) = 4
AB
dist (AB, AC) = 3
AC
dist (AB, ABC) = 2
BC
dist (BC, AC) = 5
ABC
4
3
ABC
3
2
dist (BC, ABC) = 3
dist (AC, ABC) = 2
dist (AB, BB) = 3
dist (AC, AC) = 3
dist (BC, BC) = 2
dist (ABC, ABC) = ∞
BC
AC
5
3
2
Using the Query Planning Graph
Example Query: r-path (start, end, 10)
A
1) Find Start and End Points
2
C
3
end
2
4
2
2 start
B
2) Compute Distances to Borders
3
3) Add this Information to QPG
2
AB
start
2
4
3
ABC
4
2
2
3
4) Find all paths from start to
end (including cycles) <= k (10)
BC
AC
5
3
2
3
2
2
end
In this case 22 paths
5) Convert Set of Paths to Set of Queries
Peer_B:Peer_C –– 32 
 Peer_A:Peer_C
Peer_B:Peer_C –– 32 
 end
end
start – 2  Peer_A:Peer_B
A
3
C
3 2
2
end
2
B
2
start
Converting Paths to Queries
start
2
A:B
3
A:C
3
end
• Each edge (pair of endpoints) represents a query
• For example, ρ-path (start, Peer_A:Peer_B, 2)
What is the correct hop-limit?
hop-limit = edge weight + (k – path weight)
ρ-path (start, Peer_A:Peer_B, 4)
ρ-path (Peer_A:Peer_B, Peer_A:Peer_C, 5)
ρ-path (Peer_A:Peer_C, end, 5)
k = 10
Find the maximum hop-limit for each pair of end points
Pair
Hop-limit
(start, Peer_A:Peer_B)
5
(start, Peer_A:Peer_B:Peer_C)
7
(start, Peer_B:Peer_C)
8
(Peer_A:Peer_B, Peer_A:Peer_C)
5
(Peer_A:Peer_B, Peer_A:Peer_B:Peer_C)
5
(Peer_A:Peer_B, Peer_A:Peer_B)
3
(Peer_A:Peer_B, Peer_B:Peer_C)
6
(Peer_A:Peer_C, Peer_A:Peer_B:Peer_C)
3
(Peer_A:Peer_C, Peer_B:Peer_C)
6
(Peer_A:Peer_C, end)
5
(Peer_B:Peer_C, end)
8
(Peer_B:Peer_C, Peer_B:Peer_C)
6
(Peer_B:Peer_C, Peer_A:Peer_B:Peer_C)
5
(Peer_A:Peer_B:Peer_C, end)
6
Which Peer gets each query?
ρ-path (Peer_B:Peer_A, Peer_A:Peer_C, 5)
Peer_A
ρ-path (Peer_B:Peer_C, Peer_B:Peer_C, 5)
Peer_B and Peer_C
Peer_A
5
Peer_C
Peer_B
Final Query Plan
Peer_A
Queries for Peer_B
Peer_C
Peer_A:Peer_B:Peer_C
FROM: Peer_B:Peer_C
Peer_A:Peer_B
FROM: Peer_B:Peer_C
FROM: Peer_A:Peer_B
Peer_A:Peer_C
FROM: Peer_A:Peer_B
Peer_A:Peer_B:Peer_C
FROM: Peer_A:Peer_B
Peer_A:Peer_B:Peer_C
FROM: Peer_A:Peer_B:Peer_C
Peer_A:Peer_C
FROM: Peer_A:Peer_B
Peer_A:Peer_B:Peer_C
FROM: Peer_A:Peer_B:Peer_C
Peer_A:Peer_C
TO: Peer_B:Peer_C
Hop
end
Hop Limit:
8 Limit: 63
Peer_A:Peer_C
TO: start
Hop
Peer_B:Peer_C
Hop Limit:
6 Limit: 85
TO: start
Peer_A:Peer_B:Peer_C
Hop
Peer_B:Peer_C
Hop Limit:
5 Limit: 5
Peer_A:Peer_B
TO: Peer_A:Peer_B:Peer_C
Hop
end
Hop Limit:
6 Limit: 53
TO: Peer_A:Peer_B
Hop
Peer_A:Peer_C
Hop Limit:
3 Limit: 3
TO: Peer_B:Peer_C
Hop
end
Hop Limit:
5 Limit: 5
TO: Peer_B:Peer_C
Hop
Hop Limit:
5 Limit: 6
TO: start
Hop Limit: 7
Query Execution at Peer
Input:
Set of Queries: { ρ-path ({uri, …}, {uri, …}, k), …}
Algorithm:
Graph Traversal of Main Memory representation
Bi-directional BFS
Results in a set of statements
Output:
Union of each set of statements
Query Execution at Peer
• Peer does not enumerate paths
• Returns a subgraph (set of triples)
• Benefits
– Eliminates redundant data transfer
– Saves computation time
Scalability: Multiple Super-Peers
Super-Peer/Super-Peer Borders
• Super-Peer_1:Super-Peer_2
• Super-Peer_1:Super-Peer_3
• Super-Peer_2:Super-Peer_3
Super-Peer_1
Super-Peer_2
Peer_B
Super-Peer/Peer Borders
Super-Peer_1
Peer_A
• Peer_B:Super-Peer_2
• Peer_A:Super-Peer_3
• Peer_C:Super-Peer_3
Peer_C
Super-Peer_3
Integration of SP graph and Peer Graph
Super-Peer_1’s new Peer-Level QPG
A:B
4
A:SP3
3
3
2
4
2
B:SP2
5
0
A:B:C
3
0
4
3
2
5
A:C
SP1:SP3
B:C
2
2
0
4
C:SP3
SP1:SP2
Query Planning Algorithm
SP2
SP1
B
start
D
A
C
E
end
SP3
1) Find start and end points
2) Compute distances to borders
4
3) Add temporary information for endpoints
(both peer and super-peer QPG)
Super-Peer QPG
SP2:SP3
3
4
3
4) Find all directed paths <= k connecting
start to end in the Super-Peer QPG
6
SP1:SP2
3
2
SP1:SP3
6
6
2
end
start
10
start – 6  SP1/SP3 – 2  SP1/SP3 – 2  end
start – 6  SP1/SP3 – 2  end
start – 3  SP1/SP2 – 6  end
start – 10  end
k = 10
5) Form a list of sub-query-plan requests for each super-peer
Super-Peer_1
FROM:
FROM:
FROM:
FROM:
FROM:
FROM:
start
start
SuperPeer_1:Super-Peer_2
start
Super-Peer_1:Super-Peer_3
Super-Peer_1:Super-Peer_3
TO: end
TO: Super-Peer_1:Super-Peer_2
TO: end
TO: Super-Peer_1:Super-Peer_3
TO: Super-Peer_1:Super-Peer_3
TO: end
Hop-Limit: 10
Hop-Limit: 4
Hop-Limit: 7
Hop-Limit: 8
Hop-Limit: 2
Hop-Limit: 4
TO: Super-Peer_1:Super-Peer_3
Hop-Limit: 2
Super-Peer_3
FROM: Super-Peer_1:Super-Peer_3
7) Each super-peer goes through the previous process on its peer QPG
to form a list of ρ-path queries for its peers
B:
Queries for
for Peer
Peer E:
C:
Queries
Queries
for
Peer
A:
FROM: A:B
A:B
FROM:
E:SP1
FROM:
A:B
FROM:
FROM: A:B
A:B
FROM:
A:B:C
FROM:
FROM: A:B:C
A:B:C
FROM:
A:B
FROM:
FROM: A:B
B:C
FROM:
A:B
FROM:
FROM: A:SP2
B:C
FROM:
A:B
FROM:
FROM: B:C
B:C
FROM:
A:B:C
FROM:
FROM: B:C
A:C
FROM:
A:C
FROM:
FROM: B:C
A:B:C
FROM:
FROM: A:B
A:B:C
FROM:
FROM: A:B
A:B:C
FROM:
FROM: A:B:C
C:SP3
FROM: A:B:C
TO: A:B
B:C
TO:
E:SP1
TO:
A:B
TO:
TO: B:C
end
TO:
A:SP3
TO:
TO: B:SP2
end
TO:
A:SP3
TO:
TO: B:SP2
end
TO:
A:C
TO:
TO: start
B:C
TO:
A:B:C
TO:
TO: B:SP2
C:SP3
TO:
A:C
TO:
TO: start
C:SP3
TO:
A:SP3
TO:
TO: B:C
A:C
TO:
TO: start
B:C
TO:
TO: A:B:C
C:SP3
TO:
TO: start
end
TO: B:C
33
Hop Limit:
Limit: 2
5
Hop
Hop
Limit:
Hop
Hop Limit:
Limit: 64
5
Hop
Limit:
Hop
Hop Limit:
Limit: 46
6
Hop
Limit:
Hop
Hop Limit:
Limit: 25
8
Hop
Limit:
Hop
Hop Limit:
Limit: 45
6
Hop
Limit:
Hop
Hop Limit:
Limit: 53
6
Hop
Limit:
Hop
Hop Limit:
Limit: 83
3
Hop
Limit:
Hop
Hop Limit:
Limit: 63
Hop
Hop Limit:
Limit: 55
Hop
Hop Limit:
Limit: 54
Hop
Hop Limit:
Limit: 74
Hop Limit: 5
8) Querying peer now communicates directly with other peers to execute
the ρ-path queries
Conclusions and Future Work
• Presented a Query-Planning Algorithm for
r-path queries over distributed data set
• Problems
–
–
–
–
Efficiently compute node neighborhoods
How to continue searches across KBs
How to check for the many possible cases
How to determine search length in each KB
Conclusions and Future Work
• Future Work
–
–
–
–
Performance Testing
Effect of relative border size
Different criteria for group formation
How to accommodate other types of queries
Questions?
Computing Borders
Super-Peer maintains Sorted Map of URIs
• Peer Border
– Traverse new list and update Sorted Map
• Super Peer Border
– Don’t care about other URIs not in this group
– Keep total data transferred at a minimum
Forming the Network
1)
3) Broadcast
List of URIs
P New
I want to join the
network
SP2
SP1
SP3
P1
P2
2) I am a super-peer
Forming the Network
6) New peer picks one
super-peer
reject
SP2
accept
P New
SP1
SP3
reject
P1
P2
4) Send
SPs compute
overlap
5)
overlap count
to
O(n log
k) (maintain
new
peer
border information)
Forming the Network
10) Peers send minimum distances
SP2
P New
SP1
SP3
P1
P2
7)
SP1 updates
9)
are yourpermanent
borders
8) Here
recomputes
SP
uri
index
borders
Computing Super-Peer Borders
SP1
SP2
C
A
E
C,
(SP1, K,
U, false)
true)
H,
(SP1,
B
H
G
K
H
L
G,
false)
(SP2,
(SP2, null,
J,
null)
R,true)
true)
J
M
K
R
R
U
S
Super-Peer Level QPG
Super-Peer 1
Borders
Minimum Distances
AB
dist (AB, BC) = 4
AC
dist (AB, AC) = 3
BC
dist (AB, ABC) = 2
A/SP3
B/SP2
A
C/SP3
B
SuperPeer 3
dist (BC, AC) = 5
dist (BC, ABC) = 3
dist (AC, ABC) = 2
dist (AC, A/SP3) = 3
dist (AB, A/SP3) = 4
C
SuperPeer 2
dist (ABC, A/SP3) = 3
dist (AC, C/SP3) = 2
dist (BC, C/SP3) = 4
dist (ABC, C/SP3) = 2
dist (AB, B/SP2) = 2
dist (BC, B/SP2) = 2
dist (ABC, B/SP2) = 2
Related documents
Download