Peer-to-Peer Discovery of Semantic Associations Matthew Perry, Maciej Janik, Cartic Ramakrishnan, Conrad Ibanez, Budak Arpinar, Amit Sheth 2nd International Workshop on Peer-to-Peer Knowledge Management, San Diego, California, July 17, 2005 Semantic Discovery1 From ….. Finding things To ….. Finding out about things Relationships! 1. http://lsdis.cs.uga.edu/semdis Semantic Associations • Relationship-centric nature of Semantic Web data models • We can ask questions about the relationships between objects • How is entity A related to entity B? • Applications – National Security – Insider Threat1 – Improved Searching – Bio Patent Miner2 1. 2. B. Aleman-Meza, P. Burns, M. Eavenson, D. Palaniswami, A. Sheth, An Ontological Approach to the Document Access Problem of Insider Threat, Proceedings of the IEEE Intl. Conference on Intelligence and Security Informatics (ISI-2005), May 19-20, 2005 Sougata Mukherjea, Bhuvan Bamba, BioPatentMiner: An Information Retrieval System for BioMedical Patents, VLDB 2004. Semantic Associations Define a set of operators ρ for querying complex relationships between entities (Semantic Associations)1 “Matt” Semantic Association &r1 &r6 “Perry” &r5 name “The University of Georgia” ρ-path name “LSDIS Lab” 1. Adapted From: Kemafor Anyanwu, and Amit Sheth, ρ-Queries: Enabling Querying for Semantic Associations on the Semantic Web, The Twelfth International World Wide Web Conference, Budapest, Hungary, pp. 690-699. Uniqueness of Semantic Association Queries • Simple query specification (only the two endpoints) • Doesn’t require extensive knowledge of schema ρ-path (A, B) Difficult to express with existing Query Languages SELECT ?startURI, ?property_1, ?endURI FROM (?startURI ?property_1 ?endURI) SELECT ?startURI, ?property_1, ?endURI FROM (?endURI ?property_1 ?start) SELECT ?startURI, ?property_1, ?x, ?property_2, ?endURI FROM (?startURI ?property_1 ?x)(?x ?property_2 ?endURI) WHERE ?startURI ne ?x && ?endURI ne ?x SELECT ?startURI, ?property_1, ?x, ?property_2, ?endURI FROM (?startURI ?property_1 ?x)(?endURI ?property_2 ?x) WHERE ?startURI ne ?x && ?endURI ne ?x SELECT ?startURI, ?property_1, ?x, ?property_2, ?endURI FROM (?x ?property_1 ?startURI)(?x ?property_2 ?endURI) WHERE ?startURI ne ?x && ?endURI ne ?x SELECT ?startURI, ?property_1, ?x, ?property_2, ?endURI FROM (?x ?property_1 ?startURI)(?endURI ?property_2 ?x) WHERE ?startURI ne ?x && ?endURI ne ?x RDQL: Find paths of length at most 2 from startURI to endURI Why Semantic Associations in P2P? • Data on the web by its nature is distributed • Knowledge will be stored in multiple stores and multiple ontologies • Search for semantic paths will have to include many knowledge sources • In the spirit of the Semantic Web (collaborative knowledge discovery) Contributions • Super-Peer Architecture for Querying Semantic Associations • Knowledgebase Borders and Distances between Borders • Query Planning Algorithm based on Knowledgebase Borders and Distances Assumptions • Pair-wise mapping of resources between peers (solution to Entity Disambiguation / Reference Reconciliation problem) • We use URIs to solve Entity Disambiguation problem • Main focus is Query Planning over P2P network • Not concerned with fault tolerance, details of network formation, etc. at this point Passenger Ticket subClassOf(isA) number for String Flight String FFlyer RDF InstancePayment Graph Customer fflierno ffid FFNo String subPropertyOf Bank Account float String CCard &r4 typeOf(instance) String purchased Cash Client ffid “XYZ123” &r11 “John” &r1 purchased &r2 paidby &r3 “Smith” &r5 “Bill” &r7 “Jones” lname purchased &r8 for paidby &r6 &r9 holder ρ-path Problem (k-hop limited) • Given: – An RDF instance graph G, vertices a and b in G, an integer k • Find: – All simple, undirected paths p, with length less than or equal to k, which connect a and b Distributed ρ-path problem: Find all paths from a start node to an end node over the distributed RDF graphs Knowledge bases - ontologies What do we need? • Efficiently explore node neighborhoods • When to stop a search in one peer and continue it in another • Determine the search distance in each peer • Determine which peers to include in the search Approach RDF data store (sesame, bhrams) No data store ρ-path (a, b, k) Responsible for Query returns subgraph Planning Peer KB KB Peer KB Peer Super-Peer subgraph ρ-path ρ-sub-plan ρ-plan ρ-path KB Peer ρ-sub-plan ρ-sub-plan KB Peer Super-Peer Peer KB subgraph ρ-path Peer Super-Peer subgraph ρ-path ρ-sub-plan ρ-sub-plan KB KB KB Peer Peer Knowledgebase Borders Overlap (Peer_1:Peer_2 Border) Peer 2 Peer 1 Border Node Distance Between Borders P1:P2 Peer 2 Peer 1 Border node Query end point End P1:P3 dist (P1:P2, P1:P3) = 3 P2:P3 dist (P1:P2, P2:P3) = 1 Dist (P1:P3, P2:P3) = 1 Start Peer 3 Query Planning Graph • Directed Graph • Node for each distinct border • For each pair of connected borders, create 2 edges (one in each direction) • Weight is the minimum of the minimum distances (reported by peers) – For example you can get from A:B to A:B:C through either A or B Query Planning Graph A B 3 C AB 2 Minimum Distances Borders dist (AB, BC) = 4 AB dist (AB, AC) = 3 AC dist (AB, ABC) = 2 BC dist (BC, AC) = 5 ABC 4 3 ABC 3 2 dist (BC, ABC) = 3 dist (AC, ABC) = 2 dist (AB, BB) = 3 dist (AC, AC) = 3 dist (BC, BC) = 2 dist (ABC, ABC) = ∞ BC AC 5 3 2 Using the Query Planning Graph Example Query: r-path (start, end, 10) A 1) Find Start and End Points 2 C 3 end 2 4 2 2 start B 2) Compute Distances to Borders 3 3) Add this Information to QPG 2 AB start 2 4 3 ABC 4 2 2 3 4) Find all paths from start to end (including cycles) <= k (10) BC AC 5 3 2 3 2 2 end In this case 22 paths 5) Convert Set of Paths to Set of Queries Peer_B:Peer_C –– 32 Peer_A:Peer_C Peer_B:Peer_C –– 32 end end start – 2 Peer_A:Peer_B A 3 C 3 2 2 end 2 B 2 start Converting Paths to Queries start 2 A:B 3 A:C 3 end • Each edge (pair of endpoints) represents a query • For example, ρ-path (start, Peer_A:Peer_B, 2) What is the correct hop-limit? hop-limit = edge weight + (k – path weight) ρ-path (start, Peer_A:Peer_B, 4) ρ-path (Peer_A:Peer_B, Peer_A:Peer_C, 5) ρ-path (Peer_A:Peer_C, end, 5) k = 10 Find the maximum hop-limit for each pair of end points Pair Hop-limit (start, Peer_A:Peer_B) 5 (start, Peer_A:Peer_B:Peer_C) 7 (start, Peer_B:Peer_C) 8 (Peer_A:Peer_B, Peer_A:Peer_C) 5 (Peer_A:Peer_B, Peer_A:Peer_B:Peer_C) 5 (Peer_A:Peer_B, Peer_A:Peer_B) 3 (Peer_A:Peer_B, Peer_B:Peer_C) 6 (Peer_A:Peer_C, Peer_A:Peer_B:Peer_C) 3 (Peer_A:Peer_C, Peer_B:Peer_C) 6 (Peer_A:Peer_C, end) 5 (Peer_B:Peer_C, end) 8 (Peer_B:Peer_C, Peer_B:Peer_C) 6 (Peer_B:Peer_C, Peer_A:Peer_B:Peer_C) 5 (Peer_A:Peer_B:Peer_C, end) 6 Which Peer gets each query? ρ-path (Peer_B:Peer_A, Peer_A:Peer_C, 5) Peer_A ρ-path (Peer_B:Peer_C, Peer_B:Peer_C, 5) Peer_B and Peer_C Peer_A 5 Peer_C Peer_B Final Query Plan Peer_A Queries for Peer_B Peer_C Peer_A:Peer_B:Peer_C FROM: Peer_B:Peer_C Peer_A:Peer_B FROM: Peer_B:Peer_C FROM: Peer_A:Peer_B Peer_A:Peer_C FROM: Peer_A:Peer_B Peer_A:Peer_B:Peer_C FROM: Peer_A:Peer_B Peer_A:Peer_B:Peer_C FROM: Peer_A:Peer_B:Peer_C Peer_A:Peer_C FROM: Peer_A:Peer_B Peer_A:Peer_B:Peer_C FROM: Peer_A:Peer_B:Peer_C Peer_A:Peer_C TO: Peer_B:Peer_C Hop end Hop Limit: 8 Limit: 63 Peer_A:Peer_C TO: start Hop Peer_B:Peer_C Hop Limit: 6 Limit: 85 TO: start Peer_A:Peer_B:Peer_C Hop Peer_B:Peer_C Hop Limit: 5 Limit: 5 Peer_A:Peer_B TO: Peer_A:Peer_B:Peer_C Hop end Hop Limit: 6 Limit: 53 TO: Peer_A:Peer_B Hop Peer_A:Peer_C Hop Limit: 3 Limit: 3 TO: Peer_B:Peer_C Hop end Hop Limit: 5 Limit: 5 TO: Peer_B:Peer_C Hop Hop Limit: 5 Limit: 6 TO: start Hop Limit: 7 Query Execution at Peer Input: Set of Queries: { ρ-path ({uri, …}, {uri, …}, k), …} Algorithm: Graph Traversal of Main Memory representation Bi-directional BFS Results in a set of statements Output: Union of each set of statements Query Execution at Peer • Peer does not enumerate paths • Returns a subgraph (set of triples) • Benefits – Eliminates redundant data transfer – Saves computation time Scalability: Multiple Super-Peers Super-Peer/Super-Peer Borders • Super-Peer_1:Super-Peer_2 • Super-Peer_1:Super-Peer_3 • Super-Peer_2:Super-Peer_3 Super-Peer_1 Super-Peer_2 Peer_B Super-Peer/Peer Borders Super-Peer_1 Peer_A • Peer_B:Super-Peer_2 • Peer_A:Super-Peer_3 • Peer_C:Super-Peer_3 Peer_C Super-Peer_3 Integration of SP graph and Peer Graph Super-Peer_1’s new Peer-Level QPG A:B 4 A:SP3 3 3 2 4 2 B:SP2 5 0 A:B:C 3 0 4 3 2 5 A:C SP1:SP3 B:C 2 2 0 4 C:SP3 SP1:SP2 Query Planning Algorithm SP2 SP1 B start D A C E end SP3 1) Find start and end points 2) Compute distances to borders 4 3) Add temporary information for endpoints (both peer and super-peer QPG) Super-Peer QPG SP2:SP3 3 4 3 4) Find all directed paths <= k connecting start to end in the Super-Peer QPG 6 SP1:SP2 3 2 SP1:SP3 6 6 2 end start 10 start – 6 SP1/SP3 – 2 SP1/SP3 – 2 end start – 6 SP1/SP3 – 2 end start – 3 SP1/SP2 – 6 end start – 10 end k = 10 5) Form a list of sub-query-plan requests for each super-peer Super-Peer_1 FROM: FROM: FROM: FROM: FROM: FROM: start start SuperPeer_1:Super-Peer_2 start Super-Peer_1:Super-Peer_3 Super-Peer_1:Super-Peer_3 TO: end TO: Super-Peer_1:Super-Peer_2 TO: end TO: Super-Peer_1:Super-Peer_3 TO: Super-Peer_1:Super-Peer_3 TO: end Hop-Limit: 10 Hop-Limit: 4 Hop-Limit: 7 Hop-Limit: 8 Hop-Limit: 2 Hop-Limit: 4 TO: Super-Peer_1:Super-Peer_3 Hop-Limit: 2 Super-Peer_3 FROM: Super-Peer_1:Super-Peer_3 7) Each super-peer goes through the previous process on its peer QPG to form a list of ρ-path queries for its peers B: Queries for for Peer Peer E: C: Queries Queries for Peer A: FROM: A:B A:B FROM: E:SP1 FROM: A:B FROM: FROM: A:B A:B FROM: A:B:C FROM: FROM: A:B:C A:B:C FROM: A:B FROM: FROM: A:B B:C FROM: A:B FROM: FROM: A:SP2 B:C FROM: A:B FROM: FROM: B:C B:C FROM: A:B:C FROM: FROM: B:C A:C FROM: A:C FROM: FROM: B:C A:B:C FROM: FROM: A:B A:B:C FROM: FROM: A:B A:B:C FROM: FROM: A:B:C C:SP3 FROM: A:B:C TO: A:B B:C TO: E:SP1 TO: A:B TO: TO: B:C end TO: A:SP3 TO: TO: B:SP2 end TO: A:SP3 TO: TO: B:SP2 end TO: A:C TO: TO: start B:C TO: A:B:C TO: TO: B:SP2 C:SP3 TO: A:C TO: TO: start C:SP3 TO: A:SP3 TO: TO: B:C A:C TO: TO: start B:C TO: TO: A:B:C C:SP3 TO: TO: start end TO: B:C 33 Hop Limit: Limit: 2 5 Hop Hop Limit: Hop Hop Limit: Limit: 64 5 Hop Limit: Hop Hop Limit: Limit: 46 6 Hop Limit: Hop Hop Limit: Limit: 25 8 Hop Limit: Hop Hop Limit: Limit: 45 6 Hop Limit: Hop Hop Limit: Limit: 53 6 Hop Limit: Hop Hop Limit: Limit: 83 3 Hop Limit: Hop Hop Limit: Limit: 63 Hop Hop Limit: Limit: 55 Hop Hop Limit: Limit: 54 Hop Hop Limit: Limit: 74 Hop Limit: 5 8) Querying peer now communicates directly with other peers to execute the ρ-path queries Conclusions and Future Work • Presented a Query-Planning Algorithm for r-path queries over distributed data set • Problems – – – – Efficiently compute node neighborhoods How to continue searches across KBs How to check for the many possible cases How to determine search length in each KB Conclusions and Future Work • Future Work – – – – Performance Testing Effect of relative border size Different criteria for group formation How to accommodate other types of queries Questions? Computing Borders Super-Peer maintains Sorted Map of URIs • Peer Border – Traverse new list and update Sorted Map • Super Peer Border – Don’t care about other URIs not in this group – Keep total data transferred at a minimum Forming the Network 1) 3) Broadcast List of URIs P New I want to join the network SP2 SP1 SP3 P1 P2 2) I am a super-peer Forming the Network 6) New peer picks one super-peer reject SP2 accept P New SP1 SP3 reject P1 P2 4) Send SPs compute overlap 5) overlap count to O(n log k) (maintain new peer border information) Forming the Network 10) Peers send minimum distances SP2 P New SP1 SP3 P1 P2 7) SP1 updates 9) are yourpermanent borders 8) Here recomputes SP uri index borders Computing Super-Peer Borders SP1 SP2 C A E C, (SP1, K, U, false) true) H, (SP1, B H G K H L G, false) (SP2, (SP2, null, J, null) R,true) true) J M K R R U S Super-Peer Level QPG Super-Peer 1 Borders Minimum Distances AB dist (AB, BC) = 4 AC dist (AB, AC) = 3 BC dist (AB, ABC) = 2 A/SP3 B/SP2 A C/SP3 B SuperPeer 3 dist (BC, AC) = 5 dist (BC, ABC) = 3 dist (AC, ABC) = 2 dist (AC, A/SP3) = 3 dist (AB, A/SP3) = 4 C SuperPeer 2 dist (ABC, A/SP3) = 3 dist (AC, C/SP3) = 2 dist (BC, C/SP3) = 4 dist (ABC, C/SP3) = 2 dist (AB, B/SP2) = 2 dist (BC, B/SP2) = 2 dist (ABC, B/SP2) = 2