MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine Sebastian Michel Peter Triantafillou Gerhard Weikum Max-Planck-Institut für Informatik Saarbrücken, Germany smichel@mpi-inf.mpg.de University of Patras Rio, Greece peter@ceid.upatras.gr Max-Planck-Institut für Informatik Saarbrücken, Germany weikum@mpi-inf.mpg.de Middleware 2005 Grenoble, France Vision • Today: Web Search is dominated by centralized engines (“to google”) - censorship? - single point of attack/abuse - coverage of the web? • Ultimate goal: “Distributed Google” to break information monopolies • P2P approach best suitable – large number of peers – exploit mostly idle resources – intellectual input of user community MINERVA Infinity: A Scalable Efficient P2P Search Engine 2 Challenges • large scale networks – 100,000 to 10,000,000 users • large collections > 10^10 documents – 1,000,000 terms • high dynamics MINERVA Infinity: A Scalable Efficient P2P Search Engine 3 Questions • Network Organization – structured? – hierarchical? – unstructured? • Data Placement – move data around? – data remains at the owner? • Scalability? • Query Routing/Execution – Routing indexes? – Message flooding? MINERVA Infinity: A Scalable Efficient P2P Search Engine 4 Overview • • • • • • • • • • Motivation (Vision/Challenges/Questions) Introduction to IR and P2P Systems P2P- IR Minerva Infinity Network Organization Data Placement Query Processing Data Replication Experiments Conclusion MINERVA Infinity: A Scalable Efficient P2P Search Engine 5 Information Retrieval Basics 5x 7x 4x Document Terms # of terms (term frequency) MINERVA Infinity: A Scalable Efficient P2P Search Engine 6 Information Retrieval Basics (2) Top-k Query Processing: find k documents with the highest total score d51: 0.6 d12: 0.5 d14: 0.4 d52: 0.3 d44: 0.2 d28: 0.1 d28: 0.7 d11: 0.6 d17: 0.1 ... d53: 0.8 d55: 0.6 d44: 0.4 d17: 0.3 d52: 0.1 ... e.g. Fagin’s algorithm TA or a variant without random accesses B+ tree on terms ... Query Execution: Usually using some kind of threshold algorithm*: - sequential scans over the index lists (round-robin) - (random accesses to fetch missing scores) - aggregate scores - stop when the threshold is reached index lists with (DocId: tf*idf) sorted by Score MINERVA Infinity: A Scalable Efficient P2P Search Engine 7 P2P Systems • Peer: – “one that is of equal standing with another” (source: Merriam-Webster Online Dictionary ) • Benefits: – no single point of failure – resource/data sharing • Problems/Challenges: – authority/trust/incentives – high dynamics – … • Applications: – – – – File Sharing IP Telephony Web Search Digital Libraries MINERVA Infinity: A Scalable Efficient P2P Search Engine 8 Structured P2P Systems based on Distributed Hash Tables (DHTs) • “structured” P2P networks • provide one simple method: lookup:key->peer • • • • CAN [SIGCOMM 2001] CHORD [SIGCOMM 2001] Pastry [Middleware 2001] P-Grid [CoopIS 2001] robustness to load skew, failures, dynamics MINERVA Infinity: A Scalable Efficient P2P Search Engine 9 Chord • Peers and keys are mapped to the same cyclic ID space using a p56 hash function k54 p1 p8 k10 p51 • Key k (e.g., hash(file name)) is assigned to the node with p48 key p (e.g., hash(IP address)) such that k p and there is no node p‘ with k p‘ and p‘<p p42 p38 k38 p14 p21 k24 p32 k30 MINERVA Infinity: A Scalable Efficient P2P Search Engine 10 Chord (2) • • • Using finger tables to speed up lookup process Store pointers to few distant peers Lookup in O(log n) steps Lookup(54) k54 p56 p1 p8 p51 Chord Ring p14 p42 p38 p32 p21 MINERVA Infinity: A Scalable Efficient P2P Search Engine 11 Overview • • • • • • • • • • Motivation (Vision/Challenges/Questions) Introduction to IR and P2P Systems P2P- IR Minerva Infinity Network Organization Data Placement Query Processing Data Replication Experiments Conclusion MINERVA Infinity: A Scalable Efficient P2P Search Engine 12 P2P - IR • Share documents (e.g. Web pages) in an efficient and scalable way • Ranked retrieval – simple DHT is insufficient MINERVA Infinity: A Scalable Efficient P2P Search Engine 13 Possible Approaches • Each peer is responsible for storing the COMPLETE index list for a subset of terms. p1 p56 Query Routing: DHT lookups Query Execution: Distributed Top-k [TPUT ’04, KLEE ‘05] p8 p51 p48 p14 p21 p42 p38 p32 capacity overload of peers with highly frequent / popular terms (data load AND query load) MINERVA Infinity: A Scalable Efficient P2P Search Engine 14 Possible Approaches (2) • Each peer has its own local index (e.g., created by web crawls) P2 P1 P3 Distributed Directory Term List of Peers P6 Query Routing: 1. DHT lookups P5 2. Retrieve Metadata 3. Find most promising peers Query Execution: - Send the complete Query and merge the incoming results P4 capacity overload of peers with - highly frequent terms - high-quality collections MINERVA Infinity: A Scalable Efficient P2P Search Engine 15 Overview • • • • • • • • • • Motivation (Vision/Challenges/Questions) Introduction to IR and P2P Systems P2P- IR Minerva Infinity Network Organization Data Placement Query Processing Data Replication Experiments Conclusion MINERVA Infinity: A Scalable Efficient P2P Search Engine 16 Minerva Infinity • Idea: – assign (term, docId, score) triplets to the peers • order preserving • load balancing – hash(score)+ hash(term) as offset – guarantee 100% recall MINERVA Infinity: A Scalable Efficient P2P Search Engine 17 Hash Function • Requirements: – Load balancing (to avoid overloading peers) – Order preserving (to make the QP work) • One without the other is trivial ... – Load balancing: apply a pseudo random hash function – Order preserving: S-Smin ----------------Smax - Smin * N • Both together is challenging … MINERVA Infinity: A Scalable Efficient P2P Search Engine 18 Hash Function (2) • • • • Assume an exponential score distribution Place the first half of the data to the first peer The next quarter to the next peer and so on … 1 0 … MINERVA Infinity: A Scalable Efficient P2P Search Engine 19 Term Index Networks (TINs) • Reduce # of hops during QP by reducing the number of peers that maintain the index list for a particular term Only a small subset of peers is used to store an index list. 62 2 45 2 B 45 41 41 Global Network 7 A 24 7 62 12 12 37 16 C 15 24 20 16 24 MINERVA Infinity: A Scalable Efficient P2P Search Engine 20 How to Create/Find a TIN • Use u Beacon-Peers to bootstrap the TIN for term T Global Network p = 1/u For i=0 to i<n‘ do id = hash(t, i*p) if (i>0) use hash(t,(i-1)*p) as a gateway to the TIN else node with id creates the TIN End for T Beacon nodes act as gateways to the TIN MINERVA Infinity: A Scalable Efficient P2P Search Engine 21 Publish Data / Join a TIN • Peer with id = hash(t, score) not in the TIN for term t • Randomly select a beacon node (Beacon nodes act as gateways to the TIN) • Call the join method • Store the item (docId, t, score) MINERVA Infinity: A Scalable Efficient P2P Search Engine 22 Query Processing Data Peers 1 Coordinator 1 2-keyword Query Alternative: Collect data and send in one batch. MINERVA Infinity: A Scalable Efficient P2P Search Engine 23 QP with Moving Coordinator Data Peers 1 1 Coordinator 1 3-keyword Query MINERVA Infinity: A Scalable Efficient P2P Search Engine 24 Data Replication • Vertical: Replicate data inside a TIN via a ‘reverse’ communication. 1 2 3 123 123 123 • Horizontal: Replicate complete TINs 64 41 11 C A 2 45 B 24 62 7 16 24 12 8 64 C B 20 28 55 31 1 11 57 7 B A C 46 16 22 49 5 1 50 B A 34 19 MINERVA Infinity: A Scalable Efficient P2P Search Engine 25 Experiments Test bed: 10,000 peers Benchmarks: • GOV: TREC .GOV collection + 50 TREC-2003 Web queries, e.g. juvenile delinquency • XGOV: TREC .GOV collection + 50 manually expanded queries, e.g. juvenile delinquency youth minor crime law jurisdiction offense prevention • SCALABILITY: One query executed multiple times ………. MINERVA Infinity: A Scalable Efficient P2P Search Engine 26 Experiments: Metrics Metrics • Network traffic (in KB) • Query response time (in s) - network cost (150ms RTT, 800Kb/s data transfer rate) - local I/O cost (8ms rotation latency + 8MB/s transfer delay) - processing cost • Number of Hops MINERVA Infinity: A Scalable Efficient P2P Search Engine 27 Scalability Experiment • Measure time for a different query loads. – identical queries – inserted into a queue 10000000 Minerva Infinity Total Execution Time in Seconds 1000000 no parallel processing 100000 10000 1000 100 1 10 100 1000 10000 Query Load: Queue Size MINERVA Infinity: A Scalable Efficient P2P Search Engine 28 Experiments: Results GOV GOV 60000.00 Total Bandwidth in KB Total Time in Seconds 1200 1000 800 600 400 200 50000.00 40000.00 30000.00 20000.00 10000.00 0.00 0 2 3 4 Number of Query Terms 2 3 4 Number of Query Terms MINERVA Infinity: A Scalable Efficient P2P Search Engine 29 Conclusion • Novel architecture for P2P web search. • High level of distribution both in data and processing. • Novel algorithms to create the networks, place data, and execute queries. • Support of two different data replication strategies. MINERVA Infinity: A Scalable Efficient P2P Search Engine 30 Future Work • Support of different score distributions • Adapt TIN sizes to the actual load • Different top-k query processing algorithms MINERVA Infinity: A Scalable Efficient P2P Search Engine 31 Thank you for your attention MINERVA Infinity: A Scalable Efficient P2P Search Engine 32