MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine Peter Triantafillou

advertisement
MINERVA Infinity:
A Scalable Efficient Peer-to-Peer
Search Engine
Sebastian Michel
Peter Triantafillou
Gerhard Weikum
Max-Planck-Institut für Informatik
Saarbrücken, Germany
smichel@mpi-inf.mpg.de
University of Patras
Rio, Greece
peter@ceid.upatras.gr
Max-Planck-Institut für Informatik
Saarbrücken, Germany
weikum@mpi-inf.mpg.de
Middleware 2005
Grenoble, France
Vision
• Today: Web Search is dominated
by centralized engines (“to google”)
- censorship?
- single point of attack/abuse
- coverage of the web?
• Ultimate goal: “Distributed Google” to
break information monopolies
• P2P approach best suitable
– large number of peers
– exploit mostly idle resources
– intellectual input of user community
MINERVA Infinity: A Scalable Efficient P2P Search Engine
2
Challenges
• large scale networks
– 100,000 to 10,000,000 users
• large collections
> 10^10 documents
– 1,000,000 terms
• high dynamics
MINERVA Infinity: A Scalable Efficient P2P Search Engine
3
Questions
• Network Organization
– structured?
– hierarchical?
– unstructured?
• Data Placement
– move data around?
– data remains at the owner?
• Scalability?
• Query Routing/Execution
– Routing indexes?
– Message flooding?
MINERVA Infinity: A Scalable Efficient P2P Search Engine
4
Overview
•
•
•
•
•
•
•
•
•
•
Motivation (Vision/Challenges/Questions)
Introduction to IR and P2P Systems
P2P- IR
Minerva Infinity
Network Organization
Data Placement
Query Processing
Data Replication
Experiments
Conclusion
MINERVA Infinity: A Scalable Efficient P2P Search Engine
5
Information Retrieval Basics
5x
7x
4x
Document
Terms
# of terms
(term frequency)
MINERVA Infinity: A Scalable Efficient P2P Search Engine
6
Information Retrieval Basics (2)
Top-k Query Processing: find k documents with
the highest total score
d51: 0.6
d12: 0.5
d14: 0.4
d52: 0.3
d44: 0.2
d28: 0.1
d28: 0.7
d11: 0.6
d17: 0.1
...
d53: 0.8
d55: 0.6
d44: 0.4
d17: 0.3
d52: 0.1
...
e.g. Fagin’s algorithm
TA or a variant without random accesses
B+ tree on terms
...
Query Execution: Usually using
some kind of threshold algorithm*:
- sequential scans over
the index lists (round-robin)
- (random accesses to fetch
missing scores)
- aggregate scores
- stop when the threshold is
reached
index lists with
(DocId: tf*idf)
sorted by Score
MINERVA Infinity: A Scalable Efficient P2P Search Engine
7
P2P Systems
• Peer:
– “one that is of equal standing with another”
(source: Merriam-Webster Online Dictionary )
• Benefits:
– no single point of failure
– resource/data sharing
• Problems/Challenges:
– authority/trust/incentives
– high dynamics
– …
• Applications:
–
–
–
–
File Sharing
IP Telephony
Web Search
Digital Libraries
MINERVA Infinity: A Scalable Efficient P2P Search Engine
8
Structured P2P Systems based on
Distributed Hash Tables (DHTs)
• “structured” P2P networks
• provide one simple method:
lookup:key->peer
•
•
•
•
CAN [SIGCOMM 2001]
CHORD [SIGCOMM 2001]
Pastry [Middleware 2001]
P-Grid [CoopIS 2001]
robustness to
load skew,
failures,
dynamics
MINERVA Infinity: A Scalable Efficient P2P Search Engine
9
Chord
• Peers and keys are mapped to
the same cyclic ID space using a p56
hash function
k54
p1
p8
k10
p51
• Key k (e.g., hash(file name))
is assigned to the node with p48
key p (e.g., hash(IP address))
such that k  p and there is
no node p‘ with k  p‘ and p‘<p p42
p38
k38
p14
p21
k24
p32
k30
MINERVA Infinity: A Scalable Efficient P2P Search Engine
10
Chord (2)
•
•
•
Using finger tables to speed
up lookup process
Store pointers to few distant
peers
Lookup in
O(log n) steps
Lookup(54)
k54
p56
p1
p8
p51
Chord Ring
p14
p42
p38
p32
p21
MINERVA Infinity: A Scalable Efficient P2P Search Engine
11
Overview
•
•
•
•
•
•
•
•
•
•
Motivation (Vision/Challenges/Questions)
Introduction to IR and P2P Systems
P2P- IR
Minerva Infinity
Network Organization
Data Placement
Query Processing
Data Replication
Experiments
Conclusion
MINERVA Infinity: A Scalable Efficient P2P Search Engine
12
P2P - IR
• Share documents (e.g. Web pages) in an
efficient and scalable way
• Ranked retrieval
– simple DHT is insufficient
MINERVA Infinity: A Scalable Efficient P2P Search Engine
13
Possible Approaches
• Each peer is responsible for storing the
COMPLETE index list for a subset of terms.
p1
p56
Query Routing: DHT lookups
Query Execution: Distributed Top-k
[TPUT ’04, KLEE ‘05]
p8
p51
p48
p14
p21
p42
p38
p32
capacity overload of peers with
highly frequent / popular terms
(data load AND query load)
MINERVA Infinity: A Scalable Efficient P2P Search Engine
14
Possible Approaches (2)
• Each peer has its own local index
(e.g., created by web crawls)
P2
P1
P3
Distributed Directory
Term  List of Peers
P6
Query Routing:
1. DHT lookups
P5
2. Retrieve Metadata
3. Find most promising peers
Query Execution:
- Send the complete Query
and merge the incoming results
P4
capacity overload of peers with
- highly frequent terms
- high-quality collections
MINERVA Infinity: A Scalable Efficient P2P Search Engine
15
Overview
•
•
•
•
•
•
•
•
•
•
Motivation (Vision/Challenges/Questions)
Introduction to IR and P2P Systems
P2P- IR
Minerva Infinity
Network Organization
Data Placement
Query Processing
Data Replication
Experiments
Conclusion
MINERVA Infinity: A Scalable Efficient P2P Search Engine
16
Minerva Infinity
• Idea:
– assign (term, docId, score)
triplets to the peers
• order preserving
• load balancing
– hash(score)+
hash(term) as offset
– guarantee 100% recall
MINERVA Infinity: A Scalable Efficient P2P Search Engine
17
Hash Function
• Requirements:
– Load balancing (to avoid overloading peers)
– Order preserving (to make the QP work)
• One without the other is trivial ...
– Load balancing: apply a pseudo random hash function
– Order preserving:
S-Smin
----------------Smax - Smin
* N
• Both together is challenging …
MINERVA Infinity: A Scalable Efficient P2P Search Engine
18
Hash Function (2)
•
•
•
•
Assume an exponential score distribution
Place the first half of the data to the first peer
The next quarter to the next peer
and so on …
1
0
…
MINERVA Infinity: A Scalable Efficient P2P Search Engine
19
Term Index Networks (TINs)
• Reduce # of hops during QP by reducing the
number of peers that maintain the index list for a
particular term
 Only a small subset of peers is used to store an
index list.
62
2
45
2
B
45
41
41
Global
Network
7
A
24
7
62
12
12
37
16
C
15
24
20
16
24
MINERVA Infinity: A Scalable Efficient P2P Search Engine
20
How to Create/Find a TIN
• Use u Beacon-Peers to bootstrap
the TIN for term T
Global
Network
p = 1/u
For i=0 to i<n‘ do
id = hash(t, i*p)
if (i>0) use hash(t,(i-1)*p)
as a gateway to the TIN
else node with id creates the TIN
End for
T
Beacon nodes act as gateways to the TIN
MINERVA Infinity: A Scalable Efficient P2P Search Engine
21
Publish Data / Join a TIN
• Peer with id = hash(t, score) not in the TIN for
term t
• Randomly select a beacon node
(Beacon nodes act as gateways to the TIN)
• Call the join method
• Store the item (docId, t, score)
MINERVA Infinity: A Scalable Efficient P2P Search Engine
22
Query Processing
Data Peers
1
Coordinator
1
2-keyword Query
Alternative: Collect data and send in one batch.
MINERVA Infinity: A Scalable Efficient P2P Search Engine
23
QP with Moving Coordinator
Data Peers
1
1
Coordinator
1
3-keyword Query
MINERVA Infinity: A Scalable Efficient P2P Search Engine
24
Data Replication
• Vertical: Replicate data inside a TIN via a ‘reverse’
communication.
1
2
3
123
123
123
• Horizontal: Replicate complete TINs
64
41
11
C
A
2
45
B
24
62
7
16
24
12
8
64
C
B
20
28
55
31
1
11
57
7
B
A
C
46
16
22
49
5
1
50
B
A
34
19
MINERVA Infinity: A Scalable Efficient P2P Search Engine
25
Experiments
Test bed:
10,000 peers
Benchmarks:
• GOV: TREC .GOV collection + 50 TREC-2003 Web
queries, e.g. juvenile delinquency
• XGOV: TREC .GOV collection + 50 manually expanded
queries, e.g. juvenile delinquency youth minor crime law
jurisdiction offense prevention
• SCALABILITY: One query executed multiple times
……….
MINERVA Infinity: A Scalable Efficient P2P Search Engine
26
Experiments: Metrics
Metrics
• Network traffic (in KB)
• Query response time (in s)
- network cost (150ms RTT,
800Kb/s data transfer rate)
- local I/O cost (8ms rotation latency
+ 8MB/s transfer delay)
- processing cost
• Number of Hops
MINERVA Infinity: A Scalable Efficient P2P Search Engine
27
Scalability Experiment
• Measure time for a different
query loads.
– identical queries
– inserted into a queue
10000000
Minerva Infinity
Total Execution Time
in Seconds
1000000
no parallel
processing
100000
10000
1000
100
1
10
100
1000
10000
Query Load: Queue Size
MINERVA Infinity: A Scalable Efficient P2P Search Engine
28
Experiments: Results
GOV
GOV
60000.00
Total Bandwidth in KB
Total Time in Seconds
1200
1000
800
600
400
200
50000.00
40000.00
30000.00
20000.00
10000.00
0.00
0
2
3
4
Number of Query Terms
2
3
4
Number of Query Terms
MINERVA Infinity: A Scalable Efficient P2P Search Engine
29
Conclusion
• Novel architecture for P2P web search.
• High level of distribution both in data and
processing.
• Novel algorithms to create the networks, place
data, and execute queries.
• Support of two different data replication
strategies.
MINERVA Infinity: A Scalable Efficient P2P Search Engine
30
Future Work
• Support of different score distributions
• Adapt TIN sizes to the actual load
• Different top-k query processing algorithms
MINERVA Infinity: A Scalable Efficient P2P Search Engine
31
Thank you for your attention
MINERVA Infinity: A Scalable Efficient P2P Search Engine
32
Download