Efficient & Self-tuning Incremental Query Expansions for Top

advertisement
Efficient and Self-tuning
Incremental Query Expansions
for Top-k Query Processing
Martin Theobald
Ralf Schenkel
Gerhard Weikum
Max-Planck Institute for Informatics
Saarbrücken
Germany
ACM SigIR ‘05
An Initial Example…
TREC Robust Track ’04, hard query no. 363 (Aquaint news corpus)
“transportation tunnel disasters”
Increased
transportation 1.0
tunnel 1.0 retrieval robustness
disasters 1.0
Count only the best match per
document and expansion set
d2
d1
transit
highway
train
truck
metro
“rail car”
car
…
0.9
0.8
0.7
0.6
0.6
0.5
0.1
Increased efficiency
Top-k-style
query
evaluations 1.0
tube
0.9
catastrophe
Open
terms only0.9
underground
0.8scans on newaccident
on demand
“Mont Blanc”
0.7
fire 0.7
No…
threshold tuning flood 0.6
earthquake 0.6
“land slide” 0.5
…
Expansion terms from relevance feedback, thesaurus lookups, Google top-10 snippets, etc.
Term similarities, e.g., Rocchio, Robertson&Sparck-Jones, concept similarities,
or other correlation measures
ACM SigIR ‘05
Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
2
Outline
Computational model & background on
top-k algorithms
Incremental Merge over inverted lists
Probabilistic candidate pruning
Phrase matching
Experiments & Conclusions
ACM SigIR ‘05
Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
3
Computational Model
Vector space model with a Cartesian product space D1×…×Dm
and a data set D  D1×…×Dm   m
Precomputed local scores s(ti,d)∈ Di for all d∈ D
e.g., tf*idf variations, probabilistic models (Okapi BM25), etc.
typically normalized to s(ti,d)∈ [0,1]
Monotonous score aggregation
aggr: (D1×…×Dm )  (D1×…×Dm ) → +
e.g., sum, max, product (using sum over log sij ), cosine (using L2 norm)
Partial-match queries (aka. “andish”)
Non-conjunctive query evaluations
Weak local matches can be compensated
Access model
Disk-resident inverted index over large text corpus
Inverted lists sorted by decreasing local scores
 Inexpensive sequential accesses to per-term lists: “getNextItem()”
 More expensive random accesses: “getItemBy(docid)”
ACM SigIR ‘05
Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
4
No-Random-Access (NRA) Algorithm
Corpus: d1,…,dn
NRA(q,L):
scan all lists Li (i = 1..m) in parallel // e.g., round-robin
< d, s(ti ,d) > = Li.getNextItem()
E(d) = E(d)  {i}
highi = s(ti ,d)
worstscore(d) = ∑E(d) s(t ,d)
1.
2.
3.
dd1d1 1
4.
5.
s(t1,d1) = 0.7
…
s(tm,d1) = 0.2
6.
7.
8.
9.
Query q =
10.
(transportation, tunnel
disaster)
12.
11.
13.
Inverted Index
d78
transport 0.9
d64
tunnel
0.8
disaster d10
0.7
d23
0.8
d23
0.6
d78
0.5
d10
0.8
d10
0.6
d64
0.4
d1
0.7
d10
0.2
d99
0.2
d88
0.2 …
d78
0.1 …
d34
0.1 …
bestscore(d) = worstscore(d) + ∑E(d) high
if worstscore(d) > min-k then
add d to top-k
min-k = min{ worstscore(d’) | d’  top-k}
else if bestscore(d) > min-k then
candidates = candidates  {d}
if max {bestscore(d’) | d’ candidates}  min-k then return top-k
Rank
Doc WorstBestRank
WorstBestRank# Doc
Docscore
Worst-score
Best##
score
score
score
score
k=1
Naive
Join-then-Sort
1 1 d78
0.91.4 2.42.0
Scan
d78
Scan
1
d10
2.1
2.1
Scan
depth
1
2 2 d64
0.8
2.41.9
in
between
depth
depth23
2 d23
d78 1.4
1.4
2.0
2
3 3 O(mn
d10 0.7
2.42.1
O(mn) and
) runtime
3 d64
d23 0.8
1.4
1.8
44
ACM SigIR ‘05
[Fagin et al., PODS ’01
Balke et al. VLDB ’00
Buckley&Lewit, SigIR ‘85]
STOP!
0.7
d10
d64
1.2
Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
2.1
2.0
5
Outline
Computational model & background on
top-k algorithms
Incremental Merge over inverted lists
Probabilistic candidate pruning
Phrase matching
Experiments & Conclusions
ACM SigIR ‘05
Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
6
Dynamic & Self-tuning Query Expansions
accident
fire
d78 d10 d11 d1 ...
d92 d42 d32 d87...
i 1..m
max {sim(ti , tij )  s(tij , d )}
tij exp( ti )
d42 d11 d92 d11 …
incr. merge
d42 d11 d92 d21 ...

virtual list ~disaster
disaster
score(d ) :
tunnel d95 d17 d11 d99
...
Best match score aggregation for
combined term similarities and local
scores
transport d66 d93 d95 d101
...
Incrementally merge inverted lists
Li1…Lim’ in descending order of
local scores
Dynamically add lists into set of
active expansions exp(ti)
Only touch short prefixes of each list,
don’t need to open all lists
top-k
(transport, tunnel,
~disaster)
Increased retrieval robustness & fewer topic drifts
Increased efficiency through fewer active expansions
No threshold tuning of term similarities in the expansions
ACM SigIR ‘05
Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
7
Incremental Merge Operator
Index list meta data
(e.g., histograms)
Relevance feedback,
Thesaurus lookups,…
Initial high-scores
Expansion terms
~t = {t1,t2,t3}
Correlation measures,
Large corpus statistics
…
sim(t, t1 ) = 1.0
t1
sim(t, t2 ) = 0.9
t2
sim(t, t3 ) = 0.5
t3
Expansion similarities
Incremental Merge
iteratively triggered
by top-k operator 
sequential access
“getNextItem()”
ACM SigIR ‘05
~t
d78 d23 d10
0.9 0.8 0.8
d1 d88
0.4 0.3
0.9
0.4
d64 d23 d10 d12 d78
0.8 0.8 0.7 0.2 0.1
0.18
0.72
d11 d78 d64 d99 d34
0.9 0.9 0.7 0.7 0.6
0.45
0.35
...
...
...
d78 d23 d10 d64 d23 d10 d11 d78 d1 d88
0.9 0.8 0.8 0.72 0.72 0.63 0.45 0.45 0.4 0.3 ...
Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
8
Outline
Computational model & background on
top-k algorithms
Incremental Merge over inverted lists
Probabilistic candidate pruning
Phrase matching
Experiments & Conclusions
ACM SigIR ‘05
Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
9
Probabilistic Candidate Pruning [Theobald, Schenkel & Weikum, VLDB ‘04]
For each physically stored index list Li
Treat each s(ti,d)  [0,1] as a random variable Si and consider P Si   | Si  highi 
Approximate local score distribution using an
equi-width histogram with n buckets
freqi [k ] 
ACM SigIR ‘05
# docs d  Li
Li
 k k  1
with s(ti , d )   ,
 n n 
Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
10
Probabilistic Candidate Pruning [Theobald, Schenkel & Weikum, VLDB ‘04]
For each physically stored index list Li
Treat each s(ti,d)  [0,1] as a random variable Si and consider P Si   | Si  highi 
Approximate local score distribution using an
equi-width histogram with n buckets
freqi [k ] 
# docs d  Li
 k k  1
with s(ti , d )   ,
 n n 
Li
For a virtual index list ~Li = Li1…Lim’
Consider the max-distribution (feature independence)
m'
P  max{Si1 ,..., Sim '}   | Sil  highi   1   P  Sil   | Sil  highi 
l 1
Alternatively, construct meta histogram for the active expansions
1
~ freqi [k ] 
Li1  ...  Li m'
ACM SigIR ‘05
m'
 Lil
l 1
freql [k ]
Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
11
Probabilistic Candidate Pruning [Theobald, Schenkel & Weikum, VLDB ‘04]
For each physically stored index list Li
Treat each s(ti,d)  [0,1] as a random variable Si and consider P Si   | Si  highi 
Approximate local score distribution using an
equi-width histogram with n buckets
freqi [k ] 
# docs d  Li
 k k  1
with s(ti , d )   ,
 n n 
Li
For a virtual index list ~Li = Li1…Lim’
Consider the max-distribution (feature independence)
m'
P  max{Si1 ,..., Sim '}   | Sil  highi   1   P  Sil   | Sil  highi 
l 1
Alternatively, construct meta histogram for the active expansions
1
~ freqi [k ] 
Li1  ...  Li m'
m'
 Lil
l 1
freql [k ]
For all d in the candidate queue
Return current top-k
Consider the convolution over local score distributions to predict
scores
list aggregated
if candidate
Drop d from candidate queue, if
queue is empty!
P i  E (d ) Si  mink  worstscore(d ) | Si  highi   


ACM SigIR ‘05
Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
12
Outline
Computational model & background on
top-k algorithms
Incremental Merge over inverted lists
Probabilistic candidate pruning
Phrase matching
Experiments & Conclusions
ACM SigIR ‘05
Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
13
Incremental Merge for Multidimensional Phrases
Top-k
Nested Top-k operator iteratively
prefetches & joins candidate items for
each subquery condition
“getNextItem()”
sim(„fiber optic cable“,
Propagates candidates in descending
order of bestscore(d) values to provide
monotonous upper score bounds
Top-k
fiber
optics
fiber
d34 d12 d78 d7 d23
0.9 0.8 0.6 0.4 0.3 …
d78 d23 d10 d1 d88
0.9 0.8 0.8 0.7 0.2 …
d78 d17 d23 d5 d47
0.8 0.6 0.6 0.4 0.1 …
d78 d23 d10 d1 d88
0.9 0.8 0.8 0.7 0.2 …
Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
random access
optic
d41 d10 d75 d2 d23
0.9 0.7 0.5 0.3 0.1 …
ACM SigIR ‘05
sim(„fiber optic cable“,
„fiber optics“)
= 0.8
Nested
cable
Single threshold condition for
algorithm termination (candidate
pruning at the top-level queue only)
Nested
Top-k
d14 d18 d1 d23 d32
Top-level top-k operator performs
phrase tests only for the most
promising items (random access)
(Expensive predicates & minimal
probes [Chang&Hwang, SIGMOD ‘02] )
Incr.Merge
„fiber optic cable“)
= 1.0
undersea 0.9 0.9 0.8 0.8 0.7 …
Provides [wortscore(d), bestscore(d)]
guarantees to superordinate top-k
operator
q = {undersea „fiber optic cable“}
term-toposition
index
14
Outline
Computational model & background on
top-k algorithms
Incremental Merge over inverted lists
Probabilistic candidate pruning
Phrase matching
Experiments & Conclusions
ACM SigIR ‘05
Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
15
Experiments – Aquaint with Fixed Expansions
Aquaint corpus of English news articles (528,155 docs)
50 “hard” queries from TREC 2004 Robust track
WordNet expansions using a simple form of WSD
Okapi-BM25 model for local scores, Dice coefficients as term similarities
Fixed expansion technique (synonyms + first-order hyponyms)
Title-only Baseline
Join&Sort
2.5
4
2,305,637
NRA-Baseline
2.5
4
1,439,815
Join&Sort
35
118
20,582,764
NRA+Phrases, ε=0.0
35
118
NRA+Phrases, ε=0.1
35
Incr.Merge, ε=0.0
Incr.Merge, ε=0.1
0
9.4
432 KB
0.252
0.092
1.000
18,258,834
210,531
245.0
37,355 KB
0.286
0.105
1.000
118
3,622,686
49,783
79.6
5,895 KB
0.238
0.086
0.541
35
118
7,908,666
53,050
159.1
17,393 KB
0.310
0.118
1.000
35
118
5,908,017
48,622
79.4
13,424 KB
0.298
0.110
0.786
Static Expansions
Dynamic Expansions
ACM SigIR ‘05
Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
16
Experiments – Aquaint with Fixed Expansions, cont’d
Probabilistic Pruning Performance
Incremental Merge vs. top-k with static expansions
Epsilon controls pruning aggressiveness
0≤ε≤1
20,000,000
0.4
1.0
#Seq, incr. merge
18,000,000
#Seq, static expansion
0.9
16,000,000
#Rand, incr. merge
14,000,000
#Rand, static expansion
0.8
0.3
0.7
12,000,000
0.6
10,000,000
0.2
0.5
8,000,000
0.4
6,000,000
4,000,000
0.3
0.1
0.2
2,000,000
0.1
0
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
ACM SigIR ‘05
ε
relPrec,
incr. merge
merge
P@10, incr.
relPrec,
staticexpansion
expansion
P@10, static
P@10, incr. merge
MAP, incr. merge
P@10, static expansion
expansion
staticmerge
MAP, incr.
MAP,
MAP, static expansion
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
0.7
0.8
0.9
1.0
ε
17
Conclusions & Ongoing Work
Increased efficiency
Incremental Merge vs. Join-then-Sort & top-k using static expansions
Very good precision/runtime ratio for probabilistic pruning
Increased retrieval robustness
Largely avoids topic drifts
Modeling of fine grained semantic similarities
(Incremental Merge & Nested Top-k operators)
Scalability (see paper)
Large expansions (m < 876 terms per query) on Aquaint
Expansions for Terabyte collection (~25,000,000 docs)
Efficient support for XML-IR (INEX Benchmark)
Inverted lists for combined tag-term pairs
e.g., sec=mining
Efficiently supports child-or-descendant axis e.g., //article//sec//=mining
Vague content & structure queries (VCAS)
e.g., //article//~sec=~mining
TopX-Engine, VLDB ’05
ACM SigIR ‘05
Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
19
Thank you!
ACM SigIR ‘05
Efficient & Self-tuning Incremental Query Expansions for Top-k Query Processing
20
Download