Metasearch Zhenyu (Victor) Liu Software Engineer Google Inc.

advertisement
Metasearch
Mathematics of Knowledge and Search Engines: Tutorials @ IPAM
9/13/2007
Zhenyu (Victor) Liu
Software Engineer
Google Inc.
vicliu@google.com
Roadmap
The problem
 Database content modeling
 Database selection
 Summary

2
Metasearch – the problem
??? applied
mathematics
??? applied
mathematics
Search results
3
Metasearch
Engine
Subproblems

Database content modeling


Database selection


How does a Metasearch engine “perceive” the
content of each database?
Selectively issue the query to the “best” databases
Query translation

Different database has different query formats


Result merging


4
“a+b” / “a AND b” / “title:a AND body:b” / etc.
Query “applied mathematics”
top-10 results from both science.com and nature.com,
how to present?
Database content modeling and
selection: a simplified example
A “content summary” of each database
 Selection based on # of mathing docs
 Assuming independence between words

Total #: 60,000
Total #: 10,000
5
Word w
# of documents
that use w
Pr(w)
Word w
# of documents
that use w
Pr(w)
applied
4000
0.4
applied
200
0.00333
mathematics
2500
0.25
mathematics
300
0.005
10,000  0.4  0.25 = 1000
documents matches
“applied mathematics”
>
60,000  0.00333  0.005 = 1
documents matches “applied
mathematics”
Roadmap
The problem
 Database content modeling
 Database selection
 Summary

6
Database content modeling
able to obtain a full
content summary
- less storage demanding
- fully cooperative database
approximate the content
summary via sampling
- least storage demanding
- non-cooperative database
7
able to replicate the
entire text database
- most storage demanding
- fully cooperative database
download part of a
text database
- more storage demanding
- non-cooperative database
Replicate the entire database

E.g.

8
www.google.com/patents, replica of the entire USPTO
patent document database
Download a non-cooperative database



Objective: download as much as possible
Basic idea: “probing” (querying with short queries) and
downloading all results
Practically, can only issue a fixed # of probes (e.g., 1000
queries per day)
“applied”
Metasearch
Engine
9
Search
Interface
“mathematics”
A text
database
Harder than the “set-coverage” problem

All docs in a database db as the universe



Each probe corresponds to a subset
Find the least # of subsets
(probes) that covers db


10
“mathematics”
or, the max coverage with a
fixed # of subsets (probes)
NP-complete


assuming all docs are equal
Greedy algo. proved to be the
best-possible P-time
approximation algo.
Cardinality of each subset
(# of matching docs for each
probe) unknown!
“applied”
Pseudo-greedy algorithms [NPC05]
Greedy-set-coverage: choose subsets with the
max “cardinality gain”
 When cardinality of subsets is unknown


Assume cardinality of subsets is the same across
databases - proportionally


Start with certain “seed” queries, adaptively choose
query words within the docs returned

11
e.g. build a database with Web pages crawled from the
Internet, rank single words according to their frequency
Choice of probing words varies from database to database
An adaptive method


D(wi) – subsets returned by probe with word wi
w1, w2, …, wn already issued

n
arg max
n
wi1 as a word used by  D(wi )
D( wi 1 )   D( wi )
i 1
i 1

Rewritten as
|db|Pr(wi+1) - |db|Pr(wi+1 Λ (w1V…V wn))

12
Pr(w): prob. of w appearing in a doc of db
An adaptive method (cont’d)

How to estimate Pr̃(wi+1)

Zipf’s law:



-γ
Pr(w) = α(R(w)+β) , R(w): rank of w in a descending order of
Pr(w)
Assuming the ranking of w1, w2, …, wn and other words
remains the same in the downloaded subset and in db
Interpolate:
interpolated
“P̃r(w)”
13
Pr(w) values for
w1, w2, …, wn
fitted Zipf’s law curve
single words ranked by Pr(w)
in the downloaded documents
Obtain an exact content summary

C(db) for a database db


Statistics about words in db,
e.g., df – document frequency,
df
mathematics
2500
applied
4000
research
1000
Standards and proposals for co-operative
databases to follow to export C(db)

STARTS [GCM97]


14
w
Initiated by Stanford, attracted main search engine players by
1997: Fulcrum, Infoseek, PLS, Verity, WAIS, Excite, etc.
SDARTS [GIG01]

Initiated by Columbia U.
Approximate the content-summary


Objective: C̃(db) of a database db, with high
vocabulary coverage & high accuracy
Basic idea: probing and download sample docs
[CC01]

Example: df as the content summary statistics
3.
Pick a single word as the query, probe the database
Download a fraction of results, e.g., top-k
If terminating condition unsatisfied, go to 1.
4.
Output <w, df̃> based on the sample docs downloaded
1.
2.
15
Vocabulary coverage
Can a small sample of docs cover the
vocabulary of a big database?
 Yes, based on Heap’s law [Hea78]:


β
|W |= Kn

n

W



16
K
β
- # of words scanned
- set of distinct words encountered
- constant, typically in [10, 100]
- constant, typically in [0.4, 0.6]
Empirically verified [CC01]
Estimate document frequency

How to identify the df̃ of w in the entire database?



w’ appearing in the sampled docs: need to estimate df̃
based on the docs sample
Apply Zipf’s law & interpolate [IG02]
2.
Rank w and w’ based on their frequency in the sample
Curve-fit based on the true df of those w
3.
Interpolate the estimated df̃ of w’ onto the fitted curve
1.
17
w used as a query during sampling: df typically
revealed in search results
What if db changes over time?

So does its content summary C(db), and C̃(db) [INC05]

Empirical study


152 Web databases, a snapshot downloaded weekly, for 1 year
df as the statistics measure
Kullback-Leibler divergence
as the “change” measure




18
between the “latest”
snapshot and the
snapshot time t ago
db does change!
How do we model
the change?
When to resample, and
̃
get a new C(db) ?
Kullback-Leibler
divergence

t
Model the change
KLdb(t) – the KL divergence between the current
C̃(db) and C̃ (db, t) time t ago
 T: time when KLdb(t) exceeds a pre-specified τ
 Applying principles of Survival Analysis




19
Survival function Sdb(t) = 1 – Pr(T ≤ t)
Hazard funciton hdb(t) = - (dSdb(t) /dt) / Sdb(t)
How to compute hdb(t) and then Sdb(t)?
Learn the hdb(t) of database change

Cox proportional hazards regression model


ln( hdb(t) ) = ln( hbase(t) ) + β1x1 + … , where xi is some
predictor variable
Predictors


Pre-specified threshold τ
Web domain of db, “.com” “.edu” “.gov” “.org” “others”




20
5 binary “domain variables”
ln( |db| )
avg KLdb(1 week) measured in the training period
…
Train the Cox model

Stratified Cox model being applied



Training Sbase(t) for each domain

21
Domain variables didn’t satisfy the Cox proportional
assumption
Stratifying on each domain, or, a hbase(t) / Sbase(t) for
each domain
Assuming Weibull distribution, Sbase(t) = e-λtγ
Training result

γ ranges in (0.57, 1.08)  Sbase(t) not
exponential distribution
Sbase(t)
t
22
Training result (cont’d)
predictor ln( |db| ) avg KLdb(1 week)
β value
0.094
6.762
τ
-1.305
A larger db takes less time to have KLdb(t)
exceed τ
 Databases changes faster during a short period
are more likely to change later on

23
How to use the trained model?
Model gives Sdb(t)  likelihood that db “has not
changed much”
 An update policy to periodically resample each db



Intuitively, maximize ∑db Sdb(t)
More precisely
t
–
S = limt∞ (1/t)∫0 [ ∑db Sdb(t) ] dt
A policy: {fdb}, where fdb is the update frequency of
db, e.g., 2/week
 Subject to practical constraints, e.g., total update
cap per week

24
Derive an optimal update policy
–
 Find {fdb} that maximizes S under the constraint
∑db fdb = F, where F is a global frequency limit
 Solvable by the Lagrange-multiplier method
 Sample results:
db
25
λ
F=4/week F=15/week
tomshardware.com 0.088 1/46
1/5
Usps.com
1/12
0.023 1/34
Roadmap
The problem
 Database content modeling
 Database selection
 Summary

26
Database selection

Select the databases to issue a given query



Formalization


27
Necessary when the Metasearch engine do not have
entire replica of each database – most likely with
content summary only
Reduces query load in the entire system
Query q = <w1, …, wm>, databases db1, …, dbn
Rank databases according to their “relevancy score”
r(dbi, q) to query q
Relevancy score
# of matching docs in db
 Similarity between q and top docs returned by
db





Relevancy of db as judged by users

28
Typically vector-space similarity (dot-product)
between q and a doc
Sum / Avg of similarities of top-k docs of each db,
e.g., top-10
Sum / Avg of similarities of top docs of each db
exceeding a similarity threshold

Explicit relevance feedback
User click behavior data
Estimating r(db,q)
Typically, r(db, q) unavailable
 Estimate r(db,
q) based on C(db), or C̃ (db)
̃

29
Estimating r(db,q), example 1 [GGT99]
r(db, q): # of matching docs in db
 Independence assumption:



Query words w1, …, wm appear independently in db
r(db,
q):
̃
df (db,w j )
~
r (db,q)=|db|× ∏
|db|
w j∈q

df(db, wj): document frequency of wj in db –
could be df̃(db, wj) from C̃(db)
30
Estimating r(db,q), example 2 [GGT99]

r(db, q):
∑{ddb | sim(d, q)>l} sim(d, q)


d: a doc in db
sim(d, q): vector dot-product between d & q


31
each word in d & q weighted with common tfidf weighting
l: a pre-specified threshold
Estimating r(db,q), example 2 (cont’d)

Content summary, C(db), required:
df(db, w): doc frequency
–
 v(db, w): ∑{ddb} weight of w in d’s vector


32
–
–
<v(db, w1), v(db, w2), …> - “centroid” of the entire db as a
“cluster of doc vectors”
Estimating r(db,q), example 2 (cont’d)

l = 0, sum of all q-doc similarity values of db


r(db, q) = ∑{ddb} sim(d, q)
r(db,
q) = r(db, q) =
̃
–
<v(q,w1), …>  <v(db,
w1), –
v(db, w2), …>


33
v(q, w): weight of w in the query vector
l > 0?
Estimating r(db,q), example 2 (cont’d)

Assuming uniform weight of w among all docs using w


Highly-correlated query words scenario




34
i.e. weight of w in any doc = –
v(db, w) / df(db, w)
If df(db, wi) < df(db, wj), every doc using wi also uses wj
Words in q sorted s.t. df(db, w1) ≤ df(db, w2) ≤ … ≤ df(db, wm)
–
r(db,
q) = ∑i=1…pv(q, wi)v(db,
wi) +
̃
–
df(db, wp) [ ∑j=p+1…mv(q, wj)v(db,
wj)/df(db, wj)]
where p is determined by some criteria [GGT99]
Disjoint query words scenario

No doc using wi uses wj

–
r(db,
q) = ∑i=1…m | df(db, w ) > 0 Λ v(q, w )–v(db, w ) / df(db, w ) > l v(q, wi)v(db,
wi)
̃
i
i
i
i
Estimating r(db,q), example 2 (cont’d)

35
Ranking of databases based on r(db,
q)
̃
empirically evaluated [GGT99]
A probabilistic model for errors in
estimation [LLC04]
Any estimation makes errors
 An error (observed) distribution for each db



distribution of db1 ≠ distribution of db2
Definition of error: relative
r (db, q) - ~
r (db, q)
err (db, q) =
~
r (db, q)
36
Modeling the errors: a motivating
experiment


dbPMC: PubMedCentral www.pubmedcentral.nih.gov
Two query sets, Q1 and Q2 (healthcare related)



|Q1| = |Q2| = 1000, Q1  Q2 = 
Compute err(dbPMC, q) for each sample query
q  Q1 or Q2
Further verified through statistical tests (Pearson-χ2)
error
probability
distribution
37
Q1
error
probability
distribution
err(dbPMC, q), q Q1
Q2
err(dbPMC, q), q Q2
Implications of the experiment

On a text database




Sampling size study [LLC04]

38
Similar error behavior among sample queries
Can sample a database and summarize the error
behavior into an Error Distribution (ED)
Use ED to predict the error for a future unseen query
A few hundred sample queries good enough
From an Error Distribution (ED)
to a Relevancy Distribution (RD)

①
Database: db1. Query: qnew
r (db1 ,qnew ) = (err (db1 ,qnew ) +1) × ~
r (db1 ,qnew )
0.4
0.5
0.1 err(db1,qnew)
-50%
by definition
0%
+50%
from sampling 0.5
0.4
0.1
② The ED for db1
③
39
r(̃ db1,qnew) =1000
500
1000
r(db1,qnew)
1500
④ A Relevancy
Distribution (RD)
existing estimation method
for r(db1, qnew)
RD-based selection
Estimation-based: db1 > db2
r(̃ db2,qnew)
650
0.4
0.5
0.1
-50%
0%
err(db1, qnew)
+50%
db1
,q ) =1000
RD-based: db1 r<(̃ dbdb
2
( Pr(db1 < db2) = 0.850.9)
1
0.1
db2
0%
1000
0.5
0.4
0.1
500
1000
r(db1, qnew)
1500
new
err(db2, qnew)
+100%
r(̃ db1,qnew) =650
40
r(̃ db1,qnew)
0.9
0.1
db1:
650
db2:
r(db2, qnew)
1300
Correctness metric

Terminology:

DBk: k databases returned by some method
 DBtopk: the actual answer
How correct DBk is compared to DBtopk?
Absolute correctness: Cora(DBk) =
1, if DBk=DBtopk
0, otherwise




41
k
topk
|
DB
∩
DB
|
Partial correctness: Corp(DBk) =
k
Cora(DBk) = Corp(DBk) for k = 1
Effectiveness of RD-based selection



20 healthcare-related text databases on the Web
Q1 (training, 1000 queries) to learn the ED of each database
Q2 (testing, 1000 queries) to test the correctness of database
selection
k=1
k=3
Avg(Cora),
Avg(Corp)
Avg(Cora)
Avg(Corp)
Estimation-based selection
(term-independence
estimator)
0.471
0.301
0.699
RD-based selection
0.651 (+38.2%)
0.478
(+58.8%)
0.815
(+30.9%)
42
Probing to improve correctness

db1:
RD-based selection
0.85 = Pr(db2 > db1)
= Pr({db2} = DBtop1)
= 1Pr({db2} = DBtop1) +
0Pr({db2}  DBtop1)
= E[Cora({db2})]
db2:
0.5
0.4
0.1
500 650


0.1
1000 1300 1500
Probe dbi: contact a dbi to obtain its exact relevancy
After probing db1:
E[Cora({db2})] = Pr(db2 > db1) = 1
43
0.9
r(db1,q)=500
Computing the expected correctness

Expected absolute correctness


E[Cora(DBk)]
=1Pr(Cora(DBk) = 1) + 0Pr(Cora(DBk) = 0)
= Pr(Cora(DBk) = 1)
= Pr(DBk = DBtopk)
Expected partial correctness

E[Corp(DBk)]
=
∑
0 ≤l≤k
44
l
l
l
• Pr (Cor p ( DBk ) = ) =
• Pr (| DB k ∩DBtopk |= l )
k
k
0 ≤l≤k k
∑
Adaptive probing algorithm: APro

User-specified correctness threshold: t
return this DBk
dbi+1
YES
dbn
RD’s of the probed and
unprobed databases
dbi
Any DBk
with E[Cor(DBk)]  t?
unprobed
probed
NO
db1
45
dbi-1
dbi
dbi+1
dbn
Which database to probe?

A greedy strategy:
The stopping condition: E[Cor(DBk)]  t
Once probed, which database leads to the highest E[Cor(DBk)]?
Suppose we will probe db3
if r(db3,q) = ra, max E[Cor(DBk)] = 0.85
if r(db3,q) = rb, max E[Cor(DBk)] = 0.8
if r(db3,q) = rc, max E[Cor(DBk)] = 0.9
Probe the database that leads to
the largest “expected”
max E[Cor(DBk)]
46
db1
db2
db3
db4
r(db3, q) = rb
r(db3, q) = ra
r(db3, q) = rc
rc
rb
ra
Effectiveness of adaptive probing
20 healthcare-related text databases on the Web
Q1 (training, 1000 queries) to learn the RD of each
database
Q2 (testing, 1000 queries) to test the correctness of
database selection



avg
Cora
1
1
1
0.9
0.9
0.9
0.8
avg
Cora
0.7
0.6
0.8
avg
Corp
0.7
0.6
0.8
0.7
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
adaptive probing APro
0.2
the term-independence estimator
0.1
0
0.2
adaptive probing APro
0.2
0.1
the term-independence estimator
0.1
0
0
47
0.3
0.3
1
2
3
4
5
adaptive probing APro
the term-independence estimator
0
0
1
# of databases probed
2
3
4
# of databases probed
k=1
k=3
5
0
1
2
3
4
# of databases probed
k=3
5
The “lazy TA problem”
Same problem, generalized & “humanized”
 After the final exam, the TA wants to find out the top
scoring students
 TA is “lazy,” don’t want to score all exam sheets
 Input: every student’s score: a known distribution



Output: a scoring strategy

48
Observed from pervious quiz, mid-term exams
Maximizes the correctness of the “guessed” top-k
students
Further study of this problem [LSC05]
Proves greedy probing is optimal under special
cases
 More interesting factors to-be-explored:




49
“Optimal” probing strategy in general cases
Non-uniform probing cost
Time-variant distributions
Roadmap
The problem
 Database content modeling
 Database selection
 Summary

50
Summary


Metasearch – a challenging problem
Database content modeling



Database selection



51
Sampling enhanced by proper application of the Zipf’s law, the
Heap’s law
Content change modeled using Survival Analysis
Estimation of database relevancy based on assumptions
A probabilistic framework that models the error as a distribution
“Optimal” probing strategy for a collection of distributions as input
References






52
[CC01] J.P. Callan and M. Connell, “Query-Based Sampling of Text
Databases,” ACM Tran. on Information System, 19(2), 2001
[GCM97] L. Gravano, C-C. K. Chang, H. Garcia-Molina, A. Paepcke,
“STARTS: Stanford Proposal for Internet Meta-searching,” in Proc. of the
ACM SIGMOD Int’l Conf. on Management of Data, 1997
[GGT99] L. Gravano, H. Garcia-Molina, A. Tomasic, “GlOSS: Text Source
Discovery over the Internet,” ACM Tran. on Database Systems, 24(2), 1999
[GIG01] N. Green, P. Ipeirotis, L. Gravano, “SDLIP+STARTS=SDARTS: A
Protocol and Toolkit for Metasearching,” in Proc. of the Joint Conf. on Digital
Libraries (JCDL), 2001
[Hea78] H.S. Heaps, Information Retrieval: Computational and Teoretical
Aspects, Academic Press, 1978
[IG02] P. Ipeirotis, L. Gravano, “Distributed Search over the Hidden Web:
Hierarchical Database Sampling and Selection,” in Proc. of the 28th VLDB
Conf., 2002
References (cont’d)




53
[INC05] P. Ipeirotis, A. Ntoulas, J. Cho, L. Gravano, “Modeling and
Managing Content Changes in Text Databases,” in Proc. of the 21st IEEE
Int’l Conf. on Data Eng. (ICDE), 2005
[LLC04] Z. Liu, C. Luo, J. Cho, W.W. Chu, “A Probabilistic Approach to
Metasearching with Adaptive Probing,” in Proc. of the 20th IEEE Int’l Conf.
on Data Eng. (ICDE), 2004
[LSC05] Z. Liu, K.C. Sia, J. Cho, “Cost-Efficient Processing of Min/Max
Queries over Distributed Sensors with Uncertainty,” in Proc. of ACM Annual
Symposium on Applied Computing, 2005
[NPC05] A. Ntoulas, P. Zerfos, J. Cho, “Downloading Hidden Web Content,”
in Proc. of the Joint Conf. on Digital Libraries (JCDL), June 2005
Download