Comparing Offline and Online Statistics Estimation for Text Retrieval from Overlapped Collections

advertisement
Comparing Offline and Online Statistics
Estimation for Text Retrieval from
Overlapped Collections
MS Thesis Defense
Bhaumik Chokshi
Committee Members:
Prof. Subbarao Kambhampati (Chair)
Prof. Yi Chen
Prof. Hasan Davulcu
My MS Work

Collection Selection : ROSCO

Query Processing over Incomplete Autonomous
Databases: QPIAD

Handling Query Imprecision and Data Incompleteness:
QUIC
Multi Source Information Retrieval

In multi source information retrieval problem, searching every
information source is not efficient. The retrieval system must choose
one collection or subset of collections to call to answer a given query.
Overlapping Collections

Many real world collections have significant overlap.

For example, multiple bibliography collections (e.g., ACMDL, IEEE, DBLP
etc.) may store some of the same papers and multiple news archives (e.g.,
New York Times, Washington Post etc.) may store very similar news stories.
ACM
IEEE
CSB
DBLP
Science
• How likely it is that a given collection
has documents relevant to the query.
• Whether a collection will provide novel
results given the collections already
selected.
Related Work

Most collection selection approaches do not consider
overlap


Existing systems like CORI, ReDDE try to create a representative for each
collection based on term and document frequency information.
ReDDE uses collection samples to estimate relevance of each collection.
Same samples can be used to estimate overlap among collections.

16.6% of the documents in runs submitted to the TREC
2004 terabyte track were redundant. [Bernstein and
Zobel, 2005]

Using coverage and overlap statistics in context of
relational data sources. [Nie and Kambhampati, 2004]

Overlap among tuples can be identified in a much straightforward way
compared to text documents.
Challenges Involved

Need for query specific overlap


Overlap assessment offline



Two collections may have low overlap as a whole but can have high overlap
for a particular set of queries.
Offline approach can store statistics for general keywords and map incoming
query to these keywords to obtain relevance and overlap statistics.
Online approach can use the samples to estimate relevance and overlap
statistics.
Efficiently determine
collections

COSC
O online
vs.
true
overlap
between
True overlap between collections can be estimated using result to result
comparison for different collections.
Context of this work

COSCO takes overlap into account while determining collection
order.

But it does it offline.
Samples built for the collections can be used to estimate overlap statistics which
can be a better estimate as it is for a particular query.

COSCO estimates overlap using bag similarity over result-set
document.
True overlap between collections can be obtained using result to result
comparison.

COSCO does not do experiments on TREC data.
Contributions

ROSCO, an online approach which estimates overlap statistics from
the samples of the collections.

Comparison of offline (COSCO) and online (ROSCO) approaches
for statistics estimation for text retrieval from overlapping collections.
Outline





COSCO and ROSCO Architecture
ROSCO Approach
Empirical Evaluation
Other Contributions
Conclusion
COSCO Architecture
ROSCO Architecture
Outline





COSCO and ROSCO Architecture
ROSCO Approach
Empirical Evaluations
Other Contributions
Conclusion
ROSCO (Offline Component)
Collection representation through
query based sampling
C2
Training
Queries
S2
Training
Queries
Samples
Union of
Samples
S1
C1
ROSCO (Offline Component)
Collection Size Estimation
Size
Estimates
C2
C1
Random
Queries
Random
Queries
Samples
S2
 dCi
Ci .EstimatedSize  average
 dS
 i
S1

 * Si .size


Number of documents
returned from collection Ci
Number of documents
returned from sample Si
ROSCO (Offline Component)
Grainy Hash Vector
Sample
w bits
n bits
Hash
GHV
ROSCO (Online Component)
Assessing Relevance
S2
Query
Samples
S1
Query
Size
Estimates
Top-k
documents
for each
collection
Determine
top –k
relevant
documents
for each
collections
Union of
Samples
ROSCO (Online Component)
Assessing Overlap and Combining with Relevance
GHVs of
documents of
the collections
selected till now
Size
Estimates
Estimate no. of
relevant new
documents for each
collection
GHVs of
the top-k
documents of
each collection
Collection with maximum
no. of new relevant documents
Comparison of ROSCO and COSCO

COSCO:

Offline method for estimating
coverage and overlap statistics.

Gets estimate for a query by
using statistics for corresponding
frequent item sets. Statistics for
“data mining integration” can be
obtained by using statistics from
“data
mining”
and
“data
integration”.

This way of computing statistics
can lead to a much different
estimate from actual statistics.

ROSCO:

Online method for estimating
coverage and overlap statistics.

Gets estimate by sending query
to sample which can give better
estimate for a particular query at
hand.

Success of this approach
depends on the quality of
sample. Sometimes it can be
hard to obtain a good sample of
the collection.
Outline





ROSCO and COSCO Architecture
ROSCO Approach
Empirical Evaluation
Other Contributions
Conclusion
Empirical Evaluation

Whether ROSCO can perform better in an environment of
overlapping text collections compared to the approaches which do
not consider overlap.

Compare ROSCO and COSCO in presence of overlap among
collections.
Testbed Creation

Test Data



TREC Genomics data.
50 queries with their relevance judgment.
Testbed Creation

100 disjoint clusters from 200,000 documents to create topic specific
collections.

uniform-50cols:




50 collections.
Each of the 200,000 documents is randomly assigned to 10 different
collections.
Total of 2 million documents.
skewed-100cols:




100 collections.
Each of the 100 clusters is randomly assigned to 10 different collections.
Total of 2 million documents.
As each cluster is assigned to multiple collections, topic specific overlap
among collections is more prominent in this testbed compared to uniform50cols.
Collection Size and Relevance Statistics
uniform-50cols
skewed-50cols
Testbed 1
Testbed 2
Mean Relevant Documents
Mean Relevant Documents
30
25
20
15
10
5
0
1
11
21
31
Collection
41
Collection Overlap Statistics
uniform-50cols
skewed-100cols
Tested Methods

COSCO, ReDDE and ROSCO.

Greedy Ideal for establishing performance bound

Setting up COSCO


Setting up ROSCO and ReDDE






40 training queries to each of the collection
Training Queries: 25 queries for each collection.
Sample size: 10% of the actual collections.
10 size estimates
Duplicate detection: GHV containing 32 vectors of 2 bits each (total of 64 bits).
Mismatches allowed: 0 mismatch allowed for exact duplicates
Evaluation


Recall after each collection called. (Central evaluation and TREC evaluation)
Processing time.
Greedy Ideal

This method attempts to greedily maximize
percentage recall assuming oracular information.
the

It is used for establishing performance bound and as a
baseline ranking method in evaluation.
Experimental Results (Central Evaluation)



10 queries different from training queries for evaluation.
5-fold cross validation
Ranking by a particular method
Evaluation metric:
Ranking by the baseline method

For both the testbeds ROSCO performs better than ReDDE and
COSCO by 7-8% in terms of recall metric R.
Experimental Results (TREC Evaluation)


For both testbeds ROSCO is performing better than ReDDE and
ROSCO in terms of recall metric R.
As skewed-100col testbed is created by topic specific clusters,
ROSCO shows more improvement compared to uniform-50col
testbed over other approaches.
Experimental Results
(Processing Cost)

Processing time for ReDDE and ROSCO is more compared to
COSCO. But no. of collections called by ReDDE and ROSCO are
less for same amount of recall.
Summary of Experimental Results

Evaluated ROSCO, ReDDE and COSCO on two different testbeds
with overlapping collections.

ROSCO shows improvement over ReDDE and COSCO by



7-8% for central evaluations on both testbeds.
TREC evaluation: 3-5% on uniform-50cols and 8-10% on clustered-100cols.
Processing time for ReDDE and ROSCO is more compared to
COSCO. But no. of collections called by ReDDE and ROSCO are
less for same amount of recall.
Outline





ROSCO and COSCO Architecture
ROSCO Approach
Empirical Evaluation
Other Contributions
Conclusion
Other Contributions (QPIAD Project)
F Measure based query rewriting for incomplete autonomous web databases
Given a query Q:(Body Style=Convt) retrieve all relevant tuples
Id
Make
Model
Year
Body
Id
Make
Model
Year
Body
1
Audi
A4
2001
Convt
1
Audi
A4
2001
Convt
2
BMW
Z4
2002
Convt
2
BMW
Z4
2002
Convt
3
Porsche
Boxster
2005
Convt
4
BMW
Z4
2003
NULL
3
Porsche
Boxster
2005
Convt
5
Honda
Civic
2004
NULL
6
Toyota
Camry
2002
Sedan
7
Audi
A4
2006
Ranked Relevant
Uncertain Answers
AFD: Model~> Body style
Select Top K Rewritten Queries
NULL
Q1’: Model=A4
Q2’: Model=Z4
Re-order queries based
Q3’: Model=Boxster
on Estimated Precision
Id
Make
Model
Year
Body
Confidence
4
BMW
Z4
2003
NULL
0.7
7
Audi
A4
2006
NULL
0.3
Other Contributions (QPIAD Project)
F Measure based query rewriting for incomplete autonomous web databases


Sources may impose resource limitations
on the # of queries we can issue
Therefore, we should select only the top-K
queries while ensuring the proper balance
between precision and recall

SOLUTION: Use F-Measure based
selection with configurable alpha parameter

α=1
P=R

α<1
P>R

α>1
P<R

JOINS
Co-author on VLDB 2007 research paper
P – Estimated Precision
R – Estimated Recall
(based on P & Est. Sel.)

1     P  R 
F 
  P  R 
Other Contributions (QUIC Project)
Handling unconstrained attributes in presence of query imprecision
and data incompleteness
Tuples matching user query can be ranked based on unconstrained attributes.
[Surajit Chaudhuri, Gautam Das, Vagelis Hristidis and Gerhard Weikum, 2004]
Given a query Q: model = Civic,
an Accord with sedan body style may be
more relevant than Civic with coupe
body style.
In absence of query log, relevance for
unconstrained attributes can be
approximated from database.
10 queries, 13 users
1
R   r 9
2
R Metric
0.35
0.3
w/o unconstrained attributes
with unconstrained attributes
0.25
0.2
0.15
0.1
0.05
0
Make/Year/Price Make/Year/Mileage Model/Year/Price
Approach considering unconstrained attributes performs better than
the one ignoring unconstrained attributes.
Co-author on CIDR 2007 demo paper
Body/Year/Price
Model/Year
Outline





ROSCO and COSCO Architecture
ROSCO Approach
Empirical Evaluation
Other Contributions
Conclusion
Conclusion

An online method ROSCO for overlap estimation.

Comparison of offline and online approaches for text retrieval in an
environment composed of overlapping collections.

Results of empirical evaluation show that online method for overlap
estimation performs better than offline method for overlap estimation as
well as method which does not consider overlap among collections.

Co-author on two other works appearing in
CIDR – 2007 and VLDB - 2007
Download