ufmg-bruno

advertisement
Set-Based Model: A New Approach
for Information Retrieval
Bruno Pôssas
Wagner Meira Jr.
Nivio Ziviani
Berthier Ribeiro-Neto
Department of Computer Science
Federal University of Minas Gerais, Brazil
Introduction
Vector space model (VSM)
 Query terms and documents are represented
as weighted vectors in a vector space
 Query answers are documents whose
representative vectors have high similarity to
the query vector
 Term weighting scheme: TF x IDF
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
Motivation
In VSM, index terms are assumed to be
mutually independent
 Linear weighting function
 Not realistic but easy to compute
Our hypothesis:
Exploration of correlation among index
terms might improve retrieval effectiveness
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
Our Goal
Propose a new model for computing index
term weights, based on set theory
 Terms  Sets of terms (termsets)
 Correlation among index terms
 High retrieval effectiveness keeping
computational costs small
Exploit the intuition that related term
occurrences often occur close to each other
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
Related Work
Correlation among index terms




Raghavan and Yu (1979)
Rijsbergen (1977), Harper and Rijsbergen (1978)
Wong et al. (1985 and 1987)
Common limitations:
• Expensive to compute dependency factors
• Exhaustive application of term co-occurences hurts
overall effectiveness and performance
Association rule mining
 Zaki (2000)
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
Termsets
T = {t1, t2, …, tt} is the set of t unique terms
of a collection of documents D.
An n-termset s is an ordered set of n terms,
such that s  T.
ds is the frequency of a termset s.
S is the set of 2t unique termsets that may
appear in a document (power set of T).
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
Termsets: Example
D = {d1, d2, d3}
T = {A,C,D,T}
S ={sA,sC,…,sAC,
sAD,…,sACDT}
Collection D
d1
A C
T
sA = {A}
(1-termset)
sCD = {C,D}
(2-termset)
sCDT = {C,D,T} (3-termset)
d2
C
D
d3
C D
T
dsA = 1
dsCD = 2
dsCDT = 1
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
Termsets: Definitions
Frequent termset
 Is a termset with frequency greater or equal to
a given minimal frequency.
Closed termset
 Is a frequent termset that is (1) the largest
among its subsets and (2) its subsets occur in
the same set of documents.
The use of closed termsets reduces significantly the
number
of termsets taken into consideration
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
Termsets: Example
Collection
D
d1
A C
T
d2
C
D
d3
Empty set
Frequent Termset
C D
T
Closed Termset
{}
A: 1
AC: 1
C: 3
AT: 1
CD: 2
D: 2
CT: 2
DT: 1
LATIN1
- Lab for Treating Information
-- Federal
ACT:
CDT:
1 University of Minas Gerais, Brazil
T: 2
Set-Based Model
Documents and queries are described by
sets of closed termsets, instead of terms.
Closed termsets provide all elements of the
TF x IDF scheme.
Computational cost is linear on the number
of documents in the collection.
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
Set-Based Model: Termset Weights
Extension of a TF x IDF scheme
 sfi,j  number of occurrences of si in dj
 dsi  number of occurrences of si in D
 Idsi  inverted freq. of occurrence of si in D
w
*
i, j
 sf i , j  ids i  (1  log sf i , j )  log
N
ds i
SBM  VSM, if only 1-termsets are considered
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
Set-Based Model: Similarity Calculation
sAT
d1
1
2
sT
Normalization uses just
terms instead of termsets
Q
d2
sim(q,dj) 
sA
 
dj  q


| dj |  | q |
 w s,j  w s,q
*

*
s  Cq

t
i 1
w
2
i, j

t
i 1
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
w
2
i ,q
Set-Based Model: Query Mechanism
SBM Algorithm:
1. Obtain the 1-termsets from query terms;
2. Enumerate all closed termsets from
1-termsets;
3. Calculate similarities between query and
documents using the closed termsets;
4. Normalize document similarities;
5. Select the k largest document similarities.
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
Experimental Results
Reference Collection
CFC
WSJ
TReC-3
# Documents
1,240
173,252
1,078,166
# Distinct Terms
2,105
230,902
1,016,709
# Queries
100
300
300
# Query Size
3.82
18.88
22.43
Size (MB)
1.9
509
3,225
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
TReC-3: Recall x Precision
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
Average Precision
Average Precision (%)
SBM Gain (%)
Collection
VSM
GVSM
SBM
VSM
GVSM
CFC
22.42
24.47
26.56
18.47
8.54
WSJ
31.76
34.27
41.78
31.55
21.91
TReC-3
32.58
*
44.59
36.86
*
* GVSM could not be evaluated for TReC-3 collection
due to exponential cost of the min-term build phase
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
Average Precision at 10
Average Precision at 10 (%)
SBM Gain (%)
Collection
VSM
GVSM
SBM
VSM
GVSM
CFC
10.97
12.93
16.02
46.03
23.90
WSJ
12.71
16.58
19.17
50.82
15.62
TReC-3
13.66
*
21.42
56.80
*
•GVSM could not be evaluated for TReC-3 collection
due to exponential cost of the min-term build phase
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
Computational Efficiency
Avg. Response Time (s)
Increase (%)
Collection
VSM
GVSM
SBM
GVSM
SBM
CFC
0.0023
0.0056
0.0025
243.5
8.7
WSJ
0.4286
2.0143
0.6296
469.9
46.9
TReC-3
1.2732
*
2.2930
*
80.1
* GVSM could not be evaluated for TReC-3 collection due to
exponential cost of the min-term build phase
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
Conclusions and Future Work
SBM exploits index terms correlations
improving retrieval effectiveness efficiently.
Future work:
Investigate behavior of SBM when applied
to larger collections.
Extend SBM to take into account the
proximity information of index terms.
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
Termsets: Complexity
Time Complexity:
Worst Case
Avg. Case
O(2|q|.N)
O(c.N)
Space Complexity Worst Case: O(r.2l.N)
|q|
c
N
r
l
=
=
=
=
=
query size,
number of closed termsets,
number of documents,
number of maximal termsets,
length of the largest termset.
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
TReC-3: Number of Closed Termsets
Collection
Worst Case
Average Case
CFC
14.12
3.14
WSJ
456,419.21
3,217.28
TReC-3
5,650,707.18
4,081.25
The average case scenario is significantly
smaller than the worst case scenario.
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
TReC-3: Minimal Frequency
7
Avg. Precision (%)
45
6
40
35
5
30
4
25
20
3
15
2
10
1
5
0
Avg. Response Time (s)
50
0
0
2
4
6
8
10 12 14 16 18 20 22 24 26 28 30
Minimal Frequency (# docs)
Trade-off between precision, the number of termsets taken
LATIN - Lab
for Treating
Information -- Federal
of Minas Gerais, Brazil
into
consideration
andUniversity
performance
Termsets: Enumeration
An incremental algorithm that employs a
very powerful pruning strategy.
1. Enumeration of (n+1)-termsets from n-termsets
Union of all pairs (si,sj) that have the same prefix.
2. Evaluation if a frequent termset ‘s’ being
verified is closed
Check if all current termsets have ‘s’ as its closure,
being discarded if such condition holds.
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
Termsets: Example
lsA
lsC
lsD
lsT
1-termsets
= {d1}
= {d1,d2,d3}
= {d2,d3}
= {d1,d3}
2-termsets
lsAC = {d1}
lsAT = {d1}
Collection D
d1
A C
T
d2
C
D
d3
C D
T
3-termsets
lsACT = {d1}
Closed termset
lsACT = {d1}
LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil
Download