Construction of Index:

advertisement
Construction of Index: (Page 197)
• Objective: Given a document, find the
number of occurrences of each word in the
document.
• Example: Computer Science students
know computers and computer languages.
• Keywords: computer, computers, science,
students, know, and, languages.
Linear time algorithm:
• Let T be the text, |T| the length of T. We
can find the occurrences of each word in T
in O(|T|) time.
Constructing an automaton:
c
o
m
p
u
t
e
s
c
i
e
n
c
e
t
u
d
e
n
t
s
k
n
o
w
a
n
d
l
a
n
g
e
g
u
a
r
s
s
Remarks:
• There is a final state for each word.
• There is a counter on each final state storing the
number of occurrences that the final state is
reached.
• While reading, the algorithm creates new states
for the new word.
• For words having met before, we just go through
the corresponding states.
• When the final state is read, add 1 to the counter.
Extended Boolean Model:
•
•
•
Disadvantages of “Boolean Model” :
No term weight is used
Counterexample: query q=Kx AND Ky.
Documents containing just one term, e,g, Kx is
considered as irrelevant as another document
containing none of these terms.
•
•
No term weight is used
The size of the output might be too large or
too small
Extended Boolean Model:
• The Extended Boolean model was introduced
in 1983 by Salton, Fox, and Wu[703]
• The idea is to make use of term weight as
vector space model.
• Strategy: Combine Boolean query with vector
space model.
• Why not just use Vector Space Model?
• Advantages: It is easy for user to provide query.
Extended Boolean Model:
• Each document is represented by a vector
(similar to vector space model.)
idf x
wx , j  fx , j *
max iidf i
• Remember the formula.
• Query is in terms of Boolean formula.
• How to rank the documents?
Fig. Extended Boolean logic considering the space
composed of two terms kx and ky only.
• ky
• ky
(0,1)
(1,1)
(1,1)
(0,1)
kx or ky
dj+1
dj+1
dj
dj
kx and ky
(0,0)
• k
(1,0)
(0,0)
• kx
(1,0)
Extended Boolean Model:
•
For query q=Kx or Ky, (0,0) is the point we try
to avoid. Thus, we can use
x y
2
sim (qor, d ) 
to rank the documents
• The bigger the better.
2
2
Extended Boolean Model:
•
•
For query q=Kx and Ky, (1,1) is the most
desirable point.
We use
(1 x)  (1  y)
2
sim (qand, d )  1 
to rank the documents.
• The bigger the better.
2
2
Extend the idea to m terms
• qor=k1 p k2 p … p Km
sim (qor


...

x
x
x
,d )  (
)
m
j
p
1
p 1/ p
m
p
2
• qand=k1 p k2 p … p km
sim (qand, dj )  1 (
(1 x ) (1 x ) ...(1 x )
p
1
p
2
m
p
1/ p
m
)
Properties:
• The p norm as defined above enjoys a couple of
interesting properties as follows. First, when p=1
it can be verified that
x1  ...  xm
sim (qor, dj )  sim (qand, dj ) 
m
• Second, when p= it can be verified that
• Sim(qor,dj)=max(xi)
• Sim(qand,dj)=min(xi)
Example:
• For instance, consider the query q=(k1 k2)  k3.
The similarity sim(q,dj) between a document dj
and this query is then computed as
(1 x1)  (1 x 2)
(
1

(
sim (q, d ) 
2
p
(
2
p
1/ p
1/ p
p

)
x
)
p
3
• Any boolean can be expressed as a numeral
formula.
)
Exercise:
1. Give the numeral formula for extended
Boolean model of the query
q=(k1 or k2 or k3)and (not k4 or k5). (assume that
there are 5 terms in total.)
2. Assume that the document is represented by
the vector (0.8, 0.1, 0.0, 0.0, 1.0).
What is sim(q, d) for extended Boolean model?
Also try to do more exercise for other Boolean
formulas.
Download