Construction of Index:

advertisement
Construction of Index: (Page 197)
• Objective: Given a document, find the
number of occurrences of each word in the
document.
• Example: Computer Science students know
computers and computer languages.
• Keywords: computer, computers, science,
students, know, and, languages.
Linear time algorithm:
• Let T be the text, |T| the length of T. We can
find the occurrences of each word in T in
O(|T|) time.
Constructing an automaton:
c
o
m
p
u
t
e
s
c
i
e
n
c
e
t
u
d
e
n
t
s
k
n
o
w
a
n
d
l
a
n
g
e
g
u
a
r
s
s
Remarks:
• There is a final state for each word.
• There is a counter on each final state storing the
number of occurrences that the final state is
reached.
• While reading, the algorithm creates new states for
the new word.
• For words having met before, we just go through
the corresponding states.
• When the final state is read, add 1 to the counter.
Assignment one
(due in week 6 on Friday, 7:30 pm)
• Write a program to convert a text into a
vector such that each element of the vector
is the number of occurrences of the
corresponding keyword.
Download