Lecture 4: Indexing Files Inverted File Lexical Analysis

advertisement
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Lecture 4:
Indexing Files



Inverted File
Lexical Analysis
Stop lists
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Indexing


Arrangement of data (data structure) to
permit fast searching
Which list is easier to search?
sow fox pig eel yak hen ant cat dog hog
ant cat dog eel fox hen hog pig sow yak
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Creating inverted files
Word Extraction
Word IDs
Original Documents
W1:d1,d2,d3
W2:d2,d4,d7,d9
Document IDs
Wn :di,…dn
Inverted Files
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Creating Inverted file


Map the file names to file IDs
Consider the following Original Documents
D1
The Department of Computer Science was established in 1984.
D2
The Department launched its first BSc(Hons) in Computer Studies in 1987.
D3
followed by the MSc in Computer Science which was started in 1991.
D4
The Department also produced its first PhD graduate in 1994.
D5
Our staff have contributed intellectually and professionally to the
advancements in these fields.
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Creating Inverted file
Red: stop word
D1
The Department of Computer Science was established in 1984.
D2
The Department launched its first BSc(Hons) in Computer Studies
in 1987.
D3
followed by the MSc in Computer Science which was started in
1991.
D4
The Department also produced its first PhD graduate in 1994.
D5
Our staff have contributed intellectually and professionally to the
advancements in these fields.
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Creating Inverted file
After stemming, make lowercase (option), delete numbers (option)
D1
depart comput scienc establish
D2
depart launch bsc hons comput studi
D3
follow msc comput scienc start
D4
depart produc phd graduat
D5
staff contribut intellectu profession advanc field
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Creating Inverted file (unsorted)
Words
Documents
Words
Documents
depart
d1,d2,d4
produc
d4
comput
d1,d2,d3
phd
d4
scienc
d1,d3
graduat
d4
establish
d1
staff
d5
launch
d2
contribut
d5
bsc
d2
intellectu
d5
hons
d2
profession
d5
studi
d2
advanc
d5
follow
d3
field
d5
msc
d3
start
d3
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Creating Inverted file (sorted)
Words
Documents
Words
Documents
advanc
d5
msc
d3
bsc
d2
phd
d4
comput
d1,d2,d3
produc
d4
contribut
d5
profession
d5
depart
d1,d2,d4
scienc
d1,d3
establish
d1
staff
d5
field
d5
start
d3
follow
d3
studi
d2
graduat
d4
intellectu
d5
launch
d2
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Searching on Inverted File

Binary Search


Using in the small scale
Create thesaurus and combining
techniques such as:



Hashing
B+tree
Pointer to the address in the indexed file
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Lexical Analysis for indexing

Word extraction



Stop words elimination


Spaces as English words boundaries
Chinese word segmentation
“a”,”an”,”the”,”about”,”etc”,”every”,”you”,etc.
Word stemming
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Lexical Analysis


Lexical analysis is the process of
converting an input stream of
characters into a stream of words or
tokens
Lexical analysis is the first stage of:


Automatic indexing
Query processing
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Lexical Analysis for Automatic Indexing

What counts as a word or token in the
indexing scheme? (an easy problem?)

Digits


Hyphens


“F-16” “MS-DOS”
Other Punctuation


“Year 2000”, “Y2K”
“COMMAND.COM” “max_size” (often in C code)
Case

IBM or ibm
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Lexical Analysis for Automatic Indexing
(cont.)



No technical difficulty in solving any of
these problems
Must think about them carefully
Tradeoff between recall and precision


Breaking up hyphenated terms increase
recall but decreases precision
Preserving case distinctions enhances
precision but decreases recall
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
Lexical Analysis for Query Processing



Depends on the design of the lexical analyzer
for automatic indexing
Distinguish operators (Boolean operators,
weighting function operators etc.)
Process certain characters:

Control characters


“” for phase search, {} for priority
Disallowed punctuation characters (error)
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
STOPLISTS


Many of the most frequently occurring words in
English (“the” ,”of” etc.) are worthless as index
terms
Eliminating such words




Speeds processing
Saves huge amounts of space in indexes
Does not damage retrieval effectiveness
Stoplists are used to eliminates such words. E.g.,



http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_u
tils/stop_words
http://bll.epnet.com/help/ehost/Stop_Words.htm
http://www.syger.com/jsc/docs/stopwords/english.htm
CS5286 Search Engine Technology and Algorithms/Xiaotie Deng
STOPLISTS




Choices of words in stop list may vary
from person to person.
The general idea is to find words that
occur often so that they are not good
terms for information retrieval.
How to use vector space model to find
out a list of stop words?
How to find stop words in Chinese?
Download