CS 430 / INFO 430 Information Retrieval Searching Full Text 5 Lecture 5

advertisement
CS 430 / INFO 430
Information Retrieval
Lecture 5
Searching Full Text 5
1
Course Administration
2
CS 430 / INFO 430
Information Retrieval
Completion of Lecture 4
3
Word List
On disk
If a word list is held on disk, search time is dominated by the
number of disk accesses.
In memory
Suppose that a word list has 1,000,000 distinct terms.
Each index entry consists of the term, some basic statistics and
a pointer to the inverted list, average 100 characters.
Size of index is 100 megabytes, which can easily be held in
memory of a dedicated computer.
4
File Structures for Inverted Files:
Linear Index
Advantages
Can be searched quickly, e.g., by binary search, O(log n)
Good for lexicographic processing, e.g., comp*
Convenient for batch updating
Economical use of storage
Disadvantages
Index must be rebuilt if an extra term is added
5
File Structures for Inverted Files:
Binary Tree
Input: elk, hog, bee, fox, cat, gnu, ant, dog
elk
bee
ant
hog
cat
fox
dog
6
gnu
File Structures for Inverted Files:
Binary Tree
Advantages
Can be searched quickly
Convenient for batch updating
Easy to add an extra term
Economical use of storage
Disadvantages
Less good for lexicographic processing, e.g., comp*
Tree tends to become unbalanced
If the index is held on disk, important to optimize
the number of disk accesses
7
File Structures for Inverted Files:
Binary Tree
Calculation of maximum depth of tree.
Worst case: depth = n
O(n)
Ideal case: depth = log(n + 1)/log 2
O(log n)
Illustrates importance of balanced trees.
8
File Structures for Inverted Files:
Right Threaded Binary Tree
Threaded tree:
A binary search tree in which each node uses an
otherwise-empty left child link to refer to the node's inorder predecessor and an empty right child link to refer
to its in-order successor.
Right-threaded tree:
A variant of a threaded tree in which only the right
thread, i.e. link to the successor, of each node is
maintained. Can be used for lexicographic processing.
A good data structure when held in memory
Knuth vol 1, 2.3.1, page 325.
9
File Structures for Inverted Files:
Right Threaded Binary Tree
dog
gnu
bee
ant
cat
hog
elk
fox
10
NULL
File Structures for Inverted Files:
B-trees
B-tree of order m:
A balanced, multiway search tree:
• Each node stores many keys
• Root has between 2 and 2m keys.
All other internal nodes have between m and 2m keys.
• If ki is the ith key in a given internal node
-> all keys in the (i-1)th child are smaller than ki
-> all keys in the ith child are bigger than ki
• All leaves are at the same depth
11
File Structures for Inverted Files:
B-trees
B-tree example (order 2)
50 65
55 59
10 19 35
36 47
1 5 8 9
12 14 18
21 24 28
70 90 98
66 68
91 95 97
72 73
Every arrow points to a node containing between 2 and 4 keys.
A node with k keys has k + 1 pointers.
12
File Structures for Inverted Files:
B+-tree
Example: B+-tree of order 2, bucket size 4
• A B-tree is used as an index
• Data is stored in the leaves of the tree, known as buckets
50 65
10 25
... D9
55 59
D51 ... D54
70 81 90
D66...
(Implementation of B+-trees is covered in CS 432.)
13
D81 ...
CS 430 / INFO 430
Information Retrieval
Lecture 5
Searching Full Text 5
14
SMART System
An experimental system for automatic information retrieval
•
automatic indexing to assign terms to documents and queries
•
collect related documents into common subject classes
•
identify documents to be retrieved by calculating
similarities between documents and queries
•
procedures for producing an improved search query
based on information obtained from earlier searches
Gerald Salton and colleagues
Harvard 1964-1968
Cornell 1968-1988
15
Indexing Subsystem
Documents
text
assign document IDs
break into tokens
tokens
*Indicates
optional
operation.
documents
stop list*
non-stoplist
stemming*
tokens
stemmed
terms
term weighting*
terms with
weights
16
document
numbers
and *field
numbers
Index
database
Search Subsystem
query parse query
ranked
document set
query tokens
stop list*
non-stoplist
tokens
ranking*
stemming*
*Indicates
optional
operation.
17
Boolean
retrieved operations*
document set
relevant
document set
stemmed
terms
Index
database
Decisions in Building the
Word List: What is a Term?
18
•
Underlying character set, e.g., printable ASCII,
Unicode, UTF8.
•
Is there a controlled vocabulary? If so, what
words are included?
•
List of stopwords.
•
Rules to decide the beginning and end of words, e.g.,
spaces or punctuation.
•
Character sequences not to be indexed, e.g.,
sequences of numbers.
Lexical Analysis: Term
What is a term?
Free text indexing
A term is a group of characters, extracted from the
input string, that has some collective significance,
e.g., a complete word.
Usually, terms are strings of letters, digits or other
specified characters, separated by punctuation,
spaces, etc.
19
Oxford English Dictionary
20
Lexical Analysis: Choices
Punctuation: In technical contexts, punctuation may be used
as a character within a term, e.g., wordlist.txt.
Case: Case of letters is usually not significant.
Hyphens:
(a) Treat as separators: state-of-art is treated as state of art.
(b) Ignore: on-line is treated as online.
(c) Retain: Knuth-Morris-Pratt Algorithm is unchanged.
Digits: Most numbers do not make good terms, but some are
parts of proper nouns or technical terms: CS430, Opus 22.
21
Lexical Analysis: Choices
The modern tendency, for free text searching, is to map
upper and lower case letters together in index terms, but
otherwise to minimize the changes made at the lexical
analysis stage.
22
Lexical Analysis Example: Query
Analyzer
A term is a letter followed by a sequence of letters and digits.
Upper case letters are mapped into the lower case equivalents.
The following characters have significance as operators:
(
23
) &
|
Lexical Analysis: Transition Diagram
letter, digit
1
space
2
letter
(
3
)
&
0
4
|
5
other
6
end-of-string
24
7
Lexical Analysis: Transition Table
State space letter
0
1
0
1
1
1
(
)
&
| other end-of digit
string
2
1
3
1
4
1
5
1
States in red are final states.
25
6
1
7
1
6
1
Changing the Lexical Analyzer
This use of a transition table allows the system
administrator to establish differ lexical choices for
different collections of documents.
Example:
To change the lexical analyzer to accept tokens that
begin with a digit, change the top right element of
the table to 1.
26
Stop Lists
Very common words, such as of, and, the, are rarely
of use in information retrieval.
A stop list is a list of such words that are removed
during lexical analysis.
A long stop list saves space in indexes, speeds
processing, and eliminates many false hits.
However, common words are sometimes significant
in information retrieval, which is an argument for a
short stop list. (Consider the query, "To be or not to
be?")
27
Suggestions for Including Words in a
Stop List
• Include the most common words in the English language
(perhaps 50 to 250 words).
• Do not include words that might be important for retrieval
(Among the 200 most frequently occurring words in
general literature in English are time, war, home, life,
water, and world).
• In addition, include words that are very common in
context (e.g., computer, information, system in a set of
computing documents).
28
Example: Stop List for Assignment 1
a
are
but
has
in
more
one
that
this
which
29
about
as
by
have
is
new
or
the
to
will
an
at
for
he
it
of
said
their
was
with
and
be
from
his
its
on
say
they
who
you
Example: the WAIS stop list
(first 84 of 363 multi-letter words)
30
about
after
alone
among
anyone
at
becoming
behind
beyond
can't
did
down
either
etc
above
according
afterwards again
along
already
amongst an
anything anywhere
be
became
been
before
being
below
billion
both
cannot
caption
didn't
do
during
each
else
elsewhere
even
ever
across
actually
adj
against all
almost
also
although always
another any
anyhow
are
aren't
around
because become
becomes
beforehand begin
beginning
beside
besides
between
but
by
can
co
could
couldn't
does
doesn't
don't
eg
eight
eighty
end
ending
enough
every
everyone everything
Stop list policies
How many words should be in the stop list?
• Long list lowers recall
Which words should be in list?
• Some common words may have retrieval importance:
-- home, life, water, war, world
• In certain domains, some words are very common:
-- computer, program, source, machine, language
There is very little systematic evidence to use in selecting
a stop list.
31
Stop Lists in Practice
The modern tendency is:
(a) have very short stop lists for broad-ranging or multi-lingual
document collections, especially when the users are not
trained.
(b) have longer stop lists for document collections in well-defined
fields, especially when the users are trained professional.
32
Stemming
Morphological variants of a word (morphemes). Similar
terms derived from a common stem:
engineer, engineered, engineering
use, user, users, used, using
Stemming in Information Retrieval. Grouping words with a
common stem together.
For example, a search on reads, also finds read, reading, and
readable
Stemming consists of removing suffixes and conflating the
resulting morphemes. Occasionally, prefixes are also removed.
33
Categories of Stemmer
The following diagram illustrate the various
categories of stemmer. Porter's algorithm is shown
by the red path.
Conflation methods
Manual
Automatic (stemmers)
Affix
removal
Longest
match
34
Successor
variety
Simple
removal
Table
lookup
n-gram
Porter Stemmer
A multi-step, longest-match stemmer.
M. F. Porter, An algorithm for suffix stripping. (Originally
published in Program, 14 no. 3, pp 130-137, July 1980.)
http://www.tartarus.org/~martin/PorterStemmer/def.txt
Notation
v
c
(vc)m
vowel
constant
(vowel followed by a constant) repeated m times
Any word can be written: [c](vc)m[v]
35
m is called the measure of the word
Porter's Stemmer
Porter Stemming Algorithm
Complex suffixes
Complex suffixes are removed bit by bit in the different
steps. Thus:
GENERALIZATIONS
becomes
becomes
becomes
becomes
36
GENERALIZATION (Step 1)
GENERALIZE (Step 2)
GENERAL (Step 3)
GENER (Step 4).
Porter Stemmer: Step 1a
Suffix
sses
ss
caresses -> caress
ies
i
ponies -> poni
ties
-> ti
ss
s
37
Replacement Examples
ss
caress -> caress
cats
-> cat
Porter Stemmer: Step 1b
Conditions
Suffix
Replacement Examples
(m > 0)
eed
ee
feed -> feed
agreed -> agree
(*v*)
ed
null
plastered -> plaster
bled -> bled
(*v*)
ing
null
motoring -> motor
sing -> sing
*v* - the stem contains a vowel
38
Porter Stemmer: Step 5a
Step 5a is defined as follows.
(m>1)
E ->
(m=1 and not *o)
E ->
probate -> probat
rate -> rate
cease -> ceas
*o - the stem ends cvc, where the second c is not W, X or Y
(e.g. -WIL, -HOP).
39
Stemming in Practice
Evaluation studies have found that stemming can affect retrieval
performance, usually for the better, but the results are mixed.
• Effectiveness is dependent on the vocabulary. Fine
distinctions may be lost through stemming.
• Automatic stemming is as effective as manual conflation.
• Performance of various algorithms is similar.
Porter's Algorithm is entirely empirical, but has proved to be an
effective algorithm for stemming English text with trained users.
40
Selection of tokens, weights, stop lists
and stemming
Special purpose collections (e.g., law, medicine, monographs)
Best results are obtained by tuning the search engine for the
characteristics of the collections and the expected queries.
It is valuable to use a training set of queries, with lists of
relevant documents, to tune the system for each application.
General purpose collections (e.g., news articles)
The modern practice is to use a basic weighting scheme (e.g.,
tf.idf), a simple definition of token, a short stop list and little
stemming except for plurals, with minimal conflation.
Web searching combine similarity ranking with ranking based on
document importance.
41
Download