Introduction to Information Retrieval

advertisement
Introduction to Information Retrieval
Introduction to
Information Retrieval
Lecture 2:
The term vocabulary and postings lists
Related to Chapter 2:
http://nlp.stanford.edu/IR-book/pdf/02voc.pdf
Introduction to Information Retrieval
Ch. 1
Recap of the previous lecture
 Basic inverted indexes:
 Structure: Dictionary and Postings
 Boolean query processing
 Intersection by linear time “merging”
 Simple optimizations
2
Introduction to Information Retrieval
Plan for this lecture
 Preprocessing to form the term vocabulary
 Tokenization
 Normalization
 Postings
 Faster merges:
 Positional postings and phrase queries
3
Introduction to Information Retrieval
Recall the basic indexing pipeline
Documents to
be indexed.
Friends, Romans, countrymen.
Tokenizer
Token stream.
Friends Romans
Countrymen
Linguistic
modules
Modified tokens.
Inverted index.
friend
roman
countryman
Indexer friend
2
4
roman
1
2
countryman
13
4
16
Introduction to Information Retrieval
Sec. 2.1
The first step: Parsing a document
 What format is it in?
 pdf/word/excel/html
 What language is it in?
 What character set is in use?
5
Introduction to Information Retrieval
Sec. 2.1
Complications: Format/language
 Documents being indexed can include docs from
many different languages
 A single index may have to contain terms of several
languages.
 Sometimes a document or its components can
contain multiple languages/formats
 French email with a German pdf attachment.
6
Introduction to Information Retrieval
TOKENIZATION
7
Introduction to Information Retrieval
Tokenization
 Given a character sequence, tokenization is the task
of chopping it up into pieces, called
.
 Perhaps at the same time throwing away certain
characters, such as punctuation.
8
Introduction to Information Retrieval
Sec. 2.2.1
Tokenization
 Input: “university of Qom, computer department”
 Output: Tokens





university
of
Qom
computer
Department
 Each such token is now a candidate for an index entry,
after further processing: Normalization.
 But what are valid tokens to emit?
9
Introduction to Information Retrieval
Sec. 2.2.1
Issues in tokenization
 Iran’s capital  Iran? Irans? Iran’s?
 Hyphen




Hewlett-Packard  Hewlett and Packard as two tokens?
the hold-him-back-and-drag-him-away maneuver
co-author
lowercase, lower-case, lower case ?
 Space
 San Francisco: How do you decide it is one token?
10
Sec. 2.2.1
Introduction to Information Retrieval
Issues in tokenization
 Numbers
 Older IR systems may not index numbers
 But often very useful: looking up error codes/stack traces on the
web





3/12/91
Mar. 12, 1991
55 B.C.
B-52
My PGP key is 324a3df234cb23e
(800) 234-2333
12/3/91
11
Introduction to Information Retrieval
Sec. 2.2.1
Language issues in tokenization
 German noun compounds are not segmented
 Lebensversicherungsgesellschaftsangestellter
 ‘life insurance company employee’
 German retrieval systems benefit greatly from a compound splitter
module
 Can give a 15% performance boost for German
12
Sec. 2.2.1
Introduction to Information Retrieval
Language issues in tokenization
 Chinese and Japanese have no spaces between
words:
 莎拉波娃现在居住在美国东南部的佛罗里达。
 Further complicated in Japanese, with multiple
alphabets intermingled
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana
Hiragana
Kanji
Romaji
13
Introduction to Information Retrieval
Sec. 2.2.2
Stop words
 With a
, you exclude from the dictionary words like
the, a, and, to, be
 Intuition: They have little semantic content.
 The commonest words.
 Using a stop list significantly reduces the number of
postings that a system has to store, because there
are a lot of them.
 Two ways for construction:
 By expert
 By machine
14
Introduction to Information Retrieval
Stop words
 But you need them for:
 Phrase queries: “President of Iran”
 Various song titles, etc.: “Let it be”, “To be or not to be”
 “Relational” queries: “flights to London”
 The general trend in IR systems: from large stop lists
(200–300 terms) to very small stop lists (7–12 terms) to
no stop list.
 Good compression techniques (lecture 5) means the
space for including stop words in a system is very small
 Good query optimization techniques (lecture 7) mean
you pay little at query time for including stop words
15
Introduction to Information Retrieval
NORMALIZATION
16
Introduction to Information Retrieval
Sec. 2.2.3
Normalization to terms
 Example: We want to match I.R. and IR
 Token normalization is the process of canonicalizing
tokens so that matches occur despite superficial
differences in the character sequences of the tokens
 Result is terms:
 A
is a (normalized) word type, which is an entry in
our IR system dictionary
17
Introduction to Information Retrieval
Sec. 2.2.3
Normalization Example: Case folding
 Reduce all letters to lower case
 Exception: upper case in mid-sentence?
 Often best to lower case everything, since users will use
lowercase regardless of ‘correct’ capitalization.
18
Introduction to Information Retrieval
Normalization to terms
 One way is using
:
 car = automobile color = colour
 Searches for one term will retrieve documents that contain
each of these members.
 We most commonly
define equivalence
classes of terms rather than being fully calculated in
advance (
), e.g.,
 deleting periods to form a term
 I.R., IR
 deleting hyphens to form a term
 anti-discriminatory, antidiscriminatory
19
Introduction to Information Retrieval
Sec. 2.2.3
Normalization: other languages
 Accents: e.g., French résumé vs. resume.
 Umlauts: e.g., German: Tuebingen vs. Tübingen
 Normalization of things like date forms
 7月30日 vs. 7/30
 Tokenization and normalization may depend on the
and so is intertwined with language
detection
 Crucial: Need to “normalize” indexed text as well as
query terms into the same form
20
Introduction to Information Retrieval
Sec. 2.2.3
Normalization to terms
 What is the disadvantage of equivalence classing?
 An example:
 Enter: window
Search: window, windows
 Enter: windowsSearch: Windows, windows, window
 Enter: Windows
Search: Windows
 An alternative to equivalence classing is to do
 It is hand constructed
 Potentially more powerful, but less efficient
21
Introduction to Information Retrieval
Stemming and lemmatization
 Documents are going to use different forms of a
word,
 organize, organizes, and organizing
 am, are, and is
 There are families of derivationally related words
with similar meanings,
 democracy, democratic, and democratization.
 Reduce tokens to their “roots” before indexing.
22
Introduction to Information Retrieval
Sec. 2.2.4
Stemming
 “Stemming” suggest crude affix chopping
 language dependent
 Example:
 Porter’s algorithm
 http://www.tartarus.org/~martin/PorterStemmer
 Lovins stemmer
 http://www.comp.lancs.ac.uk/computing/research/stemming/gen
eral/lovins.htm
23
Sec. 2.2.4
Introduction to Information Retrieval
Typical rules in Porter




sses  ss
ies  i
ss  ss
s
presses  press
bodies  bodi
press  press
cats  cat
24
Introduction to Information Retrieval
Sec. 2.2.4
Lemmatization
 Reduce inflectional/variant forms to base form
(lemma) properly with the use of a vocabulary and
morphological analysis of words
 Lemmatizer: a tool from
which does full morphological analysis to
accurately identify the lemma for each word.
25
Introduction to Information Retrieval
Sec. 2.2.4
Language-specificity
 Many of the above features embody transformations
that are
 Language-specific and often, application-specific
 There are “plug-in” addenda to the indexing process
 Both open source and commercial plug-ins are
available for handling these
26
Introduction to Information Retrieval
Helpfulness of normalization
 Do stemming and other normalizations help?
 Definitely useful for Spanish, German, Finnish, …
 30% performance gains for Finnish!
 What about English?
 Not so considerable help!
 Helps a lot for some queries, hurts performance a lot for others.
27
Introduction to Information Retrieval
Helpfulness of normalization
 Example:
 operative (dentistry) ⇒ oper
 operational (research) ⇒ oper
 operating (systems) ⇒ oper
 For a case like this, moving to using a lemmatizer
would not completely fix the problem
28
Introduction to Information Retrieval
FASTER POSTINGS MERGES:
SKIP LISTS
29
Sec. 2.3
Introduction to Information Retrieval
Recall basic merge
 Walk through the two postings simultaneously, in
time linear in the total number of postings entries
2
8
2
4
8
41
1
2
3
8
48
11
64
17
128
21
Brutus
31 Caesar
If the list lengths are m and n, the merge takes O(m+n)
operations.
30
Sec. 2.3
Introduction to Information Retrieval
Augment postings with
128
41
2
4
8
41
64
128
31
11
1
48
2
3
8
11
17
21
31
 At indexing time.
 The resulted list is
.
31
Sec. 2.3
Introduction to Information Retrieval
Query processing with skip pointers
128
41
2
4
8
41
64
128
31
11
1
48
2
3
8
11
17
21
31
Suppose we’ve stepped through the lists until we process 8 on
each list.
We then have 41 and 11. 11 is smaller.
But the skip successor of 11 is 31, so
we can skip ahead past the intervening postings.
32
Introduction to Information Retrieval
Sec. 2.3
Where do we place skips?
 Tradeoff:
 More skips  More likely to skip. But lots of comparisons
to skip pointers.
 Fewer skips  Few successful skips. But few pointer
comparison.
33
Introduction to Information Retrieval
Sec. 2.3
Placing skips
 Simple heuristic: for postings of length L, use L
evenly-spaced skip pointers.
 This ignores the distribution of query terms.
 Easy if the index is relatively static; harder if L keeps
changing because of updates.
 The I/O cost of loading a bigger postings list can
outweigh the gains from quicker in memory merging!
D. Bahle, H. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index.
SIGIR 2002, pp. 215-221.
34
Introduction to Information Retrieval
PHRASE QUERIES AND POSITIONAL
INDEXES
35
Introduction to Information Retrieval
Sec. 2.4
Phrase queries
 Want to be able to answer queries such as “stanford
university” – as a phrase
 Thus the sentence “I went to university at Stanford”
is not a match.
 Most recent search engines support a double quotes
syntax
36
Introduction to Information Retrieval
Phrase queries
 PHRASE QUERIES has proven to be very easily
understood and successfully used by users.
 As many as 10% of web queries are phrase queries.
 For this, it no longer suffices to store only
<term : docs> entries
 Solutions?
37
Introduction to Information Retrieval
Sec. 2.4.1
A first attempt:
 Index every consecutive pair of terms in the text as a
phrase
 For example the text “Qom computer department”
would generate the biwords
 Qom computer
 computer department
 Two-word phrase query-processing is now
immediate.
38
Introduction to Information Retrieval
Longer phrase queries
 The query “modern information retrieval course” can
be broken into the Boolean query on biwords:
 modern information AND information retrieval AND
retrieval course
 Work fairly well in practice.
 But there can and will be occasional errors.
39
Introduction to Information Retrieval
Sec. 2.4.1
Issues for biword indexes
 Errors, as noted before.
 Index blowup due to bigger dictionary
 Infeasible for more than biwords, big even for them.
 Biword indexes are not the standard solution but can
be part of a compound strategy.
40
Introduction to Information Retrieval
Sec. 2.4.2
Solution 2:
 In the postings, store for each term the position(s) in
which tokens of it appear:
<term, number of docs containing term;
doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
41
Introduction to Information Retrieval
Sec. 2.4.2
Positional index example
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
2: 3, 149;
4: 17, 191, 291, 430, 434;
5: 363, 367, …>
Which of docs 1,2,4,5
could contain “to be
or not to be”?
 For phrase queries, we need to deal with more
than just equality.
42
Introduction to Information Retrieval
Sec. 2.4.2
Proximity queries
 LIMIT /3 STATUTE /3 FEDERAL /2 TORT
 /k means “within k words of (on either side)”.
 Clearly, positional indexes can be used for such
queries; biword indexes cannot.
43
Introduction to Information Retrieval
44
Sec. 2.4.2
Introduction to Information Retrieval
Positional index size
 Need an entry for each occurrence, not just once per
document
 Consider a term with frequency 0.1%
Document size
Postings
Positional postings
1000
1
1
100,000
1
100
45
Introduction to Information Retrieval
Sec. 2.4.2
Rules of thumb
 A positional index is 2–4 as large as a non-positional
index
 Positional index size 35–50% of volume of original
text
 Caveat: all of this holds for “English-like” languages
46
Introduction to Information Retrieval
Sec. 2.4.3
Combination schemes
 These two approaches can be profitably
combined
 For particular phrases (“Hossein Rezazadeh”) it is
inefficient to keep on merging positional postings lists
47
Introduction to Information Retrieval
Combination schemes
 Williams et al. (2004) evaluate a more sophisticated
mixed indexing scheme
 A typical web query mixture was executed in ¼ of the time
of using just a positional index
 It required 26% more space than having a positional index
alone

H.E. Williams, J. Zobel, and D. Bahle. 2004. “Fast Phrase
Querying with Combined Indexes”, ACM Transactions on
Information Systems.
48
Introduction to Information Retrieval
Exercise
 Write a pseudo-code for biwork phrase queries using
positional index.
 Do exercises 2.5 and 2.6 of your book.
49
Download