Introduction to Information Retrieval

advertisement
Tolerant IR
Adapted from Lectures by
Prabhakar Raghavan (Google) and
Christopher Manning (Stanford)
Prasad
L05TolerantIR
1
This lecture: “Tolerant” retrieval

Wild-card queries




For expressiveness to accommodate variants (e.g.,
automat*)
To deal with incomplete information about spelling
or multiple spellings (E.g., S*dney)
Spelling correction
Soundex
2
Dictionary data structures for
inverted indexes

Sec. 3.1
The dictionary data structure stores the term
vocabulary, document frequency, pointers to
each postings list … in what data structure?
3
Sec. 3.1
A naïve dictionary



An array of struct:
char[20] int
Postings *
20 bytes 4/8 bytes
4/8 bytes
How do we store a dictionary in memory efficiently?
How do we quickly look up elements at query time?
4
Sec. 3.1
Dictionary data structures

Two main choices:



Hashtables
Trees
Some IR systems use hashtables, some trees
5
Sec. 3.1
Hashtables

Each vocabulary term is hashed to an integer


Pros:


(We assume you’ve seen hashtables before)
Lookup is faster than for a tree: O(1)
Cons:

No easy way to find minor variants:



judgment/judgement
No prefix search
[tolerant retrieval]
If vocabulary keeps growing, need to occasionally
do the expensive rehashing
6
Sec. 3.1
Trees





Simplest: binary search tree
More usual: AVL-trees (main memory), B-trees (disk)
Trees require a standard ordering of characters and
hence strings.
Pros:
 Solves the prefix problem (terms starting with hyp)
Cons:
 Slower: O(log M) [and this requires balanced tree]
 Rebalancing binary trees is expensive

But AVL trees and B-trees mitigate the rebalancing
problem
8
Sec. 3.1
Tree: B-tree
a-hu

hy-m
n-z
Definition: Every internal nodel has a number of
children in the interval [a,b] where a, b are
appropriate natural numbers, e.g., [2,4].
9
Wild-card queries
10
Wild-card queries: *

mon*: find all docs containing any word beginning
with “mon”.



Hashing unsuitable because order not preserved
Easy with binary tree (or B-tree) lexicon: retrieve all
words in range: mon ≤ w < moo
*mon: find words ending in “mon”: harder
Maintain an additional B-tree for terms backwards.
Can retrieve all words in range: nom ≤ w < non.

Exercise: from this, how can we enumerate all terms
meeting the wild-card query pro*cent ?
11
Query processing



At this point, we have an enumeration of all terms
in the dictionary that match the wild-card query.
We still have to look up the postings for each
enumerated term.
E.g., consider the query:
se*ate AND fil*er
This may result in the execution of many Boolean
AND queries.
12
Flexible Handling of *’s


B-trees handle *’s at the beginning and
at the end of a query term
How can we handle *’s in the middle of
a query term?
 Consider co*tion
 We could look up co* AND *tion in a
B-tree and intersect the two term sets
 Expensive
13
Flexible Handling of *’s

The solution:
Use Permuterm Index.
 Transform every wild-card query
so that the *’s occur at the end
 Store each rotation of a word in
the dictionary, say, in a B-tree
14
Permuterm index

For term hello,
index it under:
hello$, ello$h, llo$he,
lo$hel, o$hell
where $ is a special
symbol.
15
Permuterm index

Queries:
 For X,
lookup on X$
 For X*,
look up $X*
 For *X,
lookup on X$*
 For *X*, lookup on
X*
 For X*Y, lookup on Y$X*
Query = hel*o
X=hel, Y=o
Lookup o$hel*
m*n
n$m*
~> man
moron
moon
mean
…
16
Permuterm query processing



Rotate query wild-card to the right
Now use B-tree lookup as before.
Problem: Permuterm more than quadruples the
size of the dictionary compared to a regular Btree. (empirical number for English)
17
Bigram indexes


Enumerate all k-grams (sequence of k chars)
occurring in any term
e.g., from text “April is the cruelest month” we
get the 2-grams (bigrams)
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru,
ue,el,le,es,st,t$, $m,mo,on,nt,h$


$ is a special word boundary symbol
Maintain a second “inverted” index from bigrams
to dictionary terms that match each bigram.
18
Bigram index example
The k-gram index finds terms based on a query
consisting of k-grams (here k=2).
$m
mace
madden
mo
among
amortize
on
among
around
19
Postings list in a 3-gram inverted index
k-gram (bigram, trigram, . . . ) indexes
Note that we now have two different types of
inverted indexes
The term-document inverted index for
finding documents based on a query
consisting of terms
The k-gram index for finding terms based
on a query consisting of k-grams
20
20
Processing n-gram wild-cards

Query mon* can now be run as






$m AND mo AND on
Gets terms that match AND version of our
wildcard query.
But we’d incorrectly enumerate moon as well.
Must post-filter these terms against query.
Surviving enumerated terms are then looked up
in the original term-document inverted index.
Fast, space efficient (compared to permuterm).
21
Processing wild-card queries


As before, we must execute a Boolean query for
each enumerated, filtered term.
Wild-cards can result in expensive query
execution (very large disjunctions…)

Avoid encouraging “laziness” in the UI:
Search
Type your search terms, use ‘*’ if you need to.
E.g., Alex* will match Alexander.
24
Advanced features

Avoiding UI clutter is one reason to hide
advanced features behind an “Advanced Search”
button

It also deters most users from unnecessarily
hitting the engine with fancy queries
25
Spelling correction
26
Spell correction

Two principal uses



Correcting document(s) being indexed
Retrieving matching documents when query
contains a spelling error
Two main flavors:

Isolated word



Check each word on its own for misspelling
Will not catch typos resulting in correctly spelled words
e.g., from  form
Context-sensitive

Look at surrounding words, e.g., I flew form Heathrow
to Narita.
27
Document correction

Especially needed for OCR’ed documents


Correction algorithms tuned for this: rn vs m
Can use domain-specific knowledge



E.g., OCR can confuse O and D more often than it would
confuse O and I. (However, O and I adjacent on the
QWERTY keyboard, so more likely interchanged in typed
documents).
Web pages and even printed material have typos
Goal: The dictionary contains fewer misspellings

But often we don’t change the documents and
instead fix the query-document mapping
28
Query mis-spellings

Our principal focus here


E.g., the query Alanis Morisett
We can either


Retrieve documents indexed by the correct
spelling, OR
Return several suggested alternative queries with
the correct spelling

Did you mean … ?
29
Isolated word correction


Fundamental premise – there is a lexicon from
which the correct spellings come
Two basic choices for this

A standard lexicon such as



Webster’s English Dictionary
An “industry-specific” lexicon – hand-maintained
The lexicon of the indexed corpus



E.g., all words on the web
All names, acronyms etc.
(Including the mis-spellings)
30
Isolated word correction



Given a lexicon and a character sequence Q,
return the words in the lexicon closest to Q
What’s “closest”?
We’ll study several alternatives



Edit distance
Weighted edit distance
n-gram overlap



Jaccard Coefficient
Dice Coefficient
Jaro-Winkler Similarity
31
Edit distance


Given two strings S1 and S2, the minimum
number of operations to convert one to the other
Basic operations are typically character-level


E.g., the edit distance from dof to dog is 1



Insert, Delete, Replace, Transposition
From cat to act is 2
from cat to dog is 3.
(Just 1 with transpose.)
Generally found by dynamic programming.
32
Transform "SPARTAN" into "PART"
Step
Comparison
Edit necessary
Total Editing
1.
"S" to ""
delete: +1 edit
"S" to "" in 1 edit
2.
"P" to "P"
no edits necessary!
"SP" to "P" in 1 edit
3.
"A" to "A"
no edits necessary!
"SPA" to "PA" in 1 edit
4.
"R" to "R"
no edits necessary!
"SPAR" to "PAR" in 1
edit
5.
"T" to "T"
no edits necessary!
"SPART" to "PART" in
1 edit
6.
"A" to "T"
delete: +1 edit
"SPARTA" to "PART"
in 2 edits
7.
"N" to "T"
delete: +1 edit
"SPARTAN" to
"PART" in 3 edits
Transform "SEA" into "ATE"
Comparison
Edit
necessary
Total
Editing
1.
"S" to ""
delete: +1
edit
"S" to "" in
1 edit
2.
"E" to ""
delete: +1
edit
"SE" to
"" in 2 edits
"A" to "A"
no edits
necessary!
"SEA" to
"A" in 2
edits
"A" to "T"
insert: +1
edit
"SEA" to
"AT" in 3
edits
"A" to "E"
insert: +1
edit
"SEA" to
"ATE" in 4
edits
Step
3.
4.
5.
34
Transform "SEA" into "ATE"
Comparison
Edit
necessary
Total Editing
1.
"" to ""
no edit
necessary!
"" to "" in 0
edits
2.
"S" to "A"
replace: +1
edit
"S" to "A" in
1 edit
3.
"S" to "T"
insert: +1
edit
"S" to "AT" in
2 edits
4.
"E" to "E"
no edits
necessary!
"SE" to "ATE"
in 2 edits
"A" to "E"
delete: +1
edit
"SEA" to
"ATE" in 3
edits
Step
5.
35
Using edit distances




Given query, first enumerate all character
sequences within a preset (weighted) edit
distance (e.g., 2)
Intersect this set with list of “correct” words
Show terms you found to user as suggestions
Alternatively,



We can look up all possible corrections in our
inverted index and return all docs … slow
We can run with a single most likely correction
The alternatives disempower the user, but save a
round of interaction with the user
36
Sec. 3.3.4
Edit distance to all dictionary terms?

Given a (mis-spelled) query – do we compute its
edit distance to every dictionary term?



Expensive and slow
Alternative?
How do we cut the set of candidate dictionary
terms?


One possibility is to use n-gram overlap for this
This can also be used by itself for spelling
correction.
37
Introduction to Information Retrieval
Distance between misspelled word and
“correct” word




We will study several alternatives.
Edit distance and Levenshtein distance
Weighted edit distance
k-gram overlap
38
Introduction to Information Retrieval
Edit distance
 The edit distance between string s1 and string s2 is the
minimum number of basic operations that convert s1 to s2.
 Levenshtein distance: The admissible basic operations are
insert, delete, and replace
 Levenshtein distance dog-do: 1
 Levenshtein distance cat-cart: 1
 Levenshtein distance cat-cut: 1
 Levenshtein distance cat-act: 2
 Damerau-Levenshtein distance cat-act: 1
 Damerau-Levenshtein includes transposition as a fourth possible
operation.
39
Introduction to Information Retrieval
Levenshtein distance: Computation
40
Introduction to Information Retrieval
Levenshtein distance: Algorithm
41
Introduction to Information Retrieval
Levenshtein distance: Algorithm
42
Introduction to Information Retrieval
Levenshtein distance: Algorithm
43
Introduction to Information Retrieval
Levenshtein distance: Algorithm
44
Introduction to Information Retrieval
Levenshtein distance: Algorithm
45
Introduction to Information Retrieval
Levenshtein distance: Example
46
Introduction to Information Retrieval
Each cell of Levenshtein matrix
cost of getting here from
my upper left neighbor
(copy or replace)
cost of getting here
from my upper neighbor
(delete)
cost of getting here from
my left neighbor (insert)
the minimum of the three
possible “movements”;
the cheapest way of getting
here
47
Introduction to Information Retrieval
Levenshtein distance: Example
48
Introduction to Information Retrieval
Dynamic programming (Cormen et al.)
 Optimal substructure: The optimal solution to the problem
contains within it subsolutions, i.e., optimal solutions to
subproblems.
 Overlapping subsolutions: The subsolutions overlap. These
subsolutions are computed over and over again when
computing the global optimal solution in a brute-force
algorithm.
 Subproblem in the case of edit distance: what is the edit
distance of two prefixes
 Overlapping subsolutions: We need most distances of
prefixes 3 times – this corresponds to moving right,
diagonally, down.
49
Introduction to Information Retrieval
Weighted edit distance
 As above, but weight of an operation depends on
the characters involved.
 Meant to capture keyboard errors, e.g., m more
likely to be mistyped as n than as q.
 Therefore, replacing m by n is a smaller edit distance
than by q.
 We now require a weight matrix as input.
 Modify dynamic programming to handle weights
50
Introduction to Information Retrieval
Exercise
❶Compute
Levenshtein distance matrix for OSLO – SNOW
❷What are the Levenshtein editing operations that transform
cat into catcat?
51
Introduction to Information Retrieval
52
Introduction to Information Retrieval
53
Introduction to Information Retrieval
54
Introduction to Information Retrieval
55
Introduction to Information Retrieval
56
Introduction to Information Retrieval
57
Introduction to Information Retrieval
58
Introduction to Information Retrieval
59
Introduction to Information Retrieval
60
Introduction to Information Retrieval
61
Introduction to Information Retrieval
62
Introduction to Information Retrieval
63
Introduction to Information Retrieval
64
Introduction to Information Retrieval
65
Introduction to Information Retrieval
66
Introduction to Information Retrieval
67
Introduction to Information Retrieval
68
Introduction to Information Retrieval
69
Introduction to Information Retrieval
70
Introduction to Information Retrieval
71
Introduction to Information Retrieval
72
Introduction to Information Retrieval
73
Introduction to Information Retrieval
74
Introduction to Information Retrieval
75
Introduction to Information Retrieval
76
Introduction to Information Retrieval
77
Introduction to Information Retrieval
78
Introduction to Information Retrieval
79
Introduction to Information Retrieval
80
Introduction to Information Retrieval
81
Introduction to Information Retrieval
82
Introduction to Information Retrieval
83
Introduction to Information Retrieval
84
Introduction to Information Retrieval
How do
I read out the editing operations that transform OSLO into SNOW?
85
Introduction to Information Retrieval
86
Introduction to Information Retrieval
87
Introduction to Information Retrieval
88
Introduction to Information Retrieval
89
Introduction to Information Retrieval
90
Introduction to Information Retrieval
91
Introduction to Information Retrieval
92
Introduction to Information Retrieval
93
Introduction to Information Retrieval
94
Introduction to Information Retrieval
95
n-gram overlap : using n-gram indexes
for spelling correction



Enumerate all the n-grams in the query string as
well as in the lexicon
Example: bigram index, misspelled word bordroom

Bigrams: bo, or, rd, dr, ro, oo, om
Use the n-gram index (recall wild-card search) to
retrieve all lexicon terms matching any of the
query n-grams
Threshold by number of matching n-grams



Variants – weight by keyboard layout, etc.
Other uses of string similarity measures

Ontology alignment
96
Example with trigrams

Suppose the text is november


Trigrams are nov, ove, vem, emb, mbe, ber.
The query is december

Trigrams are dec, ece, cem, emb, mbe, ber.

So 3 trigrams overlap (of 6 in each term)

How can we turn this into a normalized measure
of overlap?
97
One option – Jaccard coefficient


A commonly-used measure of overlap
Let X and Y be two sets; then the J.C. is
X Y / X Y



Equals 1 when X and Y have the same elements
and zero when they are disjoint
X and Y don’t have to be of the same size
Always assigns a number between 0 and 1


Now threshold to decide if you have a match
E.g., if J.C. > 0.8, declare a match
98
Introduction to Information Retrieval
2-gram indexes for spelling correction:
bordroom
100
Context-sensitive spell correction


Text: I flew from Heathrow to Narita.
Consider the phrase query “flew form Heathrow”
We’d like to respond
Did you mean “flew from Heathrow”?
because no docs matched the query phrase.

102
Context-sensitive correction

Need surrounding context to catch this.



First idea: retrieve dictionary terms close (in
weighted edit distance) to each query term
Now try all possible resulting phrases with one
word “fixed” at a time




NLP too heavyweight for this.
flew from heathrow
fled form heathrow
flea form heathrow
Hit-based spelling correction: Suggest the
alternative that has lots of hits.
103
Sec. 3.3.5
General issues in spell correction

We enumerate multiple alternatives for “Did you
mean?” to the user

The most frequent combination in the corpus



The alternative hitting most docs
Query log analysis
More generally, rank alternatives probabilistically

argmaxcorr P(corr | query)
From Bayes rule, this is equivalent to
argmaxcorr P(query | corr) * P(corr)
Noisy channel
Language model
105
Microsoft Web N-gram service

Provides probability of occurrence of
1-, 2-, 3-, 4- and 5-grams in the Web
Corpus (crawled for Bing)

Specifically, it provides three different
probabilities, corresponding to three
different sorts of occurrence



Title
Anchor
Body
107
Using context via Web N-gram service :
Tokenization
108
Using context via Web N-gram service
: Phrase extraction

Tag cloud in Twitris
E.g., foreign relations
 E.g., Tahrir Square
 E.g., tax returns
 E.g., college acceptance letter
 E.g., olympic gold medalist

109
Probabilistic Basis for Phrase Extraction




Two successive words A and B should be treated
as a phrase AB rather than as a sequence of two
words if:
Probability (AB) > Probability(A) * Probability(B)
In other words, if the probability of their joint
occurrence is more than the probability of their
coincidental joint occurrence (i.e., obtained
through independence assumption as the
product of the individual probabilities).
This approach can be extended to delimit longer
phrases in a sentence.
110
Computational cost

Spell-correction is computationally expensive

Avoid running routinely on every query?

Run only on queries that matched few docs
111
Soundex
114
Soundex

Class of heuristics to expand a query into
phonetic equivalents


Language specific – mainly for names
E.g., chebyshev  tchebycheff
115
Soundex – typical algorithm



Turn every token to be indexed into a 4-character
reduced form
Do the same with query terms
Build and search an index on the reduced forms

(when the query calls for a soundex match)
 http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top
116
Soundex – typical algorithm
1. Retain the first letter of the word.
2. Change all occurrences of the following letters
to '0' (zero):
'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.
3. Change letters to digits as follows:

B, F, P, V  1

C, G, J, K, Q, S, X, Z  2

D,T  3

L4

M, N  5

R6
117
Soundex continued
4. Remove all pairs of consecutive digits.
5. Remove all zeros from the resulting string.
6. Pad the resulting string with trailing zeros and
return the first four positions, which will be of the
form <uppercase letter> <digit> <digit> <digit>.
E.g., Herman becomes H655.
Will hermann generate the same code?
118
Introduction to Information Retrieval
Example: Soundex of HERMAN







Retain H
ERMAN → 0RM0N
0RM0N → 06505
06505 → 06505
06505 → 655
Return H655
Note: HERMANN will generate the same code
119
Exercise

Using the algorithm described above, find the
soundex code for your name

Do you know someone who spells their name
differently from you, but their name yields the
same soundex code?
120
Sec. 3.4
Soundex





Soundex is the classic algorithm, provided by
most databases (Oracle, Microsoft, …)
How useful is soundex?
Not very – for information retrieval
Okay for “high recall” tasks (e.g., Interpol),
though biased to names of certain nationalities
Zobel and Dart (1996) show that other algorithms
for phonetic matching perform much better in the
context of IR
121
What queries can we process?

We have




Basic inverted index with skip pointers
Wild-card index
Spell-correction
Soundex
Queries such as
(SPELL(moriset) /3 toron*to) OR
SOUNDEX(chaikofski)

123
Aside – results caching

If 25% of your users are searching for
britney AND spears
then you probably do need spelling correction,
but you don’t need to keep on intersecting those
two postings lists

Web query distribution is extremely skewed, and
you can usefully cache results for common
queries.

Query log analysis
124
Download