mon

advertisement
|1
Zoekmachines
› Gertjan van Noord
2014
Lecture 3: tolerant retrieval
Tolerant retrieval: overview
Methods to handle imprecise queries
• wildcard queries
• typo’s
• alternative spellings
Building alternative indexes
Finding the most similar terms
Sec. 3.2
Wild-card queries: *
mon*: find docs containing any word beginning with
“mon”.
*mon: find words ending in “mon”: harder.
mo*n: find words that start with ‘mo’ and end with ‘n’
m*o*n: find words that start with ‘m’, end with ‘n’, and
have an ‘o’ somewhere inbetween.
Wildcard queries
Two steps in retrieval for wildcard queries:
• Find all terms that fall within wildcard definition
• Find all docs containing any of these words
Three ways to do this:
B-trees, permuterm
index, k-gram index
Dictionary structures:
Hash: very efficient (lookup and construction), but
cannot be used to find terms that are “close” to the
key
Binary tree and B-tree (and tries): data structures which
keep data sorted (and balanced). Efficient search, but
construction is more costly. Words with same suffix
are close together in the result → can be used for
robust retrieval.
Sec. 3.2
Wild-card queries: *
mon*: Easy with binary tree (or B-tree) lexicon: retrieve
all terms in range: mon ≤ w < moo
*mon: Maintain an additional B-tree for terms
backwards, retrieve all words in range: nom ≤ w <
non.
m*n: Combine B-tree and reverse B-tree. Expensive!
m*o*n: ??
Solution: the permuterm index
Permuterm index and queries
Permuterm index
add an end symbol: cat$
index all permuterms (in a structure like B-tree):
cat$ at$c t$ca $cat
Wildcard query processing:
add $, rotate (if needed) until * is at the end
examples: queries that can find (a.o.) cat:
c*t c*at ca* ca*t
*t *at
permuterm
form?
Sec. 3.2.1
Permuterm index
For term hello, index under:
hello$, ello$h, llo$he, lo$hel, o$hell
where $ is a special symbol.
Queries:
X
lookup on X$
X* lookup on $X*
*X lookup on X$*
*X* lookup on X*
X*Y lookup on Y$X*
X*Y*Z
Query = hel*o
X=hel, Y=o
Lookup o$hel*
????
Exercise!
K-gram index
k-gram index (example k=3)
to each dictionary term add a start and an end
symbol: $kitten$
from this string, list all trigrams
kitten: $ki kit itt tte en$
make an inverted index of trigrams
$ki  (kinkiten, kitchen, kitten, ...)
how can we find kitten?
An alternative: K-gram indexes
Index for dictionary lookup, not for document
retrieval!
Posting lists point from k-gram to vocabulary
terms
k-gram: group of k consecutive items (contextdependent: characters, syllabes, words,..) bigram
(digram), trigram, …
K-gram index and queries
Part of 3-gram inverted index:
$ki
en$
che
ink
itt
kit
-> kinkiten kitchen kitten
-> kinkiten kitchen kitten
-> kitchen
kinkiten???
-> kinkiten
postprocessing needed!
-> kitten
-> kinkiten kitchen kitten
Wildcard query processing
$kit*en$ $ki AND kit AND en$
Sec. 3.2
Query processing
•
•
•
•
At this point, we have an enumeration of all terms in the
dictionary that match the wild-card query.
We still have to look up the postings for each enumerated
term.
E.g., consider the query: se*ate AND fil*er
This may result in the execution of many Boolean AND
queries.
Spell correction
When? If a query word (combination) is quite rare or
not available at all in the dictionary
Approach:
1. Find similar term(s)
2. Calculate their similarity to the query term
3. Choose the most frequent ones
Finding similar words and
calculate their similarity
use k-gram index of words and calculate Jaccard
coefficient to find most similar ones for query term
|A ∩ B| / |A U B|
relative similarity
size of set of elements (k-grams) in common
divided by
size of set of all elements
SET: no
duplicates!
Even more precise
then use Levenshtein distance for more precisely
selecting the terms with the least edit distance to
the query term
demo:
http://www.miislita.com/searchito/levenshtein-editdistance.html
26-01-12
Levenshtein distance
m(i, j-1)
m(i-1,j-1)
m(i-1,j)
Minimal edit
distance
Phonetic similarity
To calculate which (English) written words are most
similar in pronunciation, the SOUNDEX algorithm
gives a (rather rough) measure.
Demo:
http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#SoundE
xConverter
Download