|1 Zoekmachines › Gertjan van Noord 2014 Lecture 3: tolerant retrieval Tolerant retrieval: overview Methods to handle imprecise queries • wildcard queries • typo’s • alternative spellings Building alternative indexes Finding the most similar terms Sec. 3.2 Wild-card queries: * mon*: find docs containing any word beginning with “mon”. *mon: find words ending in “mon”: harder. mo*n: find words that start with ‘mo’ and end with ‘n’ m*o*n: find words that start with ‘m’, end with ‘n’, and have an ‘o’ somewhere inbetween. Wildcard queries Two steps in retrieval for wildcard queries: • Find all terms that fall within wildcard definition • Find all docs containing any of these words Three ways to do this: B-trees, permuterm index, k-gram index Dictionary structures: Hash: very efficient (lookup and construction), but cannot be used to find terms that are “close” to the key Binary tree and B-tree (and tries): data structures which keep data sorted (and balanced). Efficient search, but construction is more costly. Words with same suffix are close together in the result → can be used for robust retrieval. Sec. 3.2 Wild-card queries: * mon*: Easy with binary tree (or B-tree) lexicon: retrieve all terms in range: mon ≤ w < moo *mon: Maintain an additional B-tree for terms backwards, retrieve all words in range: nom ≤ w < non. m*n: Combine B-tree and reverse B-tree. Expensive! m*o*n: ?? Solution: the permuterm index Permuterm index and queries Permuterm index add an end symbol: cat$ index all permuterms (in a structure like B-tree): cat$ at$c t$ca $cat Wildcard query processing: add $, rotate (if needed) until * is at the end examples: queries that can find (a.o.) cat: c*t c*at ca* ca*t *t *at permuterm form? Sec. 3.2.1 Permuterm index For term hello, index under: hello$, ello$h, llo$he, lo$hel, o$hell where $ is a special symbol. Queries: X lookup on X$ X* lookup on $X* *X lookup on X$* *X* lookup on X* X*Y lookup on Y$X* X*Y*Z Query = hel*o X=hel, Y=o Lookup o$hel* ???? Exercise! K-gram index k-gram index (example k=3) to each dictionary term add a start and an end symbol: $kitten$ from this string, list all trigrams kitten: $ki kit itt tte en$ make an inverted index of trigrams $ki (kinkiten, kitchen, kitten, ...) how can we find kitten? An alternative: K-gram indexes Index for dictionary lookup, not for document retrieval! Posting lists point from k-gram to vocabulary terms k-gram: group of k consecutive items (contextdependent: characters, syllabes, words,..) bigram (digram), trigram, … K-gram index and queries Part of 3-gram inverted index: $ki en$ che ink itt kit -> kinkiten kitchen kitten -> kinkiten kitchen kitten -> kitchen kinkiten??? -> kinkiten postprocessing needed! -> kitten -> kinkiten kitchen kitten Wildcard query processing $kit*en$ $ki AND kit AND en$ Sec. 3.2 Query processing • • • • At this point, we have an enumeration of all terms in the dictionary that match the wild-card query. We still have to look up the postings for each enumerated term. E.g., consider the query: se*ate AND fil*er This may result in the execution of many Boolean AND queries. Spell correction When? If a query word (combination) is quite rare or not available at all in the dictionary Approach: 1. Find similar term(s) 2. Calculate their similarity to the query term 3. Choose the most frequent ones Finding similar words and calculate their similarity use k-gram index of words and calculate Jaccard coefficient to find most similar ones for query term |A ∩ B| / |A U B| relative similarity size of set of elements (k-grams) in common divided by size of set of all elements SET: no duplicates! Even more precise then use Levenshtein distance for more precisely selecting the terms with the least edit distance to the query term demo: http://www.miislita.com/searchito/levenshtein-editdistance.html 26-01-12 Levenshtein distance m(i, j-1) m(i-1,j-1) m(i-1,j) Minimal edit distance Phonetic similarity To calculate which (English) written words are most similar in pronunciation, the SOUNDEX algorithm gives a (rather rough) measure. Demo: http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#SoundE xConverter