Spelling Correction for Search Engine Queries Bruno Martins and Mario J. Silva Proceedings of EsTAL-04, España for Natural Language Processing (2004) Swapnil Chhajer schhajer@usc.edu http://schhajer.co.nr Topics Covered in Class • Peter Norvig’s Spelling Corrector: Query Processing [33-35] • Levenshtein Algortihm: Query Processing [36-41] • Evaluation Metrices: Precision & Recall: Introduction to Information Retrieval [16] • Soundex Algorithm: Query Processing [18] April 16, 2013 Spelling Correction for Search Engine Queries 3 Motivation & Abstract • Misspelled queries retrieve pages with misspelled words which leaves behind the most appropriate pages. • 10-12% of queries are misspelled. • To provide user with the best possible match instead of making user choose one of the possible corrections from the correction list. April 16, 2013 Spelling Correction for Search Engine Queries 4 Google: Spelling Correction April 16, 2013 Spelling Correction for Search Engine Queries 5 Spelling Correction • Uses • Correcting documents being indexed • Retrieve matching documents when query contains spelling error Flavors: • Isolated words • Check words on its own • Unable to catch correctly spelled typos from vs.form • Context-sensitive • Look at surrounding words, e.g., I flew form Heathrow to Narita. “a paragraph cud half mini flaws but wood bee past by the isolated spill checker” April 16, 2013 Spelling Correction for Search Engine Queries 6 General issues in Spelling Correction • UI • Did you mean works for one suggestion. • What about multiple possible corrections ? • Computational Cost • Spelling Correction is potentially expensive • Avoid running on each query • Maybe just on query that matches few documents • Guess: Spelling Correction of major search engines is efficient enough to be run on every query April 16, 2013 Spelling Correction for Search Engine Queries 6 Kinds of Spelling Mistakes: Typos • Wrong characters by mistake • Categorized mainly into 4 categories: • • • • Insertions (Missing Letter) • “appellate” as “appellare”, “prejudice” as “prejudsice” Deletions (Extra Letter) • “plaintiff” as “paintiff”, “judgment” as “judment”, “liability” as “liabilty”, “discovery” as “dicovery”, “fourth amendment” as “fourthamendment” Substitutions (Wrong letter) • “habeas” as “haceas” Transpositions • “fraud” as “fruad”, “bankruptcy” as “banrkuptcy, “subpoena” as “subpeona”, “plaintiff” as “plaitniff” • 80-95% differ from the correct spellings in just one of the four ways. • Keyboard layout is important in such cases. April 16, 2013 Spelling Correction for Search Engine Queries 8 Kinds of Spelling Mistakes: Brainos • Wrong characters on purpose • Most common type of mistake in general web queries • Mistakes derived from either pronunciation or spelling or semantic confusions • Brainos: Soundalike (Phonetic Errors) • “subpoena” as “supena”,“voir” as “voire”, “latter” as “ladder”, “withholding” as “witholding”, “foreclosure” as “forclosure” • Brainos: Confusions • “preclusion” as “perclusion”, “men” as “mans”, “juries” as “jurys” or “jureys”, “dramshop” as “dram shop” April 16, 2013 Spelling Correction for Search Engine Queries 8 Dictionary Storage: Ternary Search Trees(TST) • Data structure: Ternary Search Tree(TST) • Type of a TRIE, limited to 3 children per node. • TRIE is the common definition for a tree storing strings, in which there is one node for every common prefix and the strings are stored in extra leaf nodes. • Searching: O(log(n)+k) • n: number of strings in tree • k: length of string being searched for April 16, 2013 Spelling Correction for Search Engine Queries 10 TST Continued… Figure: A ternary search tree storing the words “to”, “too”, “toot”, “tab” and “so”, all within an associated frequency of 1 April 16, 2013 Spelling Correction for Search Engine Queries 11 Spelling Correction Algorithm • Implemented using edit distance, rule-based techniques, n-grams probabilistic techniques, neural nets, similarity key techniques, or combinations. • Goal: To find edit distance based on different strategies. • Shorter distance implies Good Correction. • Soundex System: • Indexing based on sound. • Devised to help with the problem of phonetic errors. • Metaphone Systems: • Specific to English language • Transforming words into codes based on phonetic properties • Based on consonants & diphthongs • Spelling correction for web • Complete waste to make context dependent correction as user hardly type more than three terms for a query April 16, 2013 Spelling Correction for Search Engine Queries 11 Spelling Correction Algorithm Continued… • User entered query is tokenized ignoring non-word characters. • Convert all words into lower case, and check whether the word is correctly spelled. • Update the frequencies for correctly spelled words. This basically acts as a feedback to the system. • Feedback system can be helpful for Spell Checker in predicting patterns in user’s searches. • Misspelled words are replaced by correctly spelled words. • Finally, a new query is presented to the user as a suggestion, together with the results page for the original query. April 16, 2013 Spelling Correction for Search Engine Queries 12 Spelling Correction Algorithm Continued… • Algorithm is divided into 2 phases: • Phase 1: Generation of a set of candidate suggestions • Phase 2: Select the best choice among those selections • Phase 1 • 9 Steps, at each step look up dictionary for words that relate to the original misspelling. • Differ in one character from the original word. • Differ in two character from the original word. • Differ in one letter removed or added. • Differ in one letter removed or added, plus one letter different. • Differ in repeated characters removed. • Correspond to 2 concatenated words (space between words eliminated). • Differ in having two consecutive letters exchanged & 1 character different • Have the original word as a prefix. • Differ in repeated characters removed & 1 character different. April 16, 2013 Spelling Correction for Search Engine Queries 13 Spelling Correction Algorithm Continued… • Phase 2: Heuristics used • Return the one if it only differs in accented characters • Return if it only differs in one character, with the error corresponding to an adjacent letter in the same row of the keyboard. • Return the smallest one, if there are solutions having same metaphone key as the original string. • Return if it only differs in one character, with the error corresponding to an adjacent letter in an adjacent row of the keyboard. • In last, return the last word. • Heuristics are followed sequentially and only move to the next if no matching words are found. • If there are more than one matching words, return the one with first character matched. • If still, there are more than one, choose the word with highest frequency. April 16, 2013 Spelling Correction for Search Engine Queries 14 Results Comparison • Aspell Spell Checker • • • • http://aspell.sourceforge.net/ Aspell uses Metaphone algorithm with near miss strategy 48.33% correct forms were correctly guessed. Outperformed Aspell by 1.66% * Doesn’t detect the misspelling - Failed in returning a suggestion. April 16, 2013 Spelling Correction for Search Engine Queries 15 Results Comparison Continued… • Tumba! : Search engine for Portuguese web Table: Results from spelling checker with Tumba! April 16, 2013 Spelling Correction for Search Engine Queries 16 Conclusion & Future Work • Spelling checker uses a ternary search tree data structure for storing the dictionary. • For data source, referred two popular Portuguese newspapers. • Queries in search engine may contain company or person’s name. In such cases, keeping two dictionaries, one in the TST used for correction and another in an hash-table used only for checking valid words, could yield good results. April 16, 2013 Spelling Correction for Search Engine Queries 17 Pros & Cons • Pros • Considered various factors affecting edit distance including probabilistic estimations. • Used feedback system to improve the quality of user queried results. • Cons • Did not consider Context Sensitive spell checking. • It is not language independent system. Mainly focused on Portuguese words. • No discussion about spell corrected completion suggestions as a query is incrementally entered. April 16, 2013 Spelling Correction for Search Engine Queries 18 References • Contemporary Spelling Correction - Decoding the noisy channel, Bob Carpenter • Using the Web for Language Independent Spellchecking and Autocorrection, Whitelaw, Hutchinson, Chung and Ellis • How Difficult is it to Develop a Perfect Spell-checker? A Cross-linguistic Analysis through Complex Network Approach, Choudhury, Thomas, Mukherjee, Basu and Ganguly April 16, 2013 Spelling Correction for Search Engine Queries 19