Spelling Correction as an iterative process that exploits the collective knowledge of web users Silviu Cucerzan & Eric Bill Microsoft Research Proceedings of 2004 conference on Empirical Methods in NLP Mohammad Shaarif Zia mohammsz@usc.edu University of Southern California Topics related to our course Lexicon Edit/Levenshtein Distance Tokenization into unigrams and bigrams Spell Checking 2 Spelling Correction • As per Class Notes, Spelling correction is the process of flagging a word that may not be spelled correctly, and in some cases offering alternative spellings of the word. • How does it works? Typical word processing spell checkers compute for each unknown word to a small set of in-lexicon alternatives as possible corrections, relying on information like keyboard mistakes and phonetic/cognitive mistakes. Other detect “word substitution error” i.e. use of in-lexicon words in inappropriate context. (principal instead principle) But the task of Web Query spelling correction has many similarities to traditional spelling correction but also poses additional challenges. 3 Web Query Spelling Correction • How Web Search queries are different? Very Short. They consist of one concept or an enumeration of concepts. Cannot use a static trusted lexicon, as many new names, concepts become popular everyday(such as blog, shrek ) Employing very large lexicons can result into word substitution, which are very difficult to detect. Here comes Search Query Logs into the picture. o Keeping the Record of input query entered by millions of people that use web search engines o Validity of a word can be inferred from its frequency in what people are querying for. 4 Search Query Logs • According to Douglas Merrill, former CTO of Google, You write a misspelled word in google You don’t find what you wanted (Don’t click on any results) You realize you misspelled word so you rewrite the word in the search box You find what you want This pattern multiplied millions of times, and thus with the help of Statistical Machine Learning, they offer Spelling Correction. • But it would be erroneous to simply extract from logs whose frequency are above a certain threshold and consider them valid. • Say if all users starts spelling ‘Sychology’ instead of ‘Psychology’, the search engine will be trained to accept the former. 5 Problem Statement & Prior Work • The main motive of this paper is to “try to utilize query logs to learn what queries are valid, and to build a model for valid query probabilities, despite the fact that a large percentage of the logged queries are misspelled and there is no trivial way to determine the valid from invalid queries”. • Prior Work For any out of lexicon word in text, find the closest word form(s) in the available lexicon and hypothesize it as the correct spelling alternative. How to find the closest word? Edit Distance. Flaw: Does not take into the account the frequency of words. 6 Prior Work Compute the probability of words in the target language. All inlexicon words that are within some “reasonable” distance of the unknown word are considered as good candidates. The correction being chosen based on its probability. Flaw: Uses probability of words and not actual distance. Use probabilistic edit distance (Bayesian Inversion) Flaw: Unknown words are corrected in Isolation. Context is Important. power crd-> power cord ; video crd-> video card ‘crd’ should be corrected context wise. Ignore Spaces and Delimiters Flaw: provide correct suggestions for valid words when they are meaningful as a search query than original query sap opera-> soap opera 7 Prior Work Include concatenation and splitting power point slides-> power-point slides chat inspanich-> chat in spanish Flaw: Does not take care out of lexicon words that are valid in certain contexts amd processors-> amd processors (no change) In lexicon words are changed to out of lexicon words limp biz kit-> limp bizkit Thus, now the actual language in which the web queries are expressed become less important than the query log data. 8 Exploiting Large Web Query Model • What is Iterative Correction Approach? Misspelled Query: anol scwartegger First iteration : arnold schwartnegger Second iteration : arnold schwarznegger Third iteration : arnold schwarzenegger Fourth iteration : no further correction Makes use of a modified context-dependent weighted edit function which allows insertion, deletion, substitution, immediate transposition, and long distance movement of letters as point changes, for which the weights were interactively refined using statistics from query logs. Uses threshold factor in edit distance.( 4 in above case) 9 Exploiting Large Web Query Model Consider whole queries as String britnet spear inconcert could not be corrected if the correction does not appear in the employed query log. Solution: Decomposition of query into words & word bigrams Tokenization uses space and punctuation delimiters information provided about multi word compounds by a trusted English Lexicon. 10 Query Correction Best possible alternative string to the input query Input Query Tokenization Tokens Weighted Edit Distance Set of alternatives for each token Matches are extracted from query log & lexicon Set 2 different threshold for in-lexicon & out of lexicon tokens VITERBI SEARCH on the set of all possible alternatives 11 Modified Viterbi Search • Proposed by Andrew Viterbi in 1967 as a decoding algorithm for convolution codes over digital communication links. • But now it is applied in computational linguistics, NLP and speech recognition. • It uses Dynamic Programming Technique. • It finds the most likely sequence of hidden states that results in a sequence of observed events. • Transition Probabilities are computed using bigram and unigram query log statistics • Emission Probabilities are replaced with inverse distance between two words. 12 Pros • No two adjacent in-vocabulary words are allowed to change simultaneously. No need to search all the possible paths in the trellis and hence make the search faster. Avoids log wood->dog food • Unigrams and Bigrams are stored in same data structure on which the search for correction alternatives is done. • The Search is done by first ignoring Stop Words & their misspelled alternatives, once the best path is chosen, the best alternatives for skipped stop words are computed in second viterbi search. • Makes it efficient and effective as in each iteration the search space is reduced. 13 Cons • Short queries can be iteratively transformed into other unrelated queries. To avoid this, they imposed additional restrictions of changing such queries. • Since it is based on query logs then it might give false results if the frequency of the false spelling is more than the original spelling Say if all users starts spelling ‘Sychology’ instead of ‘Psychology’, the search engine will be trained to accept the former. 14 Evaluation • Providing good suggestions for misspelled queries is more important than providing alternative query for valid queries. • More the number of iterations, more the accuracy. • Many suggestions can be considered valid despite the fact that it disagreed with the default spelling gogle->google instead of goggle • The Accuracy was higher than in the case of only using a trusted lexicon and no query log data. But the accuracy was higher when both were used. 15