Spelling correction as an iterative process that exploits the collective knowledge of web users Silviu Cucerzan and Eric Brill July, 2004 Speaker: Mengzhe Li Spell Checking of Search Engine Queries Traditional Word Processing Spell Checker: • Resolve typographical errors • Compute a small set of in-lexicon alternatives relying on: In-lexicon-word frequencies The most common keyboard mistakes Phonetic/cognitive mistakes Word substitution errors(Very few) The use in inappropriate contexts Typographical/cognitive mistakes Web Query Very short, less than 3 word on average Frequency and severity are significantly greater Validation cannot be decided by lexicon or grammaticality Consist of one or more concepts Contain legitimate words not found in traditional lexicon 1 Spell Checking of Search Engine Queries Difficulty of applying traditional spell checker on web query Defining a valid web query is difficult Impossible to maintain a high-coverage lexicon Difficult to detect word substitutions in very large lexicon Alternative Method: • • • • Evolving expertise of using web search engines – collected search query logs Validity of words – frequency in what people are querying for “the meaning of a word is its use in the language: Utilize query logs to learn the validity - Build model for valid query probabilities - Despite the fact that a large percentage of queries are misspelled - No trivial way to determine the valid from invalid 2 Traditional Lexicon-based Spelling Correction Approaches Iteratively redefine the problem to diminish the role of trusted lexicon For any out-of-lexicon word, find the closet word form in the available lexicon and hypothesize it as the correct spelling alternative based on the edit distance function. Consider the frequency of words in a language Using the product between the likelihood of misspelling a word and the prior probability of words to achieve a probabilistic edit distance Include a threshold so that all words in the distance are good candidates. Using prior probability instead of the actual distance to choose the alternative. Condition the probability of the correction 3 Contd. Misspelled word should be corrected depending on contexts. Tokenize the text so that the context can be taken into account. Ex: power crd -> power cord video crd -> video card Consider word substitution errors and generalize the problem Consider word substitution error. Ex: golf war -> gulf war sap opera -> soap opera Consider concatenation and splitting, Ex: power point slides -> powerpoint slides chat inspanish -> chat in spanish Out–lexicon words are valid in web query correction and inlexicon words should be changed to out-lexicon words Consider terms in web queries. Ex: gun dam planet -> gundam planet limp biz kit -> limp bizkit No longer explicit use of a lexicon Query data is more important in the string probability Substitute for a measure of the meaningfulness of strings as web queries 4 Distance Function and Threshold Distance Function: Modified Context-dependent weighted Damerau - Levenshtein edit function: • Defined as the minimum number of point changes required to transform a string into another, which allows insertion, deletion, substitution, immediate transportation, and long-distance movement of letters as point changes. - Using statistics from query logs to refine the weights Importance of distance function d and threshold δ: • Restrictive – right correction might not be possible • Less-limited – unlikely correction might be suggested • Desired – large distance corrections for a diversity of situation 5 Exploiting Large Web Query Logs Any string that appears in the query log used for training can be considered a valid correction and can be suggested as an alternative to the current web query based on the relative frequency of the query and the alternative spelling. Three essential properties of the query logs: • Words in the query logs are misspelled in various ways, from relatively easy-to-correct misspellings to verydifficult-to-correct ones, that make the user’s intent almost impossible to recognize; • The less malign (difficult to correct) a misspelling is the more frequent it is; • The correct spellings tend to be more frequent than misspellings. 6 Exploiting Large Web Query Logs Example for the iterative function: Anol Scwartegger -> Arnold Schwarzenegger Misspelled Query: anol scwartegger First Iteration: Arnold schwartnegger Second Iteration: Arnold schwarznegger Third Iteration: Arnold Schwarzenegger Fourth Iteration: no further correction Shortcoming of query as full string to be corrected: • Depend on the agreement between the relative frequencies and the character error model. • Need to identify all queries in the query log that are misspellings of other queries. • Find a correction sequence of logged queries for any new query. • Only cover exact matches of the queries that appear in these logs • Provide a low coverage of infrequent queries. 7 Example of Query Uses Substrings Example: britenetspear inconcert ->Britney spears in concert 𝑆0 britenetspear inconcert 𝑙0 = 2 𝑆1 britneyspears in concert 𝑙1 = 3 𝑆2 britney spears in concert 𝑙2 = 4 𝑆3 britney spears in concert Tokenization process uses space and punctuation delimiters in addition to the information provided about multi-word compounds by a trusted lexicon to extract the input query and word unigram and bigram statistics from query logs to be used as the system’s language model. 9 Query Correction Procedure: 1. An input query is tokenized using space and word-delimiter information in addition to the available lexical information. 2. A set of alternatives is computed using the weighted Levenshtein distance function described before and two thresholds for in-lexicon and out-lexicon tokens. 3. Matches are searched in the space of word unigrams and bigrams extracted from query logs and trusted lexicon. 4. Modified Viterbi search is employed to find the best possible alternative string to the input query. • Constraint: no two adjacent words change simultaneously. • Restriction: In-lexicon words are not allow changes in the first iteration. • Fringe: Searched paths form. • Assumption based on constraint: • The list of alternatives for each word is randomly ordered but the input word in the trusted lexicon is on the first position of the list 10 Modified Viterbi Search Method Using word-bigram statistics may face that stop words may interface negatively with the best path search. To avoid, special strategy is used: 1. Ignore the stop word as in Figure 1 2. Compute the best alternatives for the skipped stop words in a second Viterbi search as in Figure 2 Figure 1. Example of trellis of the modified Viterbi search Figure 2. Stop-word treatment 11 Conclusion • Success in using the collective knowledge stored in search query log for the spell correction task. • Effective and efficient search method with great space complexity . • Appropriate suggestion proposed by iteration spelling check with restriction and modified edit distance function. • Technique to find the extremely informative but noisy resource exploiting the errors made by people as a way to do effective query spelling correction. • Still need larger and more actual evaluation data to provide a more convincing achievement. • Adaptation of the technique to the general purpose spelling correction by using statistics from both query-logs and large office document collection. 12