28_Zia - University of Southern California

advertisement
Spelling Correction as an
iterative process that exploits the
collective knowledge of web
users
Silviu Cucerzan & Eric Bill
Microsoft Research
Proceedings of 2004 conference on Empirical Methods in NLP
Mohammad Shaarif Zia
mohammsz@usc.edu
University of Southern California
Topics related to our course
 Lexicon
 Edit/Levenshtein Distance
 Tokenization into unigrams and bigrams
 Spell Checking
2
Spelling Correction
• As per Class Notes, Spelling correction is the process of
flagging a word that may not be spelled correctly, and in some
cases offering alternative spellings of the word.
• How does it works?
 Typical word processing spell checkers compute for each
unknown word to a small set of in-lexicon alternatives as
possible corrections, relying on information like keyboard
mistakes and phonetic/cognitive mistakes.
 Other detect “word substitution error” i.e. use of in-lexicon
words in inappropriate context. (principal instead principle)
But the task of Web Query spelling correction has many
similarities to traditional spelling correction but also poses
additional challenges.
3
Web Query Spelling Correction
• How Web Search queries are different?
 Very Short. They consist of one concept or an enumeration of
concepts.
 Cannot use a static trusted lexicon, as many new names,
concepts become popular everyday(such as blog, shrek )
 Employing very large lexicons can result into word
substitution, which are very difficult to detect.
 Here comes Search Query Logs into the picture.
o Keeping the Record of input query entered by millions of
people that use web search engines
o Validity of a word can be inferred from its frequency in what
people are querying for.
4
Search Query Logs
• According to Douglas Merrill, former CTO of Google,
 You write a misspelled word in google
 You don’t find what you wanted (Don’t click on any results)
 You realize you misspelled word so you rewrite the word in
the search box
 You find what you want
This pattern multiplied millions of times, and thus with the help
of Statistical Machine Learning, they offer Spelling Correction.
• But it would be erroneous to simply extract from logs whose
frequency are above a certain threshold and consider them
valid.
• Say if all users starts spelling ‘Sychology’ instead of
‘Psychology’, the search engine will be trained to accept the
former.
5
Problem Statement & Prior Work
• The main motive of this paper is to “try to utilize query logs to
learn what queries are valid, and to build a model for valid
query probabilities, despite the fact that a large percentage of
the logged queries are misspelled and there is no trivial way to
determine the valid from invalid queries”.
• Prior Work
 For any out of lexicon word in text, find the closest word
form(s) in the available lexicon and hypothesize it as the
correct spelling alternative.
 How to find the closest word? Edit Distance.
 Flaw: Does not take into the account the frequency of
words.
6
Prior Work
 Compute the probability of words in the target language. All inlexicon words that are within some “reasonable” distance of the
unknown word are considered as good candidates. The correction
being chosen based on its probability.
 Flaw: Uses probability of words and not actual distance.
 Use probabilistic edit distance (Bayesian Inversion)
 Flaw: Unknown words are corrected in Isolation. Context is
Important.
power crd-> power cord ; video crd-> video card
‘crd’ should be corrected context wise.
 Ignore Spaces and Delimiters
 Flaw: provide correct suggestions for valid words when they are
meaningful as a search query than original query
sap opera-> soap opera
7
Prior Work
 Include concatenation and splitting
power point slides-> power-point slides
chat inspanich-> chat in spanish
 Flaw: Does not take care out of lexicon words that are valid in
certain contexts
amd processors-> amd processors (no change)
 In lexicon words are changed to out of lexicon words
limp biz kit-> limp bizkit
Thus, now the actual language in which the web queries are
expressed become less important than the query log data.
8
Exploiting Large Web Query Model
• What is Iterative Correction Approach?
Misspelled Query: anol scwartegger
First iteration
: arnold schwartnegger
Second iteration : arnold schwarznegger
Third iteration : arnold schwarzenegger
Fourth iteration : no further correction
 Makes use of a modified context-dependent weighted edit
function which allows insertion, deletion, substitution,
immediate transposition, and long distance movement of
letters as point changes, for which the weights were
interactively refined using statistics from query logs.
 Uses threshold factor in edit distance.( 4 in above case)
9
Exploiting Large Web Query Model
 Consider whole queries as String
britnet spear inconcert could not be corrected if the
correction does not appear in the employed query log.
 Solution: Decomposition of query into words & word bigrams
 Tokenization uses
 space and punctuation delimiters
 information provided about multi word compounds by a
trusted English Lexicon.
10
Query Correction
Best possible
alternative string to
the input query
Input Query
Tokenization
Tokens
Weighted
Edit Distance
Set of
alternatives for
each token
Matches are
extracted from query
log & lexicon
Set 2 different threshold for
in-lexicon & out of lexicon
tokens
VITERBI SEARCH on
the set of all
possible
alternatives
11
Modified Viterbi Search
• Proposed by Andrew Viterbi in 1967 as a decoding algorithm for
convolution codes over digital communication links.
• But now it is applied in computational linguistics, NLP and speech
recognition.
• It uses Dynamic Programming Technique.
• It finds the most likely sequence of hidden states that results in a
sequence of observed events.
• Transition Probabilities are computed using bigram and unigram
query log statistics
• Emission Probabilities are replaced with inverse distance between
two words.
12
Pros
• No two adjacent in-vocabulary words are allowed to change
simultaneously.
No need to search all the possible paths in the trellis and hence make
the search faster. Avoids log wood->dog food
• Unigrams and Bigrams are stored in same data structure on which
the search for correction alternatives is done.
• The Search is done by first ignoring Stop Words & their misspelled
alternatives, once the best path is chosen, the best alternatives for
skipped stop words are computed in second viterbi search.
• Makes it efficient and effective as in each iteration the search space
is reduced.
13
Cons
• Short queries can be iteratively transformed into other unrelated queries.
To avoid this, they imposed additional restrictions of changing
such queries.
• Since it is based on query logs then it might give false results
if the frequency of the false spelling is more than the original
spelling
Say if all users starts spelling ‘Sychology’ instead of ‘Psychology’,
the search engine will be trained to accept the former.
14
Evaluation
• Providing good suggestions for misspelled queries is more
important than providing alternative query for valid queries.
• More the number of iterations, more the accuracy.
• Many suggestions can be considered valid despite the fact
that it disagreed with the default spelling
gogle->google instead of goggle
• The Accuracy was higher than in the case of only using a
trusted lexicon and no query log data. But the accuracy was
higher when both were used.
15
Download