27_Li_SpellingCorrec..

advertisement
Spelling correction as an iterative process that
exploits the collective knowledge of web users
Silviu Cucerzan and Eric Brill
July, 2004
Speaker: Mengzhe Li
Spell Checking of Search Engine Queries
Traditional Word Processing Spell Checker:
• Resolve typographical errors
• Compute a small set of in-lexicon alternatives relying on:
 In-lexicon-word frequencies
 The most common keyboard mistakes
 Phonetic/cognitive mistakes
 Word substitution errors(Very few)  The use in inappropriate contexts
 Typographical/cognitive mistakes
Web Query





Very short, less than 3 word on average
Frequency and severity are significantly greater
Validation cannot be decided by lexicon or grammaticality
Consist of one or more concepts
Contain legitimate words not found in traditional lexicon
1
Spell Checking of Search Engine Queries
Difficulty of applying traditional spell checker on web query
Defining a valid web query is difficult
Impossible to maintain a high-coverage lexicon
Difficult to detect word substitutions in very large lexicon
Alternative Method:
•
•
•
•
Evolving expertise of using web search engines – collected search query logs
Validity of words – frequency in what people are querying for
“the meaning of a word is its use in the language:
Utilize query logs to learn the validity
- Build model for valid query probabilities
- Despite the fact that a large percentage of queries are misspelled
- No trivial way to determine the valid from invalid
2
Traditional Lexicon-based Spelling Correction Approaches
Iteratively redefine the problem to diminish the role of trusted lexicon
For any out-of-lexicon word, find the
closet word form in the available
lexicon and hypothesize it as the
correct spelling alternative based on
the edit distance function.
Consider the
frequency of
words in a
language
Using the product between the likelihood
of misspelling a word and the prior
probability of words to achieve a
probabilistic edit distance
Include a threshold so that all
words in the distance are good
candidates. Using prior probability
instead of the actual distance to
choose the alternative.
Condition the probability
of the correction
3
Contd.
Misspelled word should
be corrected depending
on contexts.
Tokenize the text so that the
context can be taken into
account. Ex:
power crd -> power cord
video crd -> video card
Consider word substitution errors
and generalize the problem
Consider word substitution error. Ex:
golf war -> gulf war
sap opera -> soap opera
Consider concatenation and splitting, Ex:
power point slides -> powerpoint slides
chat inspanish -> chat in spanish
Out–lexicon words are valid in
web query correction and inlexicon words should be
changed to out-lexicon words
Consider terms in web queries.
Ex:
gun dam planet -> gundam planet
limp biz kit -> limp bizkit
 No longer explicit use of a lexicon
 Query data is more important in
the string probability
 Substitute for a measure of the
meaningfulness of strings as web
queries
4
Distance Function and Threshold
Distance Function:
Modified Context-dependent weighted Damerau - Levenshtein
edit function:
• Defined as the minimum number of point changes required to
transform a string into another, which allows insertion, deletion,
substitution, immediate transportation, and long-distance movement
of letters as point changes.
- Using statistics from query logs to refine the weights
Importance of distance function d and threshold δ:
• Restrictive – right correction might not be possible
• Less-limited – unlikely correction might be suggested
• Desired – large distance corrections for a diversity of situation
5
Exploiting Large Web Query Logs
Any string that appears in the query log used for training can
be considered a valid correction and can be suggested as an
alternative to the current web query based on the relative
frequency of the query and the alternative spelling.
Three essential properties of the query logs:
• Words in the query logs are misspelled in various ways,
from relatively easy-to-correct misspellings to verydifficult-to-correct ones, that make the user’s intent almost
impossible to recognize;
• The less malign (difficult to correct) a misspelling is the
more frequent it is;
• The correct spellings tend to be more frequent than
misspellings.
6
Exploiting Large Web Query Logs
Example for the iterative function:
Anol Scwartegger -> Arnold Schwarzenegger
Misspelled Query: anol scwartegger
First Iteration: Arnold schwartnegger
Second Iteration: Arnold schwarznegger
Third Iteration: Arnold Schwarzenegger
Fourth Iteration: no further correction
Shortcoming of query as full
string to be corrected:
• Depend on the agreement between
the relative frequencies and the
character error model.
• Need to identify all queries in the
query log that are misspellings of
other queries.
• Find a correction sequence of logged
queries for any new query.
• Only cover exact matches of the
queries that appear in these logs
• Provide a low coverage of infrequent
queries.
7
Example of Query Uses Substrings
Example:
britenetspear inconcert ->Britney spears in concert
𝑆0
britenetspear inconcert
𝑙0 = 2
𝑆1
britneyspears in concert
𝑙1 = 3
𝑆2
britney spears in concert
𝑙2 = 4
𝑆3
britney spears in concert
Tokenization process uses space and punctuation delimiters in addition to
the information provided about multi-word compounds by a trusted lexicon
to extract the input query and word unigram and bigram statistics from
query logs to be used as the system’s language model.
9
Query Correction
Procedure:
1. An input query is tokenized using space and word-delimiter information in
addition to the available lexical information.
2. A set of alternatives is computed using the weighted Levenshtein distance
function described before and two thresholds for in-lexicon and out-lexicon
tokens.
3. Matches are searched in the space of word unigrams and bigrams extracted
from query logs and trusted lexicon.
4. Modified Viterbi search is employed to find the best possible alternative string
to the input query.
• Constraint: no two adjacent words change simultaneously.
• Restriction: In-lexicon words are not allow changes in the first iteration.
• Fringe: Searched paths form.
• Assumption based on constraint:
• The list of alternatives for each word is randomly ordered but the
input word in the trusted lexicon is on the first position of the list
10
Modified Viterbi Search Method
Using word-bigram statistics may face that
stop words may interface negatively with the
best path search.
To avoid, special strategy is used:
1. Ignore the stop word as in Figure 1
2. Compute the best alternatives for the
skipped stop words in a second Viterbi
search as in Figure 2
Figure 1. Example of
trellis of the modified
Viterbi search
Figure 2. Stop-word
treatment
11
Conclusion
• Success in using the collective knowledge stored in search query log for
the spell correction task.
• Effective and efficient search method with great space complexity .
• Appropriate suggestion proposed by iteration spelling check with
restriction and modified edit distance function.
• Technique to find the extremely informative but noisy resource
exploiting the errors made by people as a way to do effective query
spelling correction.
• Still need larger and more actual evaluation data to provide a more
convincing achievement.
• Adaptation of the technique to the general purpose spelling correction
by using statistics from both query-logs and large office document
collection.
12
Download