26_Chhajer_Swapnil

advertisement
Spelling Correction for Search Engine Queries
Bruno Martins and Mario J. Silva
Proceedings of EsTAL-04,
España for Natural Language Processing (2004)
Swapnil Chhajer
schhajer@usc.edu
http://schhajer.co.nr
Topics Covered in Class
• Peter Norvig’s Spelling Corrector: Query Processing [33-35]
• Levenshtein Algortihm: Query Processing [36-41]
• Evaluation Metrices: Precision & Recall: Introduction to Information
Retrieval [16]
• Soundex Algorithm: Query Processing [18]
April 16, 2013
Spelling Correction for Search Engine Queries
3
Motivation & Abstract
• Misspelled queries retrieve pages with misspelled words which
leaves behind the most appropriate pages.
• 10-12% of queries are misspelled.
• To provide user with the best possible match instead of making user
choose one of the possible corrections from the correction list.
April 16, 2013
Spelling Correction for Search Engine Queries
4
Google: Spelling Correction
April 16, 2013
Spelling Correction for Search Engine Queries
5
Spelling Correction
• Uses
• Correcting documents being indexed
• Retrieve matching documents when query contains spelling error
Flavors:
• Isolated words
• Check words on its own
• Unable to catch correctly spelled typos from vs.form
• Context-sensitive
• Look at surrounding words, e.g., I flew form Heathrow to Narita.
“a paragraph cud half mini flaws but wood
bee past by the isolated spill checker”
April 16, 2013
Spelling Correction for Search Engine Queries
6
General issues in Spelling Correction
• UI
• Did you mean works for one suggestion.
• What about multiple possible corrections ?
• Computational Cost
• Spelling Correction is potentially expensive
• Avoid running on each query
• Maybe just on query that matches few documents
• Guess: Spelling Correction of major search engines is efficient
enough to be run on every query
April 16, 2013
Spelling Correction for Search Engine Queries
6
Kinds of Spelling Mistakes: Typos
• Wrong characters by mistake
• Categorized mainly into 4 categories:
•
•
•
• Insertions (Missing Letter)
• “appellate” as “appellare”, “prejudice” as “prejudsice”
Deletions (Extra Letter)
• “plaintiff” as “paintiff”, “judgment” as “judment”, “liability” as “liabilty”,
“discovery” as “dicovery”, “fourth amendment” as “fourthamendment”
Substitutions (Wrong letter)
• “habeas” as “haceas”
Transpositions
• “fraud” as “fruad”, “bankruptcy” as “banrkuptcy, “subpoena” as
“subpeona”, “plaintiff” as “plaitniff”
• 80-95% differ from the correct spellings in just one of the four ways.
• Keyboard layout is important in such cases.
April 16, 2013
Spelling Correction for Search Engine Queries
8
Kinds of Spelling Mistakes: Brainos
• Wrong characters on purpose
• Most common type of mistake in general web queries
• Mistakes derived from either pronunciation or spelling or semantic
confusions
• Brainos: Soundalike (Phonetic Errors)
• “subpoena” as “supena”,“voir” as “voire”, “latter” as “ladder”,
“withholding” as “witholding”, “foreclosure” as “forclosure”
• Brainos: Confusions
• “preclusion” as “perclusion”, “men” as “mans”, “juries” as “jurys”
or “jureys”, “dramshop” as “dram shop”
April 16, 2013
Spelling Correction for Search Engine Queries
8
Dictionary Storage:
Ternary Search Trees(TST)
• Data structure: Ternary Search Tree(TST)
• Type of a TRIE, limited to 3 children per node.
• TRIE is the common definition for a tree storing strings, in which
there is one node for every common prefix and the strings are
stored in extra leaf nodes.
• Searching: O(log(n)+k)
• n: number of strings in tree
• k: length of string being searched for
April 16, 2013
Spelling Correction for Search Engine Queries
10
TST Continued…
Figure: A ternary search tree storing the words “to”, “too”, “toot”, “tab” and “so”,
all within an associated frequency of 1
April 16, 2013
Spelling Correction for Search Engine Queries
11
Spelling Correction Algorithm
• Implemented using edit distance, rule-based techniques, n-grams probabilistic
techniques, neural nets, similarity key techniques, or combinations.
• Goal: To find edit distance based on different strategies.
• Shorter distance implies Good Correction.
• Soundex System:
• Indexing based on sound.
• Devised to help with the problem of phonetic errors.
• Metaphone Systems:
• Specific to English language
• Transforming words into codes based on phonetic properties
• Based on consonants & diphthongs
• Spelling correction for web
• Complete waste to make context dependent correction as user hardly type more
than three terms for a query
April 16, 2013
Spelling Correction for Search Engine Queries
11
Spelling Correction Algorithm
Continued…
• User entered query is tokenized ignoring non-word characters.
• Convert all words into lower case, and check whether the word is
correctly spelled.
• Update the frequencies for correctly spelled words. This basically
acts as a feedback to the system.
• Feedback system can be helpful for Spell Checker in predicting
patterns in user’s searches.
• Misspelled words are replaced by correctly spelled words.
• Finally, a new query is presented to the user as a suggestion,
together with the results page for the original query.
April 16, 2013
Spelling Correction for Search Engine Queries
12
Spelling Correction Algorithm
Continued…
• Algorithm is divided into 2 phases:
• Phase 1: Generation of a set of candidate suggestions
• Phase 2: Select the best choice among those selections
• Phase 1
• 9 Steps, at each step look up dictionary for words that relate to the original
misspelling.
• Differ in one character from the original word.
• Differ in two character from the original word.
• Differ in one letter removed or added.
• Differ in one letter removed or added, plus one letter different.
• Differ in repeated characters removed.
• Correspond to 2 concatenated words (space between words eliminated).
• Differ in having two consecutive letters exchanged & 1 character different
• Have the original word as a prefix.
• Differ in repeated characters removed & 1 character different.
April 16, 2013
Spelling Correction for Search Engine Queries
13
Spelling Correction Algorithm
Continued…
• Phase 2: Heuristics used
• Return the one if it only differs in accented characters
• Return if it only differs in one character, with the error corresponding to an
adjacent letter in the same row of the keyboard.
• Return the smallest one, if there are solutions having same metaphone key as the
original string.
• Return if it only differs in one character, with the error corresponding to an
adjacent letter in an adjacent row of the keyboard.
• In last, return the last word.
• Heuristics are followed sequentially and only move to the next if no matching
words are found.
• If there are more than one matching words, return the one with first character
matched.
• If still, there are more than one, choose the word with highest frequency.
April 16, 2013
Spelling Correction for Search Engine Queries
14
Results Comparison
• Aspell Spell Checker
•
•
•
•
http://aspell.sourceforge.net/
Aspell uses Metaphone algorithm with near miss strategy
48.33% correct forms were correctly guessed.
Outperformed Aspell by 1.66%
* Doesn’t detect the misspelling
- Failed in returning a suggestion.
April 16, 2013
Spelling Correction for Search Engine Queries
15
Results Comparison Continued…
• Tumba! : Search engine for Portuguese web
Table: Results from spelling checker with Tumba!
April 16, 2013
Spelling Correction for Search Engine Queries
16
Conclusion & Future Work
• Spelling checker uses a ternary search tree data structure for storing the
dictionary.
• For data source, referred two popular Portuguese newspapers.
• Queries in search engine may contain company or person’s name. In such
cases, keeping two dictionaries, one in the TST used for correction and
another in an hash-table used only for checking valid words, could yield good
results.
April 16, 2013
Spelling Correction for Search Engine Queries
17
Pros & Cons
• Pros
• Considered various factors affecting edit distance including probabilistic
estimations.
• Used feedback system to improve the quality of user queried results.
• Cons
• Did not consider Context Sensitive spell checking.
• It is not language independent system. Mainly focused on Portuguese words.
• No discussion about spell corrected completion suggestions as a query is
incrementally entered.
April 16, 2013
Spelling Correction for Search Engine Queries
18
References
• Contemporary Spelling Correction - Decoding the noisy channel, Bob Carpenter
• Using the Web for Language Independent Spellchecking and Autocorrection,
Whitelaw, Hutchinson, Chung and Ellis
• How Difficult is it to Develop a Perfect Spell-checker? A Cross-linguistic Analysis
through Complex Network Approach, Choudhury, Thomas, Mukherjee, Basu and
Ganguly
April 16, 2013
Spelling Correction for Search Engine Queries
19
Download