Alameer A.

advertisement
Using the Web for Language Independent
Spellchecking and Auto correction
Authors:
C. Whitelaw, B. Hutchinson, G. Chung, and G. Ellis
Google Inc.
Published in:
the 2009 Conference on Empirical Methods in Natural Language Processing
Presented by:
Abdulmajeed Alameer
What has been done in the paper
• Most spelling systems require some manually compiled
resources such as lexicon and list of misspellings
• The system proposed in the paper requires no annotated
data. It relies on the Web as a large noisy corpus in the
following way:
• Infer information about misspellings from term usage observed
on the Web
• The most frequently observed terms are taken as a list of
potential candidate corrections
• N-grams are used to build a Language Model (LM), which is
used to make context-appropriate corrections
1
The Web-based Approach
• For an observed word w and a candidate correction s:
compute P(s|w) as P(w|s) × P(s)
 For each token in the input text, candidate suggestions are drawn from
the term list
 Candidates are scored using an error model
 Then, evaluated in context using a Language Model
 Finally, classifiers are used to determine our confidence in whether a
word has been misspelled and whether it should be autocorrected to
the best-scoring suggestion available
2
Building the Term List
• Rather than attempting to build a lexicon of well-spelled
words, we take the most frequent tokens observed on the
web
• Using more than 1 billion sample of web pages
• Use filters to remove non-words (too much punctuation, too
short or long)
• This term list is so large that it should contain most wellspelled words, but also a large number of misspellings
3
Building the Error Model
• Substring error model is used to estimate the value of P(w|s)
• To train the Error model we need triples of (intended_word,
observed_word, count)
• The triples are not used directly for proposing corrections
• Since we use substring error model, the triples need not to be an
exhaustive list of spelling mistakes
• we would expect:
• P(the | the) to be very high
• P(teh | the) to be relatively high
• P(hippopotamus | the) to be extremely low
4
Building the Error Model (Cont.)
• Substring error model:
• P(w|s) is estimated as follows:
• Example:
Say we have w=“fisikle” and s=“physical”, for both w and
s pick a partition from the set of all possible partitions
f
i
s
i
k
le
ph
y
s
i
c
al
P(w|s) = P(‘f’|‘ph’) × P(‘I’|‘y’) × P(‘s’|‘s’) × P(‘i’|‘i') × P(‘k’|‘c’) ×
P(‘le’|‘al’)
5
Building the Error Model (Cont.)
• Finding Close Words: For each term in the term list, find all
others terms that are close to it.
• Use Levenshtein edit distance
• This stage takes very long time (tens to hundreds of CPU-hours)
• Filtering Triples: On the assumption that words are spelled
correctly more often than they are misspelled, we next filter the
set such that the first term’s frequency is at least 10 times that
of the second term
6
Building the Language Model
• N-gram Language Model used to estimate P(s)
• Use both forward and backward context when available
• Most of user edits have both right and left context
• A variable λ is used to tune the confidence of the LM
depending on the availability of context
• P(s|w) = P(w|s) × P(s)λ
7
Confidence Classifiers
• First, all suggestions s for a word w are ranked according to their
P(s|w) scores
• Second, a spellchecking classifier is used to predict whether w is
misspelled
• Third, if w is both predicted to be misspelled and s is non-empty,
an auto correction classifier is used to predict whether the topranked suggestion is correct
8
Confidence Classifiers (cont.)
• Training and tuning the confidence classifiers require clean
supervised data
• Clean data are not generally available, so articles from news
papers can be used as a clean corpus.
• It can be assumed that news articles are almost entirely wellspelled
• Artificial errors are generated at a systematically uniform rate
(rate of 2 errors per hundred characters)
9
Results
• The system provided lower error rate than GNU Aspell
• GNU Aspell is a Free and Open Source spell checker that is
available from aspell.net
System
TER
CER
FER
NGS
GNU Aspell
4.83
2.87
2.84
18.3
Web-based
Suggestion
2.55
2.21
1.29
10.1
TER = Total Error Rate
CER = Correction Error Rate
FER = Flagging Error Rate
NGS = No good Suggestion Rate
10
Results (other languages)
• The System also performed well in German, Arabic, and Russian
• Relative improvements in total error rate are 47% in German,
60% in Arabic and 79% in Russian
System
TER
CER
FER
NGS
German Aspell
8.64
4.28
5.25
29.4
German WS
4.62
3.35
2.27
16.5
Arabic Aspell
11.67
4.66
8.51
25.3
Arabic WS
4.64
3.97
2.30
15.9
Russian Aspell
16.75
4.40
13.11
40.5
Russian WS
3.53
2.45
1.93
15.2
11
Effect of Web Corpus Size
• It can be seen from the graph that the gains are small after
using about 106 documents
12
Download