Charlie-phase1

advertisement
Project 1: Spelling Challenge
Phase 1: Analysis
Tim Armstrong, Charlie Greenbacker, andIvankaZhuo Li
CISC 889: Applications of NLP
11 April 2011
Task Description
Our assignment is essentially to implement a “Did you mean?” feature for a web search
engine. The task can be divided into three steps: 1. identification of errors, 2. generation of
suggested corrections, and 3. assigning probabilities to the list of suggestions. Input is given
as user-generated query, and output will be provided as a list of possible replacement
queries with assigned probabilities. A sample dataset based on a portion of the TREC
corpus has been provided by Microsoft, with example user queries provided alongside
expert-made suggested replacement queries. The Microsoft Research Speller Challenger,
upon which this assignment is based, is concerned with more than just simple spelling
errors as it also incorporatessome limited query reformulation (e.g., multiple valid
spellings, alternative punctuation). Thus, our list of suggestions should include
recommendations of this type as well.
We recognize three broad, overlapping categories of errors that our system will need to
detect and suggest corrections for. Typographical errors include mistakes in typing, often
introduced via substitution, insertion, deletion, or transposition. Errors of this type are
usually of edit distance 1 or 2, and some mistakes are more common than others. Phonetic
errors occur when a user attempts to “sound-out” the spelling of a word that they don’t
know the correct spelling for. These errors can have a much higher edit distance, but it’s
often difficult to tell between a phonetic error and a typo. Finally, cognitive errors involve
real word errors in which the user succeeds in spelling a word correctly… but it’s
unfortunately not the word they intended. Examples include motor memory errors and
homonyms.
We are required to use a noisy channel model for spelling correction. The source model will
be a language model (such as n-grams) that provides an apriori probability for known
words. The channel model will be a model of noise that indicates how likely certain errors
are, based on things like minimum distance edit or keyboard distance. Together, the noisy
channel model will tell us how likely an error is for a given word, or more importantly, how
likely a known word is for a given error.
The system will be evaluated on Microsoft’s training set, using expected F1 as defined in
the challenge rules as the evaluation metric. Additionally, systems submitted to the
Microsoft challenge will be evaluated on a hidden Bing test dataset.
Data Analysis
We divided the errors in the dataset into three groups for identification and analysis:
typographical errors, phonetic spelling errors, and cognitive errors.
Typographical Errors
Typographical errors include any spelling error caused by a typing mistake, and are
generally divided into four categories: substitutions, insertions, deletions, and
transpositions. To avoid confusion with cognitive errors (including so-called “real word”
errors), we limited our search for typographical errors to instances where the error word
was not found in WordNet. To avoid uncommon proper nouns (which we would have very
little chance of generating proper corrections for), this analysis was also limited to only
those instances where the suggested correct was found in WordNet.
Out of the 311 queries in which an error was detected, 65 queries included a typo. Two of
these queries received two different suggested corrections for the same error word,
bringing the total number of typo/correction pairs to 67. Of these 67 typos, 63 (94%) were
of edit distance 1, 4 typos were of edit distance 2, and no typos were of edit distance 3+.
The four typos of edit distance 2 were as follows:
(542,1,4) practioner => (542,2,4) practitioner [two deletions]
(2078,1,1) higgins => (2078,2,1) higginson [two deletions]
(2953,1,3) retiremt => (2953,2,3) retirement [two deletions]
(5063,1,1) licens => (5063,3,1) licence [substitution + deletion]
Percentage of typos by category:
transpositions
6
8%
substitutions
5
7%
deletions
49
69%
insertions
11
16%
Unfortunately, there were far too few instances of substitution, insertion, and transposition
to do a proper statistical analysis of letter distribution among the errors. There were,
however, sufficient examples of deletion to offer some interesting insights into this
category.
Here is the breakdown of the most common contexts for deletion typos (single instances
omitted for space). Deleted letters are in boldface, and dollar sign (‘$’) denotes end-of-line.
both sides
RNM: 4
NIA: 3
LLI: 2
SSI: 2
TIE: 2
TTE: 2
left context
RN: 4
LL: 3
NI: 3
TT: 3
RE: 2
SS: 2
TI: 2
right context
NM: 4
IA: 3
E$: 2
IE: 2
LI: 2
SI: 2
TE: 2
Looking at the individual letters that were deleted offers additional insight. Here we see a
frequency distribution of deleted letters, paired with their usage frequency in English
(based on figures from the Cornell University Math Explorer's Project, via
http://en.wikipedia.org/wiki/Letter_frequency) to get a sense of just how common the
deletion is:
Deletion
20%
English
16%
12%
8%
4%
0%
N
I
T
E
A
L
S
C
H
R
U
Y
Breakdown of Data
We divided 300 of the queries that had a suggestion different from the original into the
following categories. See below for analysis of the tokenization and form of word
categories.
1. Words with legitimate alternative spellings (11). They want us to suggest both.
a. British/American (8)
b. Other (3)
2. Non-English (4). Specifically Spanish.
3. Acronyms with and without periods (63). They want us to give some
acronymsboth with and without periods. The list they do want includes “U.S.” and
U.S. state names. One example of something for which they don’t ask for periods is
“fema” (for Federal Emergency Management Agency).
4. Abbreviation variations (3). E.g., “gov” and “govt” are suggested for “gov” (as in
“government”).
5. Repeated words (2). One was a misspelled word, the other just an accidental
repetition.
6. White space issues (38). E.g. healthcare vs. health care. In either of those forms,
Microsoft will suggest both.
a. Space present (12).
b. Space absent (26).
Both of these can be either cognitive errors or typos. For most of the
examples, the correct usage would not be 100% clear to a native speaker.
Microsoft wants all the alternatives.
7. Punctuation errors/variants (54). E.g., variations of “online” would be “online” and
“on-line”. We group whitespace and punctuation errors together into “tokenization”
errors. Because Microsoft replaces punctuation with whitespace, it’s not always
possible to tell if the original had a dash or a space, but often it is clear.
a. Punctuation present (26).
b. Punctuation absent (11).
Additionally, worth its own category:
c. Missing apostrophe in intended “ ‘s ” (17). Likely often this error occurs
because people intend the query engine to correct their query!
8. Form of word variant errors (FOWVE) (50), as when the suggestions are other
forms of the query word, such as “statistic” and “statistics”. Some of them look to be
cognitive errors, such as “william van regenmorter crime victim rights act” when
the act is really called “william van regenmorter crime victim s rights act”. Some of
them look to be typos, such as “fact on the sun”. Then it was only accidental that the
typo results in a real word. Others really look correct, and probably don’t need
suggestions, but Microsoft gives some anyway: “florida bankruptcy court” for
“florida bankruptcy courts”.
a. Cognitive (16).
b. Typos (18).
c. Not really an error? (16).
Then other errors didn’t fall into the above categories. For many of the examples, it’s not
possible to tell definitively if it was a typo or a cognitive error.
9. Other cognitive errors (34). This category includes phonetic errors (“health
insurance for children with low income familys”) and homonyms (“can you take
amitriptyline while your pregnant”).
10. Other typos (49). These queries were clearly mistyped, such as “Louisiana diaster
unemployment” for “disaster”.
British/American
Other alternative spellings
Non-English
Acronym periods
Abbrev. variations
Repeated words
White space present
White space absent
Punct. present
Punct. absent
Missing 's
FOWVE - cognitive
FOWVE - typos
FOWVE - not an error?
Other cognitive
Other typos
Cognitive Errors
Cognitive errors include query errors that are “conceptually wrong”. Conceptually wrong
means that the query word could be valid word but not semantically correct, or suggested
by experts due to habitual variant formats.
Pure Tokenization Errors
The first component of cognitive errors is tokenization errors – errors caused by insertion
of punctuations or space. Punctuations are replaced with space during the preprocess
phase of TREC therefore yield in different tokenization. These punctuations include “’”, “-”,
“.”, etc. The following are examples of punctuation error and space tokenization error:
 Punctuaton errors:
(107, 1, 3):
childrens magazines children s magazines
(309, 1, 3):
my skin wont tan
my skin won t tan
 Space Tokenization errors:
(45, 1, 2):
calif high way patrol calif highway patrol
(301, 1, 6):
zip codes at the uspostoffice zip codes at the us post office
There are in total 141queries with different tokenization, however only 111, that is
35.69%,are classified as pure tokenization errors. An error is classified as tokenization
error only if the format of the word and semantic is not changed by different tokenization.
Out of 111 tokenization errors, Typical tokenization examples are:
(62, 1, 1):
power point presentations about the construction industry
powerpoint presentations about the construction industry
(63, 1, 1):
childrens hospital oklahoma city 39th streeet
children s hospital oklahoma city 39th streeet
children s hospital oklahoma city 39th street
The following is an example of various tokenization error but should be labeled as word
variant error:
(19, 1, 5): how to establish a survivor s group how to establish a survivors group
Examples of tokenization that causes different semantics and hence not considered
tokenization error:
(35, 1, 8): behavior centers for trisomy 21 down syndrome teen agers
behavior centers for trisomy 21 down syndrome teenagers
behavior centers for trisomy 21 down s syndrome teenagers
(231, 1, 4):
importing bone products in to the united states
importing bone products into the united states
The average edit distance of tokenization errors is 1.1. Out of 111 total tokenization errors,
100 have edit distance of 1, 11 have edit distance of 2.
Form of word variant errors
Errors involved with wrong form of the same word are classified as form of word variant
errors (FOWVE). Again, these types of errors are unlikely to be detected from dictionary.
Our initial idea is to use N-gram language model in such cases. In the 311 errors from TREC
data, there are in total 49 such errors, consisting 15.76% of the total errors. Various forms
of the same word include noun/adj/verb transformations, plural/singular transformations,
special case, and also disagreement between different people on abbreviations.
A typical FOWVE is as below:
(4, 1, 2):
family education rights and privacy act
family educational rights and privacy act
(102, 1, 2):
environmental affect of dredging the c d canal
environmental effect of dredging the c d canal
Example FOWVE caused by disagreement of abbreviations:
(203, 1, 6):
how to apply or a gov grant how to apply for a govtgrant
FOWVE may cause different number of tokens in the suggestions but they should not be
considered as pure tokenization errors. A large number of FOWVE are involved with -‘s at
the end of the word causing different tokenization, but in the cases where –s and –‘s have
different semantic meaning, it is labeled as an error of wrong form of the word, for example:
(19, 1, 5):
how to establish a survivor s group how to establish a survivors group
Form of word variant errors on average has edit distance of 1, since out of 49 such errors,
30 of them are –s and –‘s problems which has only 1 edit distance. Also, disagreement of
abbreviation on average has edit distance of 1. None the less, plural and singular forms
vary by edit distance of 1 generally.This statistics shows that edit distance can be helpful in
ranking the likelihood from N-gram suggested corrections.
Additional Cognitive Errors
So far, one of the key measurements of cognitive errors has been “not being able to easily
detect by dictionary but highly suggested by N-gram”. This situation can happen to
especially people’s names, location’s names, multi-languages, and motor-memory errors. It
should be noted that there exist large proportions of overlap between cognitive errors and
phonetic errors and typing errors.
The average edit distance of such errors is 1.4, while the maximum distance is only 2. It
shows that this kind of errors is mostly related to similar phonetic characters of different
words. The following is a simple graph comparing the edit distances of each of the 3
categories discussed above:
Edit Distance
1.5
1
0.5
0
Ave
Average Edit Distance of
Cognitive Errors
Initial Solution Ideas
We are planning to submit an entry for the Microsoft Research Speller Challenger, which
means we should follow the challenge guidelines and implement a REST-based web service
interface to our system.
Our initial idea for a system to complete this task is as follows. The system will accept lines
of text as input, and distribute this input to three separate modules. Each module is
responsible to identifying one type of error (typo, phonetic, or cognitive) and generating a
list of possible corrections. These suggestions will be pooled together in a common list,
ranked and assigned probabilities accordingly, and returned in the speller challenger
output format. We expect to use a wide assortment of techniques to identify errors and
generate/weight corrections, including dictionary-based approaches, minimum distance
edit, semantic similarity metrics, n-gram models, and Soundex/SPEEDCOP.
For the design phase, each of us will individually come up with a plan to identify instances
of, and generate suggested corrections for, our particular class of errors (typo, phonetic, or
cognitive). Then we’ll meet to exchange ideas, as well as discuss common concerns, such as
how to assign probabilities to the combined list of suggestions.
Download