Project 1: Spelling Challenge Phase 1: Analysis Tim Armstrong, Charlie Greenbacker, andIvankaZhuo Li CISC 889: Applications of NLP 11 April 2011 Task Description Our assignment is essentially to implement a “Did you mean?” feature for a web search engine. The task can be divided into three steps: 1. identification of errors, 2. generation of suggested corrections, and 3. assigning probabilities to the list of suggestions. Input is given as user-generated query, and output will be provided as a list of possible replacement queries with assigned probabilities. A sample dataset based on a portion of the TREC corpus has been provided by Microsoft, with example user queries provided alongside expert-made suggested replacement queries. The Microsoft Research Speller Challenger, upon which this assignment is based, is concerned with more than just simple spelling errors as it also incorporatessome limited query reformulation (e.g., multiple valid spellings, alternative punctuation). Thus, our list of suggestions should include recommendations of this type as well. We recognize three broad, overlapping categories of errors that our system will need to detect and suggest corrections for. Typographical errors include mistakes in typing, often introduced via substitution, insertion, deletion, or transposition. Errors of this type are usually of edit distance 1 or 2, and some mistakes are more common than others. Phonetic errors occur when a user attempts to “sound-out” the spelling of a word that they don’t know the correct spelling for. These errors can have a much higher edit distance, but it’s often difficult to tell between a phonetic error and a typo. Finally, cognitive errors involve real word errors in which the user succeeds in spelling a word correctly… but it’s unfortunately not the word they intended. Examples include motor memory errors and homonyms. We are required to use a noisy channel model for spelling correction. The source model will be a language model (such as n-grams) that provides an apriori probability for known words. The channel model will be a model of noise that indicates how likely certain errors are, based on things like minimum distance edit or keyboard distance. Together, the noisy channel model will tell us how likely an error is for a given word, or more importantly, how likely a known word is for a given error. The system will be evaluated on Microsoft’s training set, using expected F1 as defined in the challenge rules as the evaluation metric. Additionally, systems submitted to the Microsoft challenge will be evaluated on a hidden Bing test dataset. Data Analysis We divided the errors in the dataset into three groups for identification and analysis: typographical errors, phonetic spelling errors, and cognitive errors. Typographical Errors Typographical errors include any spelling error caused by a typing mistake, and are generally divided into four categories: substitutions, insertions, deletions, and transpositions. To avoid confusion with cognitive errors (including so-called “real word” errors), we limited our search for typographical errors to instances where the error word was not found in WordNet. To avoid uncommon proper nouns (which we would have very little chance of generating proper corrections for), this analysis was also limited to only those instances where the suggested correct was found in WordNet. Out of the 311 queries in which an error was detected, 65 queries included a typo. Two of these queries received two different suggested corrections for the same error word, bringing the total number of typo/correction pairs to 67. Of these 67 typos, 63 (94%) were of edit distance 1, 4 typos were of edit distance 2, and no typos were of edit distance 3+. The four typos of edit distance 2 were as follows: (542,1,4) practioner => (542,2,4) practitioner [two deletions] (2078,1,1) higgins => (2078,2,1) higginson [two deletions] (2953,1,3) retiremt => (2953,2,3) retirement [two deletions] (5063,1,1) licens => (5063,3,1) licence [substitution + deletion] Percentage of typos by category: transpositions 6 8% substitutions 5 7% deletions 49 69% insertions 11 16% Unfortunately, there were far too few instances of substitution, insertion, and transposition to do a proper statistical analysis of letter distribution among the errors. There were, however, sufficient examples of deletion to offer some interesting insights into this category. Here is the breakdown of the most common contexts for deletion typos (single instances omitted for space). Deleted letters are in boldface, and dollar sign (‘$’) denotes end-of-line. both sides RNM: 4 NIA: 3 LLI: 2 SSI: 2 TIE: 2 TTE: 2 left context RN: 4 LL: 3 NI: 3 TT: 3 RE: 2 SS: 2 TI: 2 right context NM: 4 IA: 3 E$: 2 IE: 2 LI: 2 SI: 2 TE: 2 Looking at the individual letters that were deleted offers additional insight. Here we see a frequency distribution of deleted letters, paired with their usage frequency in English (based on figures from the Cornell University Math Explorer's Project, via http://en.wikipedia.org/wiki/Letter_frequency) to get a sense of just how common the deletion is: Deletion 20% English 16% 12% 8% 4% 0% N I T E A L S C H R U Y Breakdown of Data We divided 300 of the queries that had a suggestion different from the original into the following categories. See below for analysis of the tokenization and form of word categories. 1. Words with legitimate alternative spellings (11). They want us to suggest both. a. British/American (8) b. Other (3) 2. Non-English (4). Specifically Spanish. 3. Acronyms with and without periods (63). They want us to give some acronymsboth with and without periods. The list they do want includes “U.S.” and U.S. state names. One example of something for which they don’t ask for periods is “fema” (for Federal Emergency Management Agency). 4. Abbreviation variations (3). E.g., “gov” and “govt” are suggested for “gov” (as in “government”). 5. Repeated words (2). One was a misspelled word, the other just an accidental repetition. 6. White space issues (38). E.g. healthcare vs. health care. In either of those forms, Microsoft will suggest both. a. Space present (12). b. Space absent (26). Both of these can be either cognitive errors or typos. For most of the examples, the correct usage would not be 100% clear to a native speaker. Microsoft wants all the alternatives. 7. Punctuation errors/variants (54). E.g., variations of “online” would be “online” and “on-line”. We group whitespace and punctuation errors together into “tokenization” errors. Because Microsoft replaces punctuation with whitespace, it’s not always possible to tell if the original had a dash or a space, but often it is clear. a. Punctuation present (26). b. Punctuation absent (11). Additionally, worth its own category: c. Missing apostrophe in intended “ ‘s ” (17). Likely often this error occurs because people intend the query engine to correct their query! 8. Form of word variant errors (FOWVE) (50), as when the suggestions are other forms of the query word, such as “statistic” and “statistics”. Some of them look to be cognitive errors, such as “william van regenmorter crime victim rights act” when the act is really called “william van regenmorter crime victim s rights act”. Some of them look to be typos, such as “fact on the sun”. Then it was only accidental that the typo results in a real word. Others really look correct, and probably don’t need suggestions, but Microsoft gives some anyway: “florida bankruptcy court” for “florida bankruptcy courts”. a. Cognitive (16). b. Typos (18). c. Not really an error? (16). Then other errors didn’t fall into the above categories. For many of the examples, it’s not possible to tell definitively if it was a typo or a cognitive error. 9. Other cognitive errors (34). This category includes phonetic errors (“health insurance for children with low income familys”) and homonyms (“can you take amitriptyline while your pregnant”). 10. Other typos (49). These queries were clearly mistyped, such as “Louisiana diaster unemployment” for “disaster”. British/American Other alternative spellings Non-English Acronym periods Abbrev. variations Repeated words White space present White space absent Punct. present Punct. absent Missing 's FOWVE - cognitive FOWVE - typos FOWVE - not an error? Other cognitive Other typos Cognitive Errors Cognitive errors include query errors that are “conceptually wrong”. Conceptually wrong means that the query word could be valid word but not semantically correct, or suggested by experts due to habitual variant formats. Pure Tokenization Errors The first component of cognitive errors is tokenization errors – errors caused by insertion of punctuations or space. Punctuations are replaced with space during the preprocess phase of TREC therefore yield in different tokenization. These punctuations include “’”, “-”, “.”, etc. The following are examples of punctuation error and space tokenization error: Punctuaton errors: (107, 1, 3): childrens magazines children s magazines (309, 1, 3): my skin wont tan my skin won t tan Space Tokenization errors: (45, 1, 2): calif high way patrol calif highway patrol (301, 1, 6): zip codes at the uspostoffice zip codes at the us post office There are in total 141queries with different tokenization, however only 111, that is 35.69%,are classified as pure tokenization errors. An error is classified as tokenization error only if the format of the word and semantic is not changed by different tokenization. Out of 111 tokenization errors, Typical tokenization examples are: (62, 1, 1): power point presentations about the construction industry powerpoint presentations about the construction industry (63, 1, 1): childrens hospital oklahoma city 39th streeet children s hospital oklahoma city 39th streeet children s hospital oklahoma city 39th street The following is an example of various tokenization error but should be labeled as word variant error: (19, 1, 5): how to establish a survivor s group how to establish a survivors group Examples of tokenization that causes different semantics and hence not considered tokenization error: (35, 1, 8): behavior centers for trisomy 21 down syndrome teen agers behavior centers for trisomy 21 down syndrome teenagers behavior centers for trisomy 21 down s syndrome teenagers (231, 1, 4): importing bone products in to the united states importing bone products into the united states The average edit distance of tokenization errors is 1.1. Out of 111 total tokenization errors, 100 have edit distance of 1, 11 have edit distance of 2. Form of word variant errors Errors involved with wrong form of the same word are classified as form of word variant errors (FOWVE). Again, these types of errors are unlikely to be detected from dictionary. Our initial idea is to use N-gram language model in such cases. In the 311 errors from TREC data, there are in total 49 such errors, consisting 15.76% of the total errors. Various forms of the same word include noun/adj/verb transformations, plural/singular transformations, special case, and also disagreement between different people on abbreviations. A typical FOWVE is as below: (4, 1, 2): family education rights and privacy act family educational rights and privacy act (102, 1, 2): environmental affect of dredging the c d canal environmental effect of dredging the c d canal Example FOWVE caused by disagreement of abbreviations: (203, 1, 6): how to apply or a gov grant how to apply for a govtgrant FOWVE may cause different number of tokens in the suggestions but they should not be considered as pure tokenization errors. A large number of FOWVE are involved with -‘s at the end of the word causing different tokenization, but in the cases where –s and –‘s have different semantic meaning, it is labeled as an error of wrong form of the word, for example: (19, 1, 5): how to establish a survivor s group how to establish a survivors group Form of word variant errors on average has edit distance of 1, since out of 49 such errors, 30 of them are –s and –‘s problems which has only 1 edit distance. Also, disagreement of abbreviation on average has edit distance of 1. None the less, plural and singular forms vary by edit distance of 1 generally.This statistics shows that edit distance can be helpful in ranking the likelihood from N-gram suggested corrections. Additional Cognitive Errors So far, one of the key measurements of cognitive errors has been “not being able to easily detect by dictionary but highly suggested by N-gram”. This situation can happen to especially people’s names, location’s names, multi-languages, and motor-memory errors. It should be noted that there exist large proportions of overlap between cognitive errors and phonetic errors and typing errors. The average edit distance of such errors is 1.4, while the maximum distance is only 2. It shows that this kind of errors is mostly related to similar phonetic characters of different words. The following is a simple graph comparing the edit distances of each of the 3 categories discussed above: Edit Distance 1.5 1 0.5 0 Ave Average Edit Distance of Cognitive Errors Initial Solution Ideas We are planning to submit an entry for the Microsoft Research Speller Challenger, which means we should follow the challenge guidelines and implement a REST-based web service interface to our system. Our initial idea for a system to complete this task is as follows. The system will accept lines of text as input, and distribute this input to three separate modules. Each module is responsible to identifying one type of error (typo, phonetic, or cognitive) and generating a list of possible corrections. These suggestions will be pooled together in a common list, ranked and assigned probabilities accordingly, and returned in the speller challenger output format. We expect to use a wide assortment of techniques to identify errors and generate/weight corrections, including dictionary-based approaches, minimum distance edit, semantic similarity metrics, n-gram models, and Soundex/SPEEDCOP. For the design phase, each of us will individually come up with a plan to identify instances of, and generate suggested corrections for, our particular class of errors (typo, phonetic, or cognitive). Then we’ll meet to exchange ideas, as well as discuss common concerns, such as how to assign probabilities to the combined list of suggestions.