Detecting collocation errors in English Language Learners’ writing Yoko Futagi Educational Testing Service ECOLT October 29, 2010 Outline • Motivation and goal • System description • Summary 2 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. What are collocations? • “A sequence of words or terms which co-occur more often than would be expected by chance” (Wikipedia) • Examples: – hold a meeting, not clench a meeting (note: clench teeth) – powerful computer rather than strong computer (note: strong tea rather than powerful tea) • Our working definition: – Combinations of words that are frequent “enough” 3 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Why are collocations important? • Collocation affects fluency; poor use of collocation can disrupt communication. (Howarth 1998, Martynska 2004, Wray and Perkins 2000, etc.). • Goal: Development of a tool which detects collocation errors and suggests alternatives (“collocation tool”) 4 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Approaches (1) • Manual construction - dictionaries • Corpora-based – Manual tagging of errors – Aligned bilingual corpora (machine learning) – Large-scale corpora 5 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Approaches (2) • Wible et al. (2002) – Constructed a learner corpora that was partially manually annotated • Chan & Liou (2005) – System to teach English collocation using aligned bilingual corpora (TotalRecallTM) • Microsoft U.S. patent 7031911 – Uses aligned corpora to construct miscollocation database • Seretan et al. (2005) – Uses the web to look up collocates of a given word • Lin (1998) – Uses a broad coverage parser and mutual information statistic; online collocation checker • Shei & Pain (2000) – Focuses on V-N pattern – Similar strategy to our system, but involves manual checking of errors. 6 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Tool Design To compensate for common learner errors Candidate string extraction (POStagger, pattern match) Student essay Find synonyms Inflection and article variation Reference database lookup Decisionmaking algorithm 7 Spellchecking Yes Correctly spelled? OK/ERROR Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. No Omit string Common learner errors: Misspellings and inflection/article errors • They can mess up the performance of Parts-of-Speech (“POS”) tagger (a software which goes through a text and tags each word with an appropriate part-of-speech such as “singular noun” or “past-tense verb”) • N-grams (strings of words) containing them, especially misspellings, can’t be found in a database built from well-formed texts (this can cause a false error-flagging) Misspelled strings are currently omitted. 8 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Tool Design Candidate string extraction (POStagger, pattern match) Student essay Find synonyms Inflection and article variation Reference database lookup Decisionmaking algorithm 9 Spellchecking Yes Correctly spelled? OK/ERROR Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. No Omit string Find collocation candidates: Target patterns • • • • • • Adjective-Noun • strong tea, powerful computer Noun-Noun • bee hive, house arrest Noun-of-Noun • a swarm of bees Verb-Object Noun • throw a party, reject an appeal Verb-Adverb/Adverb-Verb • argue strenuously Phrasal verb • turn off the light, grow up 10 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Find collocation candidates: The process • Parts-of-Speech (“POS”) tagger (a software which goes through a text and tags each word with an appropriate part-ofspeech such as “singular noun” or “past-tense verb”) Example: I used to have a very strict sckedule. I used to separate the leiser time and work time. I would like to live with a person who also likes the same things. After POS tagging: I_PRP used_VBD to_TO have_VB a_DT very_RB strict_JJ sckedule_NN ._.I_PRP used_VBD to_TO separate_VB the_DT leiser_NN time_NN and_CC work_NN time_NN ._. I_PRP would_MD like_VB to_TO live_VB with_IN a_DT person_NN who_WP also_RB likes_VBZ the_DT same_JJ things_NNS ._. 11 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Find collocation candidates: The process, cont’d • The tool then scans the text and looks for patterns of parts-of-speech that match the target syntactic patterns I_PRP used_VBD to_TO have_VB a_DT very_RB strict_JJ sckedule_NN I_PRP used_VBD to_TO separate_VB the_DT leiser_NN time_NN and_CC work_NN time_NN ._. I_PRP would_MD like_VB to_TO live_VB with_IN a_DT person_NN who_WP also_RB likes_VBZ the_DT same_JJ things_NNS ._. 12 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Find collocation candidates: The process, cont’d • While scanning the text, words which typically do not participate in collocations, such as numbers, pronouns, thing, also, always, etc., are ignored to prevent picking up false candidates • have_VB a_DT very_RB strict_JJ sckedule_NN • strict_JJ sckedule_NN • separate_VB the_DT leiser_NN time_NN • leiser_NN time_NN • work_NN time_NN • likes_VBZ the_DT same_JJ things_NNS • also_RB likes_VBZ 13 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Tool Design Candidate string extraction (POStagger, pattern match) Student essay Find synonyms Inflection and article variation Reference database lookup Decisionmaking algorithm 14 Spellchecking Yes Correctly spelled? OK/ERROR Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. No Omit string Spell Checking • Misspelled candidate strings are thrown out, since they would not be found in the database • have_VB a_DT very_RB strict_JJ sckedule_NN • strict_JJ sckedule_NN • separate_VB the_DT leiser_NN time_NN • leiser_NN time_NN • work_NN time_NN 15 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Tool Design Candidate string extraction (POStagger, pattern match) Student essay Find synonyms Inflection and article variation Reference database lookup Decisionmaking algorithm 16 Spellchecking Yes Correctly spelled? OK/ERROR Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. No Omit string Generate morphology and article variations • If there is an article or inflection error, the string would not be found – especially problematic for non-native speakers’ writing. • Solution: create variants of the original string by: – varying the article (a/an/the/Ø for singular, the/Ø for plural: original: leaving city leaving a city leaving the city – varying verb or noun inflection: original: pick apples picked apples picks apples picking apples etc… 17 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Tool Design Candidate string extraction (POStagger, pattern match) Student essay Find synonyms Inflection and article variation Reference database lookup Decisionmaking algorithm 18 Spellchecking Yes Correctly spelled? OK/ERROR Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. No Omit string Find synonyms Powerful and strong are synonyms. and... Strong tea occurs much more frequently than powerful tea ↓ We can say with confidence that strong tea is a collocation. 19 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Tool Design Candidate string extraction (POStagger, pattern match) Student essay Find synonyms Inflection and article variation Reference database lookup Decisionmaking algorithm 20 Spellchecking Yes Correctly spelled? OK/ERROR Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. No Omit string Look up in the reference database • Larger the corpora the database is created from, the better the coverage. • Generally speaking, raw counts are not used to measure collocational “strength” of a string. → This is because some words are simply more common than others. • Instead, word-association statistics are used. 21 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Evaluation • Procedure – 2000+ candidate strings extracted from 300 randomly selected TOEFL essays, with OK or ERROR decision by the collocation tool – 2 native speakers (“annotators”) were asked to mark each string with one of the following judgments: • 3 = “This sounds natural” • 2 = “Not so good, but not impossible” • 1 = “This is really unnatural; I would never say it like this”. – Annotators agreed on 1020 strings. 22 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Agreement between raters and collocation tool Precision Recall ERROR judgment 0.24 0.96 OK judgment 0.99 0.61 Precision and recall explanation? Precision = | OK recognized by the annotator OK detected by the tool | | OK detected by the tool | Recall = | OK recognized by the annotator OK detected by the tool | | OK recognized by the annotator | 23 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Sources of tool errors • Tool errors: – POS tagger errors – Pattern-matching errors – Reference database coverage • Writer errors: – Misspellings resulting in real words (e.g. by spelled as buy) – Grammar errors – Incomprehensible “sentences” 24 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Future plans • Decrease tool errors – Try a more robust POS-tagger – Try using a syntactic parser – Increase database size • Improve tool speed – Eliminate the need for generating inflection and article variations → build a special database • Improve correction suggestions 25 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Acknowledgments • • • • • • • Martin Chodorow Paul Deane Sarah Ohls Joel Tetreault Cathy Trapani Vincent Weng Waverely vanWinkle 26 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Thank you. 27 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.