Detecting Collocation Errors in English Language Learners

Detecting collocation errors in
English Language Learners’
writing
Yoko Futagi
Educational Testing Service
ECOLT
October 29, 2010
Outline
• Motivation and goal
• System description
• Summary
2
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
What are collocations?
• “A sequence of words or terms which co-occur more
often than would be expected by chance” (Wikipedia)
• Examples:
– hold a meeting, not clench a meeting
(note: clench teeth)
– powerful computer rather than strong computer
(note: strong tea rather than powerful tea)
• Our working definition:
– Combinations of words that are frequent “enough”
3
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Why are collocations important?
• Collocation affects fluency; poor use of collocation can
disrupt communication. (Howarth 1998, Martynska 2004,
Wray and Perkins 2000, etc.).
• Goal: Development of a tool which detects collocation
errors and suggests alternatives (“collocation tool”)
4
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Approaches (1)
• Manual construction - dictionaries
• Corpora-based
– Manual tagging of errors
– Aligned bilingual corpora (machine learning)
– Large-scale corpora
5
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Approaches (2)
• Wible et al. (2002)
– Constructed a learner corpora that was partially manually
annotated
• Chan & Liou (2005)
– System to teach English collocation using aligned bilingual corpora
(TotalRecallTM)
• Microsoft U.S. patent 7031911
– Uses aligned corpora to construct miscollocation database
• Seretan et al. (2005)
– Uses the web to look up collocates of a given word
• Lin (1998)
– Uses a broad coverage parser and mutual information statistic;
online collocation checker
• Shei & Pain (2000)
– Focuses on V-N pattern
– Similar strategy to our system, but involves manual checking of
errors.
6
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Tool Design
To compensate
for common
learner errors
Candidate string
extraction (POStagger, pattern
match)
Student
essay
Find
synonyms
Inflection
and article
variation
Reference
database
lookup
Decisionmaking
algorithm
7
Spellchecking
Yes
Correctly
spelled?
OK/ERROR
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
No
Omit
string
Common learner errors:
Misspellings and inflection/article errors
• They can mess up the performance of Parts-of-Speech
(“POS”) tagger (a software which goes through a text
and tags each word with an appropriate part-of-speech
such as “singular noun” or “past-tense verb”)
• N-grams (strings of words) containing them, especially
misspellings, can’t be found in a database built from
well-formed texts (this can cause a false error-flagging)
 Misspelled strings are currently omitted.
8
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Tool Design
Candidate string
extraction (POStagger, pattern
match)
Student
essay
Find
synonyms
Inflection
and article
variation
Reference
database
lookup
Decisionmaking
algorithm
9
Spellchecking
Yes
Correctly
spelled?
OK/ERROR
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
No
Omit
string
Find collocation candidates:
Target patterns
•
•
•
•
•
•
Adjective-Noun
• strong tea, powerful computer
Noun-Noun
• bee hive, house arrest
Noun-of-Noun
• a swarm of bees
Verb-Object Noun
• throw a party, reject an appeal
Verb-Adverb/Adverb-Verb
• argue strenuously
Phrasal verb
• turn off the light, grow up
10
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Find collocation candidates:
The process
•
Parts-of-Speech (“POS”) tagger (a software which goes
through a text and tags each word with an appropriate part-ofspeech such as “singular noun” or “past-tense verb”)
Example:
I used to have a very strict sckedule. I used to separate the
leiser time and work time. I would like to live with a person
who also likes the same things.
After POS tagging:
I_PRP used_VBD to_TO have_VB a_DT very_RB strict_JJ
sckedule_NN ._.I_PRP used_VBD to_TO separate_VB
the_DT leiser_NN time_NN and_CC work_NN time_NN ._.
I_PRP would_MD like_VB to_TO live_VB with_IN a_DT
person_NN who_WP also_RB likes_VBZ the_DT same_JJ
things_NNS ._.
11
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Find collocation candidates:
The process, cont’d
•
The tool then scans the text and looks for patterns of
parts-of-speech that match the target syntactic patterns
I_PRP used_VBD to_TO have_VB a_DT very_RB
strict_JJ sckedule_NN I_PRP used_VBD to_TO
separate_VB the_DT leiser_NN time_NN and_CC
work_NN time_NN ._. I_PRP would_MD like_VB to_TO
live_VB with_IN a_DT person_NN who_WP also_RB
likes_VBZ the_DT same_JJ things_NNS ._.
12
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Find collocation candidates:
The process, cont’d
•
While scanning the text, words which typically do not
participate in collocations, such as numbers, pronouns,
thing, also, always, etc., are ignored to prevent picking
up false candidates
•
have_VB a_DT very_RB strict_JJ sckedule_NN
•
strict_JJ sckedule_NN
•
separate_VB the_DT leiser_NN time_NN
•
leiser_NN time_NN
•
work_NN time_NN
•
likes_VBZ the_DT same_JJ things_NNS
•
also_RB likes_VBZ
13
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Tool Design
Candidate string
extraction (POStagger, pattern
match)
Student
essay
Find
synonyms
Inflection
and article
variation
Reference
database
lookup
Decisionmaking
algorithm
14
Spellchecking
Yes
Correctly
spelled?
OK/ERROR
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
No
Omit
string
Spell Checking
•
Misspelled candidate strings are thrown out, since they
would not be found in the database
•
have_VB a_DT very_RB strict_JJ sckedule_NN
•
strict_JJ sckedule_NN
•
separate_VB the_DT leiser_NN time_NN
•
leiser_NN time_NN
•
work_NN time_NN
15
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Tool Design
Candidate string
extraction (POStagger, pattern
match)
Student
essay
Find
synonyms
Inflection
and article
variation
Reference
database
lookup
Decisionmaking
algorithm
16
Spellchecking
Yes
Correctly
spelled?
OK/ERROR
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
No
Omit
string
Generate morphology and article variations
• If there is an article or inflection error, the string would not be found
– especially problematic for non-native speakers’ writing.
• Solution: create variants of the original string by:
– varying the article (a/an/the/Ø for singular, the/Ø for plural:
original: leaving city
leaving a city
leaving the city
– varying verb or noun inflection:
original: pick apples
picked apples
picks apples
picking apples
etc…
17
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Tool Design
Candidate string
extraction (POStagger, pattern
match)
Student
essay
Find
synonyms
Inflection
and article
variation
Reference
database
lookup
Decisionmaking
algorithm
18
Spellchecking
Yes
Correctly
spelled?
OK/ERROR
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
No
Omit
string
Find synonyms
Powerful and strong are synonyms.
and...
Strong tea occurs much more frequently than powerful tea
↓
We can say with confidence that strong tea is a collocation.
19
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Tool Design
Candidate string
extraction (POStagger, pattern
match)
Student
essay
Find
synonyms
Inflection
and article
variation
Reference
database
lookup
Decisionmaking
algorithm
20
Spellchecking
Yes
Correctly
spelled?
OK/ERROR
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
No
Omit
string
Look up in the reference database
• Larger the corpora the database is created from, the
better the coverage.
• Generally speaking, raw counts are not used to measure
collocational “strength” of a string.
→ This is because some words are simply more
common than others.
• Instead, word-association statistics are used.
21
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Evaluation
• Procedure
– 2000+ candidate strings extracted from 300 randomly
selected TOEFL essays, with OK or ERROR decision
by the collocation tool
– 2 native speakers (“annotators”) were asked to mark
each string with one of the following judgments:
• 3 = “This sounds natural”
• 2 = “Not so good, but not impossible”
• 1 = “This is really unnatural; I would never say it like this”.
– Annotators agreed on 1020 strings.
22
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Agreement between raters and collocation tool
Precision
Recall
ERROR judgment
0.24
0.96
OK judgment
0.99
0.61
Precision and recall explanation?
Precision = | OK recognized by the annotator  OK detected by the tool |
| OK detected by the tool |
Recall = | OK recognized by the annotator  OK detected by the tool |
| OK recognized by the annotator |
23
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Sources of tool errors
• Tool errors:
– POS tagger errors
– Pattern-matching errors
– Reference database coverage
• Writer errors:
– Misspellings resulting in real words (e.g. by spelled as
buy)
– Grammar errors
– Incomprehensible “sentences”
24
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Future plans
• Decrease tool errors
– Try a more robust POS-tagger
– Try using a syntactic parser
– Increase database size
• Improve tool speed
– Eliminate the need for generating inflection and article
variations → build a special database
• Improve correction suggestions
25
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Acknowledgments
•
•
•
•
•
•
•
Martin Chodorow
Paul Deane
Sarah Ohls
Joel Tetreault
Cathy Trapani
Vincent Weng
Waverely vanWinkle
26
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.
Thank you.
27
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved.