Presentations

advertisement
Annotating ESL Errors: Challenges and Rewards
Alla Rozovskaya and Dan Roth
University of Illinois at Urbana-Champaign
NAACL-HLT BEA-5 2010
Los Angeles, CA
Page 1
Annotating a corpus of English as a Second Language (ESL)
writing: Motivation




Many non-native English speakers
ESL learners make a variety of mistakes in grammar and usage
Conventional proofing tools do not detect many ESL mistakes
– target native English speakers and do not address many
mistakes of ESL writers
We are not restricting ourselves to ESL mistakes
Page 2
Goals

Developing automated techniques for detecting and
correcting context-sensitive mistakes

Paving the way for better proofing tools for ESL writers


E.g., providing instructional feedback
Developing automated scoring techniques

E.g. , automated evaluation of student essays
Annotation is an
important part of
that process
Page 3
Annotating ESL errors: a hard problem

A sentence usually contains multiple errors


In Western countries prisson conditions are
more better than in Russia , and this fact
helps to change criminals in better way of life .
Not always clear how to mark the type of a mistake
“…which reflect a traditional female role and a
traditional attitude to a woman…”
“…which reflect a traditional female role and a
traditional attitude towards women…”

women
a woman
a
woman
<NONE>
women
Page 4
Annotating ESL errors: a hard problem

Distinction between acceptable/unacceptable usage is fuzzy

Women were indignant at inequality from men.
Women were indignant at the inequality from
men.
Page 5
Common ESL mistakes

English as a Second Language (ESL) mistakes

Mistakes involving prepositions


We even do good to*/for other people <NONE>*/by
spending money on this and asking <NONE>*/for
nothing in return.
Mistakes involving articles

The main idea of their speeches is that a*/the
romantic period of music was too short.

Laziness is the engine of the*/<NONE> progress.

Do you think anyone will help you? There are not
many people who are willing to give their*/a
hands*/hand.
Page 6
Purpose of the annotation

To have a gold standard set for the development and evaluation of an
automated system that corrects ESL mistakes

There is currently no gold standard data set available for researchers

Systems are evaluated on different data sets – performance comparison
across different systems is hard



Results depend on the source language of the speakers and proficiency level
The annotation of this corpus is available and can be used by researchers who
gain access to the ICLE and the CLEC corpora.
This corpus is used in the experiments described in [Rozovskaya and Roth,
NAACL, ’10]
Page 7
Outline


Annotating ESL mistakes: Motivation
Annotation








Data selection
Annotation procedure
Error classification
Annotation tool
Annotation statistics
Statistics on article corrections
Statistics on preposition corrections
Inter-annotator agreement
Page 8
Annotation: Overview


Annotated a corpus of ESL sentences (63K words)
Extracted from two corpora of ESL essays:





International Corpus of Learner English (ICLE) [Granger et al.,’02]
Chinese Learner English Corpus (CLEC) [Gui and Yang,’03]
Sentences written by ESL students of 9 first language
backgrounds
Each sentence is fully corrected and error tagged
Annotated by native English speakers
Page 9
Annotation: focus of the annotation

Focus of the annotation: Mistakes in article and preposition
usage

These mistakes have been shown to be very common mistakes for
learners of different first language backgrounds [Dagneaux et al, ’98;
Gamon et al., ’08; Tetreault et al., ’08; others]
Page 10
Annotation: data selection

Sentences for annotation extracted from two corpora of ESL
essays

International Corpus of Learner English (ICLE)



Chinese Learner of English Corpus (CLEC)




Essays by advanced learners of English
First language backgrounds: Bulgarian, Czech, French, German, Italian,
Polish, Russian, Spanish
Essays by Chinese learners of different proficiency levels
Garbled sentences and sentences with near-native fluency
excluded with a 4-gram language model
50% of sentences for annotation randomly sampled from the
two corpora
50% of sentences selected manually to collect more
preposition errors
Page 11
Annotation: procedure

Annotation performed by three native English speakers




Graduate and undergraduate students in Linguistics/foreign languages
With previous experience in natural language annotation
Annotation performed at the sentence level – all errors in the
sentence are corrected and tagged
The annotators were encouraged to propose multiple
alternative corrections

Useful for the evaluation of an automated error correction system

“ They contribute money to
to the building of
to/towards
hospitals”
Page 12
Annotation: error classification


Focus of the annotation: mistakes in article and preposition
usage
Error classification (inspired by [Tetreault and Chodorow,’08])

developed with the focus on article and preposition errors


“…which reflect a traditional female role and a
traditional attitude to a woman…”  “…which
reflect a traditional female role and a
traditional attitude towards a*/<NONE>
woman*/women…”
was intended to give a general idea about the types of mistakes ESL
students make
Page 13
Annotation: error classification
Error type
Example
Article error
Women were indignant at
<None>*/the inequality from men.
Preposition error
…to change their views to*/for the
better.
Noun number
Science is surviving by overcoming
the mistakes not by uttering the
truths*/truth.
Verb form
He write*/writes poetry.
Word form
It is not simply*/simple to make
professional army.
Spelling
…if a person commited*/committed a
crime…
Word replacement
(lexical error)
There is a
probability*/possibility that
today’s fantasies will not be
fantasies tomorrow.
Page 14
Outline


Annotating ESL mistakes: Motivation
Annotation








Data selection
Annotation procedure
Error classification
The annotation tool
Annotation statistics
Statistics on article corrections
Statistics on preposition corrections
Inter-annotator agreement
Page 15
The annotated ESL corpus

Sentence for
annotation
Annotating ESL sentences with an annotation tool
Flexible
infrastructure
allows for an easy
adaptation to a
different domain
Page 16
Example of an annotated sentence

Before annotation
“This time asks for
our eyes opened.”

Annotation rate: 30-40
sentences per hour
looking
at things with
With annotation comments
“This time @period, age, time@ asks $us$ for
<to> looking *look* at things with our eyes opened
.”

After annotation
“This period asks us to look at things with our
eyes opened.”
Page 17
Outline


Annotating ESL mistakes: Motivation
Annotation








Data selection
Annotation procedure
Error classification
Annotation tool
Annotation statistics
Statistics on article corrections
Statistics on preposition corrections
Inter-annotator agreement
Page 18
Annotation statistics
Articles
12.5%
Punctuation
22.5%
Prepositions
17.1%
Verb form
5.2
Word replacement
28.2%
Word form
2.9%
Noun number
3.0%
Spelling
6.5%
Word order
2.2%
Page 19
Common article and preposition mistakes

Article mistakes

Missing articles


Extraneous articles


But this , as such , is already <NONE>*/a new
subject for discussion .
Laziness is the engine of the*/<NONE> progress.
Preposition mistakes

Confusing different prepositions

Education gives a person a better appreciation
of*/for such fields as art , literature , history
, human relations , and science
Page 20
Statistics on article corrections
Source language
Errors total
Errors per hundred words
Bulgarian
76
1.2
Chinese
179
1.9
Czech
138
2.1
French
22
0.4
German
23
0.5
Italian
43
0.6
Polish
71
1.5
Russian
271
2.5
Spanish
134
1.7
All
957
1.5
Page 21
Distribution of article errors by error type
Not all
confusions are
equally likely
Distribution of errors by type
Errors are
dependent on the
first language of the
writer
60
50
40
Chinese
30
Czech
Russian
20
10
0
Missing the Missing a
Extr.the
Extr.a
Conf.(a,the )
Page 22
Statistics on preposition corrections
Sour ce
language
Bulgarian
Chinese
Czech
French
German
I t alian
Polish
Russian
Spanish
A ll
E r r or s E r r or s
t ot al
p er 100
wor ds
89
1.4
384
4.1
91
1.4
57
1.0
75
1.5
120
1.8
77
1.7
251
2.3
165
2.1
1309
2.1
Many contexts license multiple
prepositions [Tetreault and
Chodorow, ’08]
M ist akes by er r or t y p e
R epl. I ns. D el. W it h
or ig.
58%
22% 11% 8%
52%
24% 22% 2%
51%
21% 24% 4%
61%
9%
12% 18%
61%
8%
16% 15%
57%
22% 12% 8%
49%
18% 16% 17%
53%
21% 17% 9%
55%
20% 19% 6%
54% 21% 18% 7%
Unlike with articles, preposition
confusions account for over 50%
of all preposition errors
Page 23
Inter-annotator agreement
A gr eem ent set
A greement set 1
A greement set 2
A greement set 3
R at er
Rat er
Rat er
Rat er
Rat er
Rat er
Rat er
#2
#3
#1
#3
#1
#2
J udged
cor r ect
37
59
79
73
83
47
J udged
incor r ect
63
41
21
27
17
53
Page 24
Inter-annotator agreement
A gr eem ent set
A greement set 1
A greement set 2
A greement set 3
A gr eem ent
56%
78%
60%
kappa
0.16
0.40
0.23
Page 25
Conclusions


We presented the annotation of a corpus of ESL sentences
Annotating ESL mistakes is an important but a challenging
task





Interacting mistakes in a sentence
Fuzzy distinction between acceptable/unacceptable usage
We have described an annotation tool that facilitates the
error-tagging of a corpus of text
The inter-annotator agreement on the task is low and shows
that this is a difficult problem
The annotated data can be used by other researchers for the
evaluation of their systems
Page 26


Annotation tool
ESL annotation
rozovska@illinois.edu
http://L2R.cs.uiuc.edu/~cogcomp/software.php
Thank you!
Questions?
Page 27
Download