- University of Minnesota Duluth

advertisement
Technology Transfer of
Automated Essay Evaluation:
From NLP research through
deployment as a business
Jill Burstein
Educational Testing Service
Presented to the Computer Science Department
University of Minnesota, Duluth
Fall Colloquium Series
October 4, 2004
CriterionSM, E-rater®, Critique,
C-rater, & more …
Jill Burstein
Claudia Leacock
Thomas Morton
Educational Testing Service
Martin Chodorow
Hunter College, CUNY
Susanne Wolff
Princeton University
Let’s Talk About
Writing & Assessment
Educators’ Vision:
Writing Skill Development
• Master basic skills in K-12
– Grammar, spelling, punctuation, etc.
• Perfected the 5-paragraph essay
– U.S. Concept
– Thesis, 3 Main Points, Conclusion
• Writing within and beyond discipline
– Address different audiences
– Generate various genres: persuasive, compareand–contrast, research writing within discipline
Evaluation of Vision Through
Writing Assessments
• High stakes: Undergraduate & Graduate
– Admissions
– Placement
• No/Low stakes: K-12
– Statewide and national assessments
– Classroom instruction
What do essays look like?
The Reality of Writing Quality
• Timed Assessments
– Up to 500 words (grade-level dependent)
– Not Literary Essays
!creative
!irony
!metaphor
• Instructional Uses
– Maybe Longer Essays
– Better Quality with Revision Facility
Most essays look like this
“My position on school uniforms is as follows. School uniforms are
a violation on several students rights. School uniforms makes us,
the students, think that we do not have the write to express our
feelings through clothing. Many students show pride through the
clothing that they choose to wear.
For instance some students may be of a different religion. Their
relgion may require them to wear a clothing of such. When the
school forces us to wear a uniform it forces these people to go back
on their religion. wearing differnent types of clothing expresses a
student's inner child, if not more. Through clothing we can see a
students hobbies, joys, and loves of life. Putting uniforms on us
would violate the fact to actually to show an opinion. Does the
school want us to look exactly alike? Some students may not have
an open mind about the fact that they cannot show their youth and
personalty through clothing they will show it in another unhealthy
way.
In closing uniforms are and unjustice act against all students alike."
And others like this…
“You are stupid. You are stupid because you can't read.
You are also stupid becuase you don't speak English
and because you can't add.
Your so stupid, you can't even add! Once, a teacher
give you a very simple math problem; it was 1+1=?.
Now keep in mind that this was in fourth grade, when
you should have known the answer. You said it was 23!
I laughed so hard I almost wet my pants! How much
more stupid can you be?!
So have I proved it? Don't you agree that your the
stupidest person on earth? I mean, you can't read, speak
English, or add. Let's face it, your a moron, no, an idiot,
no, even worse, you're an imbosol.”
And this ….
“I THINK THAT EVERYONE
SHOULD BE ABLE TO WEAR
WHATEVER THE HELL THEY
WANT TO WEAR. “
And this…
“I don't know how to explain this
question because I took a nap
while listening.
Sorry. “
And this …
This is my topic on presidents. The one i will be talking about is Bill
Clinton. He like most of us have fingers, with fingernails. He also has
two arms were his fingers are on, were teh fingernails connect.What
can a arm be without a hand? nothing thats the answer, so he obviously
has two hands in witch the fingers are on, with the fingernails connect
too.
He also has eyes.....EYES o yeah even him the big cheese has eyes,
its weird cuz not many people have eyes but this good president does,
and he can SEE you SEE you, you might not be able to see him , but
he can SEE you.
One time we were on AOL chat and i was talking to him he said
he like to climb trees in his underwear well his arms were covered with
sauce....pizza sauce. He said it makes him feel free and good about
himself. Also in the chat he said he has a pigeon for a pet and the
things name is Frances, he said they like to make bacon together in the
mourning and at night.And they eat at his friends Y place mostly
everyday.
And this …
“Is true that in so many jobs people have to wear
dress codes for so many reazons.Like in the restaurant
the workers are obligaded to use a drees code because
they have to look differently and have a good looking to
impresionate the cutmers.Not necesesary at all the
schools have drees codes but the ones that havet is
because
Arriba todos mis compas ya llego el rey del
cristal, y yo mismo lo cosino para mejor calidad por esos
mismos motivos me busca la federal la dea de estados
unidos tambien me quiere agarar. si ellos me buscan por
tierra yo me les pelo por mar si piensan que ando colima
yo me paseo en michoacan por la ruana y poir tepeque
aguililla y cuancoman ,cuantas libras va a llevarse “
[Descriptive Translation: “It’s a rhyme about a drugrunner … the guy is basically saying that he's the king
of Cristal meth, wanted by the DEA and the Feds. ”]
Human Scoring
Algorithm
essay
human reader score
(S1)
YES
human reader score
(S2)
Is |S1 - S2| > 1 ?
expert human reader score (S3)
NO
Final Score
=
mode, or mean
of closest
Final Score
=
mean
Building Automated Essay
Scoring Capabilities
Some History
– PEG – 1960’s essay scoring (Page, 1966)
• Transformation of essay length
• Some syntactic analysis
• Convincing results
– Writer’s Workbench (Cherry et al, 1982)
• Editing tool for students
• Diction, style, spelling
• Discourse structure (‘topic sentence’ identification)
– Intelligent Essay Assessor (Landauer et al,
1998)
• Essay scoring with latent semantic analysis (LSA)
• Style and mechanics measures
Assessment Market
Technologically Ready
• Increase in Internet & Computer Access
– Instructional computers with Internet access in
public schools: 12.1 to 1 in 1998 & 4.8 to 1 in
2001 (NCES, 2002)
– Web resources used in over 40% of college
courses (Campus Computing Project, 2000)
– 99% of public schools have internet access (NCES
report, 2002)
• State Assessments: Increase in computerbased delivery
• Largest Test Publishers offer 850+ digital
textbook titles
Motivation in Assessment
• Cost Savings for Large-Scale
Assessments
• Classroom Integration for
Instruction
– More practice writing possible
– Electronic writing portfolios
– Individual performance assessment
– Classroom assessments
Educators’ Questions
About Innovation
1. Reliability: Can automated essay
assessments increase scoring
consistency for authentic assessments?
2. Assessment Type: Can automated
scoring introduce more varied highstakes assessments?
3. Costs/Performance: Can scoring costs
be reduced, but scoring performance
maintained?
Starting Development
What should a good essay look like?
• Clearly states the author's position, and effectively
persuades the reader of validity of author's
argument.
• Well organized, with strong transitions helping to
link words and ideas.
• Develops its arguments with specific, wellelaborated support.
• Varies sentence structures and makes good word
choices; very few errors in spelling, grammar, or
punctuation
Mapping Writing Features
to NLP Tools
Writing Features
NLP Tools
Grammar Errors &
Sentence Structures
POS Taggers;
Syntactic Parsers
Vocabulary Usage
Content Analysis; POS
Taggers
Sentence & WordLevel Mechanics
Spelling Tools; POS
Taggers
Organization &
Development of Ideas
Discourse Analyzers
E-rater (2/99 – 9/04)
• 50+ Writing-Relevant Features
–
–
–
–
Syntactic Structure Features: clause types
Discourse Structure Features: cue words & terms
Content: Content vector analysis
Lexical Complexity: e.g., word length, unique words
• NLP Tools
– Syntactic Parses
– High-level discourse analyzer
– tf*idf (essay level & argument level)
• Topic-Specific Models
– Training with Human-Scored Essays
– Stepwise Linear Regression (Variable Feature Set & Weights)
• System Performance
– Agreement with Humans
– Comparable to Two Humans
– E-rater/Human agreement : 59% exact; 98% exact + adjacent
essay
human reader score
(S1)
YES
E-rater score
(S2)
Is |S1 - S2| > 1 ?
expert human reader score (S3)
NO
Final Score
=
mode, or mean
of closest
Final Score
=
mean
Outcomes of Early Success
NY Times Headline Phobia
Can you spell imbecile?: E-rater® Gives Good Score
to Bad Essay
By A. Reporter
ETS’s automated scoring system thinks that this essay should
get something like a “B.” Would you want your child to do
well on this kind of writing? You be the judge.
“You are stupid. You are stupid because you can't read.
You are also stupid becuase you don't speak English and
because you can't add.
Your so stupid, you can't even add! Once, a teacher give
you a very simple math problem; it was 1+1=?. Now keep
in mind that this was in fourth grade, when you should have
known the answer. You said it was 23! I laughed so hard I
almost wet my pants! How much more stupid can you be?!
So have I proved it? Don't you agree that your the
stupidest person on earth? I mean, you can't read, speak
English, or add. Let's face it, your a moron, no, an idiot, no,
even worse, you're an imbosol.”
Anomalous Essay Detection
Statistical evaluation of word usage to flag
anomalous essays
– “Your essay does not resemble others being
written on this topic.”
– “Your essay might not be relevant to assigned
topic.”
– “Your essay appears to be restatement of the
topic with a few additional concepts.”
– “Compared to other essays written on this topic,
your essay contains more repetition of words.”
– “Your essay shows less development of a theme
than other essays written on this topic.”
Positive Outcomes:
Changing Business Model
• Cost Savings
– 1995- 2000: ETS Research, some marketing,
little sales
• Revenue Generation
– 2001-2003: Spin-off (ETS Tech), all
marketing, all sales, all product development,
all the time
– 2003 – present: Spin-back (to ETS) with
vastly increase marketing & sales
What teachers really wanted:
Qualitative Feedback
Learning from Assessment Experts
• Holistic scores not meaningful to students
–
–
–
–
Score 3:
While a position may be stated, either it is unclear OR
undeveloped.
May have organization in parts, but lacks organization in
other parts.
The support of the position may be brief, repetitive, or
irrelevant.
Inconsistent control of sentence structure, and incorrect word
choices; errors in spelling, grammar, or punctuation
occasionally interfere with reader understanding.
• Demos for focus groups with teachers,
policy makers, assessment experts
– Errors in grammar, usage, mechanics, and style
– Organization & Development
More Innovation – More Questions
– Meaningfulness: Is the feedback
consistently related to a clearly-defined
standard?
– Self-Evaluation: Can instructional
software help students understand
evaluation of their writing?
– Improvement: Can writing practice
with immediate feedback help
students?
CriterionSM Online Essay
Evaluation Service
• Critique writing analysis tools
– Grammar
– Usage
– Mechanics
– Style
– Organization & Development
• E-rater
Motivation For
New Capability Development
• What’s free for commercial use
- Spelling
• What’s not …free …
- Grammar
- Usage
- Mechanics
- Style
• What doesn’t exist
- Organization & Development
Methods
Grammar, Usage and
Mechanics Errors
• Corpus of well-formed text: 30 million words
from newspapers
• Features: function words and part-of-speech tags
a_AT good_JJ job_NN during_IN
• Collect frequencies for:
– Unigrams of tags and function words
– bigrams of tags and function words
a_JJ AT_JJ JJ_NN NN_during NN_IN
• Method: pointwise mutual information and log
likelihood ratio used to detect unexpected
sequences – likely violations of English grammar
Grammar, Usage and
Mechanics Errors
• Harvest low probability bigrams from a
set of essays.
• Low probability bigrams:
– DTS_NN, AT_NNS
– *these pencil, *every teenagers
• Write Filters:
– *These pencil is yellow.
– but not These pencil erasers are dirty.
Grammar
•
•
•
•
•
•
•
Fragments
Garbled Sentences
Subject-Verb Agreement: the motel are …
Verb form: They are need to distinguish …
Pronoun Errors: Them are my reasons …
Possessive Errors: the students grades
Wrong or Missing Word: The should take the
student
• Proofread This!: I think my through problems
Usage
• Article Errors: I like these song
• Confused Words: Because of there
different genres …
• Wrong Form of Word: the right choose
• Faulty Comparison: It is more easier
• Nonstandard Verb or Word Form
Mechanics
•
•
•
•
•
•
•
•
•
•
•
Spelling
Missing Capitalization
Missing Initial Capitalization
Missing Question Mark
Missing Final Punctuation
Missing Apostrophe: Thats about the only
thing
Missing Comma
Missing Hyphen: a well deserved vacation
Fused Words
Compound Words
Duplicate Words: escape to the another town
Style
• Short sentences, unusually long
sentences, and passives?
• Automatic detection of repeated words
– 300 essays manually annotated for
repetition
– Word-based text features with C5.0
• proportion of word use in essays
• distance between repeated word
occurrences
• pronoun?
• word length
How Do We Identify Organization &
Development in Essay Writing?
• Discourse Theories
• Lacks Essay-Based Discourse Function
–Cue-word & term detection (Cohen, Hirschberg &
Litman, Hovy et al, Knott, Mann & Thompson, Vander
Linden & Martin, Sidner, & Quirk, et al)
• Topical Coherence, Not Discourse Function
–TexTiling – (Hearst & Plaunt)
–LSA (Landauer et al)
–Select-A-Kibbitzer (Weimer-Hastings & Graesser)
• Not Student Friendly
–RST Trees (Mann & Thompson)
• Essay-Based Discourse Analyzer
(Burstein, Marcu, & Knight)
• Background, Thesis, Main Points, Supporting Ideas, and
Conclusion
Organization and Development:
Essay-Based Discourse Analyzer
• 1400+ essays manually annotated with pre-defined
labels
• Voting Between 3 Systems: 2 Probabilistic & 1
Decision-Based
– Probable discourse label sequences
– Essay sublanguage: agree, should, would, opinion, for
example, because, however...
– RST relations: contrast, elaboration, antithesis...
– Syntactic structures: infinitive, complement,
subordinating clauses...
• Identifies background text, thesis statement, main
ideas, supporting ideas, & conclusion statement
Evaluating Capabilities
• Precision, Recall, & F-measure
– Trade-off Precision for Recall
– Better not to show falsely identified errors
• Grammar, Usage, & Mechanics (Bigrams)
– Minimum Precision for Deployment
• Style & Organization & Development
– Human-annotated data
– Develop baseline comparisons
– Precision, Recall, F-measure outperform
baselines & approach human agreement
Some Numbers
• Grammar, Usage, & Mechanics
– (Minimum) Overall System Precision = 0.90
• Discourse Capability (Org & Dev)
– Baseline Precision = 0.71
– Overall System Precision = 0.85
– Human agreement = 0.95
• Repetitive Word Use (Style)
– Baseline Precision: 0.27
– Overall System Precision = 0.95
– Human Agreement: 0.55
E-rater v.2.0 – Release 8/04
• 12 Features: Relevant to Writing Standards
– Grammar, Usage, and Mechanics : Error Types
– Style: Sentence Type, Sentence Length, Repeated Words
– Organization: Thesis, Main Points, Support, and
Conclusion
– Content: Vocabulary Usage
• Topic-Specific & Grade-Specific Models
– Training with Human-Scored Essays
– Multiple Regression (Standardized Feature Set &
Variable Weights)
• System Performance
– Agreement with Humans
– Comparable to Two Humans
– E-rater/Human agreement : up to 62% exact (from
59%) ; 98% exact + adjacent
CriterionSM Online
Essay Evaluation
Student Independence
essay
E-rater score
Critique Writing Analysis
Feedback
G, U, M, S
Org & Dev
Effectiveness Studies
Criterion Field Data (Attali, 2004)
• Research Questions
– Can we evaluate the basic effectiveness
of Critique writing analysis tools?
– Can students understand and respond to
system feedback?
• Criterion Field Data
– Multiple submissions from about 9,000
6th to 12th grade essays
– Available for analysis:
• First and last essay submission
• Total number of submissions
Summary Results
• 25% error reduction across 30+error types
• Increased number of essay-based discourse
elements
– background
– main point
– supporting idea
– conclusion
CriterionSM & Standardized Testing
(Shermis et al, 2004)
•Research Question
– Can Criterion use have a positive impact on
FCAT writing scores?
•Data/Design
– 36 10th Grade English classes in Miami-Dade
• 18 used Criterion
• 18 used “traditional” instruction
Summary Results
• Bad News
– No significant differences in FCAT scores
• Good News
– Significant growth in writing performance
across different topics; Reduced numbers
of errors
– Significantly more writing productivity
Users & Volumes
E-rater
– GMAT (1999 – present)
– 350K essays scored each year
– Moving into Statewide Assessment
Criterion: E-rater + Critique
– K-12, college, and graduate level practice
applications
– End 2002: 200 clients & 50K subscriptions
– End 2003: 445 clients & 127K subscriptions
– March 2004: 544 clients & 437K subscriptions
– September 2004: 580 clients & 530K subscriptions
International Exposure
– Users in Canada, Mexico, Pakistan, India, Estonia,
Puerto Rico, Egypt, Nepal, Taiwan, Hong Kong &
Japan
Beyond Essay Evaluation
C-rater (Leacock)
–
short-answer, concept-based evaluation
•
•
•
•
•
•
•
morphological analyzer
pronoun resolution
syntactic chunker
predicate argument structure generator
automated spelling correction
word similarity matrices
part-of-speech tagger
Test Item Creation Assistants
– key-distractor selection for word-based test items
• Statistical word similarity tools (Deane & Higgins)
– article/paragraph selection for reading
comprehension items
• part-of-speech tagger
• rhetorical parser
Next Generation
Capabilities
Additional Grammar Error Detection
(Chodorow, Leacock, Han & Wolff)
•Missing (or extra) Determiner Errors
I can do so for __ following reasons
•Preposition Errors
a knowledge of/*at math
•Long Distance Subject-Verb Agreement
The use of dress codes are becoming a popular
subject.
Word Salad Detector: Prevent System Gaming
(T. Morton)
•Rare p-o-s tag sequences (mixing up content!)
quick The the over brown dogs fox. jumped
lazy
•Abnormal p-o-s tag distributions
kfdl afjidaoi djfd &&&&**
Current Research Problems
•Evaluate Coherence in Organization &
Development (Higgins, Burstein & Marcu)
– Does the thesis statement respond to the
question?
– Do the main points relate to the thesis
statement?
– Are all sentences in a supporting idea
related?
•E-rater Trait-Based Scoring
CL Research Contributions to
Automated Text Evaluation
(1 of 2)
SHORT-ANSWER SCORING PROTOTYPES (1993 – 1995)
Paul Jacobs, Jacqueline Kud, & Lisa Rao (General Electric IR Tools)
ESSAY SCORING PROTOTYPE (1996 – 1999): Lisa BradenHarder, Simon Corston-Oliver, George Heidorn, Karen Jensen, &
Steve Richardson (Microsoft syntactic parser+); Robin Cohen,
Sidney Greenbaum, Julia Hirschberg, Ed Hovy, Alistair Knott, Julia
Lavid, Geoffrey Leech, Keith Vander Linden, Diane Litman, William
Mann, James Martin, Elisabeth Maier, Candace Sidner, Sandra
Thompson, & Randolph Quirk (discourse theory); *Gerard Salton
(Vector Space Analysis)
E-RATER® (1999 - present): *Steve Abney (CASS parser), *Eric
Brill (p-o-s tagger), Hoa Trang Dang, Mary Dee Harris, Karen
Kukich, Thomas Landauer (scholarly debate), Leah Larkey, Ralph
Grishman (COMLEX), *George Miller (WordNet), *Thomas
Morton,*Adwait Ratnaparkhi (p-o-s tagger), *Susanne Wolff
CL Research Contributions to
Automated Text Evaluation
(2 of 2)
CRITIQUE WRITING ANALYSIS TOOLS (2000-present): ,
Giovanni Flammia (Kappa Tool), Peter Foltz, Walter Kintsch, and
Thomas Landauer (LSA & text coherence), Marti Hearst & Christian
Plaunt (TexTiling), Derrick Higgins, P. Kinerva, J. Kristoferson, and
A. Holtz (Random Indexing), Kevin Knight & *Daniel Marcu
(Rhetorical/Discourse Structure Parsers), *Dekang Lin (Word
Similarity Indices) & Andrew McCallum & Kamal Nigam
(multivariate bernoulli), Eleni Miltsakaki, Ross Quinlan (C5.0), Rob
Schapire (BoosTexter), Peter Weimer-Hastings & Arthur Graesser
(essay coherence theory), Magdalena Wolska, & Vladmir Vapnik
(Support Vector Machines), Linguistic Data Consortium.
ANOMALOUS ESSAY DETECTION (2000): Martin Chodorow
C-RATER (2000 – present): Claudia Leacock, Rebecca Passoneau
TEST ITEM CREATION ASSISTANTS (2002 – present): Paul
Deane & Derrick Higgins, R. Soricut (Rhetorical parser)
More Publications:
http://www.ets.org/research/erater.html
Tom Morton’s Freeware Parser:
https://sourceforge.net/projects/opennlp
 OpenNLP Tools, Download
Download