Automated Essay Scoring in the Classroom Paper Presented at the

advertisement
Automated Essay Scoring in the Classroom
Paper Presented at the American Educational Research Association
April 7-11, 2006
San Francisco, California
Douglas Grimes
Dept. of Informatics
University of California, Irvine
grimesd@ics.uci.edu
Mark Warschauer
Dept. of Education
University of California, Irvine
markw@uci.edu
(please check with authors before citing or quoting)
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 1
Abstract
Automated essay scoring (AES) software uses natural language processing
techniques to evaluate essays and generate feedback to help students revise their work.
We used interviews, surveys, and classroom observations to study teachers and students
using AES software in five elementary and secondary schools. In spite of generally
positive attitudes toward the software, levels of use were lower than expected. Although
teachers and students valued the automated feedback as an aid for revision, teachers
scheduled little time for revising, and students made little use of the feedback except to
correct spelling errors. There was no evidence of AES favoring or disfavoring particular
groups of students.
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 2
Introduction
The global information economy places increasing demands on literacy skills
(Marshall et al., 1992; Zuboff, 1989), yet literacy has been lagging in the U.S. for
decades (National Commission on Excellence in Education, 1983). Increased writing
practice is a cornerstone for improving literacy levels (National Commission on Writing
in America's Schools and Colleges, 2003). However, writing teachers are challenged to
find the time to evaluate their students’ essays. If a teacher wants to spend five minutes
per student per week correcting papers for 180 students, she must somehow find 15 hours
per week, generally after hours.
Automated Essay Scoring (AES) software has been applied for two primary
purposes: fast, cost-effective grading of high-stakes writing tests, and as a classroom tool
to enable students to write more while easing the grading burden on teachers. Using a 6point scale to rate essays, grader reliability is on a par with human graders: Inter grader
consistency between human and a machine have been as high as between two human
graders for at least three major AES programs.
Although much has been published on the reliability of automated scoring and its
application to high-stakes tests such as the GMAT and TOEFL exams, less has been
published on AES as a tool for teaching writing (exceptions: Attali, 2004; Elliot et al.,
2004; Shermis et al., 2004). Our research on AES in grades 5 through 12 explores how
students and teachers use it and how their attitudes and usage patterns vary by social
context.
AES uses artificial intelligence (AI) to evaluate essays. It compares lexical and
syntactic features of an essay with several hundred baseline essays scored by human
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 3
graders. Thus, scores reflect the judgments of the humans who graded the baseline
essays. In addition to scoring essays, the programs flag potential errors in spelling,
grammar and word usage, and suggest ways to improve organization and development.
We studied two popular AES programs, MY Access, from Vantage Learning, and
Criterion, from ETS Technologies, a for-profit subsidiary of the Educational Testing
Service. Both have two engines, one that scores essays, and another that provides writing
guidance and feedback for revising.
Prior Research: Context and Controversy
Most prior AES research falls into two categories -- the technical features of
natural language processing (NLP), and reliability studies based on comparing human
graders with a machine grader (AES program).
Computers are notoriously fallible in emulating the critical thinking required to
assess writing (e.g., Dreyfus, 1991). People become wary when computers make
judgments and acquire an aura of expertise (Bandura, 1985). Hence it is no surprise that
automated writing assessment has been sharply criticized. Among the concerns are:
deskilling of teachers and decontextualization of writing (Hamp-Lyons, 2005);
manipulation of essays to trick the software, biases against innovation, and separation of
teachers from their students (Cohen et al., 2003); and violation of the essentially social
nature of writing (CCCC Executive Committee, 1996).
Most of the criticisms focus on its use in large-scale, high-stakes tests. Classroom
use of AES presents a very different scenario. The major vendors market their software
to supplement but not replace human teachers, and recommend a low-stakes approach to
classroom use of AES, encouraging teachers to assign the lion’s share of grades.
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 4
Criterion and MY Access are marketed mostly to upper elementary and secondary
education schools, especially junior high and middle schools. We selected study sites
accordingly – one K-8 school, three junior high schools, and one high school in Orange
County, California. The student bodies varied widely in academic achievement,
socioeconomic status, ethnic makeup, and access to computers.
Automated Writing Evaluation and Feedback
The dual engine architecture of both Criterion and MY Access represent the
merger of two streams of development in natural language processing (NLP) technology
– statistical tools to emulate or surpass human graders in evaluating essays, and feedback
tools for automated feedback on organization, style, grammar and spelling.
Ellis Page pioneered automated writing evaluation in the 1960s with Page Essay
Grade (PEG), which used multiple regression analysis of measurable features of text,
such as essay length and average sentence length, to build a scoring model based on a
corpus of human-graded essays (Page, 1967; Shermis et al., 2001). In the 1990s ETS and
Vantage Learning developed competing AES programs called e-rater and Intellimetric,
respectively (Burstein, 2003; Cohen et al., 2003; Elliot & Mikulas, 2004). Like PEG,
both employed regression models based on a corpus of human-graded essays.
Since 1999 e-rater has been used in conjunction with a human grader to score the
analytical writing assessment in GMAT, the standard entrance examination used by most
business schools. Intellimetric has also been tested on college entrance examinations,
although published details of the results are much sketchier (e.g., Elliot, 2001).
Automated feedback tools on writing mechanics (spelling, grammar, and word
usage), style, and organization comprise a parallel stream of evolution. Feedback tools
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 5
for mechanics and style developed in the 1980s (Kohut et al., 1995; MacDonald et al.,
1982). Organization, syntax and style checking continue to present formidable
technological challenges today (Burstein et al., 2003; Stanford NLP Group, 2005).
ETS and Vantage Learning both addressed the market for their scoring engines by
coupling with a complementary feedback engine, and Criterion and MY Access emerged
in the late 1990s.
Most of the research on AES adheres to an established model for evaluating largescale tests which require human judgment. The goal is to maximize inter-rater reliability,
i.e., minimize variability in scores among raters. At least four automated evaluation
programs – PEG, e-rater, Intellimetric, and Intelligent Essay Assessor (IEA, from
Pearson Knowledge Technologies) – have scored essays roughly on a par with expert
human raters (Burstein, 2003; Cohen et al., 2003; Elliot & Mikulas, 2004; Landauer et
al., 2003).
Methodology
This is a mixed-methods exploratory case-study to learn how AES is used in
classrooms.
Research Sites and Participants: We studied AES use in four junior high schools
and one high school in Southern California. The two junior high schools we studied most
intensively were part of a larger 1:1 laptop study, where all students in certain grades had
personal laptops that they used in most of their classes. One of the 1:1 schools was a
new, high-SES K-8 school with mostly Caucasian and Asian students. The other was an
older, low-SES junior high school with two-thirds Latino students. The other three
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 6
schools (two junior high schools and one high school) used either laptop carts or
computer labs. Three schools used My Access; two used Criterion.
Teachers using My Access were selected by availability, and included most of the
language arts teachers in the 1:1 schools, plus a few teacher of other subjects in those,
and three language arts or English teachers in non-1:1 schools. Teachers using Criterion
were recommended by their principals
Data collection: We recorded and transcribed semi-structured audio interviews
with three principals, three technical administrators, and nine language arts teachers. We
also observed approximately twenty language arts classes using MY Access and
Criterion, conducted two focus groups with students, and examined student essays and
reports of over 2,400 essays written with MY Access. Finally, we conducted surveys of
teachers and students in the 1:1 laptop schools. Nine teachers and 564 students
responded to the section on MY Access.
Findings
Our finding can be summarized in terms of usage patterns, attitudes, and social
context:
1. How do teachers make use of automated essay scoring?
2. How do students make use of automated essay scoring?
3. How does usage vary by school and social context?
1.
Teachers
Paradox #1: High opinions and low utilization
All of the teachers and administrators we interviewed expressed favorable views
of their AES programs overall. Several talked glowingly about students’ increased
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 7
motivation to write. The teacher survey gave quantitative support for interview data
(Table 1): Six of the seven teachers said that My Editor, the MY Access function that
highlights likely errors in mechanics (spelling, punctuation, and grammar) was helpful.
Five of seven agreed that My Tutor, the function that gives generic advice on writing,
was helpful. More important, all seven said they would recommend the program to other
teachers, and six of seven said they thought the program helps students develop
insightful, creative writing.
Table 1: Teacher Views of My Access
Please indicate how much you agree or disagree with each of the following statements
The errors pointed out
by My Editor are
helpful to my students.
The advice given in My
Tutor is helpful to my
students.
I would recommend the
program to other
teachers.
The program helps
students develop
insightful, creative
writing.
Strongly
Disagree
Disagree
Neutral
Agree
Strongly
Agree
Ave.
0% (0)
14% (1)
0% (0)
71% (5)
14% (1)
3.86
0% (0)
29% (2)
0% (0)
57% (4)
14% (1)
3.57
0% (0)
0% (0)
0% (0)
86% (6)
14% (1)
4.14
0% (0)
0% (0)
14% (1) 86% (6)
0% (0)
3.86
Average
3.86
Average scores are the weighted averages for the row:
1 = Strongly disagree, 2 = Disagree, 3 = Neutral, 4 = Agree, 5 = Strongly agree.
Nine teachers answered the section of the survey on MY Access. Two of them said they
used MY Access less than 15 minutes a week. They answered “N/A” to the above
questions and were eliminated from the responses reported here. Two of the remaining
seven taught general education in fifth or sixth grade. The other five all taught 7th or 8th
grade: four taught English Language Arts or English Development, and one taught
special education.
AES makes it easier to manage writing portfolios. Warschauer notes in the
context of 1:1 laptop programs:
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 8
A major advantage of automated writing evaluation was that it engaged students
in autonomous activity while freeing up teacher time. Teachers still graded
essays, but they could be more selective about which essays, and which versions,
they chose to grade. In many cases, teachers allowed students to submit early
drafts for automated computer scoring and a final draft for teacher evaluation and
feedback. (Warschauer, in press, p. 108)
One teacher compared My Access to “a second pair of eyes” to watch over her
students, making it easier to manage a classroom full of squirrelly students.
Given their positive attitudes toward My Access, we were surprised how
infrequently teachers used it. 7th-grade students in two 1:1 laptop averaged only 2.38
essays each in the 2004-2005 school year (Table 2). Students in lower grades in the 1:1
laptop program, and 7th-grade students in non-1:1 schools used My Access even less.
Table 2
My Access Use by 7 -Graders in Two 1:1 Laptop Schools
2004-2005 School Year
Low-SES School
High-SES School
Total
Revisions Essays
%
Essays
%
Essays
%
0
695
73%
153
69%
848
72%
1
136
14%
52
23%
188
16%
2
51
5%
16
7%
67
6%
>2
74
8%
2
1%
76
6%
Essays
956 100%
223
100%
1,179
100%
Students
403
93
496
Essays/Stu.
2.37
2.40
2.38
Explanation: In the low-SES school, 403 students produced 695 essays with
no revisions, 136 with one revision, 51 with two, and 74 with more than two, a
total of 956 essays, or 2.37 essays per student. Similarly, in the high-SES
school, 93 students produced 153 essays with no revisions, 52 with one
revision, 16 with two, and 2 with more than two, a total of 223 essays, or 2.40
essays per student.
th
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 9
In spite of limited use, teachers said that MY Access allowed them to assign more
writing. The most common explanation for low use was lack of time. Several teachers
said they would like to have their students use the program more, but they needed
precious class time for grammar drills and other preparation for state tests.
Paradox #2: Flawed scores are useful.
Although teachers often disagreed with the automated scores, they found them
helpful for teaching. The only question in the My Access teacher survey that received a
negative overall response related to the fairness of scoring. On a scale of 1 to 5, where 3
was neutral, the average response to “The numerical scores are usually fair and accurate”
was 2.71. Nevertheless, six of the seven teachers said that the numerical scores helped
their students improve their writing.
Several observations may help resolve this seeming paradox. First, students
tended to be less skeptical of the scores than teachers. The average teacher response to
“the numerical scores are usually fair and accurate” was 2.71 (Table 3). In contrast, the
average student response to “The essay scores My Access gives are fair” was 3.44 (Table
3).
Second, speed of response is a strong motivator, no matter what the score.
Interviews with teachers and students and student responses to open-ended survey
questions all confirmed our observation that students were excited by scores that came
back within a few seconds, whether or not they felt the grades were fair. Teachers
discovered that the scores turned the frequently tedious process of essay assessment into
a speedy cousin of computer games and TV game shows. Students sometimes shouted
with joy on receiving a high score, or moaned in mock agony over a low one.
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 10
Teachers may also have felt relieved to offload the blame for grades onto a
machine. Thus teachers could play the role of students’ friend, coaching them to improve
their scores, rather than the traditional adversarial role of teacher-as-examiner, ever
vigilant of errors. Whatever the reasons, teachers actively encouraged students to score
higher, often emphasizing scores per se instead of mastery of writing skills that may have
been only loosely connected to scores.
Most teachers continued to grade essays manually, although perhaps more quickly
than before. Several said they wanted to make sure the grades were fair. One said that
My Editor overwhelmed students by flagging too many errors; she just highlighted a few
of the more significant ones.
Discussion: The teachers using MY Access in our study were all in their first
year of using any AES program. It remains to be seen if they gain confidence in the
automated scoring with more experience, and rely more on the automated grades. A
teacher who had three years’ experience with Criterion only spot-checked student essays,
and emphasized that ETS encouraged low-stakes use of the program.
Paradox #3: Teachers valued AES for revision, but scheduled little time for it.
As Table 2 indicates, 72% of the seventh-grade essays in My Access were not
revised at all – only 28% were revised after receiving a preliminary score and feedback.
However, these figures are not reliable indicators of how much students would revise if
given ample time to polish an essay and sufficient experience with the program. On one
hand, the 28% figure may actually overstate the number of actual revisions for several
reasons. For example, teachers sometimes assigned an essay over portions of two or
three class periods, so each sitting after the first will show up as a revision unless they
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 11
save without submitting. Students can submit at any time, and can submit repeatedly in a
single sitting, just to see if their score has gone up when they fix one spelling error.
Sometime they resubmit without making any changes at all, in which case the number of
revisions overstates the actual editing. On the other hand, the 28% figure might
understate the amount of revising of routine essay assignments because teacher
sometimes disallowed revisions, either for internal testing purposes (e.g. to compare
writing skills at the beginning and end of the school year), or to simulate the timed
writing conditions of standardized tests.
One teacher wrote, “I feel that My Access puts the emphasis on revision. It is so wonderful to be able to have students revise and
immediately find out if they improved.”
2.
Students
We used several data collection methods for student attitudes. Our student survey
contained both multiple-choice and open-ended questions. We asked students’ opinions
in focus groups and informal conversations. We asked teachers and administrators about
student attitudes, and we observed students using the programs.
No surprise: Students liked fast scores and feedback, even when they didn’t trust the
scores and ignored the feedback.
Table 3 summarizes the responses of over 400 seventh-grade students to part of
our survey on My Access. On the same scale, where 1 = “strongly disagree” and 5 =
“strongly agree,” answers to seven of the eight questions averaged slightly above 3, or
“neutral.” The average response to all eight questions indicated positive attitudes toward
My Access. However, individual opinions varied widely.
A narrow majority of students regarded the automated scores as fair, but
confidence in My Access scores varied widely. 51% of students agreed or strongly
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 12
agreed with “The essay scores My Access gives are fair” (Table 3). A fifth grade teacher
wrote, “I have had students who have turned in papers for their grade level, and they were
incomplete, and they still received a 6! I think it lost some credibility with my students.
If I use the middle school prompts it is better, but then I feel I am stepping on the toes of
the sixth, seventh, and eighth grade teachers.”
Table 3: Student Views of My Access
Please indicate how much you agree or disagree with each of the following statements:
I find My Access easy to
use.
I sometimes have trouble
knowing how to use My
Access.
I like writing with My
Access
I revise my writing more
when I use My Access.
Writing with My Access
has increased my
confidence in my writing.
My Tutor has good
suggestions for
improving my writing.
The essay scores My
Access gives are fair.
My Access helps
improve my writing.
Strongly
disagree
3%
(11)
Disagree
7%
(28)
Neutra
l
25%
(108)
Agree
47%
(199)
Strongly
agree
13%
(57)
I don't
know
5%
(23)
12%
(51)
42%
(177)
23%
(98)
18%
(77)
2%
(8)
3%
(11)
9%
(38)
5%
(22)
15%
(62)
16%
(67)
33%
(140)
26%
(109)
29%
(124)
33%
(139)
10%
(43)
14%
(61)
4%
(17)
6%
(27)
6%
(25)
18%
(77)
30%
(127)
28%
(120)
12%
(50)
7%
(28)
3.23
8%
(34)
11%
(47)
25%
(107)
26%
(109)
10%
(41)
20%
(87)
3.22
5%
(22)
6%
(24)
12%
(52)
10%
(41)
26%
(109)
22%
(92)
39%
(167)
38%
(160)
12%
(53)
18%
(75)
6%
(24)
8%
(32)
Ave.
3.65
2.55
3.18
3.38
3.44
3.56
Responses of 428 seventh-grade students to the My Access portion of a survey on a 1:1
laptop program. The two schools are the same as in Table 2, and most of the students are the
same.
Our survey asked students to complete two sentences: “The best thing about My
Access is ....” and “The worst thing about My Access is ....” Approximately equal
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 13
numbers of students mentioned the scores in a positive light as in a negative one. The
most common unprompted answer to “the worst thing ...” was unfair scores (15.3% of
responses); the second most common answer to “the best thing ...” was the scores (16.2%
of responses).
Discussions with students corroborated this antimony. In general, the speed of
scoring and feedback appeared to be more important more than the accuracy. Many
students were not bothered by anomalies in the scoring. Some compared AES with video
games; speed of response seemed to matter more than fairness, at least when the stakes
were low. Unlike a high-stakes test, they could ask the teacher to override a low score
they considered unfair.
The two AES programs we studied were least capable of scoring short or highly
innovative essays. Novice writers who struggled to produce a few lines sometimes
received zeroes. Their essays were ungradable because they were too short, failed to use
periods between sentences, or did not conform to the expected formulaic structure.
Student surveys, interviews, and observations confirmed the motivating power of
quick numeric scores. Given the low-stakes nature of the automated scores, and the
teachers’ willingness to override them, students had little to lose from an unfairly low
grade. We asked a focus group of 7th graders what they would do if they wrote an essay
they thought was worth a 6 and it received a 2, and they unanimously chimed back that
they would tell the teacher, and she would fix it. Unsurprisingly, their strategy was not
symmetrical for scores that were too high; when we asked them what they would do if
they wrote an essay they thought only deserved a 2 and it got a 6, they all said they would
keep the score, but wouldn’t reveal the grading aberration to the teacher.
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 14
Discussion: Diverse schools of psychology are in substantial agreement with the
common sense notion that clear, proximal goals are ceteris paribus stronger motivators
than ambiguous, distal ones (summarized in Bandura, 1985). Several teachers compared
AES to video games in terms of students’ motivation to improve their scores.
No surprise: Students’ revisions were superficial.
As Table 2 indicates, 72% of the My Access essays were not revised at all after
the first submission. As indicated above, the seemingly low level of revision may be due
to lack of time. It is consistent with findings, such as Janet Emig’s classic study of the
writing habits of twelfth-graders, that students are generally disinclined to revise schoolsponsored writing (Emig, 1971).
The average student response to “I revise my writing more when I use My
Access” was 3.38 (Table 3). This is in line with our finding that students generally revise
more when using laptops (Warschauer, in press; Warschauer et al., 2005). However, it
was clear that most students in our study conceived of revising as merely correcting
errors in mechanics, especially spelling. Examination of ten essays that were revised
revealed none that had been revised for content or organization. Except for one essay in
which a sentence was added (repeating what had already been said), all of the revisions
maintained the previous content and sentence structure. Changes were limited to single
words and simple phrases, and the original meaning remained intact. Most changes
appeared to be in response to the automated error feedback.
Discussion: The above observations about superficial editing apply to seventh
grade and below in our study. It is well recognized in the composition community that
such young writers are less able than older ones to view their own work critically, a
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 15
prerequisite for in-depth revision (Bereiter et al., 1987). In general, students revise more
when writing on computers than on paper (Warschauer, in press, p. 98). Hence AES
revision frequency should be compared with writing on a word processor, rather than on
paper.
Frequency of revision also varies with audience (Butler-Nalin, 1984). Writing for
the teacher-as-examiner (as in the stereotypical classroom essay) elicits the fewest
revisions; writing for authentic audiences, especially those outside of school, elicits the
most. Hence the artificial audience of AES, even when supplemented by a teacher in the
role of examiner, would not be expected to inspire students to revise deeply.
Paradox: Students treat machine assessment as almost human.
Two teachers noted that students treated AES programs like a human reviewer in
that they do not like to turn in less than their best effort. We observed a high school
literature class that lent confirmation to her observation. An unexpected interruption
forced the teacher to stop the students in mid-essay. She asked them to submit their
incomplete essays so that a group of visiting teachers could see the Criterion feedback.
The students objected that they would get low scores because they were not finished; they
wanted to save their essays and exit instead of submitting for them a score, even though
the scores wouldn’t count toward their class grade.
Seventh-graders in two focus groups also said they did not want to submit an
incomplete essay, even though it did not count toward a class grade. Thus students often
appear to treat numeric scores as personal evaluations, even when their essays are
incomplete. They cheered high scores and bemoaned low ones, but treated them with
more levity than grades from their teacher.
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 16
A high school principal who had approximately six years experience supervising
teachers with Criterion said that students viewed their work more critically as they wrote,
editing before they submit. She suggested that mere anticipation of automated feedback
or scores may affect the writing processes. If this hypothesis is valid, the number of
revisions reported in Table 2 will understate the amount of revision. Tracking the
evolution of a text reveals only the visible product of revision, not mental revisions that
take place before the writer puts them on paper or computer (Witte, 1987). "It makes
little psychological sense to treat changing a sentence after it is written down as a
different process from changing it before it is written" (Scardamalia et al., 1986, p. 783).
No surprise: AES did not noticeably affect scores on standardized writing tests
Scores on the Language Arts portion of the California State Tests for 2005 for
seventh-graders at our two 1:1 schools using AES were not significantly different from
their peers at other schools in the same district (Warschauer & Grimes, 2005). Given the
limited use of AES in the two focus schools, it would be unrealistic to expect
improvement in standardized tests so soon.
3.
Social context
SES did not appear to affect revision patterns or attitudes toward AES.
The two schools we studied most intensively had recently introduced 1:1 laptop
programs, where all 7th-graders had personal laptops they used in most their classes. One
school was high-SES, dominated by students who spoke English or Korean at home. The
other school was low-SES; two-thirds of the students were Latinos, and native Spanish
speakers slightly outnumbered native English speakers.
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 17
Language arts teachers in the high-SES school assigned more writing in My
Access than in the low-SES school, but revision patterns were substantially the same
(Table 2). Table 4 breaks down the attitude averages in Table 3 by school. The two
schools are essentially indistinguishable in terms of average responses. The differences
are a tenth of a point or less in all cases except one (“My Tutor has good suggestions for
improving my writing”), which differed by about two-tenths. The similarity in survey
responses was surprising, given the substantial difference in demographics, disciplinary
problems, standardized test scores, and general enthusiasm for learning in the two student
groups.
Table 4
Student Attitudes toward My Access in a High-SES and a Low-SES School
Please indicate how much you agree or disagree with each of the
following statements:
Low-SES
(N = 347)
High-SES
(N = 163)
I find My Access easy to use.
3.65
3.69
I sometimes have trouble knowing how to use My Access.
I like writing with My Access.
2.54
3.17
2.59
3.21
I revise my writing more when I use My Access.
3.37
3.42
Writing with My Access has increased my confidence in my
writing.
3.26
3.22
My Tutor has good suggestions for improving my writing.
3.20
3.42
The essay scores My Access gives are fair.
3.45
3.35
My Access helps improve my writing.
3.57
3.50
The students are mostly the same seventh-graders as in Table 3, plus some upper elementary
students.
Teachers in both schools said that MY Access allowed them to assign more
writing. Teachers in the high-SES school were more comfortable assigning homework in
My Access because they were confident students had Internet access at home. The
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 18
number of teachers is not large enough to estimate statistical significance in any of the
quantitative comparisons between schools. Two teachers using Criterion in other schools
said students occasionally requested permission to write extra essays, in addition to
regular homework – a rare and pleasant surprise for teachers accustomed to students
complaining about homework.
AES does not appear to favor particular categories of middle-school learners.
Teachers also said that it assisted writing development for diverse categories of
students -- English Language Learners, gifted, special education, at-risk, and students
with no special needs (Table 5).
Table 5: Teachers’ View of My Access for Diverse Groups of Learners
For each category of students, please indicate how much My Access assists their
writing development:
Very
Slig
Respo
Positive --
Neut
htly
ral
Posit
N/
nse
A
Avera
it really
helps
ive
ge
writing
English
17
0%
67%
(0)
(4)
Language
17% (1)
Learners
%
4.20
(1)
Special
0%
33%
Education
(0)
(2)
17
50% (3)
AERA 2006, Symposium on Technology and Literacy
4.60
%
Automated Essay Scoring, p. 19
(1)
Gifted
29
0%
43%
(0)
(3)
29% (2)
%
4.40
(2)
At-risk
17
17%
33%
(1)
(2)
33% (2)
%
4.20
(1)
General
17
students --
0%
33%
50% (3)
no special
(0)
%
4.60
(2)
(1)
needs
The same seven teachers responded as in Tables 1 and 3 above. Two more
columns, “Very negative – it really hinders learning” and “Slightly negative”
were full of zeroes and have been suppressed to save space.
A high school teacher and principal said that Criterion was useful for average and
below-average high school students, especially in ninth and tenth grades. However, older
students, especially the skilled writers in Advanced Placement classes, quickly learned to
spoof the scoring system. They needed feedback on style, and Criterion’s error feedback,
oriented at correcting the mechanics of Standard American English, was of less use to
them than to students still struggling with conventions of grammar, punctuation, and
spelling.
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 20
Discussion: The larger social context of high-stakes testing and the prior
practices of individual teachers appeared to shape classroom use of AES more than racial,
ethnic, or economic status of the students. It appears that AES is most useful for students
who are neither too young and unskilled as writers, nor too skilled. Fourth-graders in
California are expected to understand four simple genres and basic punctuation, two
skills which happen to be needed for AES programs to score an essay (California
Department of Education Standards and Assessment Division, 2002). Students with
these skills and the ability to type roughly as fast as they write by hand could be expected
to use AES.
Older writers may become sufficiently skilled to outgrow AES, perhaps in high
school, when they master the basic genres and mechanics of writing and develop their
own writing styles, or if they write best when not conforming to the formulaic
organization expected by the programs. Skilled writers, it seems, are more likely to be
critical of the scores and the feedback.
A study of contrasts: two teachers, two schools, two AES programs, and two styles of
teaching
In order to give a sense of the diverse ways in which we saw AES being used, we
offer case studies of two seventh grade teachers in two schools with highly contrasting
demographics. The first school is two-thirds Latino with slightly below average API
(Academic Performance Index) scores compared to similar schools, and almost 60% of
the students on free or reduced lunch program; the second is a California blue-ribbon
school with mostly white and oriental students in a high-income neighborhood. The first
is the low-SES school in Table 2; however, the second is not the high-SES school in that
table.
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 21
All of the seventh-graders in the first school participated in a 1:1 laptop program,
funded by an Enhancing Education Through Technology (EETT) grant. The teacher in
our case-study taught courses principally composed of Latino English language learners.
She said that some of her students had never written a whole paragraph before
they entered seventh grade. Their first attempts to write essays on My Access resulted in
low scores – mostly 1’s and 2’s on a 6-point scale – but they quickly improved. Having
laptops with wireless Internet connection allowed them to do research related to the AES
prompts. For example, before writing to a prompt about a poem by Emily Dickinson,
they researched and answered questions on the author’s life and times.
Her students struggled with keyboarding skills when they first received their
laptops. Most wrote faster by hand than on computer, at least in their first few months
with laptops. Deficient keyboarding skills only compounded these novice writers’
reticence. To facilitate the flow of words, the teacher asked them to write first by hand,
and copy it to the computer.
In contrast, a teacher in the second school had the enviable position of more
attentive students with better training in typing and computer use. For many years she
had given her students one thirty-minute timed essay per week. Before introducing
Criterion three years ago, she spent all day either Saturday or Sunday each weekend to
grade over 150 students.
She still schedules one thirty-minute timed essay per week, with ten minutes of
pre-writing beforehand, alternating between Criterion and a word processor on alternate
weeks. Criterion saves her a few hours compared with the word-processed essays. She
said:
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 22
I have 157 students and I need a break.... I cannot get through these essays and
give the kids feedback. One of the things I’ve learned over the years is it doesn’t
matter if they get a lot of feedback from me, they just need to write. The more
they write, the better they do.
She added that students are excited about Criterion. “They groan when they have
to write an essay by hand. They’d much rather write for Criterion.”
These two cases present many contrasts aside from demographics:
•
Many students in the first school have little typing experience prior to entering
seventh grade. The second school is in a school district that attempts to train all
students to type at least thirty words a minute by the time they finish sixth grade.
Consequently, most seventh graders have the keyboarding skills needed for fluid
computer use.
•
The narrative feedback from My Access, used in the first school, was extremely
verbose and totally generic (not adapted to each essay). In contrast, the narrative
feedback from Criterion, used in the second school, was concise and specific to
each essay. (My Access did allow teachers to select a more basic level of
narrative feedback, but we did not see any teachers use it. My Tutor, the narrative
feedback function in My Access, is now much simpler than at the time of our
study.)
•
The teacher in the second school writes many of her own prompts for Criterion,
whereas the teacher in the first school uses only prompts provided by My Access.
(Both programs allow teachers to create their own prompts, but the programs can
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 23
only score essays written to prompts that have a corpus of several hundred
human-graded essays in the database.)
•
Teachers in the first school count themselves lucky if half their students do any
homework; in the second school the teacher can count on most students doing
about three hours of homework a week. In both cases, most the writing is done at
school, but for very different reasons. In the low-SES school, parents are less
involved, and many students would not write at all if their homework were their
only writing practice. In the high-SES school, the teacher complained that she
used to assign writing homework, and the parents became too involved: “I got
tired of grading the parents’ homework!” She stopped assigning written
homework.
In other respects the two teachers and their schools are similar. In both cases, the
teachers teach the five-paragraph formula. Both teachers take a disciplined and structured
approach to instruction, but the seventh graders in the high-SES school write far more
copiously and coherently than do their ELL peers in the low-SES school.
Both teachers are accustomed to technical problems, such as cranky hardware,
students who have trouble logging in or saving their work, or students who don’t
understand how word-wrapping works. (Teachers have to instruct students not to use
manual line feeds to avoid unscorable essays because both My Access and Criterion
interpret manual line feeds as paragraph delimiters.) Both teachers responded to the
technical problems gracefully, calling on spare hardware or technical support as needed.
Both felt strongly that computer experience is an essential part of twenty-first century
education, and that AES helps students gain both writing experience and computer
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 24
experience. In both cases, AES relieved the teacher of some of the burden of corrections,
comments, and grades.
Conclusions
In summary, the chief benefits of AES in the classroom include increased
motivation for students and easier classroom management for teachers. There is also
some evidence of increased writing practice and increased revision. The chief drawbacks
are that students who are already challenged by writing, typing, and computers may be
further stymie; and unless computer hardware and technical support are already available,
they require substantial investments. It is also possible that AES may be misused to
reinforce artificial, mechanistic, and formulaic writing disconnected from genuine
communication.
More research is merited on the amount of writing and revision, especially as
teachers grow in familiarity with the programs, and students become more accustomed to
writing on computers. However, we do not expect teachers to substantively change their
instructional practice, such as relating writing to students’ daily lives, encouraging deeper
revision, or addressing authentic audiences, without professional training and major
changes in the educational environment.
We expect that AES will continue to grow in popularity. We hope there is
parallel growth in the quality of feedback, especially feedback on syntax, and in the
ability of these programs to assess diverse writing styles and structures. Perhaps more
important, we hope to see professional development of teachers and new forms of
assessment, so that writing instruction can be freed from reductionist formulae and
become a tool for authentic communication, not just a tool for acing tests. The effects of
any tool depend on its use:
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 25
The main point to bear in mind is that such automated systems do not replace
good teaching but should instead be used to support it. This is particularly so with
the instruction of at-risk learners, who may lack the requisite language and
literacy skills to make effective use of automated feedback. Students must learn
to write for a variety of audiences, including not only computer scoring engines,
but peers, teachers, and interlocutors outside the classroom, and must also learn to
accept and respond to feedback from diverse readers. (Warschauer, in press, n.p.)
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 26
References
Attali, Y. (2004, April 12 to 16, 2004). Exploring the feedback and revision
features of criterion. Paper presented at the National Council on Measurement in
Education (NCME), San Diego, CA.
Bandura, A. (1985). Social foundations of thought and action: A social cognitive
theory:Prentice Hall.
Bereiter, C., & Scardamalia, M. (1987). The psychology of written
composition.Hillsdale, N.J.: Lawrence Erlbaum Associates.
Burstein, J. (2003). The e-rater scoring engine: Automated essay scoring with
natural language processing. In M. D. Shermis & J. Burstein (Eds.), Automated essay
scoring: A cross-disciplinary perspective (chap. 7) (pp. 113-121). Mahwah, New Jersey:
Lawerence Erlbaum Associates, Publisher.
Burstein, J., & Marcu, D. (2003). Automated evaluation of discourse structure in
student essays (chap. 13). In M. D. Shermis & J. Burstein (Eds.), Automated essay
scoring: A cross-disciplinary perspective.Mahwah, New Jersey: Lawerence Erlbaum,
Associates.
Butler-Nalin, K. (1984). Revising patterns in students' writing. In A. N. Applebee
(Ed.), Contexts for learning to write (pp. 121-133): Ablex Publishing Co.
California Department of Education Standards and Assessment Division. (2002).
Teacher guide for the california writing standards tests at grades 4 and 7:California
Department of Education.
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 27
CCCC Executive Committee. (1996). Writing assessment: A position statement.
NCTE
Cohen, Y., Ben-Simon, A., & Hovav, M. (2003, October, 2003). The effect of
specific language features on the complexity of systems for automated essay scoring.
Paper presented at the 29th Annual Conference of the International Association for
Educational Assessment, Manchester, UK.
Dreyfus, H. L. (1991). What computers still can't do: A critique of artificial
reason.Cambridge, Mass.: MIT Press.
Elliot, S. M., & Mikulas, C. (2004, April 12-16, 2004). The impact of my
access!™ use on student writing performance: A technology overview and four studies.
Paper presented at the Annual Meeting of the American Educational Research
Association, San Diego, CA.
Emig, J. (1971). The composing process of twelfth graders.Urbana, IL: National
Council of Teachers of English.
Hamp-Lyons, L. (2005). Book review: Mark d. Shermis, jill c. Burstein (eds.),
automated essay scoring. Assessing Writing, 9, 262-265.
Kohut, G. F., & Gorman, K. J. (1995). The effectiveness of leading grammar/style
software packages in analyzing business students' writing. Journal of Business and
Technical Communication, 9(3), 341-361.
Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automated scoring and
annotation of essays with the intelligent essay assessor. In M. D. Shermis & J. Burstein
(Eds.), Automated essay sscoring: A cross-disciplinary perspective (pp. 87-112).
Mahwah, NJ: Lawrence Erlbaum Associates.
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 28
MacDonald, N. H., Frase, L. T., S., G. P., & Keenan, S. A. (1982). The writer's
workbench: Computer aids for text analysis. IEEE Transactions on Communications(30),
105-110.
Marshall, R., & Tucker, M. (1992). Thinking for a living: Work, skills, and the
future of the american economy.New York: Basic Books.
National Commission on Excellence in Education. (1983). A nation at risk. from
www.ed.gov/pubs/NatAtRisk/risk.html
National Commission on Writing in America's Schools and Colleges. (2003). The
neglected "r": The need for a writing revolution.
Page, E. (1967). Statistical and linguistic strategies in the computer grading of
essays. Paper presented at the 1967 International Conference On Computational
Linguistics.
Scardamalia, M., & Bereiter, C. (1986). Research on written composition. In C.
Wittrock (Ed.), Handbook of research on teaching (3rd ed.) (pp. 778-803). New York,
NY: Macmillan.
Shermis, M. D., Burstein, J. C., & Bliss, L. (2004, April, 2004). The impact of
automated essay scoring on high stakes writing assessments. Paper presented at the
Annual Meetings of the National Council on Measurement in Education, San Diego, CA.
Shermis, M. D., Mzumara, H. R., Olson, J., & Harrington, S. (2001). On-line
grading of student essays: Peg goes on the world wide web. Assessment & Evaluation in
Higher Education, 26(3), 247 - 259.
Stanford NLP Group. (2005). Stanford parser. Retrieved 12/1/2005, 2005, from
http://nlp.stanford.edu/software/lex-parser.shtml
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 29
Warschauer, M. (in press). Laptops & literacy:Teachers' College Press.
Warschauer, M., & Grimes, D. (2005). First year evaluation report: Fullerton
school district laptop program.
Zuboff, S. (1989). In the age of the smart machine: The future of work and
power:Basic Books.
AERA 2006, Symposium on Technology and Literacy
Automated Essay Scoring, p. 30
Download