Essay-Grading Software Seen as Time

advertisement
Essay-Grading Software Seen as TimeSaving Tool
Teachers are turning to essay-grading software to
critique student writing, but critics point to serious
flaws in the technology
By Caralee J. Adams
Jeff Pence knows the best way for his 7th grade English students to improve their writing is to do more of
it. But with 140 students, it would take him at least two weeks to grade a batch of their essays.
So the Canton, Ga., middle school teacher uses an online, automated essay-scoring program that allows
students to get feedback on their writing before handing in their work.
"It doesn't tell them what to do, but it points out where issues may exist," said Mr. Pence, who says the a
Pearson WriteToLearn program engages the students almost like a game.
With the technology, he has been able to assign an essay a week and individualize instruction efficiently. "I
feel it's pretty accurate," Mr. Pence said. "Is it perfect? No. But when I reach that 67th essay, I'm not real
accurate, either. As a team, we are pretty good."
With the push for students to become better writers and meet the new Common Core State Standards,
teachers are eager for new tools to help out. Pearson, which is based in London and New York City, is one
of several companies upgrading its technology in this space, also known as artificial intelligence, AI, or
machine-reading. New assessments to test deeper learning and move beyond multiple-choice answers are
also fueling the demand for software to help automate the scoring of open-ended questions.
Critics contend the software doesn't do much more than count words and therefore can't replace human
readers, so researchers are working hard to improve the software algorithms and counter the naysayers.
While the technology has been developed primarily by companies in proprietary settings, there has been a
new focus on improving it through open-source platforms. New players in the market, such as the startup
venture LightSide and edX, the nonprofit enterprise started by Harvard University and the Massachusetts
Institute of Technology, are openly sharing their research. Last year, the William and Flora Hewlett
Foundation sponsored an open-source competition to spur innovation in automated writing assessments that
attracted commercial vendors and teams of scientists from around the world. (The Hewlett Foundation
supports coverage of "deeper learning" issues in Education Week.)
"We are seeing a lot of collaboration among competitors and individuals," said Michelle Barrett, the
director of research systems and analysis for CTB/McGraw-Hill, which produces the Writing Roadmap for
use in grades 3-12. "This unprecedented collaboration is encouraging a lot of discussion and transparency."
Mark D. Shermis, an education professor at the University of Akron, in Ohio, who supervised the Hewlett
contest, said the meeting of top public and commercial researchers, along with input from a variety of
fields, could help boost performance of the technology. The recommendation from the Hewlett trials is that
the automated software be used as a "second reader" to monitor the human readers' performance or provide
additional information about writing, Mr. Shermis said.
"The technology can't do everything, and nobody is claiming it can," he said. "But it is a technology that
has a promising future."
'Hot Topic'
The first automated essay-scoring systems go back to the early 1970s, but there wasn't much progress made
until the 1990s with the advent of the Internet and the ability to store data on hard-disk drives, Mr. Shermis
said. More recently, improvements have been made in the technology's ability to evaluate language,
grammar, mechanics, and style; detect plagiarism; and provide quantitative and qualitative feedback.
The computer programs assign grades to writing samples, sometimes on a scale of 1 to 6, in a variety of
areas, from word choice to organization. The products give feedback to help students improve their writing.
Others can grade short answers for content. To save time and money, the technology can be used in various
ways on formative exercises or summative tests.
The Educational Testing Service first used its e-rater automated-scoring engine for a high-stakes exam in
1999 for the Graduate Management Admission Test, or GMAT, according to David Williamson, a senior
research director for assessment innovation for the Princeton, N.J.-based company. It also uses the
technology in its Criterion Online Writing Evaluation Service for grades 4-12.
Over the years, the capabilities changed substantially, evolving from simple rule-based coding to more
sophisticated software systems. And statistical techniques from computational linguists, natural language
processing, and machine learning have helped develop better ways of identifying certain patterns in
writing.
But challenges remain in coming up with a universal definition of good writing, and in training a computer
to understand nuances such as "voice."
In time, with larger sets of data, more experts can identify nuanced aspects of writing and improve the
technology, said Mr. Williamson, who is encouraged by the new era of openness about the research.
"It's a hot topic," he said. "There are a lot of researchers and academia and industry looking into this, and
that's a good thing."
High-Stakes Testing
In addition to using the technology to improve writing in the classroom, West Virginia employs automated
software for its statewide annual reading language arts assessments for grades 3-11. The state has worked
with CTB/McGraw-Hill to customize its product and train the engine, using thousands of papers it has
collected, to score the students' writing based on a specific prompt.
"We are confident the scoring is very accurate," said Sandra Foster, the lead coordinator of assessment and
accountability in the West Virginia education office, who acknowledged facing skepticism initially from
teachers. But many were won over, she said, after a comparability study showed that the accuracy of a
trained teacher and the scoring engine performed better than two trained teachers. Training involved a few
hours in how to assess the writing rubric. Plus, writing scores have gone up since implementing the
technology.
Automated essay scoring is also used on the ACT Compass exams for community college placement, the
new Pearson General Educational Development tests for a high school equivalency diploma, and other
summative tests. But it has not yet been embraced by the College Board for the SAT or the rival ACT
college-entrance exams.
The two consortia delivering the new assessments under the Common Core State Standards are reviewing
machine-grading but have not committed to it.
Jeffrey Nellhaus, the director of policy, research, and design for the Partnership for Assessment of
Readiness for College and Careers, or PARCC, wants to know if the technology will be a good fit with its
assessment, and the consortium will be conducting a study based on writing from its first field test to see
how the scoring engine performs.
Likewise, Tony Alpert, the chief operating officer for the Smarter Balanced Assessment Consortium, said
his consortium will evaluate the technology carefully.
Open-Source Options
With his new company LightSide, in Pittsburgh, owner Elijah Mayfield said his data-driven approach to
automated writing assessment sets itself apart from other products on the market.
"What we are trying to do is build a system that instead of correcting errors, finds the strongest and weakest
sections of the writing and where to improve," he said. "It is acting more as a revisionist than a textbook."
The new software, which is available on an open-source platform, is being piloted this spring in districts in
Pennsylvania and New York.
In higher education, edX has just introduced automated software to grade open-response questions for use
by teachers and professors through its free online courses. "One of the challenges in the past was that the
code and algorithms were not public. They were seen as black magic," said company President Anant
Argawal, noting the technology is in an experimental stage. "With edX, we put the code into open source
where you can see how it is done to help us improve it."
Still, critics of essay-grading software, such as Les Perelman, want academic researchers to have broader
access to vendors' products to evaluate their merit. Now retired, the former director of the MIT Writing
Across the Curriculum program has studied some of the devices and was able to get a high score from one
with an essay of gibberish.
"My main concern is that it doesn't work," he said. While the technology has some limited use with grading
short answers for content, it relies too much on counting words and reading an essay requires a deeper level
of analysis best done by a human, contended Mr. Perelman.
"The real danger of this is that it can really dumb down education," he said. "It will make teachers teach
students to write long, meaningless sentences and not care that much about actual content."
Download