Document 11590795

advertisement
Holistic Scores of Automated Writing Evaluation
Consistency, Perceptions, and Use
Zhi Li, Stephanie Link, Hong Ma, Hyejin Yang, Volker Hegelheimer Language Tes<ng Research Colloquium April 4, 2012 Iowa State University
Applied Linguistics and Technology, Department of English
1 Overview
v Background to the Study
v Literature Review
v Methodology
v Results
§  Consistency with Instructors’ scores
§  Instructor/Student Perceptions about scores
§  Instructor/Student Use of scores
v Discussion
v Implications/Conclusions
Iowa State University
Applied Linguistics and Technology, Department of English
2 Background to the Study
Large research group on AWE
Students’ Forma<ve linguis<c assessment development Focus on Holistic
Scores
Iowa State University
Applied Linguistics and Technology, Department of English
Pedagogical prac<ce 3 AWE Tool in this Study:
Criterion®
4 AWE Tool in this Study:
Criterion
5 Literature Review
v 
Previous Studies on Holis<c Scores from AWE. In Tes<ng Context •  Target: High Correla<on AWE nrgesearch for tes.ng gh, ave n
o
J
e
; Wa
n, D i
6
0
e
inves.gated: t
0
s
2
07
i &
kli, Ber
Di
ison , 2000 n, 20
P
w
o
r
nd
& B
e
h
s
n
ein, Tow
ing, t
s
n
r
r
u
a
Lef the AWE & B
e o
a
nd r
eliability i Validity l
g
a
a
t
T
n
A
Va
2006
002 2
scoring s
ystems ,
row
o
an, d
m
o
e
h
2
g
Brid 010 er, C ick, 200
,
i
w
l
o
a
P
AT
ni, 2
, Kuk
a
s
p
e
a
l
r
s a subs<tute & T for human Fow AWE a
on &
In the Classroom es & 10 m
i
r
G
, 20
r
e
u
cha
Wars
eaT, d
n
i
& W
y
r
a
Eby
2010
•  Target: Feedback & Chen 08 , 20
Cheng
004 2
,
i
l
a
AT
Sim 0wri<ng -­‐
n
7 e
raters 0
B
2
,
T
e
Benn
Iowa State University
Applied Linguistics and Technology, Department of English
6 Literature Review
v 
Perceptions within classroom settings
Teachers vs. Students
Problems with scoring system
Grimes & Warschauer (2010, 2006)
Chen & Cheng (2008)
Teachers often disagreed with
scores, but found them helpful
for teaching
“Students tended to be less
skeptical of the scores than
teachers” (p10.).
Teachers showed lightly less
than neutral opinions about
fairness and accuracy
1. Favors lengthiness
2. Overemphasizes the use of
transition words
3. Ignores coherence and
content development
4. Discourages unconventional
ways of essay writing
5. Partially reflects actual English
writing ability
Iowa State University
Applied Linguistics and Technology, Department of English
7 Literature Review
v 
Teachers’ Use of AWE scores Part of students’ grades or complete
disregard (varied reliance)
• Chen & Cheng (2008)
• Grimes & Warschauer (2010)
Required minimum AWE score
v 
• Chen & Cheng (2008)
Students’ Use of AWE scores Scores motivated students
• Ebyary & Windeatt (2010)
Low scores led to higher
improvement
Iowa State University
Applied Linguistics and Technology, Department of English
• Attali (2004)
8 Gaps in the area of AWE
v 
v 
Few studies on actual use of AWE scores in the
classroom context.
More in-depth studies on AWE scoring quality,
student/instructor perception and use are
needed.
Purpose:
v 
to investigate how AWE scores are used in the
classroom for assessment purposes.
Iowa State University
Applied Linguistics and Technology, Department of English
9 Research Questions
Scoring Consistency
between
Criterion scores
and Instructor scores
Perceptions and
Use of Criterion
Scores
(RQ2) What are the
(RQ1) How well are Criterion
instructors’ perceptions and
holistic scores correlated with
use of Criterion holistic scores
the instructors’ rating?
in the classroom?
Iowa State University
(RQ3) What are the learners’
perceptions and use of
Criterion holistic scores to
improve their writing?
Applied Linguistics and Technology, Department of English
10 Methodology
Setting
ESL Academic Writing Course (Engl101C)
Participants
3 ESL Academic Writing Instructors
-Trained raters for English Placement Test (EPT) writing
-Experienced ESL writing instructors
-Proficient in technology use
67 ESL students
-Intermediate proficiency level
-Majority from Asian backgrounds
Iowa State University
Applied Linguistics and Technology, Department of English
11 Methodology
Materials
§ 2 Major paper assignments
§  Paper 1: Narrative (500+ words)
§  Paper 4: Argumentative (900+ words)
§  Scores from Criterion® and instructors
§  Rating training
§  Reliability (α): paper 1: .506; paper 4: .203
§  Averaged scores of two closest ratings used
Iowa State University
Applied Linguistics and Technology, Department of English
12 Methodology
Data Analysis
Data Collection
(Spring 2012)
(Fall 2011)
Paper 1
Paper 4
Paper 3
Paper 2
(No holistic score)
(No holistic score)
1st individual student
questionnaires
Teacher focus group
interviews
4th individual student
questionnaires
individual interviews
questionnaire
2nd individual
interviews
Iowa State University
Department of English, Applied Linguistics and Technology
13 Methodology
v 
Instructor -Rubric (Score: 0-100)
Material
(20 Points)
Fully explains an event, including historical or other contextual details
necessary to understand the effects of that event, and details the effects that
event had on the student’s life or the life of someone known to the student.
Organization Material is organized appropriately to allow readers to clearly understand the
author’s stance and how information in each paragraph supports that
(20 Points)
position.
Expression
(20 Points)
Uses appropriate vocabulary—including transitional devices—and sentence
structure to convey meaning clearly and maintain a reader’s interest.
Correctness
(20 Points)
Uses appropriate word choice, sentence structure, punctuation, and spelling
with few grammatical errors.
Paper Process
Completes each step of the revision process. If any of the below steps are not
completed due to an absence or failure to complete an updated draft for the
day it is due, points will be deducted from the process grade.
14 (20 Points)
Score 0-­‐80 used for correla.ons Methodology
v 
Criterion – Scoring Guide (Score: 1-6)
You have put together a convincing argument.
Here are some of the strengths evident in your writing:
Your essay:
Looks at the topic from a number of angles and responds to all
aspects [Material]
Responds thoughtfully and insightfully to the issues in the topic
[Material]
•  Develops with a superior structure and apt reasons or examples
[Organization]
•  Uses sentence styles and language that have impact and energy
[Expression]
•  Demonstrates that you know the mechanics of correct sentence
structure [Correctness]
Score • 
of • 
6 15 Methodology
● 
Data Analysis
● 
RQ1: Score Correlation
– 
● 
● 
Correlation Analysis: Spearman Rho
RQ2: Perception
– 
Descriptive analysis: student questionnaires
– 
A priori à inductive coding: interviews
RQ3: Use
– 
A priori coding: instructor questionnaires
– 
A priori à inductive coding: interviews
Iowa State University
Applied Linguistics and Technology, Department of English
16 RQ1: Consistency
Correlation of Criterion scores and Instructor
Scores (Spearman rho)
Criterion Score Paper 1 (N) Averaged Instructor Score Paper 1 Criterion Score Paper 4 (N) .426**(47) Averaged Instructor Score Paper 4 .129 (42) Note: **. Correla<on is significant at the 0.01 level (2-­‐tailed). *. Correla<on is significant at the 0.05 level (2 tailed). Iowa State University
Applied Linguistics and Technology, Department of English
17 RQ1: Consistency
Correlation of Criterion score and Instructor
analytic scores - Paper 1 & 4 (Spearman rho)
Averaged Instructor Analy.c Scores – Paper 1 Criterion Score Paper 1 (N=47) Material Organiza<on Expression Correctness .279 .471** .302* .346* Averaged Instructor Analy.c Scores – Paper 4 Criterion Score Paper 4 (N=42) .116 .178 .186 .265 Note: **. Correla<on is significant at the 0.01 level (2-­‐tailed). *. Correla<on is significant at the 0.05 level (2 tailed). Iowa State University
Applied Linguistics and Technology, Department of English
18 RQ1: Consistency
Distribution of Instructors’ scores over
Criterion holistic scores on Paper 1
Criterion scores Paper 1 6 5 4 3 Mean (N) 70.1 (16) 68.8 (25) 64.5 (6) 0(0) Averaged Instructor Scores >74 72-­‐73 70-­‐71 66-­‐69 64-­‐65 62-­‐63 58-­‐61 [A] [A-­‐] [B+] [B] [B-­‐] [C+] [C] 2 2 3 8 1 1 2 5 16 1 0 0 1 1 1 0 3 0 0 0 0 0 0 0 19 RQ1: Consistency
Distribution of Instructors’ scores over
Criterion holistic scores on Paper 4
Criterion Averaged Instructor Scores scores Mean (N) >74 72-­‐73 70-­‐71 66-­‐69 64-­‐65 62-­‐63 58-­‐61 [A] [A-­‐] [B+] [B] [B-­‐] [C+] [C] Paper 4 6 5 4 3 71.5 (33) 71.1 (8) 67.5 (1) 0 (0) 11 6 5 9 4 1 3 2 1 0 0 0 0 0 0 0 20 RQ2: Instructor’s Perception
1) Trustworthiness of Criterion scores
Q: How much do you trust the scores from Criterion?
Respondents Instructors’ Ra.ng of Trustworthiness 3 2 1 0 6 5 4 3 High trust Iowa State University
Applied Linguistics and Technology, Department of English
2 1 Low trust 21 RQ2: Instructor’s Perception
1) Trustworthiness of Criterion scores
High Trust High Scores Low Scores Low Trust “So I feel like that some of [my student’s]
sentences are unreadable…It’s amazing that...he
got scores like 5 sometimes from Criterion, or
even 6.” (Abner)
Iowa State University
Applied Linguistics and Technology, Department of English
22 RQ2: Instructor’s Perception
2) Interpretation of Criterion scores
More Problems High Scores Low Scores Not free of problems “Getting a high score from Criterion does not mean
you are good, but if you get a low score in criterion, it
means you are problematic.” (Abner)
------------------------------“...and I will say 6 doesn’t mean anything” (Abbott)
Iowa State University
Applied Linguistics and Technology, Department of English
23 RQ2: Instructors’ Use
1) Approaches to the use of Criterion scores
• Appraisal
of
errors
Forewarning
• Requirement
of minimum
scores
Benchmark
Iowa State University
Applied Linguistics and Technology, Department of English
• Part
of
grade
Assessment
24 RQ2: Instructors’ Use
1a) Criterion scores as a Forewarning
Pay more aTen<on Low Scores “…if I got 2 or 3 from Criterion, I would say that I
need to pay more attention to that paper.” (Abbott)
-------------------“Use of AWE scores can help [students] realize
that they still need to work on grammar and
that their grammar is not good as they expected” (Tea
cher 3)
Iowa State University
Applied Linguistics and Technology, Department of English
25 RQ2: Instructors’ Use
1b) Criterion scores as a Benchmark “I ask my students to reach a certain score
before their peer review section (4) and
before their submission to me (5-6)
…” (Teacher 1)
Iowa State University
Applied Linguistics and Technology, Department of English
26 RQ2: Instructors’ Use
1c) Criterion scores as Assessment
“In Fall 2011 I used Criterion assignment as a
midterm test and used the holistic scores as
they were.” (Teacher 2)
-----------------------------------“According to my syllabus, the students…
can get…5 points for getting a score of 6
(out of 6) from Criterion.” (Teacher 3)
Iowa State University
Applied Linguistics and Technology, Department of English
27 RQ3: Students’ Perception
1) Trustworthiness of Criterion scores
Q: How much do you trust the scores from Criterion?
Respondents Result: 4.12 (relatively high)
Students’ Ra.ng of Trustworthiness 20 10 0 6 High trust 5 4 3 Iowa State University
Applied Linguistics and Technology, Department of English
2 1 Low trust 28 RQ3: Students’ Use
1) Approaches to the use of Criterion scores
• Push for more editing
Motivator
Iowa State University
Applied Linguistics and Technology, Department of English
29 RQ3: Students’ Use
1b) Criterion scores as Motivator
“…it’s a really powerful power to push me. Yeah, fix
it over and over again” (101c315).
---------------------------“Oh the holistic score, because usually I get a 5 so I
want to get it 6, or a 6. So that motivates me…….”
“…I once submitted it about 10 times. I did it until
I get a score of 6…” (101c304).
Iowa State University
Applied Linguistics and Technology, Department of English
30 RQ3: Students’ Perception
2) Usefulness of Criterion scores
“Criterion’s feedback always the same. Like,
if you use, like my score…sometimes it’s 5,
sometimes a 4. But the…explanation about
your score always the same.…” (101c310)
-------------------“I...correct grammar error or any errors but,
like, it didn’t give me higher score. It stay the
same as what I first got.” (101c319)
Iowa State University
Applied Linguistics and Technology, Department of English
31 Discussion / Limitation
(RQ1) Consistency
● 
● 
The relatively low to moderate correlations between holistic
scores and instructors’ rating (.129 to .426) may be a result
of:
● 
differences in scoring rubrics,
● 
rater-training, and
● 
restricted range of students proficiency
Coefficient discrepancy b/w Paper 1 and 4 is likely due to:
● 
the nature of assignments, and
● 
students’ strategies.
Iowa State University
Applied Linguistics and Technology, Department of English
32 Discussion
(RQ2 & 3) Perception and Use
● 
Variability in perceptions and uses may be
caused by:
– 
Discrepancies between the holistic scores and the actual
quality of students’ writing.
Iowa State University
Applied Linguistics and Technology, Department of English
33 Implications & Conclusions
● 
Criterion scores seemed to be less beneficial for:
v 
● 
Use as a summative assessment tool.
Criterion scores were beneficial for:
v 
Use as a formative assessment tool.
– 
A guide to inform students of editing issues
– 
A motivator to encourage revision
Iowa State University
Applied Linguistics and Technology, Department of English
34 Questions/Comments?
Thank you for listening
Zhi Li Hyejin Yang Hong Ma Stephanie Link Volker Hegelheimer
zhili@iastate.edu hjyang@iastate.edu hma2@iastate.edu smcross@iastate.edu volkerh@iastate.edu hTp://volkerh.public.iastate.edu/awe Iowa State University
Applied Linguistics and Technology, Department of English
35 hTp://volkerh.public.iastate.edu/awe Iowa State University
Applied Linguistics and Technology, Department of English
36 References
ATali, Y., Bridgeman, B., & Trapani, C. (2010). Performance of a Generic Approach in Automated Essay Scoring. Journal of Technology, Learning, and Assessment, 10(3). Retrieved from hTp://
www.jtla.org. Ben-­‐Simon, A. & BenneT, R.E. (2007). Toward More Substan<vely Meaningful Automated Essay Scoring. Journal of Technology, Learning, and Assessment, 6(1). Retrieved [date] from hTp://
www.jtla.org. Bridgeman, B., Trapani, C., & ATali, Y. (2009). Considering fairness and validity in evalua<ng automated scoring, Listening, Learning, Leading. Paper presented at the annual mee<ng of the Na<onal Council on Measurement in Educa<on (NCME) April 13-­‐17, 2009, San Diego, CA. Chen, C., & Cheng, W. (2008). Beyond the design of automated wri<ng evalua<on: Pedagogical prac<ces and perceived learning effec<veness in EFL wri<ng Classes, Language Learning & Technology, 12, 2, 94-­‐112. P106 Ebyary, K., & WindeaT, S. (2010). The impact of computer-­‐based feedback on students’ wriTen work, Interna=onal Journal of English Studies, 10 (2), 121-­‐142. Grimes, D., & Waschauer, M. (2006). Automated essay scoring in the classroom, Paper presented at the American Educa<onal Research Associa<on. Grimes, D. & Warschauer, M. (2010). U<lity in a Fallible Tool: A Mul<-­‐Site Case Study of Automated Wri<ng Evalua<on. Journal of Technology, Learning, and Assessment, 8(6). Retrieved [date] from hHp://www.jtla.org.James, C. L. (2006 ). Valida<ng a computerized scoring system for assessing wri<ng and placing students in composi<on courses. Assessing Wri=ng, 11(3), 167-­‐178. Iowa State University
Applied Linguistics and Technology, Department of English
37 
Download