ESL essay raters* cognitive processes

advertisement
ESL essay raters’
cognitive processes
Paula Winke and Hyojung Lim
Michigan State University
winke@msu.edu
hyojung@msu.edu
This is a study of rater behavior
This is a study of rater behavior
O How does
a rater
make
scoring
decisions?
What does
a rater pay
attention
to when
rating?
My essay
This is a study of rater behavior
O Language
testers need
to know if
constructirrelevant
variation in
scores stem
from how
raters
approach and
think about a
rubric.
My essay
This is a study of rater behavior
O Empirical
studies on
raters’
cognitive
processes are
scarce
(especially
with analytic
scoring), and
findings are
not consistent.
My essay
Previous findings
O Raters focus
on different
features in
essays when
scoring; weight
the different
scoring
categories
differently
(Cumming et al.,
2002; Eckes,
2008; Orr, 2002).
My essay
Previous findings
O Sometimes
they consider
external
features that
are not even
described in
a rubric
(Barkaoui,
2010; Lumley,
2005; Vaughan,
1991).
My essay
Previous findings
O Raters may
have different
attentional foci
when scoring,
and their foci
may depend on
O the scale type
(holistic vs.
analytic),
O the rater’s
experience
(expert vs.
novice rater),
O the raters’ L1
and even L2
background.
My essay
The current study
We’d like to know…
O How raters cognitively process (i.e., use) an
analytic rubric while rating ESL essays
O Whether variability in processing (difference
in rubric usage) is associated with lower
inter-rater reliability
Research Questions
O To which parts of an analytic rubric do raters
pay the most attention (measured as total
fixation duration and visit count)?
O Are inter-rater reliability statistics on the
subcomponents of an analytic rubric related
to the amount of attention paid to those
subcomponents?
Method
O 9 raters, all ESL instructors in the same English-
language program at a large, Midwestern
university and native speakers of English.
O Each rated 40 essays (4 prompts * 10 essays).
O Analytic rating scale: Currently used at the
language program; it is a modified version from
Jacobs et al. (1981) – content, organization,
vocabulary, language use, and mechanics
O Tobii TX300 eye-tracker: The rubric was installed in
the Tobii Studio program.
Method
O 9 raters, all ESL instructors in the same English-
language program at a large, Midwestern
university and native speakers of English.
O Each rated 40 essays (4 prompts * 10 essays).
O Analytic rating scale: Currently used at the
language program; it is a modified version from
Jacobs et al. (1981) – content, organization,
vocabulary, language use, and mechanics
O Tobii TX300 eye-tracker: The rubric was installed in
the Tobii Studio program.
The data collection set-up
Rubric
64cm
Essay
Score
Procedure
Session 1 in
Session 2 in Lab Session 3 in Lab
conference room
Two-hour rater
training session
The raters
worked through
7 benchmark
essays with
Paula.
Hyojung
explained the
procedure.
• Background
questionnaire
• Eye calibration
• Practice rating
(norming session)
Block 1: 10 essays
Block 2: 10 essays
• Eye calibration
• Practice rating
(norming
session)
Block 3: 10 essays
Block 4: 10 essays
The data
Data Analysis
O To quantify attention: total fixation duration (divided
by the number of words in each category) and visit
count
O To observe a rating process: time to first fixation,
gaze plots, and heat maps (Bax & Weir, 2012)
O Inter-rater reliability: the intraclass coefficient (ICC)
and reliability adjusted by the Spearman-Brown
prophecy formula
O Statistics: the Kruskal-Wallis and Mann-Whitney
(post hoc) test
Results
O In general, raters
read the rubric from
left to right, starting
from content,
organization,
vocabulary, language
use to mechanics.
Oftentimes (71 times,
to be specific),
mechanics were
overlooked.
Results
O Organization received
the most attention (in
terms of fixation
duration and visit
count) and showed
the highest inter-rater
reliability; raters
attended least to and
agreed least on
mechanics.
r = .90
r = .75
Fixation
Visit count
duration
(mean) in
seconds with
# of words
controlled
Intraclass
Coefficient
SpearmanBrown
prophecy
formula
Content
.071
4.03
.89
.82
Organization
.081
4.14
.92
.90
Vocabulary
.056
4.40
.88
.78
Language
Use
.053
4.15
.90
.82
Mechanics
.041
2.57
.85
.75
Organization,
Content >>
Vocab. Lang
>>
Mechanics
Vocab,
Organization,
Lang,
Content>>
Mechanics
Statistical
results
Results
O From a qualitative
review of the videos
and heatmaps in
comparison with each
rater’s inter-rater
reliability estimate,
we believe that raters
who agreed the most
had common
attentional foci,
whereas those who
agreed the least did
not.
Incongruous Raters
O Raters 1 and 7 were found to be
most incongruous, given their
lowest inter-rater reliability for the
total score (.45), and the second
lowest reliability for content (.36)
and for mechanics (.28).
O Because the scores for Essay 2
had the largest standard deviation,
we looked at the heat maps for
essay 2 for raters 1 and 7.
Essay 2
Rater 1
Essay 2
Rater 7
Agreeing Raters
O Raters 6 and 8 had the highest
correlation coefficient in total
scores (r=.79) as well as on the
sub-scores for content (r=.75) and
mechanics (r=.67).
O Given that the scores of Essay 8
shows the smallest standard
deviation, the heat maps for the
essay 8 were compared between
rater 6 and 9.
Essay 8
Rater 6
Essay 8
Rater 8
Discussion
O Raters’ attention and inter-rater reliability
O More attention leads to higher inter-rater
reliability with analytic scoring. (<-> greater
care and attention decrease reliability with
holistic scoring, Wolfe, 1997)
O Those who showed higher inter-rater
reliability showed similar reading patterns
– reading a relatively large area of the
rubric, and having common patterns of
attentional foci.
Discussion
O The effect of the layout
O With an analytic scale, raters’ decision-making
behaviors tend to operate within the scope of
the given guidelines (Smith, 2000).
O Part of the guidelines is the order of the
categories. We think that raters gave their
most attention to content and organization
and their least attention to mechanics
because of a primacy effect.
O It has to do with rubric real estate.
Discussion
O In Lumley’s (2005) study, the conventions of
presentation (spelling, punctuation, script layout)
received the second most attention after content,
more attention than organization and grammar.
O In his study, the conventions of presentation came
second after content in the rubric.
O May also be evidence of this primacy effect.
Discussion
O Raters may use the rubric mainly to justify or adjust
the scores for an essay on which they have already
made decisions. When finishing reading an essay,
raters seemed to know where the quality of the essay
would fall in the grid of the analytic rubric.
O Those who showed higher inter-rater agreement
appeared to look through more descriptors for various
levels; those who didn’t seemed to stick to their initial
judgment.
Limitations & Future
Directions
O The eye-movement data don’t fully explain why raters
paid more attention to certain categories or whether
raters considered non-criterion features. -> analysis
of our stimulated-recall interview data is needed.
O We don’t know if there was any halo effect across
essays in the rating process.
O Information is lacking on how raters read the essays
and how they went back and forth between the
essays and the rating scale.
O We have collected data for a second study in which
both the rubric and essay are on screen, and data for a
third study to investigate potential halo effects.
Questions or comments?
Paula Winke
winke@msu.edu
Hyojung Lim
hyojung@msu.edu
Notes on Essays
O
We assembled a stratified sample of 40 essays from prior ESL placement tests at a
large Midwestern university. We culled four sets of 10 essays, each set from one of four
scoring bands (64 and below, 65-69, 70-74, and 75 and above: see supplemental
material that accompanies the online version of this manuscript). We balanced the
selection of the 40 essays equally across four prompts as follows, with two to three
essays at each score band being a response to one of these prompts:
O
O
O
O
O
Do you think it is better for people to make their purchases online or to go shopping in
stores and malls? Use specific details and examples to explain your answer.
Some people say that all international students who are studying English should have an
American roommate for at least one year. What is your opinion on this topic?
Some employees have bosses that they really like working for, while others have bosses
that they absolutely hate. What are the most important qualities of a good boss at work,
and why?
If you had the choice, would you rather take a college course online or have the same
class face to face with an instructor and classmates in a classroom? Use specific details
and examples to explain your answer.
The length of student essays was limited to one page so that raters did not need to flip
over pages while rating. The order of 10 essays within each prompt set was randomized,
and the order of the four prompt sets was counterbalanced across raters. A packet of 40
copied essays were ready for each rater, and raters were allowed to write on the essays
while rating. Additionally, we selected two more essays for norming, and the essays were
from the middle two score bands of 65-74.
Notes on Time to 1st Fixation
The mean rank is the result of the Kruskai-Wallis test.
Categories
N
Mean Time
Std.
Mean Rank
Deviation
Content
351
101.66
33.31
567.65
Organization
351
108.16
33.18
649.64
Vocabulary
351
123.39
38.44
838.28
Language Use
350
142.41
44.98
1030.29
Mechanics
280
163.64
55.87
1196.35
Eye fixation duration with number of words controlled
Note. Measurement units are seconds (e.g. 10.720 seconds). Mean ranks are the result of the Kruskal-Wallis test.
N
Total
fixation
duration
(Mean)
Number
Fixation duration
of
(Mean)
Words with number of words
controlled
SD
Mean
Rank
Content
351
10.720
151
13.766/151= .071
.047
1050.45
Organization
351
7.576
94
9.597/94= .081
.062
1089.23
Vocabulary
351
8.216
146
10.397/146=.056
.037
888.95
Language
Use
350
9.689
184
12.576/184=.053
.034
843.29
Mechanics
280
3.690
89
4.133/89=.041
.050
518.07
Download