ESL essay raters’ cognitive processes Paula Winke and Hyojung Lim Michigan State University winke@msu.edu hyojung@msu.edu This is a study of rater behavior This is a study of rater behavior O How does a rater make scoring decisions? What does a rater pay attention to when rating? My essay This is a study of rater behavior O Language testers need to know if constructirrelevant variation in scores stem from how raters approach and think about a rubric. My essay This is a study of rater behavior O Empirical studies on raters’ cognitive processes are scarce (especially with analytic scoring), and findings are not consistent. My essay Previous findings O Raters focus on different features in essays when scoring; weight the different scoring categories differently (Cumming et al., 2002; Eckes, 2008; Orr, 2002). My essay Previous findings O Sometimes they consider external features that are not even described in a rubric (Barkaoui, 2010; Lumley, 2005; Vaughan, 1991). My essay Previous findings O Raters may have different attentional foci when scoring, and their foci may depend on O the scale type (holistic vs. analytic), O the rater’s experience (expert vs. novice rater), O the raters’ L1 and even L2 background. My essay The current study We’d like to know… O How raters cognitively process (i.e., use) an analytic rubric while rating ESL essays O Whether variability in processing (difference in rubric usage) is associated with lower inter-rater reliability Research Questions O To which parts of an analytic rubric do raters pay the most attention (measured as total fixation duration and visit count)? O Are inter-rater reliability statistics on the subcomponents of an analytic rubric related to the amount of attention paid to those subcomponents? Method O 9 raters, all ESL instructors in the same English- language program at a large, Midwestern university and native speakers of English. O Each rated 40 essays (4 prompts * 10 essays). O Analytic rating scale: Currently used at the language program; it is a modified version from Jacobs et al. (1981) – content, organization, vocabulary, language use, and mechanics O Tobii TX300 eye-tracker: The rubric was installed in the Tobii Studio program. Method O 9 raters, all ESL instructors in the same English- language program at a large, Midwestern university and native speakers of English. O Each rated 40 essays (4 prompts * 10 essays). O Analytic rating scale: Currently used at the language program; it is a modified version from Jacobs et al. (1981) – content, organization, vocabulary, language use, and mechanics O Tobii TX300 eye-tracker: The rubric was installed in the Tobii Studio program. The data collection set-up Rubric 64cm Essay Score Procedure Session 1 in Session 2 in Lab Session 3 in Lab conference room Two-hour rater training session The raters worked through 7 benchmark essays with Paula. Hyojung explained the procedure. • Background questionnaire • Eye calibration • Practice rating (norming session) Block 1: 10 essays Block 2: 10 essays • Eye calibration • Practice rating (norming session) Block 3: 10 essays Block 4: 10 essays The data Data Analysis O To quantify attention: total fixation duration (divided by the number of words in each category) and visit count O To observe a rating process: time to first fixation, gaze plots, and heat maps (Bax & Weir, 2012) O Inter-rater reliability: the intraclass coefficient (ICC) and reliability adjusted by the Spearman-Brown prophecy formula O Statistics: the Kruskal-Wallis and Mann-Whitney (post hoc) test Results O In general, raters read the rubric from left to right, starting from content, organization, vocabulary, language use to mechanics. Oftentimes (71 times, to be specific), mechanics were overlooked. Results O Organization received the most attention (in terms of fixation duration and visit count) and showed the highest inter-rater reliability; raters attended least to and agreed least on mechanics. r = .90 r = .75 Fixation Visit count duration (mean) in seconds with # of words controlled Intraclass Coefficient SpearmanBrown prophecy formula Content .071 4.03 .89 .82 Organization .081 4.14 .92 .90 Vocabulary .056 4.40 .88 .78 Language Use .053 4.15 .90 .82 Mechanics .041 2.57 .85 .75 Organization, Content >> Vocab. Lang >> Mechanics Vocab, Organization, Lang, Content>> Mechanics Statistical results Results O From a qualitative review of the videos and heatmaps in comparison with each rater’s inter-rater reliability estimate, we believe that raters who agreed the most had common attentional foci, whereas those who agreed the least did not. Incongruous Raters O Raters 1 and 7 were found to be most incongruous, given their lowest inter-rater reliability for the total score (.45), and the second lowest reliability for content (.36) and for mechanics (.28). O Because the scores for Essay 2 had the largest standard deviation, we looked at the heat maps for essay 2 for raters 1 and 7. Essay 2 Rater 1 Essay 2 Rater 7 Agreeing Raters O Raters 6 and 8 had the highest correlation coefficient in total scores (r=.79) as well as on the sub-scores for content (r=.75) and mechanics (r=.67). O Given that the scores of Essay 8 shows the smallest standard deviation, the heat maps for the essay 8 were compared between rater 6 and 9. Essay 8 Rater 6 Essay 8 Rater 8 Discussion O Raters’ attention and inter-rater reliability O More attention leads to higher inter-rater reliability with analytic scoring. (<-> greater care and attention decrease reliability with holistic scoring, Wolfe, 1997) O Those who showed higher inter-rater reliability showed similar reading patterns – reading a relatively large area of the rubric, and having common patterns of attentional foci. Discussion O The effect of the layout O With an analytic scale, raters’ decision-making behaviors tend to operate within the scope of the given guidelines (Smith, 2000). O Part of the guidelines is the order of the categories. We think that raters gave their most attention to content and organization and their least attention to mechanics because of a primacy effect. O It has to do with rubric real estate. Discussion O In Lumley’s (2005) study, the conventions of presentation (spelling, punctuation, script layout) received the second most attention after content, more attention than organization and grammar. O In his study, the conventions of presentation came second after content in the rubric. O May also be evidence of this primacy effect. Discussion O Raters may use the rubric mainly to justify or adjust the scores for an essay on which they have already made decisions. When finishing reading an essay, raters seemed to know where the quality of the essay would fall in the grid of the analytic rubric. O Those who showed higher inter-rater agreement appeared to look through more descriptors for various levels; those who didn’t seemed to stick to their initial judgment. Limitations & Future Directions O The eye-movement data don’t fully explain why raters paid more attention to certain categories or whether raters considered non-criterion features. -> analysis of our stimulated-recall interview data is needed. O We don’t know if there was any halo effect across essays in the rating process. O Information is lacking on how raters read the essays and how they went back and forth between the essays and the rating scale. O We have collected data for a second study in which both the rubric and essay are on screen, and data for a third study to investigate potential halo effects. Questions or comments? Paula Winke winke@msu.edu Hyojung Lim hyojung@msu.edu Notes on Essays O We assembled a stratified sample of 40 essays from prior ESL placement tests at a large Midwestern university. We culled four sets of 10 essays, each set from one of four scoring bands (64 and below, 65-69, 70-74, and 75 and above: see supplemental material that accompanies the online version of this manuscript). We balanced the selection of the 40 essays equally across four prompts as follows, with two to three essays at each score band being a response to one of these prompts: O O O O O Do you think it is better for people to make their purchases online or to go shopping in stores and malls? Use specific details and examples to explain your answer. Some people say that all international students who are studying English should have an American roommate for at least one year. What is your opinion on this topic? Some employees have bosses that they really like working for, while others have bosses that they absolutely hate. What are the most important qualities of a good boss at work, and why? If you had the choice, would you rather take a college course online or have the same class face to face with an instructor and classmates in a classroom? Use specific details and examples to explain your answer. The length of student essays was limited to one page so that raters did not need to flip over pages while rating. The order of 10 essays within each prompt set was randomized, and the order of the four prompt sets was counterbalanced across raters. A packet of 40 copied essays were ready for each rater, and raters were allowed to write on the essays while rating. Additionally, we selected two more essays for norming, and the essays were from the middle two score bands of 65-74. Notes on Time to 1st Fixation The mean rank is the result of the Kruskai-Wallis test. Categories N Mean Time Std. Mean Rank Deviation Content 351 101.66 33.31 567.65 Organization 351 108.16 33.18 649.64 Vocabulary 351 123.39 38.44 838.28 Language Use 350 142.41 44.98 1030.29 Mechanics 280 163.64 55.87 1196.35 Eye fixation duration with number of words controlled Note. Measurement units are seconds (e.g. 10.720 seconds). Mean ranks are the result of the Kruskal-Wallis test. N Total fixation duration (Mean) Number Fixation duration of (Mean) Words with number of words controlled SD Mean Rank Content 351 10.720 151 13.766/151= .071 .047 1050.45 Organization 351 7.576 94 9.597/94= .081 .062 1089.23 Vocabulary 351 8.216 146 10.397/146=.056 .037 888.95 Language Use 350 9.689 184 12.576/184=.053 .034 843.29 Mechanics 280 3.690 89 4.133/89=.041 .050 518.07