Holistic Scores of Automated Writing Evaluation Consistency, Perceptions, and Use Zhi Li, Stephanie Link, Hong Ma, Hyejin Yang, Volker Hegelheimer Language Tes<ng Research Colloquium April 4, 2012 Iowa State University Applied Linguistics and Technology, Department of English 1 Overview v Background to the Study v Literature Review v Methodology v Results § Consistency with Instructors’ scores § Instructor/Student Perceptions about scores § Instructor/Student Use of scores v Discussion v Implications/Conclusions Iowa State University Applied Linguistics and Technology, Department of English 2 Background to the Study Large research group on AWE Students’ Forma<ve linguis<c assessment development Focus on Holistic Scores Iowa State University Applied Linguistics and Technology, Department of English Pedagogical prac<ce 3 AWE Tool in this Study: Criterion® 4 AWE Tool in this Study: Criterion 5 Literature Review v Previous Studies on Holis<c Scores from AWE. In Tes<ng Context • Target: High Correla<on AWE nrgesearch for tes.ng gh, ave n o J e ; Wa n, D i 6 0 e inves.gated: t 0 s 2 07 i & kli, Ber Di ison , 2000 n, 20 P w o r nd & B e h s n ein, Tow ing, t s n r r u a Lef the AWE & B e o a nd r eliability i Validity l g a a t T n A Va 2006 002 2 scoring s ystems , row o an, d m o e h 2 g Brid 010 er, C ick, 200 , i w l o a P AT ni, 2 , Kuk a s p e a l r s a subs<tute & T for human Fow AWE a on & In the Classroom es & 10 m i r G , 20 r e u cha Wars eaT, d n i & W y r a Eby 2010 • Target: Feedback & Chen 08 , 20 Cheng 004 2 , i l a AT Sim 0wri<ng -­‐ n 7 e raters 0 B 2 , T e Benn Iowa State University Applied Linguistics and Technology, Department of English 6 Literature Review v Perceptions within classroom settings Teachers vs. Students Problems with scoring system Grimes & Warschauer (2010, 2006) Chen & Cheng (2008) Teachers often disagreed with scores, but found them helpful for teaching “Students tended to be less skeptical of the scores than teachers” (p10.). Teachers showed lightly less than neutral opinions about fairness and accuracy 1. Favors lengthiness 2. Overemphasizes the use of transition words 3. Ignores coherence and content development 4. Discourages unconventional ways of essay writing 5. Partially reflects actual English writing ability Iowa State University Applied Linguistics and Technology, Department of English 7 Literature Review v Teachers’ Use of AWE scores Part of students’ grades or complete disregard (varied reliance) • Chen & Cheng (2008) • Grimes & Warschauer (2010) Required minimum AWE score v • Chen & Cheng (2008) Students’ Use of AWE scores Scores motivated students • Ebyary & Windeatt (2010) Low scores led to higher improvement Iowa State University Applied Linguistics and Technology, Department of English • Attali (2004) 8 Gaps in the area of AWE v v Few studies on actual use of AWE scores in the classroom context. More in-depth studies on AWE scoring quality, student/instructor perception and use are needed. Purpose: v to investigate how AWE scores are used in the classroom for assessment purposes. Iowa State University Applied Linguistics and Technology, Department of English 9 Research Questions Scoring Consistency between Criterion scores and Instructor scores Perceptions and Use of Criterion Scores (RQ2) What are the (RQ1) How well are Criterion instructors’ perceptions and holistic scores correlated with use of Criterion holistic scores the instructors’ rating? in the classroom? Iowa State University (RQ3) What are the learners’ perceptions and use of Criterion holistic scores to improve their writing? Applied Linguistics and Technology, Department of English 10 Methodology Setting ESL Academic Writing Course (Engl101C) Participants 3 ESL Academic Writing Instructors -Trained raters for English Placement Test (EPT) writing -Experienced ESL writing instructors -Proficient in technology use 67 ESL students -Intermediate proficiency level -Majority from Asian backgrounds Iowa State University Applied Linguistics and Technology, Department of English 11 Methodology Materials § 2 Major paper assignments § Paper 1: Narrative (500+ words) § Paper 4: Argumentative (900+ words) § Scores from Criterion® and instructors § Rating training § Reliability (α): paper 1: .506; paper 4: .203 § Averaged scores of two closest ratings used Iowa State University Applied Linguistics and Technology, Department of English 12 Methodology Data Analysis Data Collection (Spring 2012) (Fall 2011) Paper 1 Paper 4 Paper 3 Paper 2 (No holistic score) (No holistic score) 1st individual student questionnaires Teacher focus group interviews 4th individual student questionnaires individual interviews questionnaire 2nd individual interviews Iowa State University Department of English, Applied Linguistics and Technology 13 Methodology v Instructor -Rubric (Score: 0-100) Material (20 Points) Fully explains an event, including historical or other contextual details necessary to understand the effects of that event, and details the effects that event had on the student’s life or the life of someone known to the student. Organization Material is organized appropriately to allow readers to clearly understand the author’s stance and how information in each paragraph supports that (20 Points) position. Expression (20 Points) Uses appropriate vocabulary—including transitional devices—and sentence structure to convey meaning clearly and maintain a reader’s interest. Correctness (20 Points) Uses appropriate word choice, sentence structure, punctuation, and spelling with few grammatical errors. Paper Process Completes each step of the revision process. If any of the below steps are not completed due to an absence or failure to complete an updated draft for the day it is due, points will be deducted from the process grade. 14 (20 Points) Score 0-­‐80 used for correla.ons Methodology v Criterion – Scoring Guide (Score: 1-6) You have put together a convincing argument. Here are some of the strengths evident in your writing: Your essay: Looks at the topic from a number of angles and responds to all aspects [Material] Responds thoughtfully and insightfully to the issues in the topic [Material] • Develops with a superior structure and apt reasons or examples [Organization] • Uses sentence styles and language that have impact and energy [Expression] • Demonstrates that you know the mechanics of correct sentence structure [Correctness] Score • of • 6 15 Methodology ● Data Analysis ● RQ1: Score Correlation – ● ● Correlation Analysis: Spearman Rho RQ2: Perception – Descriptive analysis: student questionnaires – A priori à inductive coding: interviews RQ3: Use – A priori coding: instructor questionnaires – A priori à inductive coding: interviews Iowa State University Applied Linguistics and Technology, Department of English 16 RQ1: Consistency Correlation of Criterion scores and Instructor Scores (Spearman rho) Criterion Score Paper 1 (N) Averaged Instructor Score Paper 1 Criterion Score Paper 4 (N) .426**(47) Averaged Instructor Score Paper 4 .129 (42) Note: **. Correla<on is significant at the 0.01 level (2-­‐tailed). *. Correla<on is significant at the 0.05 level (2 tailed). Iowa State University Applied Linguistics and Technology, Department of English 17 RQ1: Consistency Correlation of Criterion score and Instructor analytic scores - Paper 1 & 4 (Spearman rho) Averaged Instructor Analy.c Scores – Paper 1 Criterion Score Paper 1 (N=47) Material Organiza<on Expression Correctness .279 .471** .302* .346* Averaged Instructor Analy.c Scores – Paper 4 Criterion Score Paper 4 (N=42) .116 .178 .186 .265 Note: **. Correla<on is significant at the 0.01 level (2-­‐tailed). *. Correla<on is significant at the 0.05 level (2 tailed). Iowa State University Applied Linguistics and Technology, Department of English 18 RQ1: Consistency Distribution of Instructors’ scores over Criterion holistic scores on Paper 1 Criterion scores Paper 1 6 5 4 3 Mean (N) 70.1 (16) 68.8 (25) 64.5 (6) 0(0) Averaged Instructor Scores >74 72-­‐73 70-­‐71 66-­‐69 64-­‐65 62-­‐63 58-­‐61 [A] [A-­‐] [B+] [B] [B-­‐] [C+] [C] 2 2 3 8 1 1 2 5 16 1 0 0 1 1 1 0 3 0 0 0 0 0 0 0 19 RQ1: Consistency Distribution of Instructors’ scores over Criterion holistic scores on Paper 4 Criterion Averaged Instructor Scores scores Mean (N) >74 72-­‐73 70-­‐71 66-­‐69 64-­‐65 62-­‐63 58-­‐61 [A] [A-­‐] [B+] [B] [B-­‐] [C+] [C] Paper 4 6 5 4 3 71.5 (33) 71.1 (8) 67.5 (1) 0 (0) 11 6 5 9 4 1 3 2 1 0 0 0 0 0 0 0 20 RQ2: Instructor’s Perception 1) Trustworthiness of Criterion scores Q: How much do you trust the scores from Criterion? Respondents Instructors’ Ra.ng of Trustworthiness 3 2 1 0 6 5 4 3 High trust Iowa State University Applied Linguistics and Technology, Department of English 2 1 Low trust 21 RQ2: Instructor’s Perception 1) Trustworthiness of Criterion scores High Trust High Scores Low Scores Low Trust “So I feel like that some of [my student’s] sentences are unreadable…It’s amazing that...he got scores like 5 sometimes from Criterion, or even 6.” (Abner) Iowa State University Applied Linguistics and Technology, Department of English 22 RQ2: Instructor’s Perception 2) Interpretation of Criterion scores More Problems High Scores Low Scores Not free of problems “Getting a high score from Criterion does not mean you are good, but if you get a low score in criterion, it means you are problematic.” (Abner) ------------------------------“...and I will say 6 doesn’t mean anything” (Abbott) Iowa State University Applied Linguistics and Technology, Department of English 23 RQ2: Instructors’ Use 1) Approaches to the use of Criterion scores • Appraisal of errors Forewarning • Requirement of minimum scores Benchmark Iowa State University Applied Linguistics and Technology, Department of English • Part of grade Assessment 24 RQ2: Instructors’ Use 1a) Criterion scores as a Forewarning Pay more aTen<on Low Scores “…if I got 2 or 3 from Criterion, I would say that I need to pay more attention to that paper.” (Abbott) -------------------“Use of AWE scores can help [students] realize that they still need to work on grammar and that their grammar is not good as they expected” (Tea cher 3) Iowa State University Applied Linguistics and Technology, Department of English 25 RQ2: Instructors’ Use 1b) Criterion scores as a Benchmark “I ask my students to reach a certain score before their peer review section (4) and before their submission to me (5-6) …” (Teacher 1) Iowa State University Applied Linguistics and Technology, Department of English 26 RQ2: Instructors’ Use 1c) Criterion scores as Assessment “In Fall 2011 I used Criterion assignment as a midterm test and used the holistic scores as they were.” (Teacher 2) -----------------------------------“According to my syllabus, the students… can get…5 points for getting a score of 6 (out of 6) from Criterion.” (Teacher 3) Iowa State University Applied Linguistics and Technology, Department of English 27 RQ3: Students’ Perception 1) Trustworthiness of Criterion scores Q: How much do you trust the scores from Criterion? Respondents Result: 4.12 (relatively high) Students’ Ra.ng of Trustworthiness 20 10 0 6 High trust 5 4 3 Iowa State University Applied Linguistics and Technology, Department of English 2 1 Low trust 28 RQ3: Students’ Use 1) Approaches to the use of Criterion scores • Push for more editing Motivator Iowa State University Applied Linguistics and Technology, Department of English 29 RQ3: Students’ Use 1b) Criterion scores as Motivator “…it’s a really powerful power to push me. Yeah, fix it over and over again” (101c315). ---------------------------“Oh the holistic score, because usually I get a 5 so I want to get it 6, or a 6. So that motivates me…….” “…I once submitted it about 10 times. I did it until I get a score of 6…” (101c304). Iowa State University Applied Linguistics and Technology, Department of English 30 RQ3: Students’ Perception 2) Usefulness of Criterion scores “Criterion’s feedback always the same. Like, if you use, like my score…sometimes it’s 5, sometimes a 4. But the…explanation about your score always the same.…” (101c310) -------------------“I...correct grammar error or any errors but, like, it didn’t give me higher score. It stay the same as what I first got.” (101c319) Iowa State University Applied Linguistics and Technology, Department of English 31 Discussion / Limitation (RQ1) Consistency ● ● The relatively low to moderate correlations between holistic scores and instructors’ rating (.129 to .426) may be a result of: ● differences in scoring rubrics, ● rater-training, and ● restricted range of students proficiency Coefficient discrepancy b/w Paper 1 and 4 is likely due to: ● the nature of assignments, and ● students’ strategies. Iowa State University Applied Linguistics and Technology, Department of English 32 Discussion (RQ2 & 3) Perception and Use ● Variability in perceptions and uses may be caused by: – Discrepancies between the holistic scores and the actual quality of students’ writing. Iowa State University Applied Linguistics and Technology, Department of English 33 Implications & Conclusions ● Criterion scores seemed to be less beneficial for: v ● Use as a summative assessment tool. Criterion scores were beneficial for: v Use as a formative assessment tool. – A guide to inform students of editing issues – A motivator to encourage revision Iowa State University Applied Linguistics and Technology, Department of English 34 Questions/Comments? Thank you for listening Zhi Li Hyejin Yang Hong Ma Stephanie Link Volker Hegelheimer zhili@iastate.edu hjyang@iastate.edu hma2@iastate.edu smcross@iastate.edu volkerh@iastate.edu hTp://volkerh.public.iastate.edu/awe Iowa State University Applied Linguistics and Technology, Department of English 35 hTp://volkerh.public.iastate.edu/awe Iowa State University Applied Linguistics and Technology, Department of English 36 References ATali, Y., Bridgeman, B., & Trapani, C. (2010). Performance of a Generic Approach in Automated Essay Scoring. Journal of Technology, Learning, and Assessment, 10(3). Retrieved from hTp:// www.jtla.org. Ben-­‐Simon, A. & BenneT, R.E. (2007). Toward More Substan<vely Meaningful Automated Essay Scoring. Journal of Technology, Learning, and Assessment, 6(1). Retrieved [date] from hTp:// www.jtla.org. Bridgeman, B., Trapani, C., & ATali, Y. (2009). Considering fairness and validity in evalua<ng automated scoring, Listening, Learning, Leading. Paper presented at the annual mee<ng of the Na<onal Council on Measurement in Educa<on (NCME) April 13-­‐17, 2009, San Diego, CA. Chen, C., & Cheng, W. (2008). Beyond the design of automated wri<ng evalua<on: Pedagogical prac<ces and perceived learning effec<veness in EFL wri<ng Classes, Language Learning & Technology, 12, 2, 94-­‐112. P106 Ebyary, K., & WindeaT, S. (2010). The impact of computer-­‐based feedback on students’ wriTen work, Interna=onal Journal of English Studies, 10 (2), 121-­‐142. Grimes, D., & Waschauer, M. (2006). Automated essay scoring in the classroom, Paper presented at the American Educa<onal Research Associa<on. Grimes, D. & Warschauer, M. (2010). U<lity in a Fallible Tool: A Mul<-­‐Site Case Study of Automated Wri<ng Evalua<on. Journal of Technology, Learning, and Assessment, 8(6). Retrieved [date] from hHp://www.jtla.org.James, C. L. (2006 ). Valida<ng a computerized scoring system for assessing wri<ng and placing students in composi<on courses. Assessing Wri=ng, 11(3), 167-­‐178. Iowa State University Applied Linguistics and Technology, Department of English 37