Item Response Theory in the Secondary Classroom: What Rasch Modeling Can Reveal About Teachers, Students, and Tests. T. Jared Robinson tjaredrobinson.com David O. McKay School of Education Brigham Young University NRMERA, 2012, Park City, UT Purpose • My purpose is to show how Rasch modeling can be applied in certain secondary education situations, and how teachers, students, and tests might benefit. • This case study examines a high school biology exam using the Rasch model in order to demonstrate some possible implications of item response theory (IRT) in a secondary setting. • Provide a very brief and basic introduction to IRT/Rasch modeling Context • Brookhart (2003)—Measurement theory developed for large-scale assessments not appropriate for classroom assessment. • McMillan (2003)—Measurement specialists need to adapt to be more relevant to classroom assessment. • Smith (2003)—traditional notions of reliability not appropriate for classrooms. • Plake (1993), Stiggins (1991, 1995) —Teachers are empirically under-trained in assessment and testing. • Newfields (2006)—Still important for teachers to develop assessment literacy. • Rudner & Schafer (2002)—Teachers need to understand reliability and validity now more than ever. Some of my assumptions • While many types of classroom assessment defy application of measurement theory, teachers still use summative assessment in classroom settings. • To the extent that thinking about such assessments in terms of measurement theory provides utility for teachers and students, it should be explored. How big of an N is big enough? • Source: John Michael Linacre, http://www.rasch.org/rmt/rmt74m.htm Case Study Design • This study used data from a biology test given to sophomores at a suburban high school in the mountain west. • The test consisted of 35 multiple choice and true/false questions. • The study analyzed data for 115 students from four different sections of a biology class all taught by the same teacher. • A Rasch analysis of the data using the WINSTEPS software. Results were used to inform strengths and weaknesses of the test, as well as the general knowledge of students. Classical Test Theory Reliability Cronbach's Alpha = Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32 Q33 Q34 Q35 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32 Q33 Q34 Q35 0.21 0.08 0.23 0.05 0.04 0.10 0.00 0.00 0.02 0.02 Sum of Variance 3.06 0.00 0.00 0.00 0.00 0.00 sum of half covar 1.89 0.00 0.01 0.00 0.00 0.00 0.01 sum of covar 3.78 0.02 0.01 -0.01 0.00 0.00 0.01 0.09 sum of all terms 6.84 0.01 0.01 0.02 0.01 0.00 0.00 -0.01 0.09 correction factor 1.03 0.01 0.00 0.02 0.00 0.00 0.00 -0.01 0.00 0.06 coefficient alpha 0.57 0.01 0.02 0.00 0.00 0.00 0.01 0.00 0.02 0.00 0.12 -0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 -0.01 0.07 0.04 0.01 0.02 0.00 0.00 0.00 0.01 0.00 0.02 -0.02 0.02 0.17 -0.01 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 -0.01 0.00 0.01 0.01 0.00 0.01 0.01 0.00 -0.01 0.00 0.01 -0.01 0.00 0.00 0.08 0.00 0.02 0.01 0.01 0.00 0.00 0.02 -0.01 0.02 -0.02 0.01 0.01 0.01 0.01 -0.01 0.13 0.02 0.00 0.01 0.01 0.00 0.00 0.00 0.01 0.00 -0.01 0.00 0.01 0.00 0.00 -0.01 0.01 0.06 0.02 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.00 -0.01 0.00 -0.01 0.00 0.01 0.00 -0.01 0.00 0.04 0.00 -0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 -0.02 0.00 0.00 0.00 0.00 0.02 -0.02 0.00 0.02 0.17 0.00 0.02 -0.02 0.00 0.00 0.01 0.02 0.00 0.00 0.01 0.00 0.00 0.01 0.03 0.01 0.01 -0.01 0.01 -0.03 0.14 -0.01 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.01 0.00 0.00 0.00 -0.01 0.00 0.03 0.01 0.00 -0.01 0.01 0.00 0.01 0.01 0.00 -0.01 0.01 0.00 0.02 0.00 0.01 0.00 0.01 0.01 0.01 0.02 0.02 0.00 0.11 0.01 0.00 0.02 0.00 0.00 0.00 0.00 -0.01 0.01 0.00 0.00 0.02 0.00 0.00 0.00 0.00 -0.01 0.00 0.00 0.01 0.00 0.01 0.08 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.01 -0.02 0.00 0.01 0.00 0.00 0.01 -0.01 0.00 0.01 0.00 0.00 -0.01 0.00 0.00 0.01 0.01 0.00 -0.01 -0.03 0.03 0.00 0.02 0.01 0.00 0.13 0.01 0.01 0.02 0.01 0.00 0.00 0.00 0.00 0.00 -0.01 0.00 0.02 0.01 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.04 -0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.01 0.00 0.01 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.02 0.01 0.00 -0.01 0.00 0.00 0.00 -0.01 0.01 0.00 0.01 -0.01 -0.01 0.00 0.01 -0.01 0.00 0.00 0.00 -0.01 0.02 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.07 -0.01 -0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.01 0.00 0.00 0.00 0.01 0.01 -0.01 0.01 -0.01 0.00 0.01 0.01 -0.01 0.01 0.01 -0.01 -0.01 -0.01 0.01 -0.01 -0.01 0.01 0.00 0.00 0.01 0.00 -0.01 0.00 0.16 0.01 0.04 0.02 0.01 0.00 0.01 0.02 0.02 0.00 0.00 0.02 0.04 0.01 0.00 0.02 0.03 -0.01 0.00 0.03 0.01 0.00 -0.01 0.00 0.00 0.00 0.02 0.00 -0.01 0.01 0.06 0.22 0.01 0.01 0.02 0.01 0.00 0.00 -0.01 0.01 -0.01 -0.01 0.02 0.01 0.01 0.00 0.00 0.00 -0.01 0.00 0.02 0.00 0.00 0.00 0.01 0.00 -0.01 0.01 0.00 0.02 0.01 0.02 0.04 0.09 0.00 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.00 0.01 0.01 0.00 0.00 0.01 0.00 0.00 0.02 0.00 -0.01 -0.01 0.01 0.00 0.01 0.00 -0.01 0.01 0.01 0.00 -0.01 0.00 0.01 0.00 -0.01 0.00 0.00 0.01 0.01 0.00 0.00 0.01 0.02 -0.01 0.01 0.09 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 -0.01 0.00 0.00 -0.01 0.02 0.01 0.00 0.00 0.00 0.01 0.01 0.02 0.01 0.03 0.00 0.03 0.01 0.04 0.01 0.02 0.14 sum of variance terms sum of covariance terms sum of all terms correction factor coefficient alpha 3.06 3.78 6.84 1.03 0.57 Basics of IRT/Rasch Modeling • IRT/Rasch modelling has several advantages over Classical Test Theory. One is that we get much more information about how each individual item interacts with students as a function of their ability. • Instead of reporting student ability scores on a percent scale of 0-100, they report scores on a logit scale that has a center point of 0 with most scores ranging from -3 to +3 (although for your test, you have students above 5). Students with positive logit scores are more able than average, and students with negative logit scores are less able than average. Scalogram Item 29 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.98 0.02 56.50 4.03 -4.03 Item 21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0.97 0.03 37.33 3.62 -3.62 Item 14 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0.96 0.04 22.00 3.09 -3.09 Item 18 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.96 0.04 22.00 3.09 -3.09 Item Item 26 Item 9 17 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 1 1 0.96 0.94 0.94 0.04 0.06 0.06 22.00 15.43 15.43 3.09 2.74 2.74 -3.09 -2.74 -2.74 Item 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0.92 0.08 11.78 2.47 -2.47 Item 28 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 0.92 0.08 11.78 2.47 -2.47 Item 15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0.91 0.09 10.50 2.35 -2.35 Item 23 Item 8 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0.91 0.90 0.09 0.10 10.50 9.45 2.35 2.25 -2.35 -2.25 Item 32 Item 7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 0 0.90 0.90 0.10 0.10 9.45 8.58 2.25 2.15 -2.25 -2.15 Item 34 Item 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 0 0 1 0.90 0.89 0.10 0.11 8.58 7.85 2.15 2.06 -2.15 -2.06 Item 22 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 1 0 1 0 1 0.87 0.13 6.67 1.90 -1.90 Item 10 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 0 0 1 1 1 0.86 0.14 6.19 1.82 -1.82 Item 16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 0.84 0.16 5.39 1.68 -1.68 Item 25 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0.84 0.16 5.39 1.68 -1.68 Item 20 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 1 1 1 0 1 0 1 1 0 0 1 1 1 0 0 0 1 0 1 1 1 0.83 0.17 5.05 1.62 -1.62 Item 35 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1 0 0 0 1 0 1 0 0.83 0.17 5.05 1.62 -1.62 Item 30 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 0 0 1 1 0.80 0.20 4.00 1.39 -1.39 Item 12 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 0 1 0 0 1 1 0 1 0 0 0 0.78 0.22 3.60 1.28 -1.28 Item 19 Item 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 0 1 0 0 1 0 0 0 1 1 0 1 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 0 1 0 1 1 0.78 0.70 0.22 0.30 3.60 2.29 1.28 0.83 -1.28 -0.83 Item 31 Item 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 0 1 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0.67 0.63 0.33 0.37 2.03 1.74 0.71 0.55 -0.71 -0.55 Total 35 35 35 35 35 34 34 34 34 34 34 33 33 33 33 32 32 32 32 32 31 31 31 31 31 31 31 30 30 30 30 30 29 29 28 28 28 27 27 27 26 26 26 25 25 24 23 Prop. Prop. Correct Incorrect 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 0.97 0.03 0.97 0.03 0.97 0.03 0.97 0.03 0.97 0.03 0.97 0.03 0.94 0.06 0.94 0.06 0.94 0.06 0.94 0.06 0.91 0.09 0.91 0.09 0.91 0.09 0.91 0.09 0.91 0.09 0.89 0.11 0.89 0.11 0.89 0.11 0.89 0.11 0.89 0.11 0.89 0.11 0.89 0.11 0.86 0.14 0.86 0.14 0.86 0.14 0.86 0.14 0.86 0.14 0.83 0.17 0.83 0.17 0.80 0.20 0.80 0.20 0.80 0.20 0.77 0.23 0.77 0.23 0.77 0.23 0.74 0.26 0.74 0.26 0.74 0.26 0.71 0.29 0.71 0.29 0.69 0.31 0.66 0.34 Odds #DIV/0! #DIV/0! #DIV/0! #DIV/0! #DIV/0! 34.00 34.00 34.00 34.00 34.00 34.00 16.50 16.50 16.50 16.50 10.67 10.67 10.67 10.67 10.67 7.75 7.75 7.75 7.75 7.75 7.75 7.75 6.00 6.00 6.00 6.00 6.00 4.83 4.83 4.00 4.00 4.00 3.38 3.38 3.38 2.89 2.89 2.89 2.50 2.50 2.18 1.92 log odds unit #DIV/0! #DIV/0! #DIV/0! #DIV/0! #DIV/0! 3.53 3.53 3.53 3.53 3.53 3.53 2.80 2.80 2.80 2.80 2.37 2.37 2.37 2.37 2.37 2.05 2.05 2.05 2.05 2.05 2.05 2.05 1.79 1.79 1.79 1.79 1.79 1.58 1.58 1.39 1.39 1.39 1.22 1.22 1.22 1.06 1.06 1.06 0.92 0.92 0.78 0.65 What Rasch modeling can teach teachers about their tests • One useful thing about IRT is that the item difficulty estimates are also computed on the logit scale. Thus, we can easily compare the items difficulty with student ability, like in the chart on the next slide. TABLE 12.2 Biology Exam Rasch ZOU591WS.TXT Oct 3 21:44 2012 INPUT: 115 Persons 35 Items MEASURED: 115 Persons 35 Items 2 CATS 1.0.0 -------------------------------------------------------------------------------- 5 4 3 2 1 0 1- 2- 3EACH Persons MAP OF Items <more>|<rare> .### + | | | T| | .########## | + | | S| | ###### | | + ######### | M| | ########## |T | ##### | I0002 .#### + I0031 S| I0001 #### | .# | | .# |S I0012 # | I0030 . T+ I0020 . | I0016 | I0010 | I0022 | I0003 | I0008 | I0015 +M I0011 | | I0009 | | I0014 | | + | I0021 |S | | I0004 | | + | | I0006 |T | | | + I0005 <less>|<frequ> '#' IS 2. I0019 I0035 I0025 I0007 I0032 I0023 I0028 I0034 I0017 I0018 I0026 I0013 I0027 I0024 I0033 I0029 What does this mean? • WINSTEPS uses the mode or middle questions in terms of question difficulty to center 0 on the logit scale. This table visually demonstrates that most of the questions are much easier than these students are able. • Students like this because it means that they get a good grade on the test. But this is not a good situation from a measurement perspective. A test with the pattern like the one above cannot really distinguish with any reliability the differences between the ability levels of most of the students. Test Information Function What does this mean? • This graph illustrates that this test will give you a lot of information about students with an ability score between about -2 and +2 with the amount of information you get about students dropping off sharply after that. • In areas of the graph where information is high, there is a low error in measuring student scores. In areas where information is low, there is a lot of error in estimating student scores. What does this tell us about this test? • This lack of matching between student ability and item difficulty leads to low score reliability. In this case, the reliability for the estimates of student ability is just .34. You want it to be much closer to .90 or even higher. • For example, 18 students out of the 115 got a score of 32/35, or 91%. In reality, these estimates are pretty rough, because we don’t have any questions that are at that difficulty level. Those students are probably not identical in ability or knowledge, but the test is designed in a way that makes so that we can’t really know their ability with any kind of precision. Limitations • Evidence of multi-dimensionality, violating some key assumptions of Cronbach’s alpha and Rasch Measurement • Only looking at one limited case ▫ Difficulty level gap might be non-representative ▫ Rasch modeling might be less appropriate in other schools with different testing procedures ▫ Only useful to the extent that is plausible for teachers to get access to and understand Conclusions • This case is one example of where Rasch modeling has utility in understanding a test, and the students who took the test. • Rasch software presents visual interpretation tools that may be easier to interpret for teachers than traditional reliability concepts. • In instances where teachers teach multiple sections of one subject, or where assessments are common across teachers, Rasch modeling can be used to produce stable estimates in secondary settings.