this PowerPoint - T. Jared Robinson

advertisement
Item Response Theory in the
Secondary Classroom: What Rasch
Modeling Can Reveal About
Teachers, Students, and Tests.
T. Jared Robinson
tjaredrobinson.com
David O. McKay School of Education
Brigham Young University
NRMERA, 2012, Park City, UT
Purpose
• My purpose is to show how Rasch modeling can
be applied in certain secondary education
situations, and how teachers, students, and tests
might benefit.
• This case study examines a high school biology
exam using the Rasch model in order to
demonstrate some possible implications of item
response theory (IRT) in a secondary setting.
• Provide a very brief and basic introduction to
IRT/Rasch modeling
Context
• Brookhart (2003)—Measurement theory developed for
large-scale assessments not appropriate for classroom
assessment.
• McMillan (2003)—Measurement specialists need to
adapt to be more relevant to classroom assessment.
• Smith (2003)—traditional notions of reliability not
appropriate for classrooms.
• Plake (1993), Stiggins (1991, 1995) —Teachers are
empirically under-trained in assessment and testing.
• Newfields (2006)—Still important for teachers to
develop assessment literacy.
• Rudner & Schafer (2002)—Teachers need to understand
reliability and validity now more than ever.
Some of my assumptions
• While many types of classroom assessment defy
application of measurement theory, teachers still
use summative assessment in classroom
settings.
• To the extent that thinking about such
assessments in terms of measurement theory
provides utility for teachers and students, it
should be explored.
How big of an N is big enough?
• Source: John Michael Linacre, http://www.rasch.org/rmt/rmt74m.htm
Case Study Design
• This study used data from a biology test given to
sophomores at a suburban high school in the
mountain west.
• The test consisted of 35 multiple choice and
true/false questions.
• The study analyzed data for 115 students from four
different sections of a biology class all taught by the
same teacher.
• A Rasch analysis of the data using the WINSTEPS
software. Results were used to inform strengths and
weaknesses of the test, as well as the general
knowledge of students.
Classical Test Theory Reliability
Cronbach's Alpha =
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q13
Q14
Q15
Q16
Q17
Q18
Q19
Q20
Q21
Q22
Q23
Q24
Q25
Q26
Q27
Q28
Q29
Q30
Q31
Q32
Q33
Q34
Q35
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32 Q33 Q34 Q35
0.21
0.08 0.23
0.05 0.04 0.10
0.00 0.00 0.02 0.02
Sum of Variance
3.06
0.00 0.00 0.00 0.00 0.00
sum of half covar
1.89
0.00 0.01 0.00 0.00 0.00 0.01
sum of covar
3.78
0.02 0.01 -0.01 0.00 0.00 0.01 0.09
sum of all terms
6.84
0.01 0.01 0.02 0.01 0.00 0.00 -0.01 0.09
correction factor
1.03
0.01 0.00 0.02 0.00 0.00 0.00 -0.01 0.00 0.06
coefficient alpha
0.57
0.01 0.02 0.00 0.00 0.00 0.01 0.00 0.02 0.00 0.12
-0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 -0.01 0.07
0.04 0.01 0.02 0.00 0.00 0.00 0.01 0.00 0.02 -0.02 0.02 0.17
-0.01 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.00 0.02
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04
-0.01 0.00 0.01 0.01 0.00 0.01 0.01 0.00 -0.01 0.00 0.01 -0.01 0.00 0.00 0.08
0.00 0.02 0.01 0.01 0.00 0.00 0.02 -0.01 0.02 -0.02 0.01 0.01 0.01 0.01 -0.01 0.13
0.02 0.00 0.01 0.01 0.00 0.00 0.00 0.01 0.00 -0.01 0.00 0.01 0.00 0.00 -0.01 0.01 0.06
0.02 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.00 -0.01 0.00 -0.01 0.00 0.01 0.00 -0.01 0.00 0.04
0.00 -0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 -0.02 0.00 0.00 0.00 0.00 0.02 -0.02 0.00 0.02 0.17
0.00 0.02 -0.02 0.00 0.00 0.01 0.02 0.00 0.00 0.01 0.00 0.00 0.01 0.03 0.01 0.01 -0.01 0.01 -0.03 0.14
-0.01 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.01 0.00 0.00 0.00 -0.01 0.00 0.03
0.01 0.00 -0.01 0.01 0.00 0.01 0.01 0.00 -0.01 0.01 0.00 0.02 0.00 0.01 0.00 0.01 0.01 0.01 0.02 0.02 0.00 0.11
0.01 0.00 0.02 0.00 0.00 0.00 0.00 -0.01 0.01 0.00 0.00 0.02 0.00 0.00 0.00 0.00 -0.01 0.00 0.00 0.01 0.00 0.01 0.08
0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.01
-0.02 0.00 0.01 0.00 0.00 0.01 -0.01 0.00 0.01 0.00 0.00 -0.01 0.00 0.00 0.01 0.01 0.00 -0.01 -0.03 0.03 0.00 0.02 0.01 0.00 0.13
0.01 0.01 0.02 0.01 0.00 0.00 0.00 0.00 0.00 -0.01 0.00 0.02 0.01 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.04
-0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.01 0.00 0.01 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.02
0.01 0.00 -0.01 0.00 0.00 0.00 -0.01 0.01 0.00 0.01 -0.01 -0.01 0.00 0.01 -0.01 0.00 0.00 0.00 -0.01 0.02 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.07
-0.01 -0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02
0.00 0.01 0.00 0.00 0.00 0.01 0.01 -0.01 0.01 -0.01 0.00 0.01 0.01 -0.01 0.01 0.01 -0.01 -0.01 -0.01 0.01 -0.01 -0.01 0.01 0.00 0.00 0.01 0.00 -0.01 0.00 0.16
0.01 0.04 0.02 0.01 0.00 0.01 0.02 0.02 0.00 0.00 0.02 0.04 0.01 0.00 0.02 0.03 -0.01 0.00 0.03 0.01 0.00 -0.01 0.00 0.00 0.00 0.02 0.00 -0.01 0.01 0.06 0.22
0.01 0.01 0.02 0.01 0.00 0.00 -0.01 0.01 -0.01 -0.01 0.02 0.01 0.01 0.00 0.00 0.00 -0.01 0.00 0.02 0.00 0.00 0.00 0.01 0.00 -0.01 0.01 0.00 0.02 0.01 0.02 0.04 0.09
0.00 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.00 0.01
0.01 0.00 0.00 0.01 0.00 0.00 0.02 0.00 -0.01 -0.01 0.01 0.00 0.01 0.00 -0.01 0.01 0.01 0.00 -0.01 0.00 0.01 0.00 -0.01 0.00 0.00 0.01 0.01 0.00 0.00 0.01 0.02 -0.01 0.01 0.09
0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 -0.01 0.00 0.00 -0.01 0.02 0.01 0.00 0.00 0.00 0.01 0.01 0.02 0.01 0.03 0.00 0.03 0.01 0.04 0.01 0.02 0.14
sum of variance terms
sum of covariance terms
sum of all terms
correction factor
coefficient alpha
3.06
3.78
6.84
1.03
0.57
Basics of IRT/Rasch Modeling
• IRT/Rasch modelling has several advantages over
Classical Test Theory. One is that we get much more
information about how each individual item
interacts with students as a function of their ability.
• Instead of reporting student ability scores on a
percent scale of 0-100, they report scores on a logit
scale that has a center point of 0 with most scores
ranging from -3 to +3 (although for your test, you
have students above 5). Students with positive logit
scores are more able than average, and students
with negative logit scores are less able than average.
Scalogram
Item
29
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.98
0.02
56.50
4.03
-4.03
Item
21
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0.97
0.03
37.33
3.62
-3.62
Item
14
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
0
1
1
1
0
1
1
1
1
1
1
1
0.96
0.04
22.00
3.09
-3.09
Item
18
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
0
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.96
0.04
22.00
3.09
-3.09
Item
Item
26 Item 9
17
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
0
1
1
1
1
0
1
1
0
1
0
1
1
1
0.96
0.94
0.94
0.04
0.06
0.06
22.00 15.43 15.43
3.09
2.74
2.74
-3.09 -2.74 -2.74
Item
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
0
1
1
1
0
1
1
1
1
1
1
1
0
0
1
1
1
1
1
1
1
1
0
0.92
0.08
11.78
2.47
-2.47
Item
28
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
1
1
1
1
1
0
1
1
1
1
1
0.92
0.08
11.78
2.47
-2.47
Item
15
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
0
1
0
1
1
1
1
1
1
1
1
0
1
1
1
1
1
0
1
1
1
0.91
0.09
10.50
2.35
-2.35
Item
23 Item 8
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
0
1
1
0
1
1
1
1
0
1
1
1
1
1
0
1
1
1
1
1
1
1
0
1
0
1
1
0
1
1
1
1
1
0.91
0.90
0.09
0.10
10.50 9.45
2.35
2.25
-2.35 -2.25
Item
32 Item 7
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
0
1
1
1
0
1
1
1
1
1
1
1
1
0
1
1
1
0
1
0
1
0
1
1
0
0
1
1
1
1
1
1
0
0
1
1
1
1
0
0.90
0.90
0.10
0.10
9.45
8.58
2.25
2.15
-2.25 -2.15
Item
34 Item 3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
0
1
0
1
1
1
1
1
1
1
1
1
0
1
1
1
0
0
0
0
1
0.90
0.89
0.10
0.11
8.58
7.85
2.15
2.06
-2.15 -2.06
Item
22
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
0
0
0
1
1
1
1
1
1
0
1
1
0
1
1
0
1
1
1
1
0
1
0
1
0.87
0.13
6.67
1.90
-1.90
Item
10
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
0
1
1
1
1
1
0
0
0
1
1
1
0.86
0.14
6.19
1.82
-1.82
Item
16
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
0
1
1
1
1
1
1
1
0
0
1
1
1
1
0
0
0.84
0.16
5.39
1.68
-1.68
Item
25
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
0
1
1
0
1
0
1
1
1
0.84
0.16
5.39
1.68
-1.68
Item
20
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
0
1
0
0
1
1
0
1
1
1
0
1
0
1
1
0
0
1
1
1
0
0
0
1
0
1
1
1
0.83
0.17
5.05
1.62
-1.62
Item
35
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
0
1
0
1
1
0
1
0
0
0
1
0
1
0
0.83
0.17
5.05
1.62
-1.62
Item
30
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
0
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
0
1
1
1
0
1
0
1
1
0
0
1
1
0.80
0.20
4.00
1.39
-1.39
Item
12
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
1
0
1
1
1
1
0
1
1
1
1
1
1
1
1
0
1
1
0
1
0
0
1
1
0
1
0
0
0
0.78
0.22
3.60
1.28
-1.28
Item
19 Item 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
0
1
0
1
1
1
1
1
1
0
1
1
1
0
1
0
1
0
1
0
0
1
0
0
0
1
1
0
1
0
1
1
0
0
1
0
1
0
1
1
1
0
0
1
0
1
1
0
1
1
1
0
1
0
1
1
0
0
1
0
1
1
0.78
0.70
0.22
0.30
3.60
2.29
1.28
0.83
-1.28 -0.83
Item
31 Item 2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
0
1
1
1
0
1
1
0
0
1
1
1
1
1
1
0
1
0
1
0
1
1
1
0
1
1
1
1
1
1
1
1
0
1
0
0
1
0
1
0
1
0
1
0
1
1
0
1
1
1
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0.67
0.63
0.33
0.37
2.03
1.74
0.71
0.55
-0.71 -0.55
Total
35
35
35
35
35
34
34
34
34
34
34
33
33
33
33
32
32
32
32
32
31
31
31
31
31
31
31
30
30
30
30
30
29
29
28
28
28
27
27
27
26
26
26
25
25
24
23
Prop.
Prop.
Correct Incorrect
1.00
0.00
1.00
0.00
1.00
0.00
1.00
0.00
1.00
0.00
0.97
0.03
0.97
0.03
0.97
0.03
0.97
0.03
0.97
0.03
0.97
0.03
0.94
0.06
0.94
0.06
0.94
0.06
0.94
0.06
0.91
0.09
0.91
0.09
0.91
0.09
0.91
0.09
0.91
0.09
0.89
0.11
0.89
0.11
0.89
0.11
0.89
0.11
0.89
0.11
0.89
0.11
0.89
0.11
0.86
0.14
0.86
0.14
0.86
0.14
0.86
0.14
0.86
0.14
0.83
0.17
0.83
0.17
0.80
0.20
0.80
0.20
0.80
0.20
0.77
0.23
0.77
0.23
0.77
0.23
0.74
0.26
0.74
0.26
0.74
0.26
0.71
0.29
0.71
0.29
0.69
0.31
0.66
0.34
Odds
#DIV/0!
#DIV/0!
#DIV/0!
#DIV/0!
#DIV/0!
34.00
34.00
34.00
34.00
34.00
34.00
16.50
16.50
16.50
16.50
10.67
10.67
10.67
10.67
10.67
7.75
7.75
7.75
7.75
7.75
7.75
7.75
6.00
6.00
6.00
6.00
6.00
4.83
4.83
4.00
4.00
4.00
3.38
3.38
3.38
2.89
2.89
2.89
2.50
2.50
2.18
1.92
log odds
unit
#DIV/0!
#DIV/0!
#DIV/0!
#DIV/0!
#DIV/0!
3.53
3.53
3.53
3.53
3.53
3.53
2.80
2.80
2.80
2.80
2.37
2.37
2.37
2.37
2.37
2.05
2.05
2.05
2.05
2.05
2.05
2.05
1.79
1.79
1.79
1.79
1.79
1.58
1.58
1.39
1.39
1.39
1.22
1.22
1.22
1.06
1.06
1.06
0.92
0.92
0.78
0.65
What Rasch modeling can teach
teachers about their tests
• One useful thing about IRT is that the item
difficulty estimates are also computed on the
logit scale. Thus, we can easily compare the
items difficulty with student ability, like in the
chart on the next slide.
TABLE 12.2 Biology Exam Rasch
ZOU591WS.TXT Oct 3 21:44 2012
INPUT: 115 Persons 35 Items MEASURED: 115 Persons 35 Items 2 CATS
1.0.0
--------------------------------------------------------------------------------
5
4
3
2
1
0
1-
2-
3EACH
Persons MAP OF Items
<more>|<rare>
.### +
|
|
|
T|
|
.########## |
+
|
|
S|
|
###### |
|
+
######### |
M|
|
########## |T
|
##### | I0002
.#### + I0031
S| I0001
#### |
.# |
|
.# |S I0012
# | I0030
. T+ I0020
. | I0016
| I0010
| I0022
| I0003
| I0008
| I0015
+M I0011
|
| I0009
|
| I0014
|
|
+
| I0021
|S
|
| I0004
|
|
+
|
| I0006
|T
|
|
|
+ I0005
<less>|<frequ>
'#' IS 2.
I0019
I0035
I0025
I0007
I0032
I0023
I0028
I0034
I0017
I0018
I0026
I0013
I0027
I0024
I0033
I0029
What does this mean?
• WINSTEPS uses the mode or middle questions in
terms of question difficulty to center 0 on the logit
scale. This table visually demonstrates that most of
the questions are much easier than these students
are able.
• Students like this because it means that they get a
good grade on the test. But this is not a good
situation from a measurement perspective. A test
with the pattern like the one above cannot really
distinguish with any reliability the differences
between the ability levels of most of the students.
Test Information Function
What does this mean?
• This graph illustrates that this test will give you a
lot of information about students with an ability
score between about -2 and +2 with the amount
of information you get about students dropping
off sharply after that.
• In areas of the graph where information is high,
there is a low error in measuring student scores.
In areas where information is low, there is a lot
of error in estimating student scores.
What does this tell us about this test?
• This lack of matching between student ability and
item difficulty leads to low score reliability. In this
case, the reliability for the estimates of student
ability is just .34. You want it to be much closer to
.90 or even higher.
• For example, 18 students out of the 115 got a score of
32/35, or 91%. In reality, these estimates are pretty
rough, because we don’t have any questions that are
at that difficulty level. Those students are probably
not identical in ability or knowledge, but the test is
designed in a way that makes so that we can’t really
know their ability with any kind of precision.
Limitations
• Evidence of multi-dimensionality, violating
some key assumptions of Cronbach’s alpha and
Rasch Measurement
• Only looking at one limited case
▫ Difficulty level gap might be non-representative
▫ Rasch modeling might be less appropriate in other
schools with different testing procedures
▫ Only useful to the extent that is plausible for
teachers to get access to and understand
Conclusions
• This case is one example of where Rasch modeling
has utility in understanding a test, and the students
who took the test.
• Rasch software presents visual interpretation tools
that may be easier to interpret for teachers than
traditional reliability concepts.
• In instances where teachers teach multiple sections
of one subject, or where assessments are common
across teachers, Rasch modeling can be used to
produce stable estimates in secondary settings.
Download