Effective Use of Test-Item Stats and Setting Performance Level

advertisement
Effective Use of Benchmark Test and Item
Statistics and Considerations When Setting
Performance Levels
California Educational Research Association
Anaheim, California
December 1, 2011
1
Review of Benchmark Test and Item
Statistics
Objective
Extend knowledge of assessment team to:
1. Better understand test reliability and the
influences of test composition and test length.
2. Better understand item statistics and use them to
identify items in need of revision
2
Reliability is a measure of the
consistency of the assessment
Types of reliability coefficients (always
range from 0 to 1)
Test-retest
Alternate forms
Split-half
Internal consistency (Cronbach’s Alpha/KR20)
3
Reliability Influenced by Test Length
• Spearman-Brown formula estimates
reliabilities of shorter tests
– Remember: The reliability of a score is an
indication of how much an observed score
can be expected to be the same if observed
again.
NOTE: See handout from STAR Technical Manual for exact cluster reliabilities.
4
Reliability Influenced by Test Length
• Example: given a 75 item test with r=.95
– 40 item test has r=.91
– 35 item test has r=.90
– 30 item test has r=.88
– 25 item test has r=.86
– 20 item test has r=.84
– 10 item test has r=.72
– 5 item test has r=.56
NOTE: See handout from STAR Technical Manual for exact cluster reliabilities.
5
Reliability Statistics for CST’s
(see handout)
 Note that CST reliabilities range from .90 to .95
 Note that cluster reliabilities are consistent with
those predicted by Spearman-Brown formula
Validity is the degree to which the test is
measuring what was intended
Types of test validity
A. Predictive or Criterion (How does it
correlate with other measures?)
B.Content
1. How well does the test sample from the
content domain?
2. How aligned are the items with regard to
format and rigor
7
Validity Is Influenced by Reliability
 Impact of Lower Reliability on Validity
 Remember: Validity is the agreement
between a test score and the quality it is
believed to measure
 Upper limit on validity coefficient is the
square root of the reliability coefficient
 75 item test = square root of .95 = .97
8
Validity Is Influenced by Reliability
 Upper limit on validity coefficient is the
square root of the reliability coefficient





9
75 item test =square root of .95=.97
30 item test= square root of .88=.94
20 item test= square root of .86=.93
10 item test = square root of .72=.85
5 item test = square root of .56=.75
Coefficient of Determination (R
squared)
 Square of validity coefficient gives
“proportion of variance in the
achievement construct accounted for by
the test”





10
75 item test =.97 squared=.94
30 item test=.94 squared=.88
20 item test=.93 squared=.86
10 item test=.85 squared=.72
5 item test=.75 squared=.56
Using Item Statistics (p-value & pointbiserials)
 Apply item analysis statistics from assessment
reporting system (e.g. Datadirector, Edusoft, OARS, EADMS,
etc.)
 P-values (percent of group getting item correct


Most should be between 30 and 80
Very high indicates it may be too easy; too low may indicate a
problem item
 Point-biserials (correlation of item with total score)


Most should be .30 or higher
Very low or negative generally indicates a problem with the item
Item statistics for CST’s
(see handout)
 Note that the range of P-values is consistent with
most being between .30 and .80
 Note that median point-biserials are generally in
the 40’s
Algebra 1
District
Question 7
Choice
A
B
C
D
E
BLANK
Total
Point
Biserial
Pilot Group
# of
Percent
Students
# of
Students
Percent
1691
36.81
220
37.48
1563
34.02
187
31.86
669
14.56
85
14.48
629
13.69
89
15.16
4
0.09
2
0.34
38
0.83
4
0.68
4594
100
587
100
0.31
0.38
PL: Basic
Algebra 1
District
Question 19
Choice
A
B
C
D
E
BLANK
Total
Point
Biserial
Pilot Group
# of
Percent
Students
# of
Students
Percent
971
21.18
108
18.40
1028
22.42
125
21.29
1193
26.02
145
24.70
1148
25.04
155
26.41
7
0.15
0
0.00
238
5.19
54
9.20
4585
100
587
100
0.23
0.19
PL: Advanced Proficient
Algebra 2
District
Question 21
Choice
A
B
C
D
E
BLANK
Total
Point
Biserial
Pilot Group
# of
Percent
Students
# of
Students
Percent
286
23.50
45
24.32
248
20.38
37
20.00
354
29.09
63
34.05
260
21.36
35
18.92
0
0.00
0
0.00
69
5.67
5
2.70
1217
100
185
100
0.19
0.24
PL: Beyond Advanced Proficient
Geometry
District
Question 12
Choice
A
B
C
D
E
BLANK
Total
Point
Biserial
Pilot Group
# of
Percent
Students
# of
Students
Percent
247
13.46
42
15.91
603
32.86
90
34.09
703
38.31
99
37.50
273
14.88
31
11.74
0
0.00
0
0.00
9
0.49
2
0.76
1835
100
264
100
0.10
0.10
PL: Proficient
Maximizing Predictive Accuracy
of District Benchmarks
Objective
Extend knowledge of assessment team to:
1. Better understand how performance level setting
is key to predictive validity.
2. Better understand how to create performance
level bands based on equipercentile equating
18
Comparing District Benchmarks to
CST Results
Common Methods for Setting Cutoffs on District
Benchmarks:

Use default settings on assessment platform (e.g. 20%, 40%,
60%, 80%)

Ask curriculum experts for their opinion of where cutoffs
should be set

Determine percent correct corresponding to performance
levels on CSTs and apply to benchmarks
19
Comparing District Benchmarks to
CST Results
There is a better way!
20
Comparing District Benchmarks to
CST Results
“Two scores, one on form X and the other on form
Y, may be considered equivalent if their
corresponding percentile ranks in any given group
are equal.” (Educational Measurement-Second
Edition, p. 563)
21
Comparing District Benchmarks to
CST Results
 Equipercentile Method of Equating at the
Performance Level Cut-points
 Establishes cutoffs for benchmarks at equivalent
local percentile ranks as cutoffs for CSTs
 By applying same local percentile cutoffs to each
trimester benchmark, comparisons across trimesters
within a grade level are more defensible
22
Equipercentile Equating Method
Step 1-Identify CST SS Cut-points
23
Equipercentile Equating Method
Step 2 - Establish Local Percentiles at CST
Performance Level Cutoffs (from scaled
score frequency distribution)
24
Equipercentile Equating Method
Step 3 – Locate Benchmark Raw Scores
Corresponding to the CST Cutoff
Percentiles (from benchmark raw score
frequency distribution)
25
Equipercentile Equating Method
Step 4 – Validate Classification Accuracy –
Old Cutoffs
2nd Semester
Biology
2006 CST
Old Cutoff
0-17
FBB
BB
Basic
Proficient Advanced
Total
FBB
57
72
25
1
0
155
18-34
BB
118
297
511
60
4
990
35-48
Basic
19
51
427
401
45
943
49-62
Proficient
1
5
27
141
207
381
63-70
Advanced
0
0
0
0
20
20
Total
195
425
990
603
276
2489
Correct Classification: Proficient & Advanced on CST =
42%
Correct Classification: Each Level on CST =
38%
26
Equipercentile Equating Method
Step 4 – Validate Classification Accuracy –
Old Cutoffs
2nd Semester
Biology
2006 CST
Old Cutoff
FBB
BB
Basic
Proficient Advanced
Total
FBB
57
72
25
1
0
155
18-34
BB
118
297
511
60
4
990
35-48
Basic
19
51
427
401
45
943
49-62
Proficient
1
5
27
141
207
381
63-70
Advanced
0
0
0
0
20
20
Total
195
425
990
603
276
2489
0-17
Correct Classification: Proficient & Advanced on CST =
42%
Correct Classification: Each Level on CST =
38%
27
Equipercentile Equating Method
Step 4 – Validate Classification Accuracy –
Old Cutoffs
2nd Semester
Biology
2006 CST
Old Cutoff
FBB
BB
Basic
Proficient Advanced
Total
FBB
57
72
25
1
0
155
18-34
BB
118
297
511
60
4
990
35-48
Basic
19
51
427
401
45
943
49-62
Proficient
1
5
27
141
207
381
63-70
Advanced
0
0
0
0
20
20
Total
195
425
990
603
276
2489
0-17
Correct Classification: Proficient & Advanced on CST =
42%
Correct Classification: Each Level on CST =
38%
28
Equipercentile Equating Method
Step 4 – Validate Classification Accuracy –
New Cutoffs
2nd Semester
Biology
2006 CST
New Cutoff
0-19
FBB
BB
Basic
Proficient Advanced
Total
FBB
89
107
53
4
0
253
20-26
BB
59
142
148
12
0
361
27-40
Basic
39
161
596
176
9
981
41-51
Proficient
8
12
181
354
82
637
52-70
Advanced
0
3
12
57
185
257
Total
195
425
990
603
276
2489
Correct Classification: Proficient & Advanced on CST =
77%
Correct Classification: Each Level on CST =
55%
29
Equipercentile Equating Method
Step 4 – Validate Classification Accuracy –
New Cutoffs
2nd Semester
Biology
2006 CST
New Cutoff
FBB
BB
Basic
Proficient Advanced
Total
FBB
89
107
53
4
0
253
20-26
BB
59
142
148
12
0
361
27-40
Basic
39
161
596
176
9
981
41-51
Proficient
8
12
181
354
82
637
52-70
Advanced
0
3
12
57
185
257
Total
195
425
990
603
276
2489
0-19
Correct Classification: Proficient & Advanced on CST =
Correct Classification: Each Level on CST =
30
77%
55%
Equipercentile Equating Method
Step 4 – Validate Classification Accuracy –
New Cutoffs
Biology
2006 CST
New Cutoff
FBB
BB
Basic
Proficient Advanced
Total
FBB
89
107
53
4
0
253
20-26
BB
59
142
148
12
0
361
27-40
Basic
39
161
596
176
9
981
41-51
Proficient
8
12
181
354
82
637
52-70
Advanced
0
3
12
57
185
257
Total
195
425
990
603
276
2489
0-19
Correct Classification: Proficient & Advanced on CST =
Correct Classification: Each Level on CST =
31
77%
55%
Example: Classification Accuracy
Biology
Old
New
Proficient or Advanced
42%
77%
Each Level
38%
55%
Proficient or Advanced
30%
77%
Each Level
31%
50%
2nd Semester
1st Semester
32
Example: Classification Accuracy
Biology
Old
New
Proficient or Advanced
53%
71%
Each Level
41%
46%
1st Quarter
33
Example: Classification Accuracy
Chemistry
Old
New
2nd Semester: Prof. & Adv.
63%
79%
2nd Semester: Each Level
47%
52%
1st Semester: Prof. & Adv.
74%
74%
1st Semester: Each Level
49%
50%
1st Quarter: Prof. & Adv.
83%
76%
1st Quarter: Each Level
48%
47%
34
Example: Classification Accuracy
Earth Science
Old
New
2nd Semester: Prof. & Adv.
48%
68%
2nd Semester: Each Level
43%
52%
1st Semester: Prof. & Adv.
33%
66%
1st Semester: Each Level
38%
47%
1st Quarter: Prof. & Adv.
42%
56%
1st Quarter: Each Level
34%
41%
35
Example: Classification Accuracy
Physics
Old
New
2nd Semester: Prof. & Adv.
57%
87%
2nd Semester: Each Level
37%
57%
1st Semester: Prof. & Adv.
60%
88%
1st Semester: Each Level
42%
50%
1st Quarter: Prof. & Adv.
65%
87%
1st Quarter: Each Level
47%
45%
36
Things to Consider Prior to
Establishing the Benchmark Cutoffs
 Will there be changes to the benchmarks after CST
percentile cutoffs are established?
 If NO then raw score benchmark cutoffs can be established by
linking CST to same year benchmark administration (i.e. spring
2011 CST matched to 2010-11 benchmark raw scores)
 If YES then wait until new benchmark is administered and then
establish raw score cutoffs on benchmark
 How many cases are available for establishing the CST
percentiles? (too few cases could lead to unstable
percentile distributions)
37
Things to Consider Prior to
Establishing the Benchmark Cutoffs
(Continued)
 How many items comprise the benchmarks to be
equated? (as test gets shorter it becomes more difficult to
match the percentile cutpoints established on the CST’s)
38
Summary
Equipercentile Equating Method
 Method generally establishes a closer correspondence
between the CST and Benchmarks
 When benchmarks are tightly aligned with CSTs, the
approach may be less advantageous (i.e. elementary
math)
 Comparisons between benchmark and CST performance
can be made more confidently
 Comparisons between benchmarks within the school year
can be made more confidently
39
Coming Soon from Illuminate
Education, Inc.!
Reports using the equipercentile
methodology are being programmed to:
(1) establish benchmark cutoffs for performance
bands
(2) create validation tables showing improved
classification accuracy based on the method
40
Contact:
Tom Barrett, Ph.D.
President, Barrett Enterprises, LLC
Director, Owl Corps, School Wise Press
2173 Hackamore Place
Riverside, CA 92506
951-905-5367 (office)
951-237-9452 (cell)
41
Download