Computer Adaptive Test

advertisement
The Impact of Selection of Student
Achievement Measurement Instrument
on Teacher Value-added Measures
James L. Woodworth, CREDO Hoover Institute, Stanford
Wen-Juo Lo, University of Arkansas
Joshua B. McGee, Laura and John Arnold Foundation
Nathan C. Jensen, Northwest Evaluation Association
Presentation Outline
1. Purpose
2. Statistical Noise
a. Why it matters
b. Sources
3. Data
4. Methods
5. Results
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Purpose
The purpose of this paper is to present to a
statistics lay population the extent to which
psychometric properties of student test
instruments impact teacher value-added
measures.
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Question
What is the impact of statistical noise introduced
by different test characteristics on the stability
and accuracy of value-added models?
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Why it matters?
Below
Basic
5th
6th
Basic
Proficient
Advanced
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Primary Sources
of Statistical Noise
1. Test Design
2. Vertical Alignment
3. Student Sample Size
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Test Design
Proficiency Tests
• Focused around proficiency
point
• Designed to differentiate
between proficient and
not proficient
• Larger variance in
Conditional Standard Errors
(CSE)
Growth Tests
• Questions measure across
entire ability spectrum
• Designed to differentiate
between all points on the
distribution
• Smaller variance in CSE
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Test Design
Paper and Pencil Tests
• Limit item pool to control
length
• Focused around proficiency
point
• Large variance in CSE
Computer Adaptive Test
• Larger item pool for
question selection
• Focused around student
ability point
• Smaller variance in CSE
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Test Design
1000
900
900
800
800
700
700
600
600
500
500
400
400
300
300
200
200
100
100
0
0
188
262
316
349
373
393
409
424
437
449
460
470
480
489
498
507
515
524
532
539
547
555
564
571
579
587
595
604
620
621
631
640
651
662
674
686
701
717
737
763
793
848
922
Scale Score
CSE Heteroskedasticity Due to Item Focusing: TAKS Reading Grade 5, 2009
1000
Scale Score
CSE Range: 24 - 74
Weighted average CSE = 38.96
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Vertical Alignment
• Year to year alignment can impact the results
of VAM
– Units must be equal across test sessions
• Spring-Spring VAM are most affected
• Fall-Spring VAM using same test avoid much of
problem
• Item alignment on computer adaptive tests can
impact the results of VAM
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Student Sample Size
• Central Limit Theorem
– Larger student n provides a more stable estimate of
teacher VAM.
– Typical single year student n’s are 25, 50, and 100
for elementary and middle school teachers.
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Question
What is the impact of statistical noise introduced
by different test characteristics on the stability
and accuracy of value-added models?
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Data Sets
TAKS – Texas Assessment of Knowledge and Skills:
Grade 5 Reading, 2009 Population Statistics
– Proficiency test
– Vertically aligned scale scores
– Average yearly gain
• 24 vertical scale points at “Met Expectations”
• 34 vertical scale points at “Commended”
– Standard Errors – Conditional Standard Errors reported by
TEA for each vertical scale score
• CSE Range: 24 - 74
• Weighted average CSE = 38.96
– Highly skewed distribution
– High variance
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Data Sets
TAKS – Texas Assessment of Knowledge and
Skills: Grade 5 Reading
Frequency Distribution, TAKS Reading Grade 5
30,000
N: 323,507
μ: 701.49
σ2: 10048.30
σ: 100.24
25,000
20,000
15,000
10,000
5,000
0
188 316 373 409 437 460 480 498 515 532 547 564 579 595 620 631 651 674 701 737 793 922
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Frequency Distribution, TAKS Reading Grade 5
30,000
25,000
20,000
15,000
10,000
5,000
0
188 316 373 409 437 460 480 498 515 532 547 564 579 595 620 631 651 674 701 737 793 922
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Data Sets
MAP – Measures of Academic Progress
– Growth measure
– Computer Adaptive Test
– Single scale
– Average yearly gain
• 5.06 RIT points
– Standard Errors – average standard errors range
2.5 - 3.5 RIT
– Slightly skewed distribution
– Small variance
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Data Sets
MAP – Measures of Academic Progress
Frequency Distribution: MAP Reading Grade 5
100,000
90,000
80,000
70,000
60,000
50,000
40,000
30,000
20,000
10,000
0
165
167
169
171
173
175
177
179
181
183
185
187
189
191
193
195
198
200
202
204
206
208
210
212
214
216
218
220
222
224
226
228
230
232
234
236
238
N: 2,663,382
μ: 208.35
σ2: 161.82
σ: 12.72
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Simulated Data
As it is impossible to isolate true scores and error
with real data, we created simulated data
points.
– True scores are known for all data points
– Every data point was given the same growth
• All iterations have the same value-added
• Any deviation from expected is a function of
measurement error only
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Simulated Data
We simulated 10,000 z-scores ~ N (0,1)
From this we selected nested, random samples of n=100, n=50,
n=25.
Statistical Summary, z-Score Samples by n
Statistic
Values
N
100
50
Mean
-.13
-.09
Std. Deviation
.97
.97
Skewness
-.12
.18
Minimum
-2.34
-1.85
Maximum
2.09
2.09
25
.01
1.00
.10
-1.77
2.09
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Data Generation
Pre-scores = P1 = z-score • σ + x
Post-scores = P2 = P1 + controlled growth
Controlled Growth Values:
TAKS = 24 (TAKS at “Commended” = 34) vertical scale points
MAP = 5.06 RIT points
Simulated Growth = (P2 + (Random2 • CSE)) - (P1 + (Random1
• CSE))
Random1 and Random2 ~ N (0,1)
CSE = Conditional Standard Errors as reported by TEA and NWEA
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Question
What is the impact of statistical noise introduced
by different test characteristics on the stability
and accuracy of value-added models?
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Monte Carlo Simulation
We ran 1,000 iterations for each simulation which was equivalent
to the same students taking the test 1,000 times with the same
true scores, but different levels of error.
Simulated Growth = (P2 + (Random2 • CSE)) - (P1 + (Random1
• CSE))
Random1 and Random2 ~ N (0,1)
CSE = Conditional Standard Errors as reported by TEA and NWEA
Aggregated values by subgroup to determine average
performance for each iteration.
False Negative : Simulated Growth < .5 Controlled Growth
False Positive: Simulated Growth > 1.5 Controlled Growth
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Results
Monte Carlo Results n=100
TAKS Actual Distribution
TAKS Normal Distribution at “Meets” Level
TAKS Normal Distribution Avg SE
TAKS Normal Distribution at “Commended”
Level
TAKS Normal Grade Transition
MAP Normal
MAP Max CSE
% False
Negative
1.7
.9
1.2
% False
Positive
2.5
1.8
1.8
% Total
Correct
ID
95.8
97.3
97.0
.8
.2
99.0
1.4
0.0
0.0
2.1
0.0
0.0
96.5
100.0
100.0
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Results
Monte Carlo Results n=50
TAKS Actual Distribution
TAKS Normal Distribution at “Meets” Level
TAKS Normal Distribution Avg SE
TAKS Normal Distribution at “Commended”
Level
TAKS Normal Grade Transition
MAP Normal
MAP Max CSE
% False
Negative
7.4
6.6
5.7
% False
Positive
9.6
8.4
7.4
% Total
Correct
ID
83.0
85.0
86.9
4.4
1.7
93.9
6.5
0.0
.7
8.1
0.0
.6
85.4
100.0
98.7
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Results
Monte Carlo Results n=25
TAKS Actual Distribution
TAKS Normal Distribution at “Meets” Level
TAKS Normal Distribution Avg SE
TAKS Normal Distribution at “Commended”
Level
TAKS Normal Grade Transition
MAP Normal
MAP Max CSE
% False
Negative
16.1
16.8
14.5
% False
Positive
18.4
18.0
16.0
% Total
Correct
ID
65.5
65.2
69.5
10.2
7.7
82.1
18.6
.5
3.0
18.2
.5
4.2
63.2
99.0
92.8
1.Purpose
2.Statistical Noise
3.Data
4.Methods
5.Results
Results
Student Sample Size
Descriptive Statistics
VAM
TAKS Actual Distribution
TAKS Normal Distribution
at “Meets”
TAKS Normal Distribution
Avg SE
TAKS Normal Distribution
at “Commended”
TAKS Normal Grade
Transition
MAP Normal
MAP Max CSE
n=100
Average
Controlled Simulated
Growth
Growth
24
24.29
SD
6.02
n=50
n=25
Average
Average
Simulated
Simulated
Growth
SD
Growth
SD
24.26
8.78
24.18
12.28
24
24.08
5.45
24.45
8.37
24.14
12.39
24
24.19
5.45
24.61
8.03
24.59
11.47
34
33.85
5.60
34.15
8.12
34.92
11.87
24
24.08
5.59
24.24
8.59
24.15
12.85
5.06
5.06
Test
TAKS Normal Distribution at “Meets”
MAP Normal
5.07
.49
5.05
.71
Percent
misidentified at
n=100
2.7
0.0
5.12
.72
5.05
.99
Percent
misidentified at
n=50
15.0
0.0
5.12
1.03
5.08
1.37
Percent
misidentified at
n=25
34.8
1.0
Conclusions
The Growth/Error ratio is the critical variable in
VAM stability.
Necessary student n to achieve a stable VAM is
sensitive to the Growth/Error ratio.
Stable VAMs are possible even with typical
classroom n’s; however, careful attention must be
paid to the suitability of the student assessment
instrument.
Limitations
No Differentiation between Student Effects,
Teacher Effects, or School Effects
No Environmental Effects
No Interaction Terms
These are all areas for additional research.
Download