The Impact of Selection of Student Achievement Measurement Instrument on Teacher Value-added Measures James L. Woodworth, CREDO Hoover Institute, Stanford Wen-Juo Lo, University of Arkansas Joshua B. McGee, Laura and John Arnold Foundation Nathan C. Jensen, Northwest Evaluation Association Presentation Outline 1. Purpose 2. Statistical Noise a. Why it matters b. Sources 3. Data 4. Methods 5. Results 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Purpose The purpose of this paper is to present to a statistics lay population the extent to which psychometric properties of student test instruments impact teacher value-added measures. 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Question What is the impact of statistical noise introduced by different test characteristics on the stability and accuracy of value-added models? 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Why it matters? Below Basic 5th 6th Basic Proficient Advanced 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Primary Sources of Statistical Noise 1. Test Design 2. Vertical Alignment 3. Student Sample Size 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Test Design Proficiency Tests • Focused around proficiency point • Designed to differentiate between proficient and not proficient • Larger variance in Conditional Standard Errors (CSE) Growth Tests • Questions measure across entire ability spectrum • Designed to differentiate between all points on the distribution • Smaller variance in CSE 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Test Design Paper and Pencil Tests • Limit item pool to control length • Focused around proficiency point • Large variance in CSE Computer Adaptive Test • Larger item pool for question selection • Focused around student ability point • Smaller variance in CSE 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Test Design 1000 900 900 800 800 700 700 600 600 500 500 400 400 300 300 200 200 100 100 0 0 188 262 316 349 373 393 409 424 437 449 460 470 480 489 498 507 515 524 532 539 547 555 564 571 579 587 595 604 620 621 631 640 651 662 674 686 701 717 737 763 793 848 922 Scale Score CSE Heteroskedasticity Due to Item Focusing: TAKS Reading Grade 5, 2009 1000 Scale Score CSE Range: 24 - 74 Weighted average CSE = 38.96 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Vertical Alignment • Year to year alignment can impact the results of VAM – Units must be equal across test sessions • Spring-Spring VAM are most affected • Fall-Spring VAM using same test avoid much of problem • Item alignment on computer adaptive tests can impact the results of VAM 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Student Sample Size • Central Limit Theorem – Larger student n provides a more stable estimate of teacher VAM. – Typical single year student n’s are 25, 50, and 100 for elementary and middle school teachers. 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Question What is the impact of statistical noise introduced by different test characteristics on the stability and accuracy of value-added models? 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Data Sets TAKS – Texas Assessment of Knowledge and Skills: Grade 5 Reading, 2009 Population Statistics – Proficiency test – Vertically aligned scale scores – Average yearly gain • 24 vertical scale points at “Met Expectations” • 34 vertical scale points at “Commended” – Standard Errors – Conditional Standard Errors reported by TEA for each vertical scale score • CSE Range: 24 - 74 • Weighted average CSE = 38.96 – Highly skewed distribution – High variance 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Data Sets TAKS – Texas Assessment of Knowledge and Skills: Grade 5 Reading Frequency Distribution, TAKS Reading Grade 5 30,000 N: 323,507 μ: 701.49 σ2: 10048.30 σ: 100.24 25,000 20,000 15,000 10,000 5,000 0 188 316 373 409 437 460 480 498 515 532 547 564 579 595 620 631 651 674 701 737 793 922 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Frequency Distribution, TAKS Reading Grade 5 30,000 25,000 20,000 15,000 10,000 5,000 0 188 316 373 409 437 460 480 498 515 532 547 564 579 595 620 631 651 674 701 737 793 922 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Data Sets MAP – Measures of Academic Progress – Growth measure – Computer Adaptive Test – Single scale – Average yearly gain • 5.06 RIT points – Standard Errors – average standard errors range 2.5 - 3.5 RIT – Slightly skewed distribution – Small variance 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Data Sets MAP – Measures of Academic Progress Frequency Distribution: MAP Reading Grade 5 100,000 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0 165 167 169 171 173 175 177 179 181 183 185 187 189 191 193 195 198 200 202 204 206 208 210 212 214 216 218 220 222 224 226 228 230 232 234 236 238 N: 2,663,382 μ: 208.35 σ2: 161.82 σ: 12.72 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Simulated Data As it is impossible to isolate true scores and error with real data, we created simulated data points. – True scores are known for all data points – Every data point was given the same growth • All iterations have the same value-added • Any deviation from expected is a function of measurement error only 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Simulated Data We simulated 10,000 z-scores ~ N (0,1) From this we selected nested, random samples of n=100, n=50, n=25. Statistical Summary, z-Score Samples by n Statistic Values N 100 50 Mean -.13 -.09 Std. Deviation .97 .97 Skewness -.12 .18 Minimum -2.34 -1.85 Maximum 2.09 2.09 25 .01 1.00 .10 -1.77 2.09 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Data Generation Pre-scores = P1 = z-score • σ + x Post-scores = P2 = P1 + controlled growth Controlled Growth Values: TAKS = 24 (TAKS at “Commended” = 34) vertical scale points MAP = 5.06 RIT points Simulated Growth = (P2 + (Random2 • CSE)) - (P1 + (Random1 • CSE)) Random1 and Random2 ~ N (0,1) CSE = Conditional Standard Errors as reported by TEA and NWEA 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Question What is the impact of statistical noise introduced by different test characteristics on the stability and accuracy of value-added models? 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Monte Carlo Simulation We ran 1,000 iterations for each simulation which was equivalent to the same students taking the test 1,000 times with the same true scores, but different levels of error. Simulated Growth = (P2 + (Random2 • CSE)) - (P1 + (Random1 • CSE)) Random1 and Random2 ~ N (0,1) CSE = Conditional Standard Errors as reported by TEA and NWEA Aggregated values by subgroup to determine average performance for each iteration. False Negative : Simulated Growth < .5 Controlled Growth False Positive: Simulated Growth > 1.5 Controlled Growth 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Results Monte Carlo Results n=100 TAKS Actual Distribution TAKS Normal Distribution at “Meets” Level TAKS Normal Distribution Avg SE TAKS Normal Distribution at “Commended” Level TAKS Normal Grade Transition MAP Normal MAP Max CSE % False Negative 1.7 .9 1.2 % False Positive 2.5 1.8 1.8 % Total Correct ID 95.8 97.3 97.0 .8 .2 99.0 1.4 0.0 0.0 2.1 0.0 0.0 96.5 100.0 100.0 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Results Monte Carlo Results n=50 TAKS Actual Distribution TAKS Normal Distribution at “Meets” Level TAKS Normal Distribution Avg SE TAKS Normal Distribution at “Commended” Level TAKS Normal Grade Transition MAP Normal MAP Max CSE % False Negative 7.4 6.6 5.7 % False Positive 9.6 8.4 7.4 % Total Correct ID 83.0 85.0 86.9 4.4 1.7 93.9 6.5 0.0 .7 8.1 0.0 .6 85.4 100.0 98.7 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Results Monte Carlo Results n=25 TAKS Actual Distribution TAKS Normal Distribution at “Meets” Level TAKS Normal Distribution Avg SE TAKS Normal Distribution at “Commended” Level TAKS Normal Grade Transition MAP Normal MAP Max CSE % False Negative 16.1 16.8 14.5 % False Positive 18.4 18.0 16.0 % Total Correct ID 65.5 65.2 69.5 10.2 7.7 82.1 18.6 .5 3.0 18.2 .5 4.2 63.2 99.0 92.8 1.Purpose 2.Statistical Noise 3.Data 4.Methods 5.Results Results Student Sample Size Descriptive Statistics VAM TAKS Actual Distribution TAKS Normal Distribution at “Meets” TAKS Normal Distribution Avg SE TAKS Normal Distribution at “Commended” TAKS Normal Grade Transition MAP Normal MAP Max CSE n=100 Average Controlled Simulated Growth Growth 24 24.29 SD 6.02 n=50 n=25 Average Average Simulated Simulated Growth SD Growth SD 24.26 8.78 24.18 12.28 24 24.08 5.45 24.45 8.37 24.14 12.39 24 24.19 5.45 24.61 8.03 24.59 11.47 34 33.85 5.60 34.15 8.12 34.92 11.87 24 24.08 5.59 24.24 8.59 24.15 12.85 5.06 5.06 Test TAKS Normal Distribution at “Meets” MAP Normal 5.07 .49 5.05 .71 Percent misidentified at n=100 2.7 0.0 5.12 .72 5.05 .99 Percent misidentified at n=50 15.0 0.0 5.12 1.03 5.08 1.37 Percent misidentified at n=25 34.8 1.0 Conclusions The Growth/Error ratio is the critical variable in VAM stability. Necessary student n to achieve a stable VAM is sensitive to the Growth/Error ratio. Stable VAMs are possible even with typical classroom n’s; however, careful attention must be paid to the suitability of the student assessment instrument. Limitations No Differentiation between Student Effects, Teacher Effects, or School Effects No Environmental Effects No Interaction Terms These are all areas for additional research.