Effective Use of Benchmark Test and Item Statistics and Considerations When Setting Performance Levels California Educational Research Association Anaheim, California December 1, 2011 1 Review of Benchmark Test and Item Statistics Objective Extend knowledge of assessment team to: 1. Better understand test reliability and the influences of test composition and test length. 2. Better understand item statistics and use them to identify items in need of revision 2 Reliability is a measure of the consistency of the assessment Types of reliability coefficients (always range from 0 to 1) Test-retest Alternate forms Split-half Internal consistency (Cronbach’s Alpha/KR20) 3 Reliability Influenced by Test Length • Spearman-Brown formula estimates reliabilities of shorter tests – Remember: The reliability of a score is an indication of how much an observed score can be expected to be the same if observed again. NOTE: See handout from STAR Technical Manual for exact cluster reliabilities. 4 Reliability Influenced by Test Length • Example: given a 75 item test with r=.95 – 40 item test has r=.91 – 35 item test has r=.90 – 30 item test has r=.88 – 25 item test has r=.86 – 20 item test has r=.84 – 10 item test has r=.72 – 5 item test has r=.56 NOTE: See handout from STAR Technical Manual for exact cluster reliabilities. 5 Reliability Statistics for CST’s (see handout) Note that CST reliabilities range from .90 to .95 Note that cluster reliabilities are consistent with those predicted by Spearman-Brown formula Validity is the degree to which the test is measuring what was intended Types of test validity A. Predictive or Criterion (How does it correlate with other measures?) B.Content 1. How well does the test sample from the content domain? 2. How aligned are the items with regard to format and rigor 7 Validity Is Influenced by Reliability Impact of Lower Reliability on Validity Remember: Validity is the agreement between a test score and the quality it is believed to measure Upper limit on validity coefficient is the square root of the reliability coefficient 75 item test = square root of .95 = .97 8 Validity Is Influenced by Reliability Upper limit on validity coefficient is the square root of the reliability coefficient 9 75 item test =square root of .95=.97 30 item test= square root of .88=.94 20 item test= square root of .86=.93 10 item test = square root of .72=.85 5 item test = square root of .56=.75 Coefficient of Determination (R squared) Square of validity coefficient gives “proportion of variance in the achievement construct accounted for by the test” 10 75 item test =.97 squared=.94 30 item test=.94 squared=.88 20 item test=.93 squared=.86 10 item test=.85 squared=.72 5 item test=.75 squared=.56 Using Item Statistics (p-value & pointbiserials) Apply item analysis statistics from assessment reporting system (e.g. Datadirector, Edusoft, OARS, EADMS, etc.) P-values (percent of group getting item correct Most should be between 30 and 80 Very high indicates it may be too easy; too low may indicate a problem item Point-biserials (correlation of item with total score) Most should be .30 or higher Very low or negative generally indicates a problem with the item Item statistics for CST’s (see handout) Note that the range of P-values is consistent with most being between .30 and .80 Note that median point-biserials are generally in the 40’s Algebra 1 District Question 7 Choice A B C D E BLANK Total Point Biserial Pilot Group # of Percent Students # of Students Percent 1691 36.81 220 37.48 1563 34.02 187 31.86 669 14.56 85 14.48 629 13.69 89 15.16 4 0.09 2 0.34 38 0.83 4 0.68 4594 100 587 100 0.31 0.38 PL: Basic Algebra 1 District Question 19 Choice A B C D E BLANK Total Point Biserial Pilot Group # of Percent Students # of Students Percent 971 21.18 108 18.40 1028 22.42 125 21.29 1193 26.02 145 24.70 1148 25.04 155 26.41 7 0.15 0 0.00 238 5.19 54 9.20 4585 100 587 100 0.23 0.19 PL: Advanced Proficient Algebra 2 District Question 21 Choice A B C D E BLANK Total Point Biserial Pilot Group # of Percent Students # of Students Percent 286 23.50 45 24.32 248 20.38 37 20.00 354 29.09 63 34.05 260 21.36 35 18.92 0 0.00 0 0.00 69 5.67 5 2.70 1217 100 185 100 0.19 0.24 PL: Beyond Advanced Proficient Geometry District Question 12 Choice A B C D E BLANK Total Point Biserial Pilot Group # of Percent Students # of Students Percent 247 13.46 42 15.91 603 32.86 90 34.09 703 38.31 99 37.50 273 14.88 31 11.74 0 0.00 0 0.00 9 0.49 2 0.76 1835 100 264 100 0.10 0.10 PL: Proficient Maximizing Predictive Accuracy of District Benchmarks Objective Extend knowledge of assessment team to: 1. Better understand how performance level setting is key to predictive validity. 2. Better understand how to create performance level bands based on equipercentile equating 18 Comparing District Benchmarks to CST Results Common Methods for Setting Cutoffs on District Benchmarks: Use default settings on assessment platform (e.g. 20%, 40%, 60%, 80%) Ask curriculum experts for their opinion of where cutoffs should be set Determine percent correct corresponding to performance levels on CSTs and apply to benchmarks 19 Comparing District Benchmarks to CST Results There is a better way! 20 Comparing District Benchmarks to CST Results “Two scores, one on form X and the other on form Y, may be considered equivalent if their corresponding percentile ranks in any given group are equal.” (Educational Measurement-Second Edition, p. 563) 21 Comparing District Benchmarks to CST Results Equipercentile Method of Equating at the Performance Level Cut-points Establishes cutoffs for benchmarks at equivalent local percentile ranks as cutoffs for CSTs By applying same local percentile cutoffs to each trimester benchmark, comparisons across trimesters within a grade level are more defensible 22 Equipercentile Equating Method Step 1-Identify CST SS Cut-points 23 Equipercentile Equating Method Step 2 - Establish Local Percentiles at CST Performance Level Cutoffs (from scaled score frequency distribution) 24 Equipercentile Equating Method Step 3 – Locate Benchmark Raw Scores Corresponding to the CST Cutoff Percentiles (from benchmark raw score frequency distribution) 25 Equipercentile Equating Method Step 4 – Validate Classification Accuracy – Old Cutoffs 2nd Semester Biology 2006 CST Old Cutoff 0-17 FBB BB Basic Proficient Advanced Total FBB 57 72 25 1 0 155 18-34 BB 118 297 511 60 4 990 35-48 Basic 19 51 427 401 45 943 49-62 Proficient 1 5 27 141 207 381 63-70 Advanced 0 0 0 0 20 20 Total 195 425 990 603 276 2489 Correct Classification: Proficient & Advanced on CST = 42% Correct Classification: Each Level on CST = 38% 26 Equipercentile Equating Method Step 4 – Validate Classification Accuracy – Old Cutoffs 2nd Semester Biology 2006 CST Old Cutoff FBB BB Basic Proficient Advanced Total FBB 57 72 25 1 0 155 18-34 BB 118 297 511 60 4 990 35-48 Basic 19 51 427 401 45 943 49-62 Proficient 1 5 27 141 207 381 63-70 Advanced 0 0 0 0 20 20 Total 195 425 990 603 276 2489 0-17 Correct Classification: Proficient & Advanced on CST = 42% Correct Classification: Each Level on CST = 38% 27 Equipercentile Equating Method Step 4 – Validate Classification Accuracy – Old Cutoffs 2nd Semester Biology 2006 CST Old Cutoff FBB BB Basic Proficient Advanced Total FBB 57 72 25 1 0 155 18-34 BB 118 297 511 60 4 990 35-48 Basic 19 51 427 401 45 943 49-62 Proficient 1 5 27 141 207 381 63-70 Advanced 0 0 0 0 20 20 Total 195 425 990 603 276 2489 0-17 Correct Classification: Proficient & Advanced on CST = 42% Correct Classification: Each Level on CST = 38% 28 Equipercentile Equating Method Step 4 – Validate Classification Accuracy – New Cutoffs 2nd Semester Biology 2006 CST New Cutoff 0-19 FBB BB Basic Proficient Advanced Total FBB 89 107 53 4 0 253 20-26 BB 59 142 148 12 0 361 27-40 Basic 39 161 596 176 9 981 41-51 Proficient 8 12 181 354 82 637 52-70 Advanced 0 3 12 57 185 257 Total 195 425 990 603 276 2489 Correct Classification: Proficient & Advanced on CST = 77% Correct Classification: Each Level on CST = 55% 29 Equipercentile Equating Method Step 4 – Validate Classification Accuracy – New Cutoffs 2nd Semester Biology 2006 CST New Cutoff FBB BB Basic Proficient Advanced Total FBB 89 107 53 4 0 253 20-26 BB 59 142 148 12 0 361 27-40 Basic 39 161 596 176 9 981 41-51 Proficient 8 12 181 354 82 637 52-70 Advanced 0 3 12 57 185 257 Total 195 425 990 603 276 2489 0-19 Correct Classification: Proficient & Advanced on CST = Correct Classification: Each Level on CST = 30 77% 55% Equipercentile Equating Method Step 4 – Validate Classification Accuracy – New Cutoffs Biology 2006 CST New Cutoff FBB BB Basic Proficient Advanced Total FBB 89 107 53 4 0 253 20-26 BB 59 142 148 12 0 361 27-40 Basic 39 161 596 176 9 981 41-51 Proficient 8 12 181 354 82 637 52-70 Advanced 0 3 12 57 185 257 Total 195 425 990 603 276 2489 0-19 Correct Classification: Proficient & Advanced on CST = Correct Classification: Each Level on CST = 31 77% 55% Example: Classification Accuracy Biology Old New Proficient or Advanced 42% 77% Each Level 38% 55% Proficient or Advanced 30% 77% Each Level 31% 50% 2nd Semester 1st Semester 32 Example: Classification Accuracy Biology Old New Proficient or Advanced 53% 71% Each Level 41% 46% 1st Quarter 33 Example: Classification Accuracy Chemistry Old New 2nd Semester: Prof. & Adv. 63% 79% 2nd Semester: Each Level 47% 52% 1st Semester: Prof. & Adv. 74% 74% 1st Semester: Each Level 49% 50% 1st Quarter: Prof. & Adv. 83% 76% 1st Quarter: Each Level 48% 47% 34 Example: Classification Accuracy Earth Science Old New 2nd Semester: Prof. & Adv. 48% 68% 2nd Semester: Each Level 43% 52% 1st Semester: Prof. & Adv. 33% 66% 1st Semester: Each Level 38% 47% 1st Quarter: Prof. & Adv. 42% 56% 1st Quarter: Each Level 34% 41% 35 Example: Classification Accuracy Physics Old New 2nd Semester: Prof. & Adv. 57% 87% 2nd Semester: Each Level 37% 57% 1st Semester: Prof. & Adv. 60% 88% 1st Semester: Each Level 42% 50% 1st Quarter: Prof. & Adv. 65% 87% 1st Quarter: Each Level 47% 45% 36 Things to Consider Prior to Establishing the Benchmark Cutoffs Will there be changes to the benchmarks after CST percentile cutoffs are established? If NO then raw score benchmark cutoffs can be established by linking CST to same year benchmark administration (i.e. spring 2011 CST matched to 2010-11 benchmark raw scores) If YES then wait until new benchmark is administered and then establish raw score cutoffs on benchmark How many cases are available for establishing the CST percentiles? (too few cases could lead to unstable percentile distributions) 37 Things to Consider Prior to Establishing the Benchmark Cutoffs (Continued) How many items comprise the benchmarks to be equated? (as test gets shorter it becomes more difficult to match the percentile cutpoints established on the CST’s) 38 Summary Equipercentile Equating Method Method generally establishes a closer correspondence between the CST and Benchmarks When benchmarks are tightly aligned with CSTs, the approach may be less advantageous (i.e. elementary math) Comparisons between benchmark and CST performance can be made more confidently Comparisons between benchmarks within the school year can be made more confidently 39 Coming Soon from Illuminate Education, Inc.! Reports using the equipercentile methodology are being programmed to: (1) establish benchmark cutoffs for performance bands (2) create validation tables showing improved classification accuracy based on the method 40 Contact: Tom Barrett, Ph.D. President, Barrett Enterprises, LLC Director, Owl Corps, School Wise Press 2173 Hackamore Place Riverside, CA 92506 951-905-5367 (office) 951-237-9452 (cell) 41