CREATE - National Evaluation Institute Annual Conference – October 5-7, 2012 Educational Accountability and Teacher Evaluation: Real Problems, Practical Solutions Oversight of test administration: Respect for educators, respect for standardization, working together to focus on measurement Eliot Long A*Star Audits, LLC Brooklyn, NY eliotlong@astaraudits.com www.astaraudits.com Talking about oversight 1. Why? Why is oversight necessary? What are the problems and how significant are they? Is this all worth the trouble? 2. How? How will we measure compliance? How will we collect/analyze data? What range of error – what is the misjudgment potential? 3. What? What will we do with the results of the oversight? Costly, time consuming investigations? Educator sanctions? Lower test scores???? 4. Focus What is our focus in choosing our methods and practices? MEASUREMENT requires RELIABILITY requires STANDARDIZATION Need for Oversight: What are the problems we are seeking to fix? Problems in test administration: Confusion – misdirection Misunderstanding, lack of preparation Special directions for some, but not all, students Inappropriate strategies for test-taking Guessing strategies (Choose a middle size answer …) Hurry, skim look for easy questions first Improper assistance Hints (Remember when to invert and multiply …) Rereading, clarifying test questions Brief instruction on test content Cheating … You wouldn’t want to be caught doing it. See, for example, 19 ways to raise student scores: Amrein-Beardsley, A., Berliner, D.C. & Rideau, S. (2010) “Cheating in the first, second, and third degree: Educators’ responses to high-stakes testing”. Educational Policy Analysis Archives, 18(14). Recovered from http://epaa.asu.edu/ojs/article/view/714 Outcomes of Improper Influence: Undermines the usefulness of test scores for program evaluation and accountability Distribution of Number Correct Scores E stima te d S kills Ba se d a nd Obse rve d N umbe r Corre ct S core s Est. True Score Distribution Observed Score Distribution 5% 1. Reduce the range of measure with respect to the true range of test-taker achievement. Reduce general pop. variance by 30% Frequency - Percent of All Students Encouraged guessing is found to: 4% 3% 2% 1% 0% 2. Create widely varying classroom to classroom test score reliability. 0 3 6 9 18 21 24 27 30 33 36 39 T ru e v s. O b se rv e d G a i n s a t M in . S c o re fo r P a ssi n g 30% 20% O b se rv e d T ru e Percent Gain or Decline Mask true achievement gains by 35% for students at Basic Performance 15 Grade 5 Reading Test - Number Correct Score Reduce general pop. reliability by ? 3. Create a test score modulator – with changes in achievement and proctor influence moving in the opposite, offsetting, direction. 12 10% 0% - 10% - 20% - 30% 42 45 Evaluating student response patterns for evidence of improper influence Student response patterns: Taken as a group, students in a classroom succeed or fail with the individual test questions in a pattern that: 1. Reflects the difficulty of the test questions and 2. Follows the same fluctuations as the patterns for other classrooms at the same achievement level. A band of normative deviations may be established around the norm for any one achievement level. Evaluating a classroom to its achievement level (peer group) norm Skill level norm: All classrooms at the same achievement level set a peer group or ‘skill level’ norm. P-value correlation: One method of comparison is a correlation of the class and skill level p-values. Here, for a 50 item test, n = 50; r = .95 Percent attempted: The line with stars indicates the percent of students who answer each item. Skill Level Norm Consistency within and across Assessment Settings Consistency of Group Test Administrations When all group profiles are correlated with their appropriate skill level norms, the distribution of correlation coefficients indicates the level of test administration consistency within the assessment setting. Consistency of Educational vs. Industrial Assessment A comparison of classroom groups with job applicant groups (tested by employers) indicates lower consistency in classroom test administrations. Schools median Classrooms median Employers median r = .900 r = .907 r = .958 Proctor Influence to Guess Two Low Performing Classrooms Two Different Forms of Encouragement to Guess Encouragement to Guess at Random Encouragement to Guess by choosing ‘C’ Guessing Effects within the Classroom Teacher encouragement to guess is challenged by the variation in student achievement and need to guess, potentially resulting in under assessment. The class below has a poor correlation with the norm (.74). Guessing by some students and teacher actions to encourage it contradict norm patterns. n =8 RS = 29.4 r Corr. = .80 Full class: n = 18; RS = 22.3; r Corr. = .74 Low performance due to guessing? n = 10 RS = 16.6 r Corr. = .44 25% Correct Proctor Effects on Summer School Gain Indications from Summer School Two major patterns for test work behavior inferred from results. 1. Proctor directions: Answer the questions in order, to take time, work carefully, and reserved guessing until the end of the session. . 2. Proctor direction to either: a) First skim the test to look for the easy questions, then guess, or b) Hurry from the beginning, don’t waste time on difficult questions, guess and move on. Improper Proctor Influence Proctor influence ranges from positive to moderately negative to a serious undermining of the assessment. Significant improper influence leads to measurable deviations in classroom response patterns. Response pattern probability: P < 0.01 Correlation with the norm: r = .713 Response pattern probability: P < 0.001 Correlation with the norm: .579 A*Star® P-value Profile Moderate scoring class (n=17) with improbable elevated p-values Response pattern probability: P < 0.01 Skill Level Norm RS 27.5 Class (n=17) Profile A*Star® P-value Profile Average scoring class with improbable elevated p-values Response pattern probability: P < 0.001 Skill Level Norm RS 30.5 Percent Attempted Percent Attempted 120% Percent Correct of Attempts - P-value 120% Percent Correct of Attempts - P-value Class (n=19) Profile 100% 80% 60% 40% 20% 100% 80% 60% 40% 20% 0% 0% 1 4 7 10 13 16 19 22 25 28 31 34 37 Grade 6 Reading Test - Test Question - Item Number 40 43 46 49 1 4 7 10 13 16 19 22 25 28 31 34 37 Grade 6 Reading Test - Test Question - Item Number 40 43 46 49 Subject Group Analysis Identify those test-takers most likely to have been the subject of improper proctor influence - Determine the expected frequency at each answer alternative for each total test score based on observed frequencies in the state population - Identify item responses consistent with the group pattern irregularity - Identify the joint likelihood of these item responses at each score level - Apply the likelihood estimate to each score level in the group - Sum the observed and expected frequencies for the group - Compare the observed to the expected frequency via the binomial probability distribution. Determine the # of students and the # test answers involved and the likelihood of this combination occurring in classrooms of the same size and at the same achievement level. Subject Group Analysis Identify those most likely subject to improper influence Most often, improper teacher influence is unplanned and disorganized. Yet, where the influence is persistent, subsets of students will be identified with matching, unlikely response patterns. Subject Group: n = 8 of 17 Response pattern probability P = 3.43e-9 Correlation with the norm: r = .498 Subject Group: n = 12 of 19 Response pattern probability: P = 6.68e-15 Correlation with the norm: r = .342 A*Star® P-value Profile Comparison of Subject Group (n=12) to Class Skill Level Norm A*Star® P-value Profile Comparison of Subject Group (n=8) to Class Skill Level Norm Subset of students likely to have received assistance - Response pattern probability: P = 6.68e-15 Subset of students likely to have received assistance - Response pattern probability: P = 3.43e-9 Skill Level Norm RS 29.3 Skill Level Norm RS 33.2 Subject Group Profile 120% Percent Correct of Attempts - P-value 120% Percent Correct of Attempts - P-value Subject Group Profile 100% 80% 60% 40% 20% 100% 80% 60% 40% 20% 0% 0% 1 4 7 10 13 16 19 22 25 28 31 34 37 Grade 6 Reading Test - Test Question - Item Number 40 43 46 49 1 4 7 10 13 16 19 22 25 28 31 34 37 Grade 6 Reading Test - Test Question - Item Number 40 43 46 49 School Administrator Involvement School administrator involvement is indicated when a highly irregular (highly improbable) response pattern is found that crosses over classrooms. Full school (n=69) response pattern reveals a substantial irregularity over the early test items. Subset group (n=30) includes students from several classrooms, indicating an influence from outside the classroom. Confirmation of Improper Influence How do we know that irregular response patterns indicate improper test administration? Confirmation for statistical analysis Testing program: “Ability-To-Benefit” testing: Basic reading and math skills Program of the Office of Federal Student Aid Analyses by: Most major test publishers Reviewed by: U. S. Dept. of Education, Office of Inspector General OIG Report: Final Management Information Report, Jan. 25, 2010 Available at: www2.ed.gov/about/offices/list/oig/ alternativeproducts/x11j0002.pdf Summary: OIG data analytics project investigated 106 test administrators indicated by the A*Star method; 83 were identified by the OIG while an unspecified number of others were not investigated due to their small number of test administrations after applying the statute of limitations. Defining “Significant” Misadministration What constitutes “significant” cases of misadministration (cheating)?” Number of test items effected Improper influence on any test item is wrong, but influence on only a few items is more likely an effort to facilitate the test administration rather than to materially raise test scores. Number of students involved My sense of it is that a large number of items for a few students is a greater problem than a few items for a large number of students – the latter may be a perceived problem with the items while the former an effort to raise the scores of lower performing students. Improbability of response pattern Any probability less than 1 in 10,000 is significant, but common, wrong answers create unusually low probabilities that may overshadow more important problems. A “six sigma” approach is conservative. Definition used here: Minimum 10% of test items Minimum #SGA students times #SGA items = 5% of all responses Probability less than 1 in 100,000 (less than 10 in one million) Audit Summary of Large Urban District - 2001 Pct. of Group Responses Freq. 2001 Urban District Elementary Schools Grade 5 Math Chart of schools by the A*Star analysis of each school’s Grade 5 Math test response pattern – plot by the volume of test responses potentially subject to improper influence and by the improbability of the pattern occurring in a normative test administration. Response Pattern Consistent with norms Modest irregularity Severe irregularity Pct. of Schools 85% 12% 3% 25.0% 24.5% 24.0% 23.5% 23.0% 22.5% 22.0% 21.5% 21.0% 20.5% 20.0% 19.5% 19.0% 18.5% 18.0% 17.5% 17.0% 16.5% 16.0% 15.5% 15.0% 14.5% 14.0% 13.5% 13.0% 12.5% 12.0% 11.5% 11.0% 10.5% 10.0% 9.5% 9.0% 8.5% 8.0% 7.5% 7.0% 6.5% 6.0% 5.5% 5.0% 4.5% 4.0% 3.5% 3.0% 2.5% 2.0% 1.5% 1.0% 0.5% 0.0% Total Percent SGA Prob. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 8 3 1 2 28 2 15 9 136 30 33 14 196 6 1 0 181 SGA Probability Exponent n/s E-01 E-02 E-03 E-04 26 E-05 E-06 E-07 E-08 E-09 E-10 4% E-11 E-12 20 3% 1 1 E-13-45 1 1 1 17 9 10 3 63 2 1 4 1 1 2 1 1 4 1 5 0 40 12 8 7 70 3 11 1 2 4 37 3 4 2 32 567 85% 1 1 6 3 2 19 3 5 2 12 1 1 3 1 1 5 1 1 1 5 2 2 7 1 1 2 7 3 2 1 1 1 2 3 1 1 1 1 1 54 8% 1 181 667 181 100.0% 27.1% n/s 7 1.0% E-01 104 152 96 15.6% 22.8% 14.4% E-02 E-03 E-04 Probable 53 7.9% E-05 19 2.8% E-06 Improbable 15 2.2% E-07 12 1.8% E-08 7 1.0% E-09 7 1.0% E-10 3 0.4% E-11 Highly Improbable 2 0.3% E-12 4 0.6% E-13-45 Ability-To-Benefit Testing - 2002-2005 Reviewed by Office of Inspector General Pct. of Group Responses Freq. 2002-2005 Nationally distributed Occupational Training Schools Basic Math Skills Chart of schools by the A*Star analysis of each school’s student applicants’ test response pattern. Plot by the volume of test responses potentially subject to improper influence and by the improbability of the pattern occurring in a normative test administration. Response Pattern Consistent with norms Modest irregularity Severe irregularity Pct. of Schools 67% 19% 14% 25.0% 24.5% 24.0% 23.5% 23.0% 22.5% 22.0% 21.5% 21.0% 20.5% 20.0% 19.5% 19.0% 18.5% 18.0% 17.5% 17.0% 16.5% 16.0% 15.5% 15.0% 14.5% 14.0% 13.5% 13.0% 12.5% 12.0% 11.5% 11.0% 10.5% 10.0% 9.5% 9.0% 8.5% 8.0% 7.5% 7.0% 6.5% 6.0% 5.5% 5.0% 4.5% 4.0% 3.5% 3.0% 2.5% 2.0% 1.5% 1.0% 0.5% 0.0% Total Percent SGA Prob. 0 1 0 0 0 0 0 0 0 0 1 0 0 1 2 1 1 0 6 0 0 0 0 10 0 1 1 7 1 1 0 0 11 0 1 0 17 1 1 0 2 21 1 0 0 7 0 0 0 0 0 0 750 100.0% SGA Probability Exponent n/s E-01 E-02 E-03 E-04 E-05 E-06 E-07 E-08 E-09 E-10 E-11 E-12 E-13-45 1 1 1 2 1 1 1 32 6 4% 104 14% 11 1 1 2 1 1 1 2 1 1 5 1 1 4 4 8 3 2 199 199 26.5% n/s 22 2.9% E-01 1 15 6 2 10 27 6 6 3 6 6 1 8 35 19 17 6 1 503 67% 100 85 2 2 4 2 2 3 11 6 6 5 19 8 3 4 15 6 6 3 10 1 1 1 2 1 1 1 1 1 8 1 5 3 3 2 2 6 1 1 13.3% 11.3% 10.0% E-03 E-04 54 7.2% E-05 26 3.5% E-06 Improbable 20 2.7% E-07 2 1 3 1 4 1 1 2 1 1 2 21 2 1 1 2 25 2 1 1 4 1 5 3 1 13 1 1 1 1 2 3 1 2 1 1 1 3 1 1 2 1 7 1 111 75 E-02 Probable 1 2 6 1 1 7 1 1 11 1.5% E-08 15% 15 2.0% E-09 8 1.1% E-10 13 10 1.7% E-11 1.3% E-12 Highly Improbable 112 14.9% E-13-45 East Coast State - 2008 Pct. of All Responses East Coast Statewide Review Elementary Schools Grade 4 Math Chart of schools by the A*Star analysis of each school’s Grade 4 Math test response pattern – plot by the volume of test responses potentially subject to improper influence and by the improbability of the pattern occurring in a normative test administration. Response Pattern Consistent with norms Modest irregularity Severe irregularity Pct. of Schools 37% 41% 22% 25.0% 24.5% 24.0% 23.5% 23.0% 22.5% 22.0% 21.5% 21.0% 20.5% 20.0% 19.5% 19.0% 18.5% 18.0% 17.5% 17.0% 16.5% 16.0% 15.5% 15.0% 14.5% 14.0% 13.5% 13.0% 12.5% 12.0% 11.5% 11.0% 10.5% 10.0% 9.5% 9.0% 8.5% 8.0% 7.5% 7.0% 6.5% 6.0% 5.5% 5.0% 4.5% 4.0% 3.5% 3.0% 2.5% 2.0% 1.5% 1.0% 0.5% 0.0% SGA Probability Exponent Freq. E-01 E-02 E-03 E-04 E-05 E-06 E-07 E-08 E-09 E-10 E-11 E-12 E-13-45 1 1 506 39% 289 2 1 7 2 2 1 1 2 2 3 0 0 1 7 2 1 1 2 3 2 0 24 6 13 12 5 12 69 9 6 3 2 17 4 8 5 9 6 86 10 14 4 8 9 88 2 1 1 479 1,311 100.0% 1 1 1 1 1 1 33 8 4 6 15 12 131 24 22 22 46 33 312 53 54 39 34 29 240 14 10 2 Total 1 1 2 1 1 1 3 148 22% 1 2 Percent SGA Prob. n/s 6 1 2 2 34 7 6 3 8 3 84 10 5 3 9 3 44 2 1 3 3 1 1 4 4 18 1 1 3 5 5 47 9 6 6 5 4 18 7 1 1 2 0 25 4 2 3 4 6 27 7 8 8 4 4 1 3 1 1 1 1 13 2 1 4 3 3 15 6 5 1 3 6 1 1 1 4 1 2 1 1 7 5 1 1 7 1 7 2 6 3 1 3 1 2 1 1 1 4 4 9 1 1 2 3 5 1 1 2 2 1 1 2 1 1 1 2 4 1 4 1 1 1 2 1 4 1 37% 1 37 3% 148 148 11.3% n/s 15 1.1% E-01 165 12.6% E-02 Probable 279 21.3% E-03 231 17.6% E-04 147 11.2% E-05 120 9.2% E-06 Improbable 67 5.1% E-07 38 2.9% E-08 41 3.1% E-09 20 1.5% E-10 8 0.6% E-11 Highly Improbable 5 0.4% E-12 27 2.1% E-13-45 Addressing the problem There are significant deviations from standardized test administration that materially undermine the usefulness of test scores. What do we do about it? - Comprehensive instructions for school administrators and teachers - Regular review of test results for misadministration - Communication with educators when irregularities arise - Reserve sanctions for persistent cases of misadministration. Case in point: Whole test manipulation Same Teacher – Two Successive Years The first year - begins normally and becomes increasingly irregular. The second year - begins irregular and continues over the entire test suggesting a preplanned intent to control the testing outcome. Second Year - Grade 5 Reading Correlation with the norm: r = .487 First Year Grade 5 Reading r = .750 First Year Grade 5 Math r = .527 Not caught by erasure analysis due to low pct. wrong to right (40%). Self Correction - following notice Oversight improves proctoring Following the second year reading test, the teacher was notified that her testing practices were under investigation Three weeks later, her administration of the math test was remarkably improved. Note: The teacher was not given any instruction on how to change her test administration practices. She was only told that irregularities had been found in her students’ test answers. The next test administration resulted in an essentially perfect response pattern. Grade 5 Math – Second Year Pattern correlation with the norm: r = .947 Oversight Program Steps Step 1: Directions for Test Administration Provide comprehensive directions for managing the classroom and conducting the test administration Address all issues that arise in test administration For example: 1. How students should respond when they do not know how to respond. i.e. Guess? Use ‘test wise’ guessing strategies? Leave blank? 2. How to deal with apparent misalignment of the test with the curriculum. i.e. When there is test material not included in classroom instruction. 3. What should the teacher do when she/he notices a student mark a wrong answer when she/he knows the student knows the correct answer? Test directions should be rewritten following meetings with school administrators and teachers to review response pattern research and address all issues challenging standardized test administration. Oversight Program Steps Step 2: Conduct Regular Review of Test Response Patterns Conduct annual review of test response patterns, for each test, by classroom and by school. The review should: Identify: 1. Problems in the test construction or in the directions for test administration. 2. Locations with irregular results and likely misadministration. Provide: 3. Trend analysis for each classroom and school Have past irregularities been cured? Is there a sudden change in the qualify of test administration? 4. Resource to evaluate complaints or allegations of misconduct. 5. Resource to evaluate test score input to teacher evaluation. Oversight Program Steps Step 3: Report review results to administrators & educators Report results and recommendations Prepare a written report for each administrator and teacher/test administrator. 1. Measure of consistency with norms for the same achievement level. 2. To the extent possible, indicate areas of test administration procedures for future improvement. Oversight Program Steps Step 4:Remedial steps for instances of particularly severe or repeated irregularities Remedial Steps: 1. Meet with assessment liaison or trainer to review areas of needed improvement. 2. Assignment to professional development test administration program 3. Assign monitor for next assessment session. 4. Provide substitute test administrator for next test session 5. Conduct investigation leading to potential sanctions. Summary of Oversight of Test Administration A focus on measurement through standardized assessment Oversight of Test Administration 1. Provide a comprehensive set of written directions for school administrators and teachers. 2. Conduct annual reviews of test response patterns. 3. Provide timely test administration reports to all involved. 4. Provide a series of steps to inform, train, and motivate test administrators to improve practices where necessary. 5. Provide sanctions for test administrators who fail to improve practices over multiple test administrations. 6. Use experience to improve directions, methods of analysis and interpretation, and methods of communication and training.