Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014 Large-scale testing: Uses and abuses 1. 2. 3. 4. 5. 3 types of large-scale tests Measuring test quality A chronology of mistakes Economists misunderstand testing How SIMCE is affected 1. Three types of large-scale tests Achievement Aptitude Non-cognitive Achievement tests Historically, were larger versions of classroom tests ~ 1900 - “scientific” achievement tests developed (Germany & USA) J.M. Rice - systematically analyzed test structures & effects E.L. Thorndike - developed scoring scales SOURCE: Phelps, Standardized Testing Primer, 2007 Achievement tests Purpose: to measure how much you know and can recall Developed using: content coverage analysis How validated: retrospective or concurrent validity (correlation with past measures, such as high school grades) Requires a mastery of content prior to test. Fairness assumes that all have same opportunity to learn content Coachable – specific content is known in advance SOURCE: Phelps, Standardized Testing Primer, 2007 Aptitude tests 1890s – A. Binet & T. Simon (France) - Pre-school children with mental disabilities - achievement test not possible - developed content-free test of mental abilities (association, attention, memory, motor skills, reasoning) 1917 – Adapted by U.S. Army to select, assign soldiers in World War 1 1930s – Harvard University president J. Conant - wanted new admission test to identify students from lower social classes with the potential to succeed at Harvard - developed the first Scholastic Aptitude Test (SAT) SOURCE: Phelps, Standardized Testing Primer, 2007 Aptitude tests Purpose: predict how much can be learned Developed using: skills/job analysis How validated: predictive validity, correlation with future activity (e.g., university or job evaluations) Content independent. Measures: … what student does with content provided … how student applies skills & abilities developed over a lifetime Not easily coachable – the content is either… … not known in advance, … basic, broad, commonly known by all, curriculum-free; … less dependent on the quality of schools SOURCE: Phelps, Standardized Testing Primer, 2007 Aptitude tests Aptitude tests can identify: - Students bored in school who study what interests them on their own - Students not well adapted to high school, but well adapted to university - Students of high ability stuck in poor schools SOURCE: Phelps, Standardized Testing Primer, 2007 Comparing Achievement & Aptitude tests Achievement Aptitude Measure past learning potential Development content analysis job/skills analysis Validation retrospective predictive Content dependent independent Coachable? very much not much Non-cognitive tests More recently developed – measure values, attitudes, preferences Types: integrity tests career exploration matchmaking employment “fit” Non-cognitive tests Purpose: to identify “fit” with others or a situation Developed using: surveys, personal interviews How validated? success rate in future activities Content is personal, not learned “Faking” can be an issue (e.g., “honesty” tests) Comparing Achievement, Aptitude, & Non-Cognitive Tests Achievement Aptitude Non-Cognitive Measure past learning potential attitudes, values, preferences Development content analysis job/skills analysis surveys Validation retrospective predictive predictive Content dependent independent independent Coachable? very much very little can be faked 2. Measuring test quality Test reports can be “data dumps” 3 measures are important: 1. Predictive validity 2. Content coverage 3. Sub-group differences Predictive validity (values from -1.0 to +1.0) …measures how well higher scores on admission test match better outcomes at university (e.g., grades, completion) A test with low predictive validity provides a little information. A positive correlation between two measures Source: NIST, Engineering Statistics Handbook A negative correlation between two measures Source: NIST, Engineering Statistics Handbook No correlation between two measures Source: NIST, Engineering Statistics Handbook How does one measure predictive capacity? Correlation Coefficient: I--------------------------------------------I -1 0 1 Predictive validities: SAT and PSU 0.6 0.5 0.4 0.3 SAT 0.2 PSU 2010 0.1 0 Language Mathematics SAT Writing PSU Social Science SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013 Predictive validities: SAT and PSU (faculty: Administracion) 0.6 0.5 0.4 0.3 0.2 0.1 0 Language Mathematics SAT SAT Writing PSU Social Science PSU Administracion SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013