can be - Universidad Finis Terrae

advertisement
Large-scale testing:
Uses and abuses
Richard P. Phelps
Universidad Finis Terrae, Santiago, Chile
January 7, 2014
Large-scale testing: Uses and abuses
1.
2.
3.
4.
5.
3 types of large-scale tests
Measuring test quality
A chronology of mistakes
Economists misunderstand testing
How SIMCE is affected
1. Three types of large-scale tests
Achievement
Aptitude
Non-cognitive
Achievement tests
Historically, were larger versions of classroom tests
~ 1900 - “scientific” achievement tests developed
(Germany & USA)
J.M. Rice -
systematically analyzed
test structures & effects
E.L. Thorndike -
developed scoring
scales
SOURCE: Phelps, Standardized Testing Primer, 2007
Achievement tests
Purpose: to measure how much you know and can recall
Developed using: content coverage analysis
How validated: retrospective or concurrent validity
(correlation with past measures, such as high school grades)
Requires a mastery of content prior to test.
Fairness assumes that all have same opportunity to learn content
Coachable – specific content is known in advance
SOURCE: Phelps, Standardized Testing Primer, 2007
Aptitude tests
1890s – A. Binet & T. Simon (France)
- Pre-school children with mental disabilities
- achievement test not possible
- developed content-free test of mental abilities
(association, attention, memory, motor skills, reasoning)
1917 – Adapted by U.S. Army to select, assign soldiers in World War 1
1930s – Harvard University president J. Conant
- wanted new admission test to identify students from lower social
classes with the potential to succeed at Harvard
- developed the first Scholastic Aptitude Test (SAT)
SOURCE: Phelps, Standardized Testing Primer, 2007
Aptitude tests
Purpose: predict how much can be learned
Developed using: skills/job analysis
How validated: predictive validity, correlation with future activity (e.g.,
university or job evaluations)
Content independent. Measures:
… what student does with content provided
… how student applies skills & abilities developed over a lifetime
Not easily coachable – the content is either…
… not known in advance,
… basic, broad, commonly known by all, curriculum-free;
… less dependent on the quality of schools
SOURCE: Phelps, Standardized Testing Primer, 2007
Aptitude tests
Aptitude tests can identify:
- Students bored in school who study
what interests them on their own
- Students not well adapted to high
school, but well adapted to university
- Students of high ability stuck in poor
schools
SOURCE: Phelps, Standardized Testing Primer, 2007
Comparing Achievement & Aptitude tests
Achievement
Aptitude
Measure
past learning
potential
Development
content analysis
job/skills analysis
Validation
retrospective
predictive
Content
dependent
independent
Coachable?
very much
not much
Non-cognitive tests
More recently developed
– measure values, attitudes, preferences
Types:
integrity tests
career exploration
matchmaking
employment “fit”
Non-cognitive tests
Purpose: to identify “fit” with others or a situation
Developed using: surveys, personal interviews
How validated? success rate in future activities
Content is personal, not learned
“Faking” can be an issue (e.g., “honesty” tests)
Comparing Achievement, Aptitude, &
Non-Cognitive Tests
Achievement
Aptitude
Non-Cognitive
Measure
past learning
potential
attitudes, values,
preferences
Development
content analysis
job/skills analysis
surveys
Validation
retrospective
predictive
predictive
Content
dependent
independent
independent
Coachable?
very much
very little
can be faked
2. Measuring test quality
Test reports can
be “data dumps”
3 measures are important:
1. Predictive validity
2. Content coverage
3. Sub-group differences
Predictive validity
(values from -1.0 to +1.0)
…measures how well higher scores
on admission test match better
outcomes at university (e.g., grades,
completion)
A test with low predictive validity provides a
little information.
A positive correlation between two measures
Source: NIST, Engineering Statistics Handbook
A negative correlation between two measures
Source: NIST, Engineering Statistics Handbook
No correlation between two measures
Source: NIST, Engineering Statistics Handbook
How does one measure
predictive capacity?
Correlation Coefficient:
I--------------------------------------------I
-1
0
1
Predictive validities: SAT and PSU
0.6
0.5
0.4
0.3
SAT
0.2
PSU 2010
0.1
0
Language Mathematics SAT Writing
PSU Social
Science
SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013
Predictive validities: SAT and PSU
(faculty: Administracion)
0.6
0.5
0.4
0.3
0.2
0.1
0
Language
Mathematics
SAT
SAT Writing
PSU Social
Science
PSU Administracion
SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013
Download