TEST USES IN THE U.S.

advertisement
THE DEVELOPMENT OF
COMPUTER BASED TESTING AND
COMPUTER ADAPTIVE TESTING IN
THE US: HISTORY,CHALLENGES,
AND SOLUTIONS
HARIHARAN SWAMINATHAN
UNIVERISTY OF CONNECTICUT
TESTING: A Brief History
1. Testing and humans have had a love-hate
relationship since the dawn of history. First
“mastery test” is mentioned in the old
testament to “classify” people into two
categories: The Ephramites and the
Gileadites
2. Civil service exams were used in China more
than 3000 years ago.
3. In the West, testing has had a mixed history –
its use waxed and waned.
4. In the US, Horace Mann argued for written
exam( objective type) and the first test was
introduced in Boston in 1845
5. Was used for grade to grade promotion
TESTING: A Brief History
6. This testing practice fell into disrepute
because of teaching to the test.
7. Grade promotion based on testing was
banned in Chicago in 1881.
8. Binet introduced mental testing in 1901
(became the Stanford Binet Test).
9. The issue of fairness that “everyone should
get the same test” was not relevant to him.
10.Binet rank ordered the items in order of
difficulty and targeted items to the child’s
ability.
INDIVIDUALIZED TESTING
• Adaptive testing was born.
• Was the primary mode of testing before the notion of
group testing was introduced.
• With the advent of group testing, individualized
(adaptive) testing went to the back burner.
• Impossible to administer to large groups.
• Group based adaptive testing was not feasible,
Until……
TAILORED TESTING
• Fred Lord introduced Item Response
Theory in the early 50’s and with it, the
notion of Tailored Testing.
• Without computers, tailored testing
was not feasible.
• To overcome this problem, Lord
developed “FLEXILEVEL TESTING”
• Flexilevel test follows Binet’s idea;
only the difficulty level of the item is
used in routing
FLEXILEVEL TESTING
• Flexilevel testing may even be
administered as a Paper and Pencil
Test as it was initially intended
• The scoring algorithm is simple
enough that a fexilevel test is self
scoring
• Its simplicity was equated with lack of
glamor and as an approximation to
CAT.
FLEXILEVEL TESTING
• Flexilevel testing languished,
unwanted and ignored by the
methodology-addicted psychometric
researchers.
• It is making a comeback in non-highstakes evaluation, medicine, and allied
health, where a full blown CAT is not
required or not feasible.
• It has the potential for being used
innovatively for classroom
assessment and diagnostic purposes.
CAT
• Meanwhile, important technical
advances were being made CAT
research.
• The Office of Naval Research, the
Army, and the Air Force funded
research for advancing CAT during the
70s.
• The name “Computerized Adaptive
Test” was coined by David Weiss
CAT
• David Weiss and his team at University
of Minnesota were funded for
developing operational procedures for
implementing CAT
• I was funded for the development of
Bayesian estimation procedures so
that we can estimate item parameters
more accurately
• All these activities were motivated
because of the large volume of test
takers in the armed forces
CAT on a Hot Tin Roof:
Operationalizing CAT
• I was on the Board of Directors of GRE
in the mid 80s.
• The Board authorized and funded
research for implementing GRE-CAT
• GRE CAT was operational in the early
90s. GMAT followed suit soon after.
Theory V Practice
• First clash between theory and practice
occurred in the implementation of CAT
• As PCs were not common, GRE had to
contract with a delivery system provider
• “Seat-time” was the major obstacle.
• In theory, testing should continue until
the stopping criterion, prescribed
standard error, was reached. Instead the
time and test length, were fixed.
•
Examinees had to complete 80% of items
Flexilevel Test and CAT
• A Flexilevel test is a CAT albeit with one
foot (one-parameter model)
•
It DOES need a good item bank
• It is an approximation to a full blown CAT.
• Many more items than a full fledged CAT
are required to obtain the same level of
precision.
• Nevertheless, with care a fexilevel test
can be made to function effectively
Issues
• Item Bank: A large item bank is needed
and maintained well. In developing item
banks, items from paper and pencil
administration should not be used
without careful investigation.
• Exposure Control: In high stakes testing,
exposure control is critical
• Content Specification and Balancing:
This is a critical issue and must be
addressed early on in the development of
item bank and item selection criteria
Issues (cont’d)
• CAT algorithm. Unchecked, a CAT algorithm
greedily choose items that provide the most
with the most information. Algorithms for
selecting items with content balancing must be
in place.
• Item parameter Shift : Over time item parameter
values will change because of instruction,
targeted instruction, and exposure of items.
Item parameters must be re-estimated and
items that show large drifts must be eliminated.
Procedures for detecting cheating in CAT have
been developed and are useful here.
Issues (Cont’d)
• Item BIAS (Differential Item Functioning):
Performance of subgroups on items must
be examined to determine if subgroups
are performing differentially on items .
This is part of validity analysis and must
be carried out not only in the
development of the item bank but also
during operational administrations.
MULTISTAGE TESTING
• Although CAT is efficient, constraints on
content balancing in item selection may
pose insurmountable problems.
• In these cases, MULTSTAGE testing is a
viable option, and is in use in some large
scale testing programs.
• Instead of administering an item at a time
a mini test (testlet) at varying levels of
difficulty is administered in stages.
Medium
Easy
Easy
Medium
Medium
Hard
Hard
MULTISTAGE TESTING
• Content balancing is achieved elegantly
• Each testlet has sufficient number of
items for estimation of proficiency
• Performs almost as well as CAT
• We evaluated several designs for the
administration of Russian language test
in the US and recommended a three stage
testing scheme.
• Multistage testing has the potential for
national assessments.
Growth Assessment:Vertical
Scale
• Growth assessment of individual has
been mandated by states as well as the
federal government.
• To develop a vertical scale items have to
be administered according to the
following scheme (as implemented in
Connecticut)
• Through this design all items across
grades are linked
Test Administration Design
ITEMS
G3
G4
G5
G6
G7
G8
G3
OP33
G4
S
T
U
D G5
E
N
T
S G6
SU34
SU45
SU43
OP44
SU56
SU54
OP55
SU67
SU65
OP66
SU78
G7
SU76
OP77
G8
SU87
OP88
CUT SCORES FOR PROFICIENCY LEVELS -MATH
650
S
c
a
l
e
d
600
550
Basic
500
Proficient
Goal
S
c
o
r
e
Advanced
450
400
350
3
4
5
6
GRADE
7
8
THETA DISTRIBUTION FOR MATHEMATICS
Growth Assessment
• In Growth assessment we need the
growth rates of individuals as well as
subgroups
• Scores over time are nested within
individuals who are in turn nested within
classrooms, schools, and districts.
• The statistical models must take this
nesting into account. The process is
complex but can be done.
• Use of growth for teacher evaluation
National Assessments
• Growth assessment of individual is not
important; we need the characteristics of
subpopulations.
• Proper coverage of the content domain is
critical. Matrix sampling of items is
necessary.
• CAT is being considered by NAEP; a
multistage approach may be better for
ensuring content coverage.
Computer Based Testing
• Was developed as part of the Computer
Assisted Instruction movement in the mid
60s by Patrick Suppes.
• It is a linear test as the P& P test
• P&P test and CBT items are not
equivalent. Easy P&P item may become
difficult in CBT and vice versa.
• Our study in Connecticut showed the
items behaved differently in the two
modes
Computer Based Testing
• CBT has the advantage of using
innovative item types
• Science Test in Connecticut is being
developed as a CBT
• Has been used by NBME innovatively in
testing
• PIRLSe is using CBT approach; PISA may
become a CBT soon.
• Standard procedures for scoring
(automated) and item analysis are usable
New Research on CAT and CBT
• Use of polytomous items
• Use of free response items – automated
scoring of items
• Multidimensional item response models
for vertical scaling
• Item generation: Item cloning
• Classification rather than estimation. Item
selection is based on measures of
information (Shanon, Kullback).
The Politics of testing
• Closely related – education and testing
have occupied center stage in politics
• Politicians in the US have smelled CAT in
the water and are circling to take a bite
• There have been debates about CAT item
administration and special interest
groups have weighed in for and against
CAT item selection algorithms.
• Issue of item release is a problem for CAT
banks
Politics of testing
• Transparency and honesty are critical to
convince the public who may not
understand the mathematics involved
• Testing must be above reproach.
• As Caesar, in divorcing Pompeia said- it
is not enough to be beyond reproach. You
must also GIVE THE APPERANACE OF
BEING BEYOND REPROACH
Conclusion
• CAT, Multistage Testing, and Computer
Based Testing are playing major roles in
statewide and national assessments
• These assessments are designed for
assessing student growth at the
individual as well as the group level.
• We have solutions or near solutions for
most of the issues that face us in the
implementation of these testing designs,
and the research continues.
Download