What can we learn from the application of computer based

advertisement
What can we learn from the
application of computer based
assessment to the military?
Daniel O. Segall
Kathleen E. Moreno
Defense Manpower Data Center
Invited presentation at the conference on Computers and Their Impact on State Assessment:
Recent History and Predictions for the Future, University of Maryland, October 18–19, 2010
Views expressed here are those of the authors and not necessarily those of the DoD or U.S. Government.
DM DC
Presentation Outline
 Provide some history of CBT Research and
Operational use in the Military.
 Talk about some lessons learned over the past
three decades.
– Many of these lessons deal not only with computerbased testing but with computerized adaptive testing
(CAT).
 End with some expectations about what lessons
are yet to be learned.
1
DM DC
ASVAB History
 Armed Services Vocational Aptitude Battery
(ASVAB)
– Before 1976, each military Service administered their
own battery.
– Starting in 1976, a single ASVAB was administered
to all Military applicants.
– Used to qualify applicants for entry into the military
and for select jobs within each Service.
– The ASVAB – Good predictor of training success.
2
DM DC
ASVAB Compromise
 Early ASVAB (Late 1970’s) – Prone to compromise
and coaching
 On-demand scheduling at over 500 testing locations
 Cheating was suspected from both applicants and
recruiters
 Congressional hearings were held on the topic of
ASVAB compromise
 One proposed solution: Computerized adaptive testing
version of the ASVAB
– Physical loss of CAT items less likely than P&P test booklets
– Sharing information about the test items less profitable for
CAT than P&P
3
DM DC
Initiation of CAT-ASVAB Research
 Marine Corps Exploratory Development Project
– 1977
 Research Questions
– First, could a suitable adaptive-testing delivery
system be developed?
– Second, would empirical data confirm the anticipated
benefits?
 Findings
– Data from recruits confirmed CAT’s increased
measurement efficiency
– Hardware suitability?
 Minicomputers slow and expensive
4
DM DC
Joint Service CAT-ASVAB Project
 Initiated in 1979
 Provide additional information about CAT
 Anticipated benefits:
– Test Compromise
– Shorter tests
– Greater precision
– Flexible start/stop times
– Online calibration
– Standardized test administration (instructions/time-limits)
– Reduced scoring errors (from hand or scanner scoring)
– Possibility of administering new types of tests
5
DM DC
Early CAT-ASVAB Development
 Early development (1979) divided into two
projects:
– Contractor delivery system development (hardware
and software to administer CAT-ASVAB)
 Commercially available hardware was inadequate for CATASVAB
 There was competition among vendors to develop suitable
hardware
 Competition abandoned by mid 1980’s because by then
commercially available computers were suitable for CATASVAB
– Psychometric development and evaluation of CATASVAB
6
DM DC
ASVAB Miscalibration
 A faulty equating of the first ASVAB in 1976 led to the enlistment of over
350,000 unqualified recruits over a five year period.
 As a result, a congressionally mandated oversight committee was
commissioned: The Defense Advisory Committee on Military Personnel
Testing.
 A central focus of the committee and military test development was to
implement sound equating and score scaling methodologies.
 Random equivalent groups equating methodology was implemented for the
development of ASVAB forms and was used as the “gold standard” for all
future ASVAB equatings.
 This heightened sensitivity to score-scale and context effects formed the
backdrop for next three decades of computer-based test development.
– Mode of administration
– CAT to paper-equating
– Effects of different computer hardware on test scores
7
DM DC
Experimental CAT-ASVAB System
 Developed 1979 – 1985
 The experimental CAT-ASVAB – Study adaptive testing
algorithms and test development procedures
 Full battery CAT version of the P&P-ASVAB for experimental
use
 Development Efforts
– Psychometric development
– Item pool development
– Delivery system development
 Experimental system used Bayesian ability estimation, maximum
likelihood item selection band, and a rudimentary exposure
control algorithm
8
DM DC
Joint-Service Validity Study
 Large Scale Validity Study: 1982-1984
 Sample
– Predictor and training success data
– N = 7,500 recruits training in one of 23 military jobs.
 Results showed that
– CAT-ASVAB and P&P-ASVAB predict training success
equally well.
– Equivalent validity could be obtained by CAT which
administered about 40 percent fewer items than it’s P&P
counterpart.
 Strong support for the operational implementation of
CAT-ASVAB.
9
DM DC
Operational CAT System Development
 1985 – Present
– Usability Considerations
 Addressed a number of
Issues:
– Reliability and Construct
Validity
– Item Pools
– Exposure Control
– Calibration Medium
– Item Selection
– Time Limits
– Penalty for Incomplete Tests
– Equating
– Hardware Effects
– Test Compromise
– Score Scale
– New Form Development
– Internet Testing
– Seeding Tryout Items
– Software/Hardware
Maintenance Issues
– Hardware Requirements
– Multi-Mode Testing Programs
10
DM DC
Item Pool Development (1980’s)
 CAT-ASVAB Forms 1 and 2: First two
operational forms
– The P&P reference form (8A) was used to form the
basis of the test specifications, but alterations were
made
 The adaptive pools – Wider range of item difficulties
– Pretest Items: About 3,600 items
– Items screened on the basis of small-sample IRT item
parameter estimates
– The surviving 2,118 items were administered to a
large applicant sample: N = 137,000
– Items were divided into two pools with about 100
items per subtest
DM DC
Item Pool Features
 CAT item pools do not need to be extraordinarily large
to obtain adequate precision and security
 Exposure control can be handled by a combination of
exposure control imposed by item selection, as well as
the use of multiple test forms consisting of multiple
(distinct) item pools
 The use of multiple item pools (with examinees
assigned at random to the pools) is an effective way to
reduce item exposure rates and overlap among
examinees.
12
DM DC
Exposure Control
 Experimental CAT-ASVAB system – Some items had
very high exposure rates
 5-4-3-2-1 strategy (Wetzel & McBride, 1985)
– Guards against remembering response sequences
– Does not guard against sharing strategy
 Sympson and Hetter
– Place an upper limit on the exposure rate of the most
informative items, and reduce the predictability of item
presentation
– Usage of items of moderate difficulty levels reduced; Little or
no usage restrictions for items of extreme difficulty or lesser
discrimination
– Only small loss of precision when compared to optimal
unrestricted item selection
13
DM DC
Calibration Medium
 Calibration Medium Concern
– Could data collected from paper-and-pencil booklets be used
to calibrate items that would be eventually administered in a
computerized adaptive testing format?
 Because CAT was not yet implemented, calibration of CAT
items on computers was not feasible.
 Some favorable results existed from other adaptive tests which
had relied on P&P calibrations
 A systematic treatment of this issue was conducted for the
development of the operational CAT-ASVAB forms using data
collected from 3,000 recruits
– Calibration medium has no practical impact on the
distributions or precision of adaptive test scores
14
DM DC
Calibration Medium
 Reading Speed is a primary cause of medium effects
 Viewing/reading questions on computer is generally slower than
Viewing/reading the same questions in a printed paper-based
format
 To the degree that tests are speeded (time-pressured), then
medium is likely to have a larger impact
 To the degree that tests are speeded, greater within medium
effects can also occur
 ASVAB approach: For power tests, reduce the time pressure by
extending the time limits
– Reducing time pressure for ASVAB power tests did not alter the construct
measured – Cross-correlation-check study
15
DM DC
Item Selection Rules
 Based on maximum item information (contingent upon
passing an exposure control screen)
 Some consideration given to content balancing, but a
primary emphasis was given to measurement precision
 More recently, provisions have been made for item
enemies
 Maximizing precision was – and remains – a primary
emphasis of the CAT-ASVAB item selection algorithm
16
DM DC
Time Limits
 CAT-ASVAB Time Limits
– Administrative requirements
– Separate time limit for each adaptive power test
 IRT Model
– Standard IRT model does not model the effects of time
pressure on item responding
 Alternate Approaches for Specifying Time Limits
– Use the per-item time allowed on the P&P-ASVAB
– Use the distribution of completion-times from an
Experimental version (which was untimed) and set the limits
so that 95% of the group would finish
17
DM DC
Specifying Time Limits
 Untimed Pilot Study
– Supported the use of the longer limits
 For reasoning tests, high ability examinees took more
time than low-ability examinees
 High ability examinees would be most effected by
shortened time-limits since High ability examinees
received more difficult questions, which required more
time to answer
 Opposite of relation between ability and test-time
observed in most traditional P&P tests
– In linear testing, low ability examinees generally take longer
than high ability examinees
18
DM DC
Penalty for Incomplete Tests
 Penalty Procedure Required for Incomplete Tests
– Due to the implementation of time-limits
 Bayesian Estimates
– Biased in the direction of the population mean
– Bias stronger for shorter tests
 Compromise Strategy
– Below-average applicants answer minimum number of items
 Penalty Procedure
– Used to score incomplete adaptive tests
– Discourage potential compromise strategy
– Provides a final ability that is equivalent to the expected score
obtained by guessing at random on the unanswered items
19
DM DC
Penalty for Incomplete Tests
 Simulations
– Used to determine penalty functions (for each subtest and
possible test length)
 Penalty Procedure Features
– Size of the penalty is correlated with the number of unfinished
items
– Applicants who have answered the same number of items and
have the same provisional ability estimate will receive the
same penalty
– With this approach, test-takers should be indifferent about
whether to guess or to leave answers blank given that time has
nearly expired
 Generous Time Limits Implemented
– Permit over 98 percent of test-takers to complete
– Avoids disproportionately punishing high ability test-takers
20
DM DC
Seeding Tryout Items
 CAT-ASVAB administers unscored tryout items
 Tryout items are administered as the 2nd, 3rd, or 4th
item in the adaptive sequence
 Item Position
– Randomly determined
 Advantages over Historical ASVAB Tryout Methods
– Operationally Motivated
– No booklet printing required
– No special data collection study required
21
DM DC
Hardware Requirements
 Customized Hardware Platform – 1984
– Abandoned in favor of an off-the-shelf system
 The Hewlett Packard (HP) Integral Computer
–
–
–
–
–
Selected as first operational system
Superior portability (17 pounds),
Large random access memory (1.5 megabytes),
Fast CPU (8 MHz 6800 Motorola),
Advanced graphics display capability (9 inch monitor with eltroluminescent display and
resolution of 512 by 255 pixels).
– UNIX based operating system
– Supported the C programming language
– Floppy diskette drive (no internal hard drive)
– Cost about $5,000 (in 1984 dollars)
 Lesson: Today’s Computers can easily handle item selection and scoring
calculations required by CAT
 Challenge: Even though today’s computer’s are thousands of times more
powerful, they are not proportionately cheaper than computers of yesteryears
22
DM DC
User Acceptance Testing
 Importance of User Acceptance Testing
– Software development is obviously important
– User Acceptance Testing is equally important
 Acceptance Testing versus Software Testing
– Software Testing – Typically performed by software
developers
– Acceptance Testing – Typically performed by those who are
most familiar with the system requirements
 CAT-ASVAB Development
– Time spent on acceptance testing exceeded that spent by
programmers developing and debugging code
23
DM DC
Usability
 Computer Usage: 1975 – 1985
– Limited primarily to those with specialized interests
 Concerns
– Deficient computer experience would lower CAT-ASVAB
reliability and validity
– Although instructions had been tested on recruits, they had not
been tested with applicants, many of whom scored over the
lower ability ranges
– In addition, instructions had been revised extensively from the
experimental system
 Approach
– Test instructions on a broad representative group of test-takers
who had no prior exposure to the ASVAB
24
DM DC
Usability
 Usability Study (1986)
– 231 military applicants and 73 high school students
 Issues Addressed
– Computer familiarity, instruction clarity, and attitudes towards CAT-ASVAB
 Method of Data Collection
– Questionnaire and Structured interviews
 Findings
– Test takers felt very comfortable using the computer, exhibited positive attitudes
towards CAT-ASVAB, and preferred a computerized test over P&P – Regardless
of their level of computer experience
– Test-takers strongly agreed that the instructions were easy to understand
– Negative outcome: Most test-takers wanted the ability to review and modify
previously answered questions
– Because of the requirements of the adaptive testing algorithm, this feature was not
implemented
 Lesson: Today with a well designed interface, variation in computer
familiarity among (young adult) test-takers should not be an impediment to
computer based testing
25
DM DC
Usability Lessons
 Stay in tune with the computer proficiency of the
test-takers
– Tailor instructions accordingly
 Do not give verbal instructions
– Keep all instructions on the computer
 Keep user interface simple and intuitive
26
DM DC
Reliability and Construct Validity
 CAT Reliability and Validity
– Contents and quality of the item pool
– Item selection, scoring, and exposure algorithms
– Clarity of test instructions
 Item Response Theory
– Provides a basis for making theoretical predictions about these
psychometric properties
– However, most assumptions are violated, at least to some degree
 Empirical Test of Assumptions
– To test the validity of key model-based assumptions, an empirical
verification of CAT-ASVAB’s precision and construct equivalence with
the P&P-ASVAB was conducted
– If assumptions held true, then large amounts of predictive validity
evidence accumulated on the P&P version would apply directly to CATASVAB
– Construct equivalence would also support the exchangeability of CATASVAB and P&P-ASVAB versions
27
DM DC
Reliability and Construct Validity
 Study Design
– Two Random Equivalent Groups
– Group 1: (N = 1033) received two P&P-ASVAB forms
– Group 2: (N = 1057) received two CAT-ASVAB forms
– All participants received an operational P&P-ASVAB
 Analyses
– Alternative forms correlations used to estimate reliabilities
– Construct equivalence was evaluated from disattenuated
correlations between CAT-ASVAB and operational P&PASVAB versions
28
DM DC
Reliability and Construct Validity
 Results – Reliability
– Seven of the ten CAT-ASVAB tests displayed significantly higher
reliability coefficients than their P&P-ASVAB counterparts
– Three other subtests displayed non-significant differences
 Results – Construct Validity
– All but one disattenuated correlation between CAT-ASVAB and P&PASVAB was equal to 1.0
– Coding Speed displayed a disattenuated correlation substantially less than
one (.86)
– However composites that contained this subtest had high disattenuated
correlations approaching 1.0
 Discussion
– Results confirmed the expectations based on theoretical IRT predictions
– CAT-ASVAB measured the same constructs as P&P-ASVAB with
equivalent or greater precision
29
DM DC
Equating CAT and P&P Versions
 1980’s – Equating viewed as a major psychometric
hurdle to CAT-ASVAB implementation
 Scale Differences between CAT-ASVAB and P&PASVAB
– P&P-ASVAB used a number-correct score-scale
– CAT-ASVAB produces scores on the natural (IRT) ability
metric
– Equating must be done to place CAT-ASVAB scores on the
P&P-ASVAB scale
 Equating Objective
– Transform CAT-ASVAB scores so its score distribution
would match the P&P-ASVAB score distributions
– Transformation would allow scores on the two versions to be
used interchangeably, without effecting applicant qualification
rates
30
DM DC
Equating Concerns
 Overall Qualification Rates
–Equipercentile equating procedure used to
obtain the required transformations
–Distribution smoothing procedures
–Equivalence of composite distributions
verified
–Distributions of composites were sufficiently
similar across P&P-ASVAB and CATASVAB
31
DM DC
Equating Concerns
 Subgroup Differences
– Concern that subgroup members not be placed at a
disadvantage by CAT-ASVAB relative to P&P-ASVAB
– Existing subgroup differences might be magnified by
precision and dimensionality differences between CAT and
P&P versions
 Approach
– Apply the equating transformation (based on the entire group)
to subgroup members taking CAT-ASVAB
– Compare subgroup means across CAT and P&P versions
 Results
– No practical significance for qualification rates was found
32
DM DC
Online Calibration and Equating
 All Data can be collected seamlessly in an
operational environment to both calibrate and
equate new CAT-ASVAB forms
 This is in contrast to earlier form development
which required special data collections using
special populations
33
DM DC
Hardware Effects Study
 Hardware Effects Concern (1990’s)
– Differences among computer
hardware could influence item
functioning
– Speeded tests especially sensitive to
small changes in test presentation
format
 Dependent Measures
– Score-scale
– Precision
– Construct validity
 Sample
– Data were gathered from 3,062
subjects
 Hardware Dimensions
– Input device
– Color scheme
– Monitor type
– CPU speed
– Portability
 Results
– Adaptive power tests were
robust to differences among
computer hardware
– Speed tests are likely to be
effected by several hardware
characteristics.
– Each subject was randomly
assigned to one of 13 conditions
34
DM DC
Stakes by Medium Interaction
 Equating Study
– Equating of Desktop and Notebook computers
 Two Phases
– Recruits – Develop a provisional transformation;
random groups design with about 2,500 respondents
per form
– Applicants – Develop a final transformation from
applicants to provide operational scores. Sample size
for this second phase was about 10,000 per form
35
DM DC
Stakes by Medium Interaction
 Comparison of Equatings Based on Recruits (non-operational motivation) and
Applicants (operational motivation)
 Differences in the CAT-P&P equating transformations were observed
 The difference was in a direction that suggested that in the first
(nonoperational) equating, CAT examinees were more motivated than P&P
examinees (possibly due to shorter test lengths or novel/interactive medium)
 It was hypothesized that there were different levels of motivation/fatigue
between CAT and P&P groups in the nonoperational recruit sample than in
the operational applicant sample
 Findings suggested that the results of a cross-medium equating may differ
depending upon whether the respondents are motivated or unmotivated
 For future equatings, this problem was avoided by:
– Performing equatings in operationally motivated samples, or by
– Performing within medium equatings when test-takers were nonoperationally
motivated (and by using a chained-based transformation if necessary to link back
to the desired cross-medium scale)
36
DM DC
Test Compromise Concerns
 Sympson-Hetter algorithm assumes a particular known
ability distribution
 Usage rates might be higher for some items if the actual
ability distribution departs from the assumed
distribution
 Since CAT tests tend to be shorter than P&P tests, each
adaptively administered item might have a greater
impact on final score
 So preview of a fixed number of CAT items may result
in a larger score gain than preview of the same number
of P&P items
37
DM DC
Test Compromise Simulations
 Simulation Study – Conditions
– Transmittal mechanism (sharing among friends or item banking)
– Correlation between the cheater and informant ability levels
– Method used by the informant to select items for disclosure
 Dependent Measure
– Score gain (mean gain for the group of cheaters over a group of non-cheaters for the same
fixed ability level)
 Results
– Score gains for CAT were larger than those for the corresponding P&P conditions
 Implications
– More stringent item exposure controls should be imposed on CAT-ASVAB item exposure
– The introduction of a third item pool (where examinees are randomly to assigned to one of
three pools) provided score gains for CAT that were equivalent to or less than those observed
for six forms of the P&P-ASVAB under all compromise strategies
– These results led to the decision to implement an additional CAT-ASVAB form
38
DM DC
Score Scale
 For testing programs that run two parallel modes of
administration (i.e., paper-based and CAT), equating and
measurement precision can be enhanced by scoring all tests
(including the paper-based test) by IRT methods
 IRT scoring of the paper-based tests provides distributions of test
scores that more closely match their CAT counterparts (i.e.,
helps make them more Normal)
 IRT scoring also reduces ceiling and floor effects of paper-based
number-right distributions which can attenuate the precision of
equated CAT-ASVAB scores
 An underlying theta (or natural ability scale) can facilitate
equating and new form development
39
DM DC
New Form Development
 The implementation of CAT-ASVAB on a large-scale has enabled
considerable streamlining of new form development
 DoD has eliminated all special form-development data-collection studies by
replacing them with online calibration and equating
 According to this approach, new item data is collected by seeding tryout items
among operational items
 These data are used to estimate IRT item parameters
 These parameters are in turn used to construct future forms, and to estimate
provisional equating transformations
 These provisional (theoretical) equatings are then updated after they are used
operationally to test random equivalent groups
 Thus, the entire cycle of form development is seamlessly integrated into
operational test administrations
40
DM DC
Internet Testing
 DoD Internet Testing
– Defense Language Proficiency Tests
– Defense Language Aptitude Battery
– Armed Forces Qualification Test
– CAT-ASVAB
 Implications for Cost-Benefits
 Implications for Software Development
– Desktop Lockdown
– Client Side: Unanticipated effects on test delivery from browser, operating
system, and security updates
– Server Side:
 Unanticipated effects of operating system updates on test delivery
 Interactions with other applications running on the same server
41
DM DC
Internet Testing
 Internet testing can defray much of the cost of computer
based testing since the cost of computers and their
maintenance is shared or eliminated.
 Strict Standardization of administration format (line
breaks, resolution, etc.) in internet testing is difficult
(and sometimes impossible) to enforce.
 Lesson: With Internet testing you do not have to pay for
the computers’ purchase and maintenance, but you do
pay a price for the lack of control over the system.
42
DM DC
Software/Hardware Maintenance Issues
 Generations of CAT-ASVAB Hardware/ Software
– Apple III
– HP
– DOS
– Windows I
– Windows II
– Internet
 Early generations of hardware/software could be treated as static entities,
much like test-booklets
 Later Windows and Internet generations require treatment more like living
entities – They require continuous care and attention (security, operating
system, and software updates)
43
DM DC
Multi-Mode Testing Programs
 When transitioning from paper-based testing to
computer-based testing, decide ahead of time if the two
mediums of administration will run in parallel for an
extended period, or if paper-based testing will be
phased out after a fixed period of time
 If the later, make sure this is communicated and that
there is strong policy support for the elimination of all
paper-based testing
 There are different resourcing requirements and cost
drivers for dual and single mode testing programs
44
DM DC
Future Lessons ?
 Intranet verses Internet -based Testing
 Computer hardware effects on test scores
 How to test speeded abilities on unstandardized
hardware?
 Can emerging technologies (such as mobile
computing devices) provide additional or
different benefits (above and beyond computers)
for large-scale assessments?
45
Download