THE DEVELOPMENT OF COMPUTER BASED TESTING AND COMPUTER ADAPTIVE TESTING IN THE US: HISTORY,CHALLENGES, AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERISTY OF CONNECTICUT TESTING: A Brief History 1. Testing and humans have had a love-hate relationship since the dawn of history. First “mastery test” is mentioned in the old testament to “classify” people into two categories: The Ephramites and the Gileadites 2. Civil service exams were used in China more than 3000 years ago. 3. In the West, testing has had a mixed history – its use waxed and waned. 4. In the US, Horace Mann argued for written exam( objective type) and the first test was introduced in Boston in 1845 5. Was used for grade to grade promotion TESTING: A Brief History 6. This testing practice fell into disrepute because of teaching to the test. 7. Grade promotion based on testing was banned in Chicago in 1881. 8. Binet introduced mental testing in 1901 (became the Stanford Binet Test). 9. The issue of fairness that “everyone should get the same test” was not relevant to him. 10.Binet rank ordered the items in order of difficulty and targeted items to the child’s ability. INDIVIDUALIZED TESTING • Adaptive testing was born. • Was the primary mode of testing before the notion of group testing was introduced. • With the advent of group testing, individualized (adaptive) testing went to the back burner. • Impossible to administer to large groups. • Group based adaptive testing was not feasible, Until…… TAILORED TESTING • Fred Lord introduced Item Response Theory in the early 50’s and with it, the notion of Tailored Testing. • Without computers, tailored testing was not feasible. • To overcome this problem, Lord developed “FLEXILEVEL TESTING” • Flexilevel test follows Binet’s idea; only the difficulty level of the item is used in routing FLEXILEVEL TESTING • Flexilevel testing may even be administered as a Paper and Pencil Test as it was initially intended • The scoring algorithm is simple enough that a fexilevel test is self scoring • Its simplicity was equated with lack of glamor and as an approximation to CAT. FLEXILEVEL TESTING • Flexilevel testing languished, unwanted and ignored by the methodology-addicted psychometric researchers. • It is making a comeback in non-highstakes evaluation, medicine, and allied health, where a full blown CAT is not required or not feasible. • It has the potential for being used innovatively for classroom assessment and diagnostic purposes. CAT • Meanwhile, important technical advances were being made CAT research. • The Office of Naval Research, the Army, and the Air Force funded research for advancing CAT during the 70s. • The name “Computerized Adaptive Test” was coined by David Weiss CAT • David Weiss and his team at University of Minnesota were funded for developing operational procedures for implementing CAT • I was funded for the development of Bayesian estimation procedures so that we can estimate item parameters more accurately • All these activities were motivated because of the large volume of test takers in the armed forces CAT on a Hot Tin Roof: Operationalizing CAT • I was on the Board of Directors of GRE in the mid 80s. • The Board authorized and funded research for implementing GRE-CAT • GRE CAT was operational in the early 90s. GMAT followed suit soon after. Theory V Practice • First clash between theory and practice occurred in the implementation of CAT • As PCs were not common, GRE had to contract with a delivery system provider • “Seat-time” was the major obstacle. • In theory, testing should continue until the stopping criterion, prescribed standard error, was reached. Instead the time and test length, were fixed. • Examinees had to complete 80% of items Flexilevel Test and CAT • A Flexilevel test is a CAT albeit with one foot (one-parameter model) • It DOES need a good item bank • It is an approximation to a full blown CAT. • Many more items than a full fledged CAT are required to obtain the same level of precision. • Nevertheless, with care a fexilevel test can be made to function effectively Issues • Item Bank: A large item bank is needed and maintained well. In developing item banks, items from paper and pencil administration should not be used without careful investigation. • Exposure Control: In high stakes testing, exposure control is critical • Content Specification and Balancing: This is a critical issue and must be addressed early on in the development of item bank and item selection criteria Issues (cont’d) • CAT algorithm. Unchecked, a CAT algorithm greedily choose items that provide the most with the most information. Algorithms for selecting items with content balancing must be in place. • Item parameter Shift : Over time item parameter values will change because of instruction, targeted instruction, and exposure of items. Item parameters must be re-estimated and items that show large drifts must be eliminated. Procedures for detecting cheating in CAT have been developed and are useful here. Issues (Cont’d) • Item BIAS (Differential Item Functioning): Performance of subgroups on items must be examined to determine if subgroups are performing differentially on items . This is part of validity analysis and must be carried out not only in the development of the item bank but also during operational administrations. MULTISTAGE TESTING • Although CAT is efficient, constraints on content balancing in item selection may pose insurmountable problems. • In these cases, MULTSTAGE testing is a viable option, and is in use in some large scale testing programs. • Instead of administering an item at a time a mini test (testlet) at varying levels of difficulty is administered in stages. Medium Easy Easy Medium Medium Hard Hard MULTISTAGE TESTING • Content balancing is achieved elegantly • Each testlet has sufficient number of items for estimation of proficiency • Performs almost as well as CAT • We evaluated several designs for the administration of Russian language test in the US and recommended a three stage testing scheme. • Multistage testing has the potential for national assessments. Growth Assessment:Vertical Scale • Growth assessment of individual has been mandated by states as well as the federal government. • To develop a vertical scale items have to be administered according to the following scheme (as implemented in Connecticut) • Through this design all items across grades are linked Test Administration Design ITEMS G3 G4 G5 G6 G7 G8 G3 OP33 G4 S T U D G5 E N T S G6 SU34 SU45 SU43 OP44 SU56 SU54 OP55 SU67 SU65 OP66 SU78 G7 SU76 OP77 G8 SU87 OP88 CUT SCORES FOR PROFICIENCY LEVELS -MATH 650 S c a l e d 600 550 Basic 500 Proficient Goal S c o r e Advanced 450 400 350 3 4 5 6 GRADE 7 8 THETA DISTRIBUTION FOR MATHEMATICS Growth Assessment • In Growth assessment we need the growth rates of individuals as well as subgroups • Scores over time are nested within individuals who are in turn nested within classrooms, schools, and districts. • The statistical models must take this nesting into account. The process is complex but can be done. • Use of growth for teacher evaluation National Assessments • Growth assessment of individual is not important; we need the characteristics of subpopulations. • Proper coverage of the content domain is critical. Matrix sampling of items is necessary. • CAT is being considered by NAEP; a multistage approach may be better for ensuring content coverage. Computer Based Testing • Was developed as part of the Computer Assisted Instruction movement in the mid 60s by Patrick Suppes. • It is a linear test as the P& P test • P&P test and CBT items are not equivalent. Easy P&P item may become difficult in CBT and vice versa. • Our study in Connecticut showed the items behaved differently in the two modes Computer Based Testing • CBT has the advantage of using innovative item types • Science Test in Connecticut is being developed as a CBT • Has been used by NBME innovatively in testing • PIRLSe is using CBT approach; PISA may become a CBT soon. • Standard procedures for scoring (automated) and item analysis are usable New Research on CAT and CBT • Use of polytomous items • Use of free response items – automated scoring of items • Multidimensional item response models for vertical scaling • Item generation: Item cloning • Classification rather than estimation. Item selection is based on measures of information (Shanon, Kullback). The Politics of testing • Closely related – education and testing have occupied center stage in politics • Politicians in the US have smelled CAT in the water and are circling to take a bite • There have been debates about CAT item administration and special interest groups have weighed in for and against CAT item selection algorithms. • Issue of item release is a problem for CAT banks Politics of testing • Transparency and honesty are critical to convince the public who may not understand the mathematics involved • Testing must be above reproach. • As Caesar, in divorcing Pompeia said- it is not enough to be beyond reproach. You must also GIVE THE APPERANACE OF BEING BEYOND REPROACH Conclusion • CAT, Multistage Testing, and Computer Based Testing are playing major roles in statewide and national assessments • These assessments are designed for assessing student growth at the individual as well as the group level. • We have solutions or near solutions for most of the issues that face us in the implementation of these testing designs, and the research continues.