FUTURE OF ASSESSMENT IN INDIA CHALLENGES AND SOLUTIONS HARIHARAN SWAMINATHAN UNIVERSITY OF CONNECTICUT Testing: A Brief History 1. Testing is part of human nature and has a long history. In the old testament, “Mastery Testing” was used in the old testament to “classify” people into two groups: • The Gileadites defeated the Ephraimites, and when the Ephraimites tried to escape, they were given a high stakes test; • The test was to pronounce the word “shibboleth”. Failure had a drastic consequence. 2. Civil service exams were used in China more than 3000 years ago. Testing: A Brief History 3. Testing in India may go even further back in time 4. In the West, testing has a mixed history – its use has waxed and waned. 5. In the US, Horace Mann argued for written exam (objective type) and the first was introduced in Boston in 1845 6. Tests were used for grade-to-grade promotion 7. This testing practice fell into disrepute because of teaching to the test. Testing: A Brief History 8. Grade promotion based on testing was abolished in Chicago in 1881. 9. Binet introduced mental testing in 1901 (became the Stanford Binet Test). 10. The issue of fairness that “everyone should get the same test” was not relevant to Binet. 11. Binet rank ordered the items in order of difficulty and targeted items to the child’s ability. 12. The first “Adaptive Test” was born. Individualized Testing • Adaptive testing was the primary mode of testing before the notion of group testing was introduced. • With the advent of group testing, individualized (adaptive) testing went to the back burner. • Primarily because it was impossible to administer to large groups. • Group based adaptive testing was not feasible, Until…… • We will return to adaptive testing later. Testing in India • Testing in India has a long history Knowledge/Skill testing was common in ancient India • Rama was subjected to competency testing by Sugriva • Yudhisthira was tested with a 133 item, high stakes test (Yaksha Prashna) • According to the Bhagavatam, the name Parikshit (Examiner) was given to the successor of Yudhisthira Testing in the form of puzzles and games was often used for entertainment Testing in India India is a country of superlatives • India can boast of probably the longest tradition of education, stretching over at least two and half millennia • Taxasila is considered the oldest seat of learning in the world • Nalanda perhaps the oldest university in the world • This tradition of learning and the value placed on education continues in India Testing in India Population Distribution in India Country India USA USA Total Population Age Range 0-4 5-9 10-15 0-15 0-15 Number 128 Million 128 Million 128 Million 384 Million 32 Million 352 Million Testing in India • These numbers are expected to decline slightly over the next twenty years, but nevertheless, will far exceed that of any other country in the world. • The tradition of learning, the value placed on education, and the population explosion has created considerable stress on the Indian education system. • It is not surprising that assessment and testing procedures in India have focused on selection and certification . Testing in India • Some of the selection examinations in India are perhaps the most grueling and most selective in the world. -- IAS Examination: of the 450,000 candidates, 1200 selected (.3% ) -- IIT Joint Entrance examination: of the 500,000 applicants only about 10,000 are selected for admission (2%) • According to the scientific advisor to the previous prime minister, C.N.R. Rao, “India has an examination system but not an education system.” • Another criticism levelled against the examination system is that they promote intensive coaching with little regard for a properly grounded knowledge base. Testing in India • Needless to say, these criticisms are not unlike the criticism levelled against testing in the US. • However, intensive testing in schools in the US is now being directed more towards assessment of learning and growth for accountability purposes rather than on assessment of student achievement for the purposes of certification and promotion. • Although assessment and testing play an important role in Indian education, assessment practices in India do not seem to have kept pace with the modern approaches and trends in testing and assessment. Test Uses in the U.S. 1. Management of Instruction 2. Placement and Counseling 3. Selection 4. Licensure and Certification 5. Accountability Test Uses in the U.S. 1. Management of Instruction • Classroom and standardized tests for daily management of instruction (formative and diagnostic evaluation) • Classroom and standardized tests for grading (summative evaluation) 2. Placement and Counseling • Standardized tests for transition from one level of school to another or from school to work Test Uses in the U.S. 3. Selection (Entry Decisions) • standardized achievement and aptitude tests for admission to college, graduate school, and special programs 4. Licensure and Certification • Standardized tests for determining qualification for entry into a profession • School graduation requirement Test Uses in the U.S. 5. Accountability • Standardized tests to show satisfactory achievement or growth of students in schools receiving public money, often required by state and federal legislation. If students in schools do not show adequate progress, sanctions are imposed on the schools and on the state. Theoretical Framework for Tests • For all these purposes we need to design tests for measuring the student’s “ability” or “proficiency” • The use of the ability/proficiency test scores must be appropriate for the intended use • Tests are measurement instruments and when we use them to measure, like with all measurement devises, we make measurement errors Theoretical Framework for Tests • The objective of measurement is to measure what we want to measure appropriately and do so with minimum error. • The construction of tests and the determination of an examinee’s proficiency level/ability scores are carried out within one of two theoretical frameworks: • Classical Test Theory • Modern Test Theory or Item Response Theory (IRT) Classical Test Theory Model X Observed Score = T + True Score X2 = Observed Score Variance T2 = True Score Variance E2 = Error Variance E Error Indices Used In Traditional Test Construction • Item and Test Indices: Item difficulty Item discrimination Test score reliability Standard Error of Measurement • Examinee Indices: Test score Classical Item Indices: Item Difficulty • Item difficulty : Proportion of examinees answering the item correctly • It is an index of how difficult an item is. • If the value is low, it indicates that the item is very difficult. Only a few examinees will respond correctly to this item. • If the value is high, the item is easy as many examinees will respond correctly to it. Classical Item Indices: Item Discrimination • • • • • Item discrimination: correlation between item score and total score A value close to 1 indicates that examinees with high scores (ability) answer this item correctly, while examinees with low ability will respond incorrectly A low value implies that there is hardly any relationship between ability and how examinees respond to this item Such items are not very useful for separating high ability examinees from low ability examinees Items with high values of discrimination are very useful for ADAPTIVE TESTING Standard Error Of Measurement 𝝈𝑬 • Indicates the amount of error to be expected in the test scores • Arguably, the most important quantity • Depends on the scale, and therefore difficult to assess its magnitude • Can be re-expressed in terms of reliability which varies between 0 and 1. Test Score Reliability • Reliability Index, 𝝆 , is defined as the correlation between true scores on “parallel” tests. • It will take on value between 0 and 1, with 0 denoting totally unreliable test scores and 1 perfectly reliable test scores. • It is related to the Standard Error of Measurement according to the expression 𝝈𝑬 = 𝝈𝑿 √(𝟏 − 𝝆) • If the test scores are perfectly reliable, 𝝆=1 and 𝝈𝑬 = 0. Reliability • Error in scores is due to factors such as testing conditions, fatigue, guessing, emotional or physical condition of student, etc. • Different types of reliability coefficients reflect different interpretations of error Reliability 1. Reliability refers to consistency of test scores • over time • over different sets of test items 2. Reliability refers to test results, not the test itself 3. A test can have more than one type of reliability coefficient 4. Reliability is necessary but not sufficient for validity Shortcomings of the Indices based on Classical test Theory • They are group DEPENDENT i.e., they change as the groups change. • Reliability and hence Standard Error of Measurement are defined in terms of Parallel Tests, almost impossible to realize in practice. And what’s wrong with that? • We cannot compare item characteristics for items whose indices were computed on different groups of examinees • We cannot compare the test scores of individuals who have taken different sets of test items Wouldn’t it be nice if ….? • Our item indices did not depend on the characteristics of the individuals on which the item data was obtained • Our examinee measures did not depend on the characteristics of the of items that was administered ITEM RESPONSE THEORY solves the problem!* * Certain conditions apply. Individual results may vary. IRT is not for everyone, including those with small samples. Side effects include nausea, drowsiness, and difficulty swallowing. If symptoms persist, consult a psychometrician. For more information, see Hambleton and Swaminathan (1985), and Hambleton, Swaminathan and Rogers (1991). Item Response Theory • Based on the postulate that the probability of a correct response to an item depends on the ability of the examinee and the characteristics of the item The Item Response Model • The mathematical relationship between the probability of a response, the ability of the examinee, and the characteristics of the item is specified by the ITEM RESPONSE MODEL Item Characteristics • An item may be characterized by its DIFFICULTY level (usually denoted as b), DISCRIMINATION level (usually denoted by a), “PSEUDO-CHANCE” level (usually denoted as c). Item Response Model 1.0 Probability of Correct Response 0.9 0.8 a = 0.5 b = -0.5 c = 0.0 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 -5 -4 -3 -2 -1 0 1 Theta (Proficiency) 2 3 4 5 Item Response Model 1.0 Probability of Correct Response 0.9 0.8 a = 0.5 b = -0.5 c = 0.0 a = 2.0 b = 0.0 c = .25 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 -5 -4 -3 -2 -1 0 1 Theta (Proficiency) 2 3 4 5 Item Response Model 1.0 Probability of Correct Response 0.9 0.8 a = 0.5 b = -0.5 c = 0.0 a = 2.0 b = 0.0 c = .25 0.7 0.6 a = 0.8 b = 1.5 c = 0.1 0.5 0.4 0.3 0.2 0.1 0.0 -5 -4 -3 -2 -1 0 1 Theta (Proficiency) 2 3 4 5 IRT Item Difficulty b • Differs from classical item difficulty • b is the θ value at which the probability of a correct response is .5 • The harder the item, the higher the b • b is on the same scale as θ and does not depend on the characteristics of the group of test takers IRT Item Discrimination a • Differs from classical item discrimination • a is proportional to the slope of the ICC at θ = b • The slope indicates how much the probability of a correct response changes for individuals with slightly different θ values, i.e., how well the item discriminates between them IRT Item Guessing Parameter • No analog in classical test theory • c is the probability that an examinee with very low θ will answer the item correctly • b is now the θ value at which the probability of a correct response is (1 + c)/2 Item Response Models • The One-Parameter Model (Rasch Model) e ( b ) P(correct response ) 1 e ( b ) • The Two-Parameter Model e a ( b ) P(correct response ) 1 e a ( b ) • The Three-Parameter Model e a ( b ) P(correct response ) c (1 c) 1 e a ( b ) How Is IRT Used In Practice? • Test construction • Equating of test forms • Vertical scaling (for growth assessment) • Detection of differential item functioning • Adaptive testing Test Construction • Traditional approach: select items with p-values in the .2 - .8 range and as highly discriminating as possible • We cannot, however, design a test that has the required reliability, SEM, and score distribution. • We cannot design a test with pre-specified characteristics. Test Construction (cont.) • IRT approach: INFORMATION FUNCTIONS • The information provided by the test about an examinee with given ability 𝜽 is directly related to the Standard Error of Measurement • We CAN assemble a test that has the characteristics we want, impossible to accomplish this in a classical framework Test Construction (cont.) • The TEST INFORMATION FUNCTION specifies the information provided by the test across the θ range • Test information is a sum of the information provided by each item • Because of this property, we can combine items to obtain a pre-specified test information Test Construction (cont.) • Items can be selected to maximize information in desired θ regions depending on test purpose • Test can be constructed of minimal length to keep standard errors below a specified maximum • By selecting items that have optimal properties, we can create a shorter test that have the same degree of precision as a longer test Item Information Functions • Bell-shaped • Peak is at or near difficulty value b: item provides greatest information at θ values near b • Height depends on discrimination; more discriminating items provide greater information over a narrow range around b • Items with low c provide most information Test And Item Information Functions 8 7 6 Information 5 4 3 2 1 0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 Theta 0.5 1.0 1.5 2.0 2.5 And what do we get for all this? • The proficiency level (ability) of an examinee is not tied to the specific items we administer • We CAN compare the ability scores of examinees who have taken different sets of test items • We can therefore match items to examinee’s ability level and measure ability more precisely with shorter tests And what do we get for all this? • We can create a bank of items by administering different items to different groups of examinees at different times • This will allow us to administer comparable tests or individually tailored tests to examinees • By administering different items to different individuals or groups we can improve test security and minimize cheating And what do we get for all this? • We can ensure the fairness of tests by making sure the test and test items are functioning in the same way across different groups • For assessment of learning, we can give different SHORT tests made up of items that measure the entire domain of skills; otherwise such coverage will require a very long test, and be unmanageable. Equating • Purpose of equating is to place scores from one form of a test on the scale of another • The goal of equating is for scores to be exchangeable; it should not matter to examinees which form of the test they take • True equating is not strictly possible using traditional procedures IRT “Equating” • Under IRT, equating is not necessary • But we have to undo the scaling; when we calibrate a test, i.e., estimate the ability scores and item parameters, we commonly “standardize” the ability scores. We have to undo this scaling to place the parameters on a common scale. • IRT “equating” is simply rescaling. • We will describe the scaling procedure in detail later. Differential Item Functioning (DIF) • When we design a test to measure a construct of interest, that test should not measure something else that is irrelevant. • For example, if we are measuring reading comprehension, we should not include items that have mathematical content (This would be silly unless this skill is a relevant and required skill). Differential Item Functioning (DIF) • Similarly, we should not include items that have a heavy reading component if our purpose is to measure the mathematical ability of the student. • Such items will favor one group over another. • Construct irrelevant items adversely affect validity. Differential Item Functioning (DIF) • IRT provides a natural framework for defining and assessing DIF Definition: An item shows DIF if two examinees at the same ability level but from different groups do not have the same probability of answering the item correctly. Differential Item Functioning (DIF) • • Detecting and eliminating these items are carried out routinely in all testing programs. It is a crtical part in the validation process, in determining if construct irrelevant variables pose threats to the intended uses of the test. Tailored Testing • Fred Lord introduced Item Response Theory in the early 50’s and with it, the notion of Tailored Testing. • Without computers, tailored testing was not feasible. • To overcome this problem, Lord developed “FLEXILEVEL TESTING” • Flexilevel test follows Binet’s idea; only the difficulty level of the item is used in routing Computerized Adaptive Testing (CAT) • Adaptive testing is the process of tailoring the test items administered to the best current estimate of an examinee’s trait level • Items are most informative when their difficulty is close to the examinee’s θ value • Different examinees take different tests • Only through IRT can items be appropriately selected, trait values estimated after each item administered, and the resulting test scores compared Advantages Of CAT • Testing time can be shortened • Examinees’ trait values can be estimated with a desired degree of precision • Scoring and reporting can be immediate • Scoring errors and loss of data are reduced • Test security is preserved (in theory) • Paper use is eliminated • Need for supervision is reduced Adaptive Testing • Items are pre-calibrated and stored in a bank • Item bank should be large and varied in difficulty • Examinee is administered one or more items of moderate difficulty to obtain initial trait estimate Computerized Adaptive Testing (CAT) • Once an initial θ estimate is obtained, items are selected for administration based on their information functions – the most informative item at the current estimate of θ is selected, taking into account content considerations • The trait estimate is updated after each item response • Testing is terminated after a fixed number of items or when the standard error of the estimate is at a desired level Computerized Adaptive Testing (CAT) • The Office of Naval Research, the Army, and the Air Force funded research for advancing CAT during the 70s. • David Weiss and his team at University of Minnesota were funded for developing operational procedures for implementing CAT • I was funded for the development of Bayesian estimation procedures so that we can estimate item parameters more accurately • All these activities were motivated because of the large volume of test takers in the armed forces Operationalizing CAT • I was on the Board of Directors of GRE in the mid 80s. • The Board authorized and funded research for implementing GRE-CAT • GRE CAT was operational in the early 90s. GMAT followed suit soon after. Theory v. Practice • First clash between theory and practice occurred in the implementation of CAT • As Personal Computers were not common GRE had to contract with a delivery system provider • “Seat-time” was the major obstacle. • In theory, testing should continue until the stopping criterion, prescribed standard error, was reached. Instead the time i.e., test length, had to be fixed. • Examinees had to complete 80% of items Issues • Item Bank: A large item bank is needed and maintained well. In developing item banks items from paper and pencil administration should not be used without careful investigation. • Exposure Control: In high stakes testing, exposure control is critical • Content Specification and Balancing: This is a critical issue and must be addressed early on in the development of item bank and item selection criteria Issues (cont’d) • CAT algorithm: Unchecked, a CAT algorithm will be greedy and choose items with the highest information. Algorithms for selecting items with content balancing must be in place. • Item parameter shift : Over time item parameter values will change because of instruction, targeted instruction, and exposure of items. Item parameters must be re-estimated and items that show large drifts must be eliminated. Issues (Cont’d) • Item BIAS (Differential Item Functioning): Performance of subgroups on items must be examined to determine if subgroups are performing differentially on items . This analysis must be carried out not only in the development of the item bank but also during operational administrations. VALIDITY "The concept of validity refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores. Test validation is the process of accumulating evidence to support such inference." Standards for Educational and Psychological Testing AERA/APA/NCME VALIDITY -- refers to the appropriateness of interpretations of test scores -- is not a property of the test itself -- is a matter of degree -- is specific to a particular use of the test VALIDITY is assessed using evidence from three categories: content-related criterion-related construct-related Validity of Ancient Tests • Was Sugriva’s Test of Rama’s Skills Valid for the intended use? • Sugriva could have given Rama a general knowledge test to see if Rama knew of Vali’s strength and skill in battle Validity of Ancient Tests • Sugriva could have given Rama a math test to determine if Rama could compute the angle to shoot the arrow to reach the target. • Sugriva could have given Rama a Physics test to determine if he could calculate the force necessary to penetrate the giant trees. Validity of Ancient Tests • But then anyone with a degree of general knowledge, and knowledge of mathematics and physics could have answered the question, without even knowing how to use a Bow and Arrow. • Sugriva, chose to give Rama an “authentic” test to determine his ability. • The test was clearly appropriate for the use and therefore Valid. Validity of Ancient Tests • The God of Death, Yama, could have given Yudhisthira a general knowledge test. • But that would not have served his purpose. The God Yama designed a test that was valid for its intended purpose. Innovative Item Formats • Sugriva and the God Yama have shown how to choose the right item format to conduct their examinations. • With the aid of computers, we can design tests with innovative item formats. • These item formats will permit authentic assessments of the skills we intend to measure. • These formats are currently being used in credentialing examinations. In closing…. I have provided a very general overview of: • How tests are used in the US • How testing is being viewed in the US • The Classical and the modern IRT frameworks that underpin the development of measurements In closing…. I have provided a very general overview of: • The advantages IRT offers over the classical framework, i.e., how IRT Provides item characteristics that are invariant over subgroups Provides proficiency scores that are not dependent on the sets of items Enables the construction of tests that have prespecified standard error of measurement and reliability In closing…. how IRT Enables the determination of proficiency with the desired accuracy at critical points on the proficiency continuum Enables the delivery of individualized and targeted tests for the efficient determination of proficiency scores In closing…. • Are any of these innovations and approaches to testing relevant or applicable in India? • Assessment is a tradition in India. • By necessity, assessment has been employed primarily for selection • These assessments, while necessary, can be streamlined and made shorter and targeted through CAT • Advances in technology have made innovative item formats possible. Higher order skills and creativity can be assessed through these innovative item formats. In closing…. • India can most certainly benefit from the advances made in assessment. • The weak link in Indian assessment system is that not much attention seems to have been paid to the issue of validity. • The current system of examinations has had some negative side effects. It has promoted extensive coaching and cheating, two factors that can lead to inequity in education and stifle creativity. In closing…. • Assessment can play a critical role in the education system, and lead to improvements in learning. • Cognitive Diagnostic Assessment is receiving considerable attention in the US. • Cognitive diagnostic assessment can be applied successfully to identify misconceptions, especially in the mathematics and science areas, and through feedback to students, student learning and instructional techniques can be enhanced. India Has Got Talent! • India has the talent and manpower to lead the way in the use of technology in instruction and assessment. • We have world class experts, such Professor Dhande, on the panel here to provide leadership in this area. • Indian students, if given a fair chance, can become leading world class scholars, a fact evidenced by the achievements of Indian students who have gone abroad to seek education. • India can most certainly benefit from the advances made in assessment. By judiciously applying assessments, it can transform itself from a country of examination systems to a country of education systems.