Using Rasch Analysis to Develop an Extended Matching Question (EMQ) Item Bank for Undergraduate Medical Education Mike Horton Bipin Bhakta Alan Tennant Medical student training Miller (1990) identified that no single assessment method can provide all the data required for judging anything as complex as the delivery of professional services by a competent physician • Knowledge • Skills • Attitudes Miller’s pyramid of competence does case presentations log books direct observation of clinical activity shows how OSCE knows how EMQ, FRQ, Essays knows MCQ What are assessments used for? Primary aim • To identify the student who is deemed to be safe and has achieved the minimal acceptable standard of competence Secondary aim • To identify students who excel Written Assessment “True or False” questions “Single, best option” multiple choice questions Multiple true or false questions “Short answer” open ended questions Essays “Key feature” questions Extended matching questions Free response questions Broadly, the free-response type questions are commonly believed to test important higher-order reasoning skills Validity is high as examinees have to generate their own responses, rather than selecting from a list of options However… Only a narrow range of subject matter can be assessed in a given amount of time They are administratively resource-sapping Due to their nature, the reliability is limited. Multiple choice questions Multiple choice type questions have been widely used and are popular as: They generally have a high reliability They can test a wide range of themes in a relatively short period of time However… They only assess the knowledge of isolated facts By giving an option list, examinees are cued to respond and the active generation of knowledge is avoided What are Extended Matching Questions (EMQs)? EMQs are used as part of the undergraduate medical course assessment programme EMQs are used to assess factual knowledge, clinical decision making and problem solving They are a variant of multiple choice questions (MCQs) EMQs are made up of 4 components Example of Extended Matching Question (EMQ) format (Taken from Schuwirth & van der Vleuten, 2003) Theme: Micro-organisms Answer options: A B C D E F G H Campylobacter jejuni Candida Albicans Clostridium difficile Clostridium perfringens Escherichia coli Giardia lamblia Helicobacter pylori Mycobacterium tuberculosis Instructions: I J K L M N O P Proteus mirabilis Pseudomonas aeruginosa Rotavirus Salmonella typhi Shigella flexneri Tropheryma whippelii Vibrio cholerae Yersinia enterocolitica For each of the following cases, select (from the list above) the microorganism most likely to be responsible. Each option may be used once, more than once or not at all. A 48 year old man with a chronic complaint of dyspepsia suddenly develops severe abdominal pain. On physical examination there is general tenderness to palpitation with rigidity and rebound tenderness. Abdominal radiography shows free air under the diaphragm. A 45 year old woman is treated with antibiotics for recurring respiratory tract infections. She develops a severe abdominal pain with haemorrhagic diarrhoea. Endoscopically a pseudomembranous colitis is seen. Item Pools Currently, a lot of institutions formulate their tests from year to year by selecting items from a pre-existing pool of questions. • Questions are pre-existing • Time and resources are saved by employing this method. However… It has been widely recognised that if tests are made up of items from a pre-existing item pool, then the relative difficulty of the exam paper will vary from year to year. [McHarg et al (2005), Muijtens et al (1998), McManus et al (2005)] Item Pools If the questions have been set, used and assessed using traditional approaches, then this will provide a certain amount of information about each of the individual questions, however, there are also drawbacks surrounding the traditional approach. It has been recognised [Downing (2003)] that there are certain limitations to Classical Measurement Theory (CMT) in that it is sample dependent. Thus, the comparability of examination results from year to year will be confounded by the overall difficulty of the exam and the relative ability of the examinees, rendering a comparison invalid. This is particularly troublesome when we wish to • compare different cohorts of students • maintain a consistent level of assessment difficulty over subsequent administrations. The problem What is the best way to ensure that all passing students are reaching the required level of expertise?? 2 forms of pass mark selection • Criterion referenced • Norm referenced Criterion referenced refers to when a specific pass mark has been designated prior to the exam as a pass/fail point. Norm referenced refers to when a specific proportion of the sample will be designated to pass the exam. i.e. the highest scoring 75% of students will pass Norm referenced or Criterion Testing? Norm-referenced testing Whatever the ability of the students taking the test, a fixed portion of them will pass/fail The standard needed to pass the test is not known in advance The validity of the norm-referencing method relies on a homogenous sample, which may not necessarily be the case There is also the risk that with a less able group of students, a student could pass the exam without reaching the desirable acceptable level of clinical knowledge. Norm referencing is not appropriate Norm referenced or Criterion Testing? Criterion testing Relative difficulty of Exam could change depending on the items that are in the exam, therefore a pre determined pass mark could be easier or harder to obtain depending upon the items in the test Although having their own disadvantages, it has been recognised [Wass et al (2001)] that Criterion referenced tests are the only acceptable means of assessing that a pre-defined clinical competency has been reached Solution? It has been identified [Muijtjens (1998)] that a criterion referenced test could be utilised if a test was constructed by selecting items from a bank of items of known difficulty, which would then enable measurement and control of the test difficulty. Difficulty estimates, as defined by classical test theory, are sample dependent! Item Banking Item banking Item Banking is a process whereby all EMQ Items that have been used over the past few years are ‘banked’ and calibrated onto the same metric scale Previously used EMQ Items Psychometrically calibrated using Rasch Analysis Data is linked by common items that have been used between the exams • “Common Item Equating” Rasch Analysis When data fit the model, generalisability of EMQ difficulties beyond the specific conditions under which they were observed occurs (specific objectivity ) In other words… Item Difficulties are not sample dependent as they are in Classical Test Theory Item banking Term 1 Term 2 Term 3 Term 4 ITEM 1 ITEM 2 ITEM 3 ITEM 4 These Items cannot be directly compared as there are no common links Item banking Term 1 Term 2 Term 3 Term 4 ITEM 1 ITEM 2 ITEM 3 ITEM 4 ITEM 5 These Items can be directly compared via the common link item across all terms Item banking Following calibration, the set of items within the bank will define the common domain that that they are measuring (in this case, medical knowledge). These items, therefore, will provide an operational definition of this unobservable (latent) trait. When all EMQ Items are calibrated onto the same scale, then it will be possible to compare the performance of students across terms, despite the fact that the EMQ exam content was made up of different items across each term. Sufficient Linkage? What is classed as sufficient linkage between item sets? There has been some variation in the literature regarding this. Three differing viewpoints suggest that: • linking items should be the larger of 20 items or 20% of the total number of items [Angoff (1971)] • 5 to 10 items are sufficient to form the common link [Wright & Bell (1984)] • one single common item could provide a valid link in co-calibrating datasets [Smith (1992)]. However, it has also been suggested [Smith (1992)] that the larger the number of common items across datasets, the greater degree of precision and stability for the item bank. Potential Problems? Limited Linkage • Data overlap reduced Potential Misfit or DIF on Link Items Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Term8 Term7 Term6 Term5 Term4 Term3 Term2 Sparse Data Matrix Term1 Sample Data was collected from 550 4th year medical students over 8 terms All EMQ data was read in to a single analysis to establish the Item Bank RUMM2020 Rasch Analysis software was used Over the 8 terms, 6 different test forms were used (the test remained the same over the first 3 terms) 1 1 1 2 1 3 1 4 1 5 1 6 2 11 2 12 2 13 2 14 2 15 2 16 3 21 3 22 3 23 3 24 3 25 3 26 3 27 3 28 3 29 4 31 4 32 4 33 4 34 4 35 4 36 4 37 4 38 4 39 4 40 5 41 5 42 5 43 5 44 5 45 5 46 5 47 5 48 6 51 6 52 6 53 6 54 6 55 6 56 6 57 6 58 6 59 6 60 6 1051 7 61 7 62 7 63 8 71 8 72 8 73 8 74 8 75 9 81 9 82 9 83 9 84 9 85 9 86 10 91 10 92 10 93 10 94 10 95 10 96 2 3 4 5 6 10 10 10 11 11 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 15 15 15 16 16 16 16 17 17 17 17 17 18 18 18 18 18 18 18 19 19 19 19 19 19 20 20 20 20 20 20 21 21 21 21 21 22 22 22 22 22 23 23 23 23 23 23 23 97 98 99 101 102 103 104 111 112 113 114 121 122 123 124 131 132 133 134 141 142 143 144 145 146 151 152 153 154 161 162 163 164 165 171 172 173 174 175 176 177 181 182 183 184 185 186 191 192 193 194 195 196 201 202 203 204 205 211 212 213 214 215 221 222 223 224 225 226 227 1 2 3 4 5 6 Theme Code Question Code 1 Theme Code Question Code Theme Code Question Code EMQ Item Bank 23 24 24 24 24 24 24 24 24 24 24 25 25 25 25 25 25 25 25 25 25 26 26 26 26 26 26 27 27 27 27 27 27 27 28 28 28 28 29 29 29 29 30 30 30 30 31 31 31 31 31 31 31 31 32 32 33 33 33 33 34 34 34 34 35 35 35 228 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 261 262 263 264 265 266 267 271 272 273 274 281 282 283 284 291 292 293 294 301 302 303 304 305 306 307 308 311 312 321 322 323 324 331 332 333 334 341 342 343 1 2 3 4 5 6 • Approximately 25% of the items were changed from exam form-to-exam form • This provides good linkage EMQ Item Bank Pretty Good!! To be expected Low. But… Low Person Separation Index Person Separation Index is fairly low We would expect this to a certain degree due to the highly focussed sample Problematic Items Approximately 12.5% (26/205) of Individual Items were found to display some sort of misfit or DIF Misfit What does misfit tell us? Misfitting items are flagged up for non-inclusion and will either be amended or removed Differential Item Functioning Misfit & DIF What does misfit tell us? Misfitting items are flagged up for non-inclusion and will either be amended or removed DIF could be due to: • Curriculum changes • Teaching Rotations • Testwiseness Does Exam Difficulty remain equal over different exam forms? Example across item set using Exam Form 1 as the Baseline Measurement and the pass mark The latent trait Increasing student ability Minimum standard 0 The “real” assessment 100 EMQ Exam 60% Pass mark Equivalent standard? Exam Form 1 was based on 98 Items Pass Mark was set at 60% 60% of 98 = Raw Score pass mark of 58.8 Equivalent pass mark? Raw score of 58.8 on Exam Form 1 = 0.561 logits Equating Problems We had to remove 2 extreme items Exam form 1 was out of 98 anyway Exam Form Maximum obtainable score 1 98 2 99 3 100 4 100 5 100 6 99 Equivalent standard? Exam form 1 had 98 items 60% of 98 = 58.8 58.8 score on Exam form 1 = 0.561 logits Exam Form 0.561 logit equated score out of 1 2 3 4 5 6 58.7 57.7 58.2 59.6 62.4 59.9 98 99 100 100 100 99 59.90% 58.30% 58.20% 62.4% - 58.2% = 4.2% 59.60% 62.40% 60.50% Conclusion Item Banking is a way of assessing the psychometric properties of EMQs that have been administered over different test forms Can identify and adapt poor questions Can perform a comparative analysis of relative test form difficulty Should the pass mark be amended every term? References 1. 2. 3. 4. 5. 6. 7. 8. 9. Miller GE. The assessment of clinical skills/competence/performance. Academic Medicine 1990; 65: S63-7 McHarg J, Bradley P, Chamberlain S, Ricketts C, Searle J, McLachlan JC. Assessment of progress tests. Medical Education 2005; 39: 221-227. Muijtjens AMM, Hoogenboom RJI, Verwijnen GM, Van der Vleuten CPM. Relative or absolute standards in assessing medical knowledge using progress tests. Advances in Health Sciences Education 1998; 3: 81-87. McManus IC, Mollon J, Duke OL, Vale JA. Changes in standard of candidates taking the MRCP(UK) Part 1 examination, 1985 to 2002: Analysis of marker questions. BMC Medicine 2005; 3 (13). Downing S M. Item response theory: applications of modern test theory in medical education. Medical Education 2003; 37: 739-745 Wass V, Van der Vleuten C, Shatzer J, Jones R: Assessment of clinical competence. Lancet 2001, 357(9620): 945-94 Angoff W H. Scales, norming, and equivalent scores. In: Thorndike R L, editor. Educational Measurement. 2nd ed. Washington (DC): American Council on Education; 1971. p508-600 Wright B D, Bell S R. Item Banks: what, why, how. Journal of Educational Measurement 1984; 21(4): 331-345 Smith R M. Applications of Rasch Measurement. Chicago: Mesa Press; 1992 New Book Smith EV Jr. & Stone GE (Eds.). Criterion Referenced Testing: Practice Analysis to Score Reporting Using Rasch Measurement Models. Maple Grove, Minnesota. JAM Press; 2009 Contact Details Mike Horton: M.C.Horton@leeds.ac.uk Matt Homer: m.s.homer@leeds.ac.uk Alan Tennant: A.Tennant@leeds.ac.uk Website: http://www.leeds.ac.uk/medicine/rehabmed/psychometric/ Course Introductory Intermediate Advanced Date March 10-12 May 12-14 Sept 15-17 Dec 1-3 March 23-25 May 18-20 Sept 14-16 Nov 30-Dec 2 May 17-19 Sept 20-22 Dec 6-8 May 23-25 Sept 19-21 Dec 5-7 Sept 23-24 Sept 22-23 2010 2010 2010 2010 2011 2011 2011 2011 2010 2010 2010 2011 2011 2011 2010 2011