Writing Effective Test Questions Terry Stratton, Ph.D. Assistant Dean, Student Assessment & Program Evaluation Amy Murphy-Spencer, Ed.S. Coordinator, Testing Services Office of Medical Education University of Kentucky College of Medicine January 20, 2010 Acknowledgements Center for Excellence in Medical Education (CEME) Dorcas Beatty Others Don Witzke, Ph.D. Paul Murphy, M.D. Donna Weber, Ph.D. Darrell Jennings, M.D. Outline Basic Definitions/Vernacular Group Exercise #1: Test-Taking “Savvy” Reliability and Validity Measurement Error Item Writing Tips: Dos and Don’ts Group Exercise #2: Critiquing Test Items Test Item Statistics/Test Item Banking Wrap-Up Basic Vernacular Assessment: Q: How does something perform? Evaluation: Q: Does something work? Is something effective? What to Assess? Attitudes Altruism Professionalism Knowledge Physiology Biochemistry Skills Problem-Solving Provider-Patient Communication Clinical Procedures (e.g., NG tube placement) How to Assess? Self-Assessment Ask the learner Observation Ward Evaluations Often non-standardized, subjective Objective Structured Clinical Exams (OSCEs) Usually standardized, objective Tests/Exams Written Oral What is a Test? A test is a sample of items or tasks which represents a specified body of knowledge or performance. Coupled with the test is a scoring procedure that enables one to estimate examinees’ knowledge or performance. Emphasis is on differentiation (who has the knowledge/skill from who doesn’t) Types of Test Items Essays Short-Answer True/False Multiple Choice (single best answer) Extended Matching Any of these formats may involve factual recall, problem-solving, etc. Our Focus Today Knowledge Domain Written Examinations Multiple Choice Items (single best answer) NBME-Type Vignettes Anatomy of a Test Item Who is the primary author of the Declaration of Independence? A. B. C. D. E. Alexander Hamilton Benjamin Franklin George Washington James Madison Thomas Jefferson (*) Stem Distractors or Foils Options Do Tests Measure ONLY Knowledge? Exercise #1: Test of General Rock & Roll Knowledge Are you ready and willing to participate? Are you a savvy test taker? Which pair of artists met their deaths from drug overdoses? A. Joplin and Slick B. Hendrix and Joplin C. Hendrix and Redding D. Joplin and Chapin E. Lennon and Morrison Demonstrates the “convergence” strategy: find the answer that includes responses most frequently used among all the options. The Payola Scandal of the late 1950s and early 1960s involved: A. tax evasion by recording artists B. accusations of Rock & Roll as communist propaganda C. paying radio DJs to play records produced by record companies D. television game shows E. violation of copyright laws The word cue “pay” tips off the answer and makes it more likely C will be guessed by examinees even if they don’t really know the answer. The Moog synthesizer is an: A. instrument B. recording medium C. musical genre of the 1960’s D. Top 40 rock group E. song The grammatical cue “an” tips off this answer; A is the only grammatically correct response. The late 1960s Top 40 rock group “The Monkees” are credited for being the first to produce a recording with this instrument. A. banjo B. Moog synthesizer C. slack guitar D. slide guitar E. sitar Conceptual Proximity: The answer is tipped off by the prior question. What was the primary reason Cleveland was selected as the site for the Rock & Roll Hall of Fame? A. Most rock and roll records were produced there B. Cash C. Pity D. Economic stability E. Alan Freed, a local 1950s Cleveland DJ, is largely credited for coining the term Rock & Roll to identify this new music genre The longest response is usually the correct answer. Which of the following statements characterizes the sound of surf music from the 1960s? A. All surf songs were instrumental B. There were never drums in surf songs C. All surf songs were produced by the Beach Boys or by Jan & Dean D. Many surf songs consisted of a twangy guitar sound and higher range vocals E. All surf music had female backup vocals and a reggae beat “All or nothing concept”: exceptions to all, always, never, and none are abundant, so test takers tend to skip options containing those words and choose the non-absolute responses. A technique for creating the staccato “thumping” bass sound characteristic of funk was: A. using a capo B. using a slide C. slapping the bass strings with the thumb D. plucking the strings with a pick E. plucking the strings with a fingernail “Homonym and Synonym Cues”: Guessers might choose C, just because ‘slapping’ and ‘thumping’ are more closely associated than the alternatives. Which record company contributed most to the “Memphis Sound”? A. Stax B. Capitol and Warner Bros. C. Columbia and Chess D. Motown and Atco E. Chess and American This is another grammatical cue. The questions asks for one company and there is only one option that lists only one company. The British Invasion included which of the following Rock & Roll groups? A. Gerry and the Pacemakers B. The Swinging Blue Jeans C. The Animals D. The Beatles E. All of the Above Most know the Beatles were British. To choose E as correct, one needs to know one other group was British – Gerry, spelled with a G, might be a clue that this is a British group. Purposes of Testing Communicate to students what is important Motivate students to study Identify areas where remediation is needed Provide feedback on learning to students Determine final grades/make promotion decisions Identify areas where course/curriculum is weak “The amount of attention given to evaluating something should reflect its relative importance” Susan Case, NBME, 1995 What Should be Tested? Exam content should match course objectives Important topics should be weighted accordingly # items/testing time should reflect topic importance The sample of items should represent the domain of instructional content Questions thus far? Reliability AKA reproducibility, dependability, internal consistency The degree to which we would obtain the same result if an examination were repeated Reliable methods of assessment will tend to yield similar results when repeated under similar conditions Methods of estimating reliability Cronbach’s alpha (α) Test-retest Parallel forms Split-half Lowest Acceptable Reliability Scores from a single test with a reliability coefficient of less than .70 should not be used to characterize or evaluate individuals or groups in higher-stakes situations. Reliability & Decision-Making High reliability is demanded when the decision: is important is final is irreversible is not confirmable with other data concerns individuals has lasting consequences Low reliability is tolerable when the decision: is of minor importance is in the early stages of decision-making is reversible is confirmable by other data concerns groups has temporary effects Source: Gronlund NE & Linn RL. Measurement & Evaluation in Teaching (6th Edition). New York: Macmillan, 1990. Standard Error of Measurement If a single student were to take the same test repeatedly - with no new learning taking place between assessments and no memory of question effects - the standard deviation of his/her repeated test scores is denoted as the standard error of measurement. Validity The extent to which a test measures what it purports to measure (e.g., IQ). An indication of how well a measure corresponds with reality Is the exam representative of the universe of possible exam items from that content area? Does the exam provide data that increase the accuracy of decisions made about the examinee? Reliability and Validity in a Nutshell True Score Theory Measurement Error Random Measurement Error All chance (random) errors that confound the measurement of any phenomena Inversely related to the reliability of the measuring instrument Possible examples: Testing environment Testing administration Examinee preparedness Effects of Random Error Non-Random Measurement Error All systematic errors that confound the measurement of any phenomena (bias) Inversely related to the validity of the measuring instrument Possible examples: Insufficient sampling of items Assessing what was not taught Poorly written test items Effects of Non-Random Error Sources of Error in Test Scores Test Quality Unclear, lengthy directions Emphasis on non-focal skills (e.g., major reading in a math test) Item Sampling Insufficient length to reflect course content Not enough time for examinees to finish Item Quality Items are ambiguous and/or fail to discriminate among examinees Environmental Characteristics Noise, distractions, temperature Scoring Important/core topics are not weighted accordingly Questions? Okay, How Do You Get There? Avoid T/F questions Write test questions with fewer flaws (less error) Develop a secure test item bank Select items with known measurement characteristics Select items representing course content Multiple-Choice Questions: Pros Versatility in measuring all levels of cognitive ability Highly reliable test scores Scoring efficiency and accuracy Objective measurement of student achievement or ability Wide sampling of content or objectives Reduced guessing factor when compared to T/F items Different response alternatives which can provide diagnostic feedback Multiple-Choice Questions: Cons Are difficult and time consuming to construct Leads to favoring simple recall of facts Place a high degree of dependence on the student's reading ability and instructor's writing ability One Best Answer Items: “Do” Focus items on important concepts or problems Gear items toward assessing application of knowledge, not recall of an isolated fact Write item stem to pose a clear question; one should be able to answer an item with the options (foils) covered Make all options (foils) homogeneous Construct items that require examinees to compare the relative correctness of options e.g., “Which of the following Xs is most likely to result in Y?” One Best Answer Items: “Don’ts” The following items are unfocused and have heterogeneous options: “Which of the following statements is correct?” “Each of the following statements is correct EXCEPT:” Do not include additional irrelevant data (e.g., unnecessary background, etc.) Avoid items that pose irrelevant difficulty Do not write items that allow options to be eliminated in a T/F fashion Avoid items containing vague references to time (e.g., frequently, usually, rarely, etc.) Single Best Answer Items: Common Flaws In poorly constructed single best answer items, the correct response will probably be: The longest response The most qualified or detailed response The response without spelling or grammatical errors Neither of two responses that mean the same thing A middle numerical value (not an extreme value) The response which grammatically fits with the stem One of two responses which are the opposite of each other Source: Camp, MG. Maximizing your Score on Multiple-Choice Exams. Layout: Stem and Options Less Desirable Item Model More Desirable Item Model Group Exercise #2: Critiquing Test Items Item #1 A 58-year old man with a history of heavy alcohol use and previous psychiatric hospitalization is confused and agitated. He speaks of experiencing the world as unreal. This symptom is called: A. B. C. D. E. de-personalization signal anxiety de-realization focal memory deficit derailment Option C repeats clue in stem; responses not alphabetized. Item #2 Following a second episode of infection, what is the likelihood that a woman is infertile? A. Less than 20% B. 20% to 30% C. Greater than 50% D. 90% E. 75% Overlapping numerical ranges (C,D, and E) Item #3 Severe obesity in early adolescence: A. usually responds dramatically to dietary regimens B. often is related to endocrine disorders C. always has a 75% chance of clearing spontaneously D. never shows a poor prognosis E. usually responds to pharmacotherapy and intensive psychotherapy Subjective qualifiers in all options; responses not alphabetized Item #4 Arrange the parents of the following children with Down’s syndrome in order of highest to lowest risk of recurrence. Assume that the maternal age in all cases is 22 years and that a subsequent pregnancy occurs within 5 years. The karyotypes of the daughters are: I. II. III. IV. V. 46, XX, -14, +T (14q21q) pat 46, XX, -14, +T (14q21q) de novo 46, XX, -14, +T (14q21q) mat 46, XX, -21, +T (14q21q) pat 47, XX, -21, +T (21q21q) (parents not karyotyped) A. B. C. D. E. III, IV, I, V, II IV, III, V, I, II III, I, IV, V, II IV, III, I, V, II III, IV, I, II, V Complex K-type, confusing, unacceptable item type Item #5 Peer review committees in HMOs may move to take action against a physician’s credentials to care for participants of the HMO. There is an associated requirement to assure that the physician receives due process in the course of these activities. Due process must include which of the following: A. B. C. Notice, an impartial forum, a chance to hear and confront evidence against him/her Proper notice, a tribunal empowered to make the decision, a chance to confront witnesses against him/her, and a chance to present evidence in defense Reasonable and timely notice, impartial panel empowered to make a decision, a chance to hear evidence against himself/herself and to confront witnesses, and the ability to present evidence in defense. C is the most detailed and lengthy option Item #6 Local anesthetics are most effective in the: A. anionic form, acting from inside the nerve membrane B. cationic form, acting from inside the nerve membrane C. cationic form, acting from outside the nerve membrane D. uncharged form, acting from inside the nerve membrane E. uncharged form, acting from outside the nerve membrane Repetitious form types and use of opposites Item #7 In patients with advanced dementia, Alzheimer’s type, the memory defect: A. can be treated adequately with phosphatidylcholine (lecithin) B. possibly involves the cholinergic system C. is never seen in patients with neurofibrillary tangles at autopsy D. could be a sequela of early Parkinsonianism E. is never severe Option B not grammatically correct; use of absolutes in C and E; responses not alphabetized Item #8 Secondary gain is: A. synonymous with malingering B. a frequent problem in obsessivecompulsive disorder C. a complication of a variety of illnesses and tends to prolong many of them D. never seen in organic brain damage Unfocused stem; responses not alphabetized Item Templates: Vignettes The patient vignettes may include some or all of the following components: Age, Gender (e.g., 45-year-old man) Site of Care (e.g., comes to the emergency room) Presenting Complaint (e.g., because of a headache) Duration (e.g., that has continued for 2 days) Patient History (with Family History) Physical Findings +/- Results of Diagnostic Studies +/- Initial Treatment, Subsequent Findings, etc. Current/prior medications After the Exam When appropriate, classify items into multiple “subtests” Review item analysis with students & faculty Provide feedback to each student Revise (if necessary) and retain quality test questions in a secure item bank Importance of Feedback Feedback: Information on examinees’ past performance intended to guide their future performance Reports should be simple, meaningful, and easy-to-interpret Report should help examinees identify strengths & weaknesses Security issue: return exams or not? Pro: old exams guide learning Con: threatens exam security Sample Student Test Report Item Analysis Report Item Quality Graph Measurement Error Chart Sample of LXR Item Bank Sample of LXR Exam Summary Our Website Student Assessment & Program Evaluation Office of Medical Education http://www.mc.uky.edu/meded/tande/index.asp Thank you!!