Assessment of Education Outcomes Bill Slayton M.D. Resource: ACGME Toolbox of Assessment Methods ACGME and Curriculum Design • Focus on competency based education • Competency-based education focuses on learner performance (learning outcomes) in reaching specific objectives (goals and objectives of the curriculum). ACGME Requires That – Learning opportunities in each competency domain – Evidence of multiple assessment methods – Use of aggregate data to improve the educational program What are the competencies? • Medical Knowledge • Patient Care • Practice Based Learning and Improvement • Systems Based Practice • Professionalism • Interpersonal and Communication Skills Glossary of Terms-Reliability/Reproducibility • …when scores on a given test are consistent with prior scores for same or similar individuals. • Measured as a correlation with 1.0 being perfect reliability and 0.5 being unreliable. Glossary of Terms: Validity • How well the assessment measures represent or predict a resident’s ability or behavior • It is the scores and not the kind of test that is valid…i.e. it is possible to determine whether the written exam score for a group of residents is valid, but incorrect to say that “all written exams” are valid Glossary of Terms: Generalizable • Measurements (scores) derived from an assessment tool are considered generalizable if they can apply to more than the sample of cases or test questions used in a specific assessment Glossary of Terms—Types of Evaluation • Formative– intended to provide constructive feedback—not intended to make a go/no-go decision • Summative—designed to accumulate all evaluations into a go/no-go decision 360 Degree Evaluation Instrument • Measurement tools completed by multiple people in a person’s sphere of influence. • Most using a rating scale of 1-5, with 5 meaning all the time and 1 meaning never • Evaluators provide more accurate and less lenient ratings when evaluation is used for formative rather than summative evaluation 360 Degree Evaluation Instrument • Published reports of use are very limited. • Reports of various categories of people evaluating residents at same time with different instruments. • Reproducible results were most easily obtainable when 5-10 nurses rated residents, whereas greater number of faculty and patients were necessary for same degree of reliability. • Higher reliability seen in military and education settings. 360 Degree Evaluation Instrument • Two practical challenges – Constructing surveys that are appropriate for use by variety of evaluators – Orchestrating data collection from large number of individuals • Use of electronic database is helpful in collecting these data Chart Stimulated Recall • Examination where patient cases of the resident are assessed in a standardized oral examination • Trained physician examiner questions the examinee about the care provided probing for reasons behind the workup, diagnoses, interpretation, and treatment plans • CSR takes 5-10 minutes per patient case Chart Stimulated Recall • Cases are chosen to be samples of patients examinee should be able to manage • Scores are derived based on predefined scoring rules. • Examinees performance is determined by combining scores from all cases for a pass/fail decision overall or by each session. Chart Stimulated Recall • Exam score reliability reported between 0.65 and 0.88. • Physician examiners need to be trained in how to question examinee and score the responses. • “Mock Orals” can use residents cases with less standardization to help familiarize residents of the upcoming orals • CSR oral exams require resources and expertise to fairly text competency and accurately standardize the exam. Checklist Evaluation • Consist of essential or desired specific behaviors • Typical response options are check boxes or yes to indicate that the behavior occurred • Forms provide information for purpose of making a judgment regarding adequacy of overall performance Checklist Evaluation • Useful for evaluating a competency that can be broken down into specific individual behaviors. • Checklists have been shown to be useful to demonstrate specific clinical skills, procedural skills, history taking and physical examination Checklist Evaluation • When users are trained, reliability is in the 0.7 to 0.8 range. • To ensure validity, checklists require consensus by several experts • Require trained evaluators. Global Rating of Live or Recorded Performance • Rater judges general rather than specific skills (clinical judgment, medical knowledge) • Judgments made retrospectively based on general impressions made over time • All rating forms have some scale on which the resident is rated • Written comments are important to allow evaluator to explain rating Global Rating of Live or Recorded Performance • Most often used to rate resident at end of rotation and summary statements over days or weeks • Scores can be highly subjective • Sometimes all competencies are rated the same in spite of variable performance • Some scores biased when reviewers refuse to use extreme ends of the scale to avoid being harsh or extreme Global Rating of Live or Recorded Performance • More skilled physicians give more reproducible ratings than physicians with less experience. • Faculty give more lenient ratings than residents • Training of raters important for reproducibility of the results. Objective structured clinical exam (OSCE) • One or more assessment tools are administered over 12-20 separate patient encounter stations. • All candidates move from station to station in a set sequence, and with similar time constraints. • Standardized patients are the primary evaluation tool in OSCE exams Objective structured clinical exam (OSCE) • Useful to measure in a standardized manner patient/doctor encounters • Not useful to measure outcomes of continuity care or procedural outcomes • Separate performance score tallied for each station, combined for a global score • OSCE with 14 to 18 stations has been recommended to obtain reliable measures of performance Objective structured clinical exam (OSCE) • Very useful to measure specific skills • Very difficult to administer • Most cost-effective with large programs Procedural, operative or case logs • Document each patient encounter • Logs may or may not include numbers of cases, details may vary from log to log • There is no known study looking at procedure logs and outcomes • Electronic databases make storing these data feasible Patient Surveys • Surveys about patient experience often include questions about physician care such as amount of time spent, overall quality of care, competency, courtesy, empathy and interest • Rated according to a scale or yes or no to statements such as “the doctor kept me waiting” Patient Surveys • Reliability estimates of 0.9 or greater have been achieved for patient satisfaction survey forms used in hospitals and clinics • Much lower reliability for rating of residents in range of 0.7-0.82 using an American Board of Medicine Patient Satisfaction Questionnaire • Use of rating scales such as yes, definitely, yes somewhat or no may produce more reproducible results Patient Surveys • Available from commercial developers and medical organizations • Focus on desirable and undesirable physician behaviors • Can be filled out quickly • Difficulty with language barriers • Difficulty obtaining enough per-resident survey to provide reproducible results Portfolios • Collection of products prepared by the resident that provides evidence of learning and achievement related to a learning plan. • Can include written documents, video and audio recordings, photographs and other forms of information • Reflection on what has been learned important part of constructing a portfolio Portfolios • Can be used for both summative and formative evaluation • Most useful to evaluate master of competencies that are difficult to master in other ways such as practice-based improvement and use of scientific evidence in patient care Portfolios • Reproducible assessments are feasible when agreement on criteria and standards for a portfolio • Can be more useful to assess an educational program than an individual • May be counterproductive when standard criteria are used to demonstrate individual learning gains relative to individual goals • Validity is determined by extent to which products or documentation included demonstrates mastery of expected learning Record Review • Trained staff at institution review medical records and abstract information such as medications, tests ordered, procedures performed, and patient outcomes. • Records are summarized and compared to accepted patient care standards. • Standards of care exist for more than 1600 diseases on the website of the Agency for HealthCare Research and Quality Record Review • Sample of 8-10 patient records is sufficient for a reliable assessment of care for a diagnosis or procedure • Fewer necessary if chosen at random • Missing or incomplete documentation is interpreted as not meeting the accepted standard Record Review • Take 20-30 minutes per record on average. • Need to see certain number of patients with a given diagnosis which can delay reports • Criteria of care must be agreed upon • Staff training regarding identifying and coding information is critical Simulation and Models • Use to assess performance through experiences that closely resemble reality and imitate but do not duplicate the real clinical problem • Allow examinees to reason through a clinical problem with little or no cueing • Permit examinees to make life-threatening errors without hurting a real patient • Provide instant feedback Simulation and Models--Types • Paper and pencil “patient branching problems” • Computerized “clinical case simulations” • Role playing situations “standardize patients” • Anatomical models and mannequins • Virtual reality combines computers and sometimes mannequins—good to assess procedural competence Simulation and Models--Use • Used to train and assess surgeons doing arthroscopy • Major wound debridement • Anesthesia training for life threatening critical incidents during surgery • Cardiopulmonary incidents • Written and computerized simulation test reasoning and development of diagnostic plans Simulation and Models• Studies have demonstrated content validity for high quality simulation designed to resemble real patients. • One or more scores are derived from each simulation based on preset scoring rules from experts in the discipline • Examinees performance determined by combining scores to derive overall performance score • Can be part of an OSCE • Expensive to create…many grants and contracts available to develop these Standardized Oral Exams • Uses realistic patient cases with a trained physician examiner questioning the examinee • Clinical problem presented as a scenario • Questions probe the reasoning for requesting clinical tests, interpretation of findings and treatment plans • Exams last 90 minutes to 2 ½ hours • 1-2 physicians serve as examiners Standardized Oral Exams • Test clinical decision making with real-life scenarios • 15 of 24 ABMS Member Boards use standardized oral exams as final examination for initial certification • Committee of experts in specialty carefully craft the scenarios • Focus on assessment of “key features” of the case • Exam score reliability is between 0.65 and 0.88 Standardized Oral Exams • Examiners need to be well trained for exams to be reliable • Mock orals can be used to prepare but are much less standardized • Extensive resources and expertise to develop and administer a standardized oral exam Standardized Patient Exam • Standardized patients are well persons trained to simulate a medical condition in a standardized way • Exam consists multiple SPs each presenting a different condition in a 10-12 minute patient encounter • Performance criteria are set in advance • Included as stations in the OSCE Standardized Patient Exam • Used to assess history-taking skills, physical exam skills, communication skills, differential diagnosis, laboratory utilization, and treatment • Reproducible scores are more readily obtained for history taking, physical exam and communication skills • Most often used as a summative performance exam for clinical skills • A single SP can assess targeted skills and knowledge Standardized Patient Exam • Standardized patient exams can generate reliable scores for individual stations • Training of raters is critical • Takes at least a half-day to test to obtain reliable scores for hands-on skills • Research on validity has found better performance by senior than junior residents (construct validity) and modest correlations between SP exams and clinical ratings or written exams (concurrent validity) Standardized Patient Exam • Development and implementation take a lot of resources • Can be more efficient when sharing SPs in multiple residency programs • Need large facility with multiple exam rooms for each station Written Exams • Usually made up of multiple choice questions • Each contains an introductory statement followed by four or five options • The examine selects one of the options as the presumed correct answer by marking the option on a coded answer sheet • In training exam is an example of this format • Typical half-day exam has 175-250 test questions Written Exams • Medical knowledge and understanding can be measured. • Comparing test scores with national statistics can serve to identify strengths and limitations of individual residents to help improvement • Comparing test results aggregated for residents each year an help identify residency training experiences that might be improved Written Exams • Committee of experts designs the test and agrees on the knowledge to be assessed • Creates a test blueprint for the number of test questions for each topic • When tests are used to make pass/fail decisions, test should be piloted and statistically analyzed • Standards for passing should be set by a committee of experts prior to administering the exam Written Exams • If performance is compared from year to year, at least 20-30 percent of the same test questions should be repeated each year • For in training exams, each residency administers exam purchased from a vendor • Tests are scored by the vendor and scores returned to the residency director • Comparable national scores provided • All 24 ABMS Member boards use MCQ exams for initial certification Use of These Tools in Medical Education • Field is changing • Technology will provide new opportunities, particularly in simulating and assessing medical problems • ACGME is requiring programs to use multiple valid tools to assess resident performance