Assessment of Education Outcomes

Assessment of Education
Bill Slayton M.D.
Resource: ACGME Toolbox of
Assessment Methods
ACGME and Curriculum Design
• Focus on competency based education
• Competency-based education focuses
on learner performance (learning
outcomes) in reaching specific
objectives (goals and objectives of the
ACGME Requires That
– Learning opportunities in each competency
– Evidence of multiple assessment methods
– Use of aggregate data to improve the
educational program
What are the competencies?
• Medical Knowledge
• Patient Care
• Practice Based Learning and
• Systems Based Practice
• Professionalism
• Interpersonal and Communication
Glossary of Terms-Reliability/Reproducibility
• …when scores on a given test are
consistent with prior scores for same or
similar individuals.
• Measured as a correlation with 1.0 being
perfect reliability and 0.5 being unreliable.
Glossary of Terms: Validity
• How well the assessment measures
represent or predict a resident’s ability or
• It is the scores and not the kind of test that
is valid…i.e. it is possible to determine
whether the written exam score for a
group of residents is valid, but incorrect to
say that “all written exams” are valid
Glossary of Terms: Generalizable
• Measurements (scores) derived from an
assessment tool are considered
generalizable if they can apply to more
than the sample of cases or test questions
used in a specific assessment
Glossary of Terms—Types of
• Formative– intended to provide
constructive feedback—not intended to
make a go/no-go decision
• Summative—designed to accumulate all
evaluations into a go/no-go decision
360 Degree Evaluation Instrument
• Measurement tools completed by multiple
people in a person’s sphere of influence.
• Most using a rating scale of 1-5, with 5
meaning all the time and 1 meaning never
• Evaluators provide more accurate and less
lenient ratings when evaluation is used for
formative rather than summative
360 Degree Evaluation Instrument
• Published reports of use are very limited.
• Reports of various categories of people
evaluating residents at same time with different
• Reproducible results were most easily
obtainable when 5-10 nurses rated residents,
whereas greater number of faculty and patients
were necessary for same degree of reliability.
• Higher reliability seen in military and education
360 Degree Evaluation Instrument
• Two practical challenges
– Constructing surveys that are appropriate for
use by variety of evaluators
– Orchestrating data collection from large
number of individuals
• Use of electronic database is helpful in
collecting these data
Chart Stimulated Recall
• Examination where patient cases of the
resident are assessed in a standardized
oral examination
• Trained physician examiner questions the
examinee about the care provided probing
for reasons behind the workup, diagnoses,
interpretation, and treatment plans
• CSR takes 5-10 minutes per patient case
Chart Stimulated Recall
• Cases are chosen to be samples of
patients examinee should be able to
• Scores are derived based on predefined
scoring rules.
• Examinees performance is determined by
combining scores from all cases for a
pass/fail decision overall or by each
Chart Stimulated Recall
• Exam score reliability reported between 0.65
and 0.88.
• Physician examiners need to be trained in how
to question examinee and score the responses.
• “Mock Orals” can use residents cases with less
standardization to help familiarize residents of
the upcoming orals
• CSR oral exams require resources and
expertise to fairly text competency and
accurately standardize the exam.
Checklist Evaluation
• Consist of essential or desired specific
• Typical response options are check boxes
or yes to indicate that the behavior
• Forms provide information for purpose of
making a judgment regarding adequacy of
overall performance
Checklist Evaluation
• Useful for evaluating a competency that
can be broken down into specific individual
• Checklists have been shown to be useful
to demonstrate specific clinical skills,
procedural skills, history taking and
physical examination
Checklist Evaluation
• When users are trained, reliability is in the
0.7 to 0.8 range.
• To ensure validity, checklists require
consensus by several experts
• Require trained evaluators.
Global Rating of Live or Recorded
• Rater judges general rather than specific
skills (clinical judgment, medical
• Judgments made retrospectively based on
general impressions made over time
• All rating forms have some scale on which
the resident is rated
• Written comments are important to allow
evaluator to explain rating
Global Rating of Live or Recorded
• Most often used to rate resident at end of
rotation and summary statements over
days or weeks
• Scores can be highly subjective
• Sometimes all competencies are rated the
same in spite of variable performance
• Some scores biased when reviewers
refuse to use extreme ends of the scale to
avoid being harsh or extreme
Global Rating of Live or Recorded
• More skilled physicians give more
reproducible ratings than physicians with
less experience.
• Faculty give more lenient ratings than
• Training of raters important for
reproducibility of the results.
Objective structured clinical exam
• One or more assessment tools are
administered over 12-20 separate patient
encounter stations.
• All candidates move from station to station
in a set sequence, and with similar time
• Standardized patients are the primary
evaluation tool in OSCE exams
Objective structured clinical exam
• Useful to measure in a standardized
manner patient/doctor encounters
• Not useful to measure outcomes of
continuity care or procedural outcomes
• Separate performance score tallied for
each station, combined for a global score
• OSCE with 14 to 18 stations has been
recommended to obtain reliable measures
of performance
Objective structured clinical exam
• Very useful to measure specific skills
• Very difficult to administer
• Most cost-effective with large programs
Procedural, operative or case logs
• Document each patient encounter
• Logs may or may not include numbers of
cases, details may vary from log to log
• There is no known study looking at
procedure logs and outcomes
• Electronic databases make storing these
data feasible
Patient Surveys
• Surveys about patient experience often
include questions about physician care
such as amount of time spent, overall
quality of care, competency, courtesy,
empathy and interest
• Rated according to a scale or yes or no to
statements such as “the doctor kept me
Patient Surveys
• Reliability estimates of 0.9 or greater have been
achieved for patient satisfaction survey forms
used in hospitals and clinics
• Much lower reliability for rating of residents in
range of 0.7-0.82 using an American Board of
Medicine Patient Satisfaction Questionnaire
• Use of rating scales such as yes, definitely, yes
somewhat or no may produce more reproducible
Patient Surveys
• Available from commercial developers and
medical organizations
• Focus on desirable and undesirable
physician behaviors
• Can be filled out quickly
• Difficulty with language barriers
• Difficulty obtaining enough per-resident
survey to provide reproducible results
• Collection of products prepared by the
resident that provides evidence of learning
and achievement related to a learning
• Can include written documents, video and
audio recordings, photographs and other
forms of information
• Reflection on what has been learned
important part of constructing a portfolio
• Can be used for both summative and
formative evaluation
• Most useful to evaluate master of
competencies that are difficult to master in
other ways such as practice-based
improvement and use of scientific
evidence in patient care
• Reproducible assessments are feasible when
agreement on criteria and standards for a
• Can be more useful to assess an educational
program than an individual
• May be counterproductive when standard
criteria are used to demonstrate individual
learning gains relative to individual goals
• Validity is determined by extent to which
products or documentation included
demonstrates mastery of expected learning
Record Review
• Trained staff at institution review medical
records and abstract information such as
medications, tests ordered, procedures
performed, and patient outcomes.
• Records are summarized and compared to
accepted patient care standards.
• Standards of care exist for more than 1600
diseases on the website of the Agency for
HealthCare Research and Quality
Record Review
• Sample of 8-10 patient records is sufficient
for a reliable assessment of care for a
diagnosis or procedure
• Fewer necessary if chosen at random
• Missing or incomplete documentation is
interpreted as not meeting the accepted
Record Review
• Take 20-30 minutes per record on
• Need to see certain number of patients
with a given diagnosis which can delay
• Criteria of care must be agreed upon
• Staff training regarding identifying and
coding information is critical
Simulation and Models
• Use to assess performance through
experiences that closely resemble reality
and imitate but do not duplicate the real
clinical problem
• Allow examinees to reason through a
clinical problem with little or no cueing
• Permit examinees to make life-threatening
errors without hurting a real patient
• Provide instant feedback
Simulation and Models--Types
• Paper and pencil “patient branching
• Computerized “clinical case simulations”
• Role playing situations “standardize
• Anatomical models and mannequins
• Virtual reality combines computers and
sometimes mannequins—good to assess
procedural competence
Simulation and Models--Use
• Used to train and assess surgeons doing
• Major wound debridement
• Anesthesia training for life threatening
critical incidents during surgery
• Cardiopulmonary incidents
• Written and computerized simulation test
reasoning and development of diagnostic
Simulation and Models• Studies have demonstrated content validity for high
quality simulation designed to resemble real patients.
• One or more scores are derived from each simulation
based on preset scoring rules from experts in the
• Examinees performance determined by combining
scores to derive overall performance score
• Can be part of an OSCE
• Expensive to create…many grants and contracts
available to develop these
Standardized Oral Exams
• Uses realistic patient cases with a trained
physician examiner questioning the
• Clinical problem presented as a scenario
• Questions probe the reasoning for
requesting clinical tests, interpretation of
findings and treatment plans
• Exams last 90 minutes to 2 ½ hours
• 1-2 physicians serve as examiners
Standardized Oral Exams
• Test clinical decision making with real-life
• 15 of 24 ABMS Member Boards use
standardized oral exams as final examination for
initial certification
• Committee of experts in specialty carefully craft
the scenarios
• Focus on assessment of “key features” of the
• Exam score reliability is between 0.65 and 0.88
Standardized Oral Exams
• Examiners need to be well trained for
exams to be reliable
• Mock orals can be used to prepare but are
much less standardized
• Extensive resources and expertise to
develop and administer a standardized
oral exam
Standardized Patient Exam
• Standardized patients are well persons
trained to simulate a medical condition in a
standardized way
• Exam consists multiple SPs each
presenting a different condition in a 10-12
minute patient encounter
• Performance criteria are set in advance
• Included as stations in the OSCE
Standardized Patient Exam
• Used to assess history-taking skills, physical
exam skills, communication skills, differential
diagnosis, laboratory utilization, and treatment
• Reproducible scores are more readily obtained
for history taking, physical exam and
communication skills
• Most often used as a summative performance
exam for clinical skills
• A single SP can assess targeted skills and
Standardized Patient Exam
• Standardized patient exams can generate
reliable scores for individual stations
• Training of raters is critical
• Takes at least a half-day to test to obtain reliable
scores for hands-on skills
• Research on validity has found better
performance by senior than junior residents
(construct validity) and modest correlations
between SP exams and clinical ratings or written
exams (concurrent validity)
Standardized Patient Exam
• Development and implementation take a
lot of resources
• Can be more efficient when sharing SPs in
multiple residency programs
• Need large facility with multiple exam
rooms for each station
Written Exams
• Usually made up of multiple choice questions
• Each contains an introductory statement
followed by four or five options
• The examine selects one of the options as the
presumed correct answer by marking the option
on a coded answer sheet
• In training exam is an example of this format
• Typical half-day exam has 175-250 test
Written Exams
• Medical knowledge and understanding can be
• Comparing test scores with national statistics
can serve to identify strengths and limitations of
individual residents to help improvement
• Comparing test results aggregated for residents
each year an help identify residency training
experiences that might be improved
Written Exams
• Committee of experts designs the test and
agrees on the knowledge to be assessed
• Creates a test blueprint for the number of test
questions for each topic
• When tests are used to make pass/fail
decisions, test should be piloted and statistically
• Standards for passing should be set by a
committee of experts prior to administering the
Written Exams
• If performance is compared from year to year, at
least 20-30 percent of the same test questions
should be repeated each year
• For in training exams, each residency
administers exam purchased from a vendor
• Tests are scored by the vendor and scores
returned to the residency director
• Comparable national scores provided
• All 24 ABMS Member boards use MCQ exams
for initial certification
Use of These Tools in Medical
• Field is changing
• Technology will provide new opportunities,
particularly in simulating and assessing
medical problems
• ACGME is requiring programs to use
multiple valid tools to assess resident
