Designing GME Evaluations

advertisement
Designing Trustworthy &
Reliable GME Evaluations
Conference Session: SES85
2011 ACGME Annual Education
Conference
Nancy Piro, PhD, Program Manager/Ed Specialist
Alice Edler, MD, MPH, MA (Educ)
Ann Dohn, MA, DIO
Bardia Behravesh, EdD, Manager/Ed Specialist
Stanford Hospital & Clinics
Department of Graduate Medical
Education
Overall Questions
What is assessment…an
evaluation?
How are they different?
What are they used for?
Why do we evaluate?
How do we construct a
useful evaluation?
What is cognitive bias?
How do we eliminate bias
from our evaluations?
What is validity?
What is reliability?
Defining the Rules of
the “Game”
Assessment - Evaluation:
What’s the difference and
what are they used for?
Assessment …is the analysis and
use of data by residents or subspecialty residents (trainees),
faculty, program directors and/or
departments to make decisions
about improvements in teaching
and learning.
Assessment - Evaluation:
What’s the difference and
what are they used for?
Evaluation is the analysis and use of
data by faculty to make judgments
about trainee performance. Evaluation
includes obtaining accurate
performance based, empirical
information which is used to make
competency decisions on trainees
across the six domains.
Evaluation Examples
Example 1: A trainee delivers an oral
presentation at a Journal Club. The
faculty member provides a critique of the
delivery and content accompanied by a
rating for the assignment.
Example 2: A program director provides
a final evaluation to a resident
accompanied by an attestation that the
resident has demonstrated sufficient
ability and acquired the appropriate
clinical and procedural skills to practice
competently and independently.
Why do we assess and evaluate?
(Besides the fact it is required…)
Demonstrate and improve trainee
competence in core and related
competency areas - Knowledge and
application
Ensure our programs produce graduates,
each of whom: “has demonstrated
sufficient ability and acquired the
appropriate clinical and procedural skills to
practice competently and independently.”
Track the impact of
curriculum/organizational change
Gain feedback on program, curriculum
and faculty effectiveness
Provide residents/fellows a means to
communicate confidentially
Provide an early warning system
Identify gaps between competency based
goals and individual performance
So What’s the Game Plan for
Constructing Effective Evaluations ?
Without a plan… evaluations can take
on a life of their own!!
How do we construct a
useful evaluation?
How do we construct a
useful evaluation?
STEP 1. Create the Evaluation (Plan)
Curriculum (Competency) Goals,
Objectives and Outcomes
Question and Scale Development
STEP 2. Deploy (Do)
Online /In-Person (Paper)
STEP 3. Analyze (Study /Check )
Reporting, Benchmarking and
Statistical Analysis
Rank Order / Norms (Within the
Institution or National)
STEP 4. Take Action (Act)
Develop & Implement
Learning/Action Plans
Measure Progress Against Learning
Goals
Adjust Learning/Action Plans
Question and Response
Scale Construction
Two Basic Goals:
1. Construct unbiased,
unconfounded, and nonleading questions that
produce valid data
2. Design and use unbiased and
valid response scales
What is cognitive bias…
Cognitive bias is distortion
in the way we perceive reality
or information.
Response bias is a particular
type of cognitive bias which
can affect the results of an
evaluation if respondents
answer questions in the way
they think they are designed
to be answered, or with a
positive or negative bias
toward the examinee.
Where does response
bias occur?
Response bias most often
occurs in the wording of the
question.
– Response bias is present
when a question contains a
leading phrase or words.
– Response bias can also occur
in rating scales.
Response bias can also be
in the raters themselves
–
–
–
–
Halo Effect
Devil Effect
Similarity Effect
First Impressions
Step 1: Create the
Evaluation
Question Construction
Example (1):
– "I can always talk to my Program
Director about residency related
problems.”
Example (2):
– “Sufficient career planning
resources are available to me
and my program director
supports my professional
aspirations .”
Question Construction
Example (3):
– “Incomplete, inaccurate medical
interviews, physical
examinations; incomplete review
and summary of other data
sources. Fails to analyze data to
make decisions; poor clinical
judgment.”
Example (4):
– "Communication in my subspecialty program is good."
Create the Evaluation
Question Construction
Example (5):
– "The pace on our service is
chaotic."
Exercise One
Review each question and
share your thinking of what
makes it a good or bad
question.
Question Construction Test
Your Knowledge
Example 1: "I can always talk
to my Program Director about
residency related problems."
Problem: Terms such as
"always" and "never" will bias
the response in the opposite
direction.
Result: Data will be skewed.
Question Construction Test
Your Knowledge
Example 2: “Career planning
resources are available to me
and my program director
supports my professional
aspirations."
Problem: Double-barreled --resources and aspirations…
Respondents may agree with
one and not the other.
Researcher cannot make valid
assumptions about which part of
the question respondents were
rating.
Result: Data is useless.
Question Construction Test
Your Knowledge
Example 3: "Communication in my
sub-specialty program is good."
Problem: Question is too broad. If
score is less than 100% positive,
researcher/evaluator still does not
know what aspect of
communication needs
improvement.
Result: Data is of little or no
usefulness.
Question Construction Test Your Knowledge
Example 4: “Evidences incomplete,
inaccurate medical interviews,
physical examinations; incomplete
review and summary of other data
sources. Fails to analyze data to
make decisions; poor clinical
judgment.”
Problem: Septuple-barreled --Respondents may need to agree
with some and not the others.
Evaluator cannot make
assumptions about which part of the
question respondents were rating.
Result: Data is useless.
Question Construction Test
Your Knowledge
Example (5):
– "The pace on our service is
chaotic.“
Problem: The question is
negative, and broadcasts a
bad message about the
rotation/program.
Result: Data will be skewed,
and the climate may be
negatively impacted.
Evaluation Question
Design Principles
Avoid ‘double-barreled’
questions
A double-barreled question
combines two or more issues
or “attitudinal objects” in a
single question.
Avoiding Double-Barreled
Questions
Example: Patient Care Core
Competency
“Resident provides sensitive
support to patients with serious
illness and to their families,
and arranges for on-going
support or preventive services
if needed.”
Minimal Progress
Progressing
Competent
Evaluation Question
Design Principles
Combining the two or more
questions into one question makes
it unclear which object attribute is
being measured, as each question
may elicit a different perception of
the resident’s performance.
RESULT:
Respondents are confused and
results are confounded leading to
unreliable or misleading results.
Tip: If the word “and” or the word
“or” appears in a question, check
to verify whether it is a doublebarreled question.
Evaluation Question
Design Principles
Avoid questions with double
negatives…
When respondents are asked
for their agreement with a
negatively phrased statement,
double negatives can occur.
– Example:
Do you agree or disagree with
the following statement?
Evaluation Question
Design Principles
“Attendings should not be required
to supervise their residents during
night call.”
If you respond that you disagree,
you are saying you do not think
attendings should not supervise
residents. In other words, you
believe that attendings should
supervise residents.
If you do use a negative word like
“not”, consider highlighting the
word by underlining or bolding it to
catch the respondent’s attention.
Evaluation Question
Design Principles
Because every question is
measuring something, it’s
important for each to be clear
and precise.
Remember…Your goal is for
each respondent to interpret
the meaning of each question
in exactly the same way.
Evaluation Question
Design Principles
If your respondents are not clear
on what is being asked in a
question, their responses may
result in data that cannot or
should not be applied to your
evaluation results…
"For me, further development of
my medical competence, it is
important enough to take risks"
– Does this mean to take risks
with patient safety, risks to one's
pride, or something else?
Evaluation Question
Design Principles
Keep questions short. Long
questions can be confusing.
Bottom line: Focus on short,
concise, clearly written
statements that get right to the
point, producing actionable
data that can inform individual
learning plans (ILPs).
– Take only seconds to respond
to/rate
– Easily interpreted.
Evaluation Question
Design Principles
Do not use “loaded” or
“leading” questions
A loaded or leading question
biases the response given by
the respondent. A loaded
question is one that contains
loaded words.
– For example: “I’m concerned
about doing a procedure if my
performance would reveal that I
had low ability”
Disagree
Agree
Evaluation Question
Design Principles
"I’m concerned about doing a
procedure on my unit if my
performance would reveal that
I had low ability"
How can this be answered with
“agree or disagree” if you think
you have good abilities in
appropriate tasks for your
area?
Evaluation Question
Design Principles
A leading question is phrased
in such a way that suggests to
the respondent that a certain
answer is expected:
– Example: Don’t you agree that
nurses should show more
respect to residents and
attendings?
Yes, they should show more
respect
No, they should not show more
respect
Evaluation Question
Design Principles
Use of Open-Ended Questions
Comment boxes after negative
ratings
– To explain the reasoning and
target areas for focus and
improvement
General, open-ended questions
at the end of the evaluation.
– Can prove beneficial
– Often it is found that entire topics
have been omitted from the
evaluation that should have been
included.
Evaluation Question
Design Principles –
Exercise 2 “Post Test”
1. Please rate the general surgery
resident’s communication and
technical skills
2. Rate the resident’s ability to
communicate with patients and
their families
3. Rate the resident’s abilities with
respect to case familiarization;
effort in reading about patient’s
disease process and familiarizing
with operative care and post op
care
4. Residents deserve higher pay for
all the hours they put in, don’t
they?
Evaluation Question Design
Principles – Exercise 2 “Post
Test”
5. Explains and performs steps in
resuscitation and stabilization
6. Do you agree or disagree that
residents shouldn’t have to pay
for their meals when on-call?
7. Demonstrates an awareness of
and responsiveness to the larger
context of health care
8.Demonstrates ability to
communicate with faculty and
staff
Bias in the Rating
Scales for Questions
The scale you
construct
can also
skew your
data, much
like we
discussed
about
question
construction.
Evaluation Design
Principles: Rating Scales
By far the most popular scale
asks respondents to rate their
agreement with the evaluation
questions or statements –
“stems”.
After you decide what you want
respondents to rate
(competence, agreement, etc.),
you need to decide how many
levels of rating you want them
to be able to make.
Evaluation Design
Principles: Rating Scales
Using too few can give less
precise, cultivated information,
while using too many could
make the question hard to read
and answer (do you really
need a 9 or 10 point scale?)
Determine how fine a
distinction you want to be able
to make between agreement
and disagreement.
Evaluation Design
Principles:
Rating Scales
Psychological research has
shown that a 6-point scale with
three levels of agreement and
three levels of disagreement
works best. An example would
be:
Disagree Strongly
Disagree Moderately
Disagree Slightly
Agree Slightly
Agree Moderately
Agree Strongly
Evaluation Design
Principles: Rating Scales
This scale affords you ample
flexibility for data analysis.
Depending on the questions,
other scales may be
appropriate, but the important
thing to remember is that it
must be balanced, or you will
build in a biasing factor.
Avoid neutral and neither
agree nor disagree…you’re
just giving up 20% of your
evaluation ‘real estate’
Evaluation Design
Principles: Rating Scales
1. Please rate the volume and
variety of patients available to
the program for educational
purposes.
Poor Fair Good Very Good Excellent
2. Please rate the performance of
your faculty members.
Poor Fair Good Very Good Excellent
3. Please rate the competence
and knowledge in general
medicine.
Poor Fair Good Very Good Excellent
Evaluation Design
Principles: Rating Scales
The data will be artificially
skewed in the positive
direction using this scale
because there are far more
(4:1) positive than negative
rating options….Yet we see
this scale being used all the
time!
Gentle Words of
Wisdom….
Avoid large numbers of questions….
Respondent fatigue – the respondent
tends to give similar ratings to all items
without giving much thought to individual
items, just wanting to finish
In situations where many items are
considered important, a large number can
receive very similar ratings at the top end
of the scale
Items are not traded-off against each
other and therefore many items that are
not at the extreme ends of the scale or
that are considered similarly important are
given a similar rating
Gentle Words of
Wisdom….
Avoid large numbers of
questions….but ensure your
evaluation is both valid and has
enough questions to be
reliable….
How many questions
(raters) are enough?
Not intuitive
Little bit of math is necessary
(sorry)
True Score =Observed Score
+/- Error score
Why are we talking about
reliability in a question
writing session ?
To create your own evaluation
questions and insure their
reliability
To share/use other
evaluations that are assuredly
reliable
To read the evaluation
literature
Reliability
Reliability is the "consistency"
or "repeatability" of your
measures.
If you could create 1 perfect
test question (unbiased and
perfectly representative of the
task) you would need only that
one question
OR if you could find 1 perfect
rater (unbiased and fully
understanding the task) you
would need only one rater
Reliability Estimates
Test Designers use four
correlational methods to check
the reliability of an evaluation:
1. the test-retest method,(Pre test
–Post test)
2. alternate forms,
3. internal consistency,
4. and inter-rater reliability.
Generalizability
One measure based on Score
Variances
– Generalizablity Theory
Problems with
Correlation Methods
Based on comparing portions of a
test to one another ( Split-Half,
Coefficient α, ICC.)
– Assumes that all portions are strictly
parallel (measuring the same skill,
knowledge, attitude)
Test-Retest assumes no learning
has occurred in the interim.
Inter-rater reliability only provides
consistency of raters across an
instrument of evaluation
UNLIKE A MATH TEST, ALL
CLINICAL SITUATIONS ARE NOT
PARALLEL…
Methods based on
Score Variance
Generalizablity Theory
– Based in Analysis of VarianceANOVA
– Can parse out the differences in
the sources of error
For example, capture the essence
of differing clinical situations
Generalizablity
Studies
Two types:
– G study
ANOVA is derived from the actual #
of facets(factors) that you put into
the equation
Produces a G coefficient (similar to
r or ά )
– D study
Allows you to extrapolate to other
testing formats
Produces a D coefficient
G Study Example
FACET ( FACTOR)
LABEL
#
Professors scoring
activity
Students in class
P
3
S
10
# items tested
I
2
Professors
11.5
%
error
52%
Test items
0.09
0.4%
COEF G = 0.46
What can we do about
this problem?
Train the raters
Increase the # of raters
Would increasing the # of test
items help?
Changing the Number
of Raters
P
3
6
12 18 24 30
S
10
10
10
10
10
10
I
2
2
2
2
2
2
Coef
G
0.45
0.61
0.75
0.82
0.85 0.89
D Study Example
Changing the Number of Items
P
3
3
3
3
3
3
S
10
10
10
10
10
10
I
2
4
8
16 32 40
Coef
G
0.45 0.46 0.46 0.47 0.47 0.47
Reliability Goals
All reliability coefficients
display the following qualities:
– <50 poor
– 50-70 moderate
– 70-90 good
– >90 excellent
Interrater Reliability
(Kappa)
IRR is not really a measure of
the test reliability, rather a
property of the raters
– It does not tell us anything about
the inherent variability within the
questions themselves
– Rather
Quality of the raters
Or misalignment of one
rater/examinee dyad
Reliabililty
Evaluation Reliability
(consistency) is an essential
but not sufficient requirement
for validity
Validity
Validity is a property of
evaluation scores. Valid
evaluation scores are ones with
which accurate inferences can
be made about the examinee’ s
performance.
The inferences can be in the
areas of :
– Content knowledge
– Performance ability
– Attitudes, behaviors and attributes
Three types of test
score validity
1. Content
– Inferences from the scores can be
generalized to a larger domain of
items similar to those on the test
itself
Example (content validity): board
scores
2. Criteria
– Score inferences can be
generalized to performance on
some real behavior (present or
anticipated) of practical importance
Example
– Present behavioral generalization
(concurrent validity ): OSCE
– Future behavioral generalization
(predictive validity): MCAT
Validity
3. Construct
– Score inferences have “no
criteria or universe of content to
entirely adequate to define the
quality to be measured”
(Cronbach and Meehl, 1955) but
the inferences can be drawn
under the label of a particular
psychological construct
Example : professionalism
Example Question:
Does not demonstrate
extremes of behavior
Communicates well
Uses lay terms when
discussing issues
Is seen as a role model
Introduces oneself and role in
the care team
Skillfully manages difficult
patient situations
Sits down to talk with patients
Process of Validation
Define the intended purposes/
use of inferences to be made
from the evaluation
Five Arguments for Validity
(Mesick, 1995)
–
–
–
–
–
Content
Substance
Structure
GENERALIZABLITY
Consequence
Generalizablity
Inferences from this
performance task can be
extended to like tasks
– Task must be representative
(not just simple to measure )
– Should be as fully represent the
domain as practically possible
Example: Multiple Mini Interview
(MMI)
Why are validity
statements critical now?
Performance evaluation is on
the crux of use for credentialing
and certification.
We are asked to measure
constructs ….not just content
and performance abilities.
Gentle Words of
Wisdom: Begin with the
End in Mind
What do you want as
your outcomes? What
is the purpose of your
evaluation
Be prepared to put in
the time with pretesting
for reliability and
understandability
The faculty member,
nurse, patient, resident
has to be able to
understand the intent of
the question - and each
must find it credible and
interpret it in the same
way
Adding more items to
the test may not always
be the answer to
increased reliability
Gentle Words of
Wisdom Continued…
Relevancy and Accuracy –
If the questions aren’t framed
properly, if they are too vague or
too specific, it’s impossible to get
any meaningful data.
– Question miswording can lead to
skewed data with little or no
usefulness.
– Ensure your response scales are
balanced and appropriate.
– If you don't plan or know how you
are going to use the data, don't ask
the question!
Gentle Words of
Wisdom Continued…
Use an appropriate number of
questions based on your
evaluation's purpose .
If you are using aggregated
data, the statistical analyses
must be appropriate for your
evaluation or, however
sophisticated and impressive,
the numbers generated that look
real will actually be false and
misleading.
Are differences really significant
given your sample size?
Summary: Evaluation
Do’s and Don’ts
DO’s
Keep Questions
Clear, Precise
and Relatively
Short.
Use a balanced
response scale
– (4-6 scale points
recommended)
Use open ended
questions
Use an
appropriate
number of
questions
DON’Ts
Do not use
Double+ Barreled
Questions
Do not use
Double Negative
Questions
Do not use
Loaded or
Leading
Questions.
Don’t assume
there is no need
for rater training
Ready to Play the
Game
Questions
Download