Designing Trustworthy & Reliable GME Evaluations Conference Session: SES85 2011 ACGME Annual Education Conference Nancy Piro, PhD, Program Manager/Ed Specialist Alice Edler, MD, MPH, MA (Educ) Ann Dohn, MA, DIO Bardia Behravesh, EdD, Manager/Ed Specialist Stanford Hospital & Clinics Department of Graduate Medical Education Overall Questions What is assessment…an evaluation? How are they different? What are they used for? Why do we evaluate? How do we construct a useful evaluation? What is cognitive bias? How do we eliminate bias from our evaluations? What is validity? What is reliability? Defining the Rules of the “Game” Assessment - Evaluation: What’s the difference and what are they used for? Assessment …is the analysis and use of data by residents or subspecialty residents (trainees), faculty, program directors and/or departments to make decisions about improvements in teaching and learning. Assessment - Evaluation: What’s the difference and what are they used for? Evaluation is the analysis and use of data by faculty to make judgments about trainee performance. Evaluation includes obtaining accurate performance based, empirical information which is used to make competency decisions on trainees across the six domains. Evaluation Examples Example 1: A trainee delivers an oral presentation at a Journal Club. The faculty member provides a critique of the delivery and content accompanied by a rating for the assignment. Example 2: A program director provides a final evaluation to a resident accompanied by an attestation that the resident has demonstrated sufficient ability and acquired the appropriate clinical and procedural skills to practice competently and independently. Why do we assess and evaluate? (Besides the fact it is required…) Demonstrate and improve trainee competence in core and related competency areas - Knowledge and application Ensure our programs produce graduates, each of whom: “has demonstrated sufficient ability and acquired the appropriate clinical and procedural skills to practice competently and independently.” Track the impact of curriculum/organizational change Gain feedback on program, curriculum and faculty effectiveness Provide residents/fellows a means to communicate confidentially Provide an early warning system Identify gaps between competency based goals and individual performance So What’s the Game Plan for Constructing Effective Evaluations ? Without a plan… evaluations can take on a life of their own!! How do we construct a useful evaluation? How do we construct a useful evaluation? STEP 1. Create the Evaluation (Plan) Curriculum (Competency) Goals, Objectives and Outcomes Question and Scale Development STEP 2. Deploy (Do) Online /In-Person (Paper) STEP 3. Analyze (Study /Check ) Reporting, Benchmarking and Statistical Analysis Rank Order / Norms (Within the Institution or National) STEP 4. Take Action (Act) Develop & Implement Learning/Action Plans Measure Progress Against Learning Goals Adjust Learning/Action Plans Question and Response Scale Construction Two Basic Goals: 1. Construct unbiased, unconfounded, and nonleading questions that produce valid data 2. Design and use unbiased and valid response scales What is cognitive bias… Cognitive bias is distortion in the way we perceive reality or information. Response bias is a particular type of cognitive bias which can affect the results of an evaluation if respondents answer questions in the way they think they are designed to be answered, or with a positive or negative bias toward the examinee. Where does response bias occur? Response bias most often occurs in the wording of the question. – Response bias is present when a question contains a leading phrase or words. – Response bias can also occur in rating scales. Response bias can also be in the raters themselves – – – – Halo Effect Devil Effect Similarity Effect First Impressions Step 1: Create the Evaluation Question Construction Example (1): – "I can always talk to my Program Director about residency related problems.” Example (2): – “Sufficient career planning resources are available to me and my program director supports my professional aspirations .” Question Construction Example (3): – “Incomplete, inaccurate medical interviews, physical examinations; incomplete review and summary of other data sources. Fails to analyze data to make decisions; poor clinical judgment.” Example (4): – "Communication in my subspecialty program is good." Create the Evaluation Question Construction Example (5): – "The pace on our service is chaotic." Exercise One Review each question and share your thinking of what makes it a good or bad question. Question Construction Test Your Knowledge Example 1: "I can always talk to my Program Director about residency related problems." Problem: Terms such as "always" and "never" will bias the response in the opposite direction. Result: Data will be skewed. Question Construction Test Your Knowledge Example 2: “Career planning resources are available to me and my program director supports my professional aspirations." Problem: Double-barreled --resources and aspirations… Respondents may agree with one and not the other. Researcher cannot make valid assumptions about which part of the question respondents were rating. Result: Data is useless. Question Construction Test Your Knowledge Example 3: "Communication in my sub-specialty program is good." Problem: Question is too broad. If score is less than 100% positive, researcher/evaluator still does not know what aspect of communication needs improvement. Result: Data is of little or no usefulness. Question Construction Test Your Knowledge Example 4: “Evidences incomplete, inaccurate medical interviews, physical examinations; incomplete review and summary of other data sources. Fails to analyze data to make decisions; poor clinical judgment.” Problem: Septuple-barreled --Respondents may need to agree with some and not the others. Evaluator cannot make assumptions about which part of the question respondents were rating. Result: Data is useless. Question Construction Test Your Knowledge Example (5): – "The pace on our service is chaotic.“ Problem: The question is negative, and broadcasts a bad message about the rotation/program. Result: Data will be skewed, and the climate may be negatively impacted. Evaluation Question Design Principles Avoid ‘double-barreled’ questions A double-barreled question combines two or more issues or “attitudinal objects” in a single question. Avoiding Double-Barreled Questions Example: Patient Care Core Competency “Resident provides sensitive support to patients with serious illness and to their families, and arranges for on-going support or preventive services if needed.” Minimal Progress Progressing Competent Evaluation Question Design Principles Combining the two or more questions into one question makes it unclear which object attribute is being measured, as each question may elicit a different perception of the resident’s performance. RESULT: Respondents are confused and results are confounded leading to unreliable or misleading results. Tip: If the word “and” or the word “or” appears in a question, check to verify whether it is a doublebarreled question. Evaluation Question Design Principles Avoid questions with double negatives… When respondents are asked for their agreement with a negatively phrased statement, double negatives can occur. – Example: Do you agree or disagree with the following statement? Evaluation Question Design Principles “Attendings should not be required to supervise their residents during night call.” If you respond that you disagree, you are saying you do not think attendings should not supervise residents. In other words, you believe that attendings should supervise residents. If you do use a negative word like “not”, consider highlighting the word by underlining or bolding it to catch the respondent’s attention. Evaluation Question Design Principles Because every question is measuring something, it’s important for each to be clear and precise. Remember…Your goal is for each respondent to interpret the meaning of each question in exactly the same way. Evaluation Question Design Principles If your respondents are not clear on what is being asked in a question, their responses may result in data that cannot or should not be applied to your evaluation results… "For me, further development of my medical competence, it is important enough to take risks" – Does this mean to take risks with patient safety, risks to one's pride, or something else? Evaluation Question Design Principles Keep questions short. Long questions can be confusing. Bottom line: Focus on short, concise, clearly written statements that get right to the point, producing actionable data that can inform individual learning plans (ILPs). – Take only seconds to respond to/rate – Easily interpreted. Evaluation Question Design Principles Do not use “loaded” or “leading” questions A loaded or leading question biases the response given by the respondent. A loaded question is one that contains loaded words. – For example: “I’m concerned about doing a procedure if my performance would reveal that I had low ability” Disagree Agree Evaluation Question Design Principles "I’m concerned about doing a procedure on my unit if my performance would reveal that I had low ability" How can this be answered with “agree or disagree” if you think you have good abilities in appropriate tasks for your area? Evaluation Question Design Principles A leading question is phrased in such a way that suggests to the respondent that a certain answer is expected: – Example: Don’t you agree that nurses should show more respect to residents and attendings? Yes, they should show more respect No, they should not show more respect Evaluation Question Design Principles Use of Open-Ended Questions Comment boxes after negative ratings – To explain the reasoning and target areas for focus and improvement General, open-ended questions at the end of the evaluation. – Can prove beneficial – Often it is found that entire topics have been omitted from the evaluation that should have been included. Evaluation Question Design Principles – Exercise 2 “Post Test” 1. Please rate the general surgery resident’s communication and technical skills 2. Rate the resident’s ability to communicate with patients and their families 3. Rate the resident’s abilities with respect to case familiarization; effort in reading about patient’s disease process and familiarizing with operative care and post op care 4. Residents deserve higher pay for all the hours they put in, don’t they? Evaluation Question Design Principles – Exercise 2 “Post Test” 5. Explains and performs steps in resuscitation and stabilization 6. Do you agree or disagree that residents shouldn’t have to pay for their meals when on-call? 7. Demonstrates an awareness of and responsiveness to the larger context of health care 8.Demonstrates ability to communicate with faculty and staff Bias in the Rating Scales for Questions The scale you construct can also skew your data, much like we discussed about question construction. Evaluation Design Principles: Rating Scales By far the most popular scale asks respondents to rate their agreement with the evaluation questions or statements – “stems”. After you decide what you want respondents to rate (competence, agreement, etc.), you need to decide how many levels of rating you want them to be able to make. Evaluation Design Principles: Rating Scales Using too few can give less precise, cultivated information, while using too many could make the question hard to read and answer (do you really need a 9 or 10 point scale?) Determine how fine a distinction you want to be able to make between agreement and disagreement. Evaluation Design Principles: Rating Scales Psychological research has shown that a 6-point scale with three levels of agreement and three levels of disagreement works best. An example would be: Disagree Strongly Disagree Moderately Disagree Slightly Agree Slightly Agree Moderately Agree Strongly Evaluation Design Principles: Rating Scales This scale affords you ample flexibility for data analysis. Depending on the questions, other scales may be appropriate, but the important thing to remember is that it must be balanced, or you will build in a biasing factor. Avoid neutral and neither agree nor disagree…you’re just giving up 20% of your evaluation ‘real estate’ Evaluation Design Principles: Rating Scales 1. Please rate the volume and variety of patients available to the program for educational purposes. Poor Fair Good Very Good Excellent 2. Please rate the performance of your faculty members. Poor Fair Good Very Good Excellent 3. Please rate the competence and knowledge in general medicine. Poor Fair Good Very Good Excellent Evaluation Design Principles: Rating Scales The data will be artificially skewed in the positive direction using this scale because there are far more (4:1) positive than negative rating options….Yet we see this scale being used all the time! Gentle Words of Wisdom…. Avoid large numbers of questions…. Respondent fatigue – the respondent tends to give similar ratings to all items without giving much thought to individual items, just wanting to finish In situations where many items are considered important, a large number can receive very similar ratings at the top end of the scale Items are not traded-off against each other and therefore many items that are not at the extreme ends of the scale or that are considered similarly important are given a similar rating Gentle Words of Wisdom…. Avoid large numbers of questions….but ensure your evaluation is both valid and has enough questions to be reliable…. How many questions (raters) are enough? Not intuitive Little bit of math is necessary (sorry) True Score =Observed Score +/- Error score Why are we talking about reliability in a question writing session ? To create your own evaluation questions and insure their reliability To share/use other evaluations that are assuredly reliable To read the evaluation literature Reliability Reliability is the "consistency" or "repeatability" of your measures. If you could create 1 perfect test question (unbiased and perfectly representative of the task) you would need only that one question OR if you could find 1 perfect rater (unbiased and fully understanding the task) you would need only one rater Reliability Estimates Test Designers use four correlational methods to check the reliability of an evaluation: 1. the test-retest method,(Pre test –Post test) 2. alternate forms, 3. internal consistency, 4. and inter-rater reliability. Generalizability One measure based on Score Variances – Generalizablity Theory Problems with Correlation Methods Based on comparing portions of a test to one another ( Split-Half, Coefficient α, ICC.) – Assumes that all portions are strictly parallel (measuring the same skill, knowledge, attitude) Test-Retest assumes no learning has occurred in the interim. Inter-rater reliability only provides consistency of raters across an instrument of evaluation UNLIKE A MATH TEST, ALL CLINICAL SITUATIONS ARE NOT PARALLEL… Methods based on Score Variance Generalizablity Theory – Based in Analysis of VarianceANOVA – Can parse out the differences in the sources of error For example, capture the essence of differing clinical situations Generalizablity Studies Two types: – G study ANOVA is derived from the actual # of facets(factors) that you put into the equation Produces a G coefficient (similar to r or ά ) – D study Allows you to extrapolate to other testing formats Produces a D coefficient G Study Example FACET ( FACTOR) LABEL # Professors scoring activity Students in class P 3 S 10 # items tested I 2 Professors 11.5 % error 52% Test items 0.09 0.4% COEF G = 0.46 What can we do about this problem? Train the raters Increase the # of raters Would increasing the # of test items help? Changing the Number of Raters P 3 6 12 18 24 30 S 10 10 10 10 10 10 I 2 2 2 2 2 2 Coef G 0.45 0.61 0.75 0.82 0.85 0.89 D Study Example Changing the Number of Items P 3 3 3 3 3 3 S 10 10 10 10 10 10 I 2 4 8 16 32 40 Coef G 0.45 0.46 0.46 0.47 0.47 0.47 Reliability Goals All reliability coefficients display the following qualities: – <50 poor – 50-70 moderate – 70-90 good – >90 excellent Interrater Reliability (Kappa) IRR is not really a measure of the test reliability, rather a property of the raters – It does not tell us anything about the inherent variability within the questions themselves – Rather Quality of the raters Or misalignment of one rater/examinee dyad Reliabililty Evaluation Reliability (consistency) is an essential but not sufficient requirement for validity Validity Validity is a property of evaluation scores. Valid evaluation scores are ones with which accurate inferences can be made about the examinee’ s performance. The inferences can be in the areas of : – Content knowledge – Performance ability – Attitudes, behaviors and attributes Three types of test score validity 1. Content – Inferences from the scores can be generalized to a larger domain of items similar to those on the test itself Example (content validity): board scores 2. Criteria – Score inferences can be generalized to performance on some real behavior (present or anticipated) of practical importance Example – Present behavioral generalization (concurrent validity ): OSCE – Future behavioral generalization (predictive validity): MCAT Validity 3. Construct – Score inferences have “no criteria or universe of content to entirely adequate to define the quality to be measured” (Cronbach and Meehl, 1955) but the inferences can be drawn under the label of a particular psychological construct Example : professionalism Example Question: Does not demonstrate extremes of behavior Communicates well Uses lay terms when discussing issues Is seen as a role model Introduces oneself and role in the care team Skillfully manages difficult patient situations Sits down to talk with patients Process of Validation Define the intended purposes/ use of inferences to be made from the evaluation Five Arguments for Validity (Mesick, 1995) – – – – – Content Substance Structure GENERALIZABLITY Consequence Generalizablity Inferences from this performance task can be extended to like tasks – Task must be representative (not just simple to measure ) – Should be as fully represent the domain as practically possible Example: Multiple Mini Interview (MMI) Why are validity statements critical now? Performance evaluation is on the crux of use for credentialing and certification. We are asked to measure constructs ….not just content and performance abilities. Gentle Words of Wisdom: Begin with the End in Mind What do you want as your outcomes? What is the purpose of your evaluation Be prepared to put in the time with pretesting for reliability and understandability The faculty member, nurse, patient, resident has to be able to understand the intent of the question - and each must find it credible and interpret it in the same way Adding more items to the test may not always be the answer to increased reliability Gentle Words of Wisdom Continued… Relevancy and Accuracy – If the questions aren’t framed properly, if they are too vague or too specific, it’s impossible to get any meaningful data. – Question miswording can lead to skewed data with little or no usefulness. – Ensure your response scales are balanced and appropriate. – If you don't plan or know how you are going to use the data, don't ask the question! Gentle Words of Wisdom Continued… Use an appropriate number of questions based on your evaluation's purpose . If you are using aggregated data, the statistical analyses must be appropriate for your evaluation or, however sophisticated and impressive, the numbers generated that look real will actually be false and misleading. Are differences really significant given your sample size? Summary: Evaluation Do’s and Don’ts DO’s Keep Questions Clear, Precise and Relatively Short. Use a balanced response scale – (4-6 scale points recommended) Use open ended questions Use an appropriate number of questions DON’Ts Do not use Double+ Barreled Questions Do not use Double Negative Questions Do not use Loaded or Leading Questions. Don’t assume there is no need for rater training Ready to Play the Game Questions