2013-04-16-Chuang

advertisement
Fundamentals of Assessment and
Grading
Alice CHUANG, MD
Department of Obstetrics and Gynecology
University of North Carolina-Chapel Hill
Chapel Hill, NC
AOE Basic Teaching Skills Curriculum
April 16, 12:00 PM, Bondurant G010
APGO Clerkship Directors’ School
 Neither I nor my spouse has any financial interests to
disclose related to this talk.
Objectives
 Understand reliability and validity
 Contrast formative and summative evaluation
 Compare and contrast norm-referenced and
criterion referenced assessments
 Improve delivery of feedback
 Understand the NBME exam
 Be familiar with different testing formats, their
uses and their limitations
Terminology
 Validity: Are we measuring what we think we’re
measuring
 Content: Does the instrument measure the depth and
breadth of the content of the course? Does it
inadvertently measure something else?
 Construct: Does the evaluation criteria or grading
construct allow for true measurement of the
knowledge, skills or attitudes taught in the course? Is
any part of the grading construct irrelevant?
 Criterion: Does the outcome correlate with true
competencies? Relate to an important current or future
events? Is the assessment relevant to future
performance?
http://pareonline.net/getvn.asp?v=7&n=10
Examples
 Validity
 Content: A summative ob/gyn test which covered only
obstetrics
 Construct: You allow students to use their textbook for
a knowledge-based multiple choice test of foundational
information on prenatal care.
 Criterion: New Coke v. Old Coke
Terminology
 Reliability: Are our measurements consistent?
The score should be the same no matter when it
was taken, who scored it, or when it was scored.
 Interrater reliability: Is a student’s score consistent
between evaluators?
 Intrarater reliability: Is a student’s score consistent with
the same rater even if rated under different
circumstances?
 Scoring rubric: standardized method of grading to
increase interrater and intrarater reliability
http://pareonline.net/getvn.asp?v=7&n=10
Examples:
In general, if you repeat the same assessment, will
you get the same answer?
 Interrater: 3 individuals are asked to go to the
beach and estimate how many seagulls they see
from 6-7AM and come up with 200, 800 and 1200.
 Intrarater: A particular food critic always gives
low scores for food quality if the server is female.
Examples: Show Choir Audition Rubric
Poor Candidate
0 points
Fair Candidate
1 points
Good Candidate
2 points
Superior Candidate
3 points
Singing Skills
Sings with as much
expression as a wet
noodle, cannot identify
which tune candidate is
singing, also cannot
identify what the lyrics
of song are secondary
to poor pronunciation
Minimally expressive,
pitch off significantly on
occasion, diction unclear
at times
Very expressive, sings
on pitch most of the
time with minor errors,
diction clear most of the
time
Artistically expressive,
sings on pitch, diction
clear
Dancing Skills
Has 2 left feet, unable to
learn new steps and
continues to dance like
MC Hammer despite
different choreography
demonstrated
Missteps despite
multiple attempts, no
artistic expression in
dance moves, unable to
learn new choreography
after 3 demonstrations
Occasionally missteps,
but overall dance steps
are accurate, adapts
choreography fairly
rapidly,
Quick and nimble,
dances artistically, able
to learn new
choreography quickly.
Freely admits not
knowing what GLEE is
Endorses enjoyment of
GLEE, but unable to
identify favorite
character
Has watched 70% of
GLEE episodes
Has seen every episode
of GLEE, all GLEE albums
confirmed in iTUNES
library, has been to
GLEE LIVE each summer
Enthusiasm
for show
CHOIR
Formative v. summative assessments
 Formative: on-going assessment, designed to help
improve educational program as well as learner
progress
 Summative: designed to evaluate student overall
performance at end of educational phase and
evaluate effectiveness of teaching
http://fcit.usf.edu/assessment/basic/basica.html
Examples
 Formative: short multiple choice exam written in
house that is pass/fail; answers are reviewed with
class at end of testing session
 Summative: NBME exam
Formative v. summative assessments
 ED30: The directors of all courses and clerkship
must design and implement a system of
formative and summative evaluation of student
achievement in each course and clerkship.
Those responsible for the evaluation of student performance
should understand the uses and limitation of various test
formats, the purposes and benefits of criterion-referenced vs.
norm-referenced grading, reliability and validity issues, formative
vs. summative assessment, etc….
Formative v. summative assessments
 ED31: Each student should be evaluated early
enough during a unit of study to allow time
for remediation
 ED32: Narrative descriptions of student
performance and of non-cognitive
achievement should be included as part of
evaluations in all required courses and
clerkships where teacher-student interaction
permits this form of assessment.
Formative v. summative assessments
Uses for
assessments
Formative
Summative
Purpose
Feedback for learning
Certification/Grading
Breadth of scope
Narrow focus on
specific objectives
Broad focus on general
goals
Scoring
Explicit feedback
Overall performance
Learner affective
response
Little anxiety
Moderate to high
anxiety
Target audience
Learner
Society
Characteristics of feedback
Effective Feedback:
• given with the goal of improvement
 timely
 honest
 respectful
 clear
 issue-specific
 objective
 supportive
 motivating
 action-oriented
 solution-oriented
Destructive Feedback:
• unhelpful
 accusatory
 personal
 judgmental
 subjective
It also
 undermines the self-esteem of
the receiver
 leaves the issue unresolved
 the receiver is unsure how to
proceed.
http://www.expressyourselftosuccess.com/the-importance-of-providing-constructive-feedback/
Feedback…from APGO/CREOG 2011




When you…
You give the impression…
I would stop…
I would recommend…instead
Norm-referenced v. criterion- referenced
assessments
 Norm-referenced
 Purpose is to classify students in order of achievement
from low to high
 Allow comparisons of students
 May not give accurate information regarding student
abilities
 Half of the students should score above midpoint score
and the other half should score below midpoint score
Rickets C. A plea for the proper use of criterion-referenced tests in medical assessment.
Med Educ, Vol 43, Issue 12.
Norm-referenced v. criterion- referenced
assessments
 Criterion-referenced
 Purpose is to evaluate students knowledge and skills
compared to a pre-determined goal performance level
 Gives information about a student’s achievement of
certain objectives
 Should be possible for everyone to earn a passing score
Rickets C. A plea for the proper use of criterion-referenced tests in medical assessment.
Med Educ, Vol 43, Issue 12.
Example
 Norm-referenced: Soccer tryouts where 11 players are
chosen out of 40
 Criterion-referenced: Test for driver’s license
Norm-referenced v. criterion- referenced
assessments
 Be sure your assessment is appropriately normreferenced or criterion referenced.
 Be sure that your assessment is designed with
this in mind.
 Most assessments in medical education are
criterion-referenced.
 Norm-referenced tests should emphasize
variability; criterion-referenced tests should
emphasize accuracy of tested material.
NBME
 Exams
 Developed by committees and content experts
 Same protocol used to build Step 1 and Step 2
 In general
 Subject exams provided to all 130 LCME accredited
medical school is US
 8 Canadian medical schools
 8 osteopathic medical school
 22 international medical schools
NBME
 Scaled to have a mean of 70 and SD of 8
based on 9000 first-time test takers from 80+
schools who took exam as end-of-clerkship
exam in 1993-94
 Scores do not reflect percentage of questions
answered correctly.
NBME: What do those scores mean?
Score
2011-2012
Total
year
Q1
Q2
Q3
Q4
93 or above
98
99
98
97
97
92
97
98
98
97
96
86
90
93
91
89
88
80
75
80
77
73
71
78
67
71
69
63
62
74
49
54
51
45
44
70
29
33
32
26
25
62
6
7
6
5
4
60
3
4
4
3
2
A score of 60 in the fourth quarter means that 2% of the examinees in the
fourth quarter scored 60 or below!
NBME: Academic purpose for exam
%
Advanced placement
5
Course/clerkship
95
Year-end
12
Make-up
21
Minimal competence
44
Identify at risk students
23
Practice for USMLE
47
Promotion requirement
37
Review course
1
Student self-assessment
26
Other
4
Total responses:
78
NBME: Weight given the subject exam
Weight given the subject exam
1-10%
11-20%
21-30%
31-40%
41-50%
>50%
Total number responding
%
4
16
33
39
13
0
70
NBME 2008 Clerkship Survey Results
Assessment/Evaluation Method
Ob/gyn (%)
Computer Case Simulations
0.5
Subject Exam
30
School’s MCQ Exam
9
Observation and evaluation by residents
28
Observation and evaluation by faculty
26
Oral exam
14
OSCE
12
Peer evaluation
1
Standardized patient exam
3
Other
18
Total number responding
81
NBME
 2004 and 2009 survey of performance guidelines
across clerkship
 Recommend setting an absolute versus a relative
standard for performance
 Angoff Procedures: item-based, judges provide guess of
minimally proficient examinees that answer each
question correctly
 Hofstee Method: judges determine minimum and
maximum scores for passing and percentage of
failures…then plotted against a graph made up of exam
score and failure rate
NBME
Testing Formats
 Multiple choice exam (MCQ)
 Objective structured clinical examination
(OSCE)
 Oral examination
 Direct observation
 Simulation
 Standardized patient
 Patient/procedure log
Casey et al, To the point: reviews in medical
education – the Objective Structured Clinical
 Medical record reviews Examination. AJOG, Jan 2009.
 Written essay questions
Testing format: MCQ
 Use distractors which could plausibly represent
correct answer
 Use a question format, not complete-the-statement
format
 Emphasize higher-level thinking, not strict
memorization
 Keep option length consistent within a question
 Balance the placement of the correct answer
 Use correct grammar
 Avoid clues to the correct answer
 Highly reliable and valid for assessing knowledge
http://testing.byu.edu/info/handbooks/14%20Rules%20for%20Writing%20Multiple-Choice%20Questions.pdf
Testing format: OSCE
 Examinees rotate through circuit of stations (5-10 minutes
each)
 One-on-one examination (with examiner or trained or
simuated patient)
 List of criteria for successful completion of each station
 Each station test a specific skill or competency
 Good for examining higher-order skills, clinical and
technical skills
 Requires large amount of resources
Testing format: Oral Exam
 Portfolio based: similar to case-based portion of Oral
Boards
 Poor inter-rater and intra-rater reliability
 Scores higher when scored live verses on video
 Teaching students how to do better on oral exam does not
improve scores
 Practicing oral exams does improve scores
 Mock public oral exam improves performance
 Limitations
 Halo effect (grade reflects not only performance on exam but
also previous experience)
 Subconscious consensus grading: examiners take subconscious
cues from each other.
Burch & Seggie, 2008; Kearney et al, 2001; Buchard et al, 2007; Jacobsohn et al, 2006
Testing format: Oral Exam
 Is an oral exam justified? Is there an advantage?
 Does the material lend itself to open questioning?
 How will communication skills, delivery of information be
graded? Will only content be graded?
 Is the examiner experienced? Will he/she skew grades in
any way?
 How will you prepare students for the exam?
 Is there enough time for every student to examine them
adequately?
 How much prompting/assistance is allowed for oral
examination? How much time will you allow for “thinking?”
How will you ensure consistency in these areas for all
examinees?
http://testing.byu.edu/info/handbooks/14%20Rules%20for%20Writing%20Multiple-Choice%20Questions.pdf
Testing format: Direct observation







Formalized criteria
Various observers
True-to-life clinical setting (versus simulated)
Numerical scores
Comment anchored
Improve reliability with multiple perspectives
Consider 360 evaluation (including self, patient and
other staff members)
Testing format
MCQ
OSCE
Direct obs
Content
+++
++
+
+
Construct
+++
++
+
+
Criterion
+
++
+
+
Reliability
+++
++
+
+
Formative
Y
Y
Y
Y
Y
Y
Y
Y
Normreferenced
Y
N
N
N
Criterionreferenced
Y
Y
Y
Y
Summative
Oral exam
General rules of thumb
Be sure your assessment









Provides reliable data
Provides valid data
Provides valuable data
Is feasible
Can be incorporated into the systems in place
(hospital, clinic, curriculum, etc)
Is consistent with course objectives
Utilizes multiple instruments, multiple assessors and
multiple points of assessment
Aligns with pre-specified criteria
Is fair
Lynch and Swing. Key Considerations for Selecting Assessment Instruments and
Implementing Assessment Systems. ACGME.
References
Bond, Linda A. (1996). Norm- and criterion-referenced testing. Practical Assessment, Research &
Evaluation, 5(2). Accessed at http://pareonline.net/getvn.asp?v=5&n=2
Burch VC, Seggie JL. Use of a structured interview to assess portfolio-based learning. Med Ed 2008:
42: 894-900.
Burchard K et al. Is it live or is it Memorex? Student oral examinatinos and the use of video for
additional scoring. Am J Surg. 193 (2007), 233-236
Casey et al, To the point: reviews in medical education – the Objective Structured Clinical
Examination. AJOG, Jan 2009.
Jacobsohn E , Kock PA, Avidan M. Poor inter-rater reliability on mock anesthesia oral examinations.
Kearney RA et al. The inter-rater and intra-rater reliability of a new Canadian oral examinatino
format in anesthesia is fair to good. Can J Anesth 2002; 49:3, 232-236.
Lynch and Swing. Key Considerations for Selecting Assessment Instruments and Implementing
Assessment Systems. ACGME.
Metheny WP, Espey EL, Bienstock J, et al. To the point: Medical education reviews evaluation in
context: Assessing learners, teachers, and training programs. Am J Obstet Gynecol.
2005;192(1):34-37.
Moskal, Barbara M. & Jon A. Leydens (2000). Scoring rubric development: validity and reliability.
Practical Assessment, Research & Evaluation, 7(10). Retrieved December 29, 2009 from
http://PAREonline.net/getvn.asp?v=7&n=10
Rickets C. A plea for the proper use of criterion-referenced tests in medical assessment. Med Educ,
Vol 43, Issue 12.
References
14 Rules for Writing Multiple Choice Questions. Brigham Young University 2001 Annual Conference.
Accessed at http://testing.byu.edu/info/handbooks/14%20Rules%20for%20Writing%20MultipleChoice%20Questions.pdf
Formative vs. Summative Assessments. Classroom Assessment. Accessed at:
http://fcit.usf.edu/assessment/basic/basica.html
NBME 2008 Clinical Clerkship Director Survey Results. Accessed at
https://portal.nbme.org/web/medschools/home?p_p_id=62_INSTANCE_dOGM&p_p_action=0&p
_p_state=maximized&p_p_mode=view&p_p_col_id=column1&p_p_col_count=1&_62_INSTANCE_dOGM_struts_action=%2Fjournal_articles%2Fview&_62_INS
TANCE_dOGM_keywords=&_62_INSTANCE_dOGM_advancedSearch=false&_62_INSTANCE_dO
GM_andOperator=true&_62_INSTANCE_dOGM_groupId=1172&_62_INSTANCE_dOGM_searchAr
ticleId=&_62_INSTANCE_dOGM_version=1.0&_62_INSTANCE_dOGM_name=&_62_INSTANCE_d
OGM_description=&_62_INSTANCE_dOGM_content=&_62_INSTANCE_dOGM_type=&_62_INST
ANCE_dOGM_structureId=&_62_INSTANCE_dOGM_templateId=&_62_INSTANCE_dOGM_status
=approved&_62_INSTANCE_dOGM_articleId=817480
Objective Structured Clinical Examination. Wikipedia. Accessed at
http://en.wikipedia.org/wiki/Objective_structured_clinical_examination
Reliability and Validity. Classroom Assessment. Accessed at:
http://fcit.usf.edu/assessment/basic/basicc.html
Talk about teaching: Significant issues in Oral Examinations. Contributed by Meryl Carlson,
Concordia College, Moorhead, MN. Accessed at
http://www.cord.edu/faculty/ulnessd/oral/MCarlson/questions.html
Download