confidence-based score

advertisement
Confidence-based assessment in the
1st year medical end-of-year exam
Tony Gardner-Medwin
Physiology, UCL

a useful study tool - but why in exams?

it reflects the proper meaning of knowledge

conventional marking disadvantages able students

how did the students do in the exam?

conf-asst was a more reliable measure of student ability

it saves on the number of questions required
What is Knowledge?
Knowledge depends on degree of belief, or confidence:
 knowledge
=0
-log2(confidence*)

uncertainty
for truth of a
increasing
 ignorance
true proposition
nescience =1
 misconception
 delusion
>>1
Measurement of knowledge requires the eliciting of
confidence (or *subjective probability) for the truth of
correct statements.
This requires a proper scheme of incentives
LAPT confidence-based scoring
scheme
2
3
2
3
-2
-6
>67% >80%
>2:1
>4:1
4
100%
2
80%
C=3
60%
0
Score
Subjective
Expectation of Score
Confidence Level
1
Score if Correct
1
Score if incorrect
0
P(correct)
< 67%
Odds
< 2:1
40%
C=2
20%
C=1
1
2
3
-2
-4
0%
0.5
0
0.75
1
-6
Subjective Probability
-8
-log2(Subjective Probability)
4

conventional marking disadvantages able students
Suppose 4 students go for the same answer options in an exam: 75 , 25 
Ai is confident of all his answers
Bo is very hesitant about all her answers
Cy is realistic (expects 75%), but can’t distinguish reliable & uncertain answers
Di is confident of 50 answers (90% ) and uncertain of the others (60% )
Clearly: Di > Cy > Bo, Ai
Di has extra insight - about her knowledge, or maybe about subtleties in questions
How can she use this insight?
Conventional scoring: Her only option is to omit uncertain answers:
% correct: Ai = Bo = Cy = 75%, Di = 45%
negative marking score (±1): Ai = Bo = Cy = 50%, Di = 40%
Confidence-based scoring: She can moderate her confidence:
Ai enters all at C=3, Bo at C=1:
Ai = Bo = 25%
Cy enters all at C=2:
Cy = 33%
Di splits answers C=3, C=1:
Di = 48%
Summary aims
• reward the ability to distinguish reliable and uncertain answers
(whatever the reason for uncertainty)
• penalise confident errors more than errors from uncertainty
What people sometimes think is the aim!
• to penalise a general over-confidence or under-confidence
- probably helped by practice & feedback, but not an exam issue
NB over-confidence can
actually get you places !!
How well did students discriminate?
exam: 500 T/F Qs, in 2 sessions, each 2hrs
331 students: 190 F, 141 M
100%
% correct
90%
5%, 95%
percentiles
80%
70%
60%
50%
i-c exF exM
@ C=1
i-c exF exM
@ C=2
i-c exF exM
@ C=3
100%
confidence-based score
A.
y = xa1.67
80%
60%
equality (only expected for
b
a pure mix of certain
knowledge and total
guesses)
scoresc if uncertainty
is homogeneous and
correctly reported
40%
theoretical scores
for homogeneous
d
uncertainty, based
on an information
theoretic measure
20%
0%
0%
(50% correct)
20%
40%
60%
80%
conventional scaled score (“simple” score)
100%
Breakdown of credit and variance due to uncertainty
100%
@C=1
@C=2
80%
@C=3
60%
40%
20%
0%
answ ers
correct
answ ers
credit variance
simple scores
credit variance
confidence scores
Simple scores (scaled conventional scores)
- 65% of the variance came from answers at C=1, but only 18% of the credit.
Confidence-based scores: these give less weight to uncertain answers;
uncertainty variance is then more in proportion to credit, and was reduced by
46% (relative to the variation of student marks)
Exam marks are determined by:
1. the student’s knowledge and skills in the subject area
2. the level of difficulty of the questions
3. chance factors in the way questions relate to details of the
student’s knowledge
4. chance factors in the way uncertainties are resolved (luck)
(1)
= “signal” (its measurement is the object of the exam)
(3,4) = “noise” (random factors obscuring the “signal”)
Confidence-based marks improve the “signal-to-noise ratio”
The most convincing test of this is to compare marks on one
set of questions with marks for the same student on a
different set . A good correlation means we are measuring
something about the student, not just “noise”
80%
100%
B.
C.
set 2 (simple)
60%
40%
set 2 (confidence)
R = 0.735
80%
R2 = 0.814
2
60%
40%
7
C
20%
20%
set 1 (sim ple)
0%
set 1 (confidence)
0%
20%
40%
60%
80%
100%
The correlation, across students, between
scores on one set of questions and another is
higher for confidence than for simple scores.
But perhaps they are just measuring ability
to handle confidence ?
No. Confidence scores are better than simple
scores at predicting even the conventional
scores on a different set of questions. This
can only be because they are a statistically
more efficient measure of knowledge.
0%
20%
40%
60%
80%
100%
D.
80%
R2 = 0.776
set 2 (simple)
0%
60%
40%
20%
set 1 (conf 0.6 )
0%
0%
20%
40%
60%
80%
100%
How should one handle students with poor calibration?
Significantly overconfident:
2 students (1%)
e.g. 50% correct @C=1, 59%@C=2, 73%@C=3
Significantly underconfident:
41 students (14%)
e.g. 83% correct @C=1, 89%@C=2, 99%@C=3
Maybe one shouldn’t penalise such students
Adjusted confidence-based score:
Mark the set of answers at each C level as if they were
entered at the C level that gives the highest score.
mean benefit = 1.5% ± 2.1% (median 0.6%)
100%
confidence-based score
A.
y = x1.67
a
80%
60%
equality (only expected for
b
a pure mix of certain
knowledge and total
guesses)
40%
theoreticaldscores
for homogeneous
uncertainty, based
on an information
theoretic measure
20%
0%
0%
(50% correct)
scores ifcuncertainty
is homogeneous and
correctly reported
20%
40%
60%
simple scaled score
80%
100%
(100% correct)
0.900
Mean values of r2 for 16 random partitionings of the 500
questions : score on one set vs score on the other
0.850
0.800
0.750
0.700
simple :
simple
conf :
conf
conf(adj):
conf(adj)
simple :
conf
simple :
conf(adj)
simple
conf
conf (adj)
Signal / noise variance ratio:
2.8
5.3
4.3
Savings in no. of Qs required:
-
48%
35%
SUMMARY CONCLUSIONS
•
Adjusted confidence scores seem the best scores to use
(they don’t discriminate on the basis of the calibration of a
person’s confidence judgements, and are also the best
predictors of performance on a separate set of questions).
•
Reliable discrimination of student knowledge can be achieved
with one third fewer questions, compared with conventional
scoring.
•
Confidence scoring is not only fundamentally more fair
(rewarding students who can correctly identify which answers
are uncertain) but it is more efficient at measuring performance.
•
www.ucl.ac.uk/~cusplap
Principles that students seem readily to understand :• both under- and over- confidence are impediments to learning
• confident errors are far worse than acknowledged ignorance and are a
wake-up call (-6!) to pay attention to explanations
• expressing uncertainty when you are uncertain is a good thing
• thinking about the basis and reliability of answers can help tie bits of
knowledge together (to form “understanding”)
• checking an answer and rereading the question are worthwhile
• sound confidence judgement is a valued intellectual skill in every
context, and one they can improve
Download