Statistical methods for assessment of agreement
Professor Dr. Bikas K Sinha
Applied Statistics Division
Indian Statistical Institute
Kolkata, INDIA
Organized by
Department of Statistics, RU
17 April, 2012
Lecture Plan
Agreement for Categorical Data [Part I]
09.00 – 10.15 hrs
Coffee Break: 10.15 – 10.30 hrs
Agreement for Continuous Data [Part II]
10.30 – 11.45 hrs
Discussion.....11.45 hrs – 12.00 hrs
Key References
Cohen , J . (1960). A Coefficient of
Agreement for Nominal Scales.
Educational & Psychological
Measurement , 20(1): 37 – 46.
[Famous for Cohen’s Kappa ]
Cohen, J.
(1968). Weighted Kappa :
Nominal Scale Agreement with Provision for Scaled Disagreement or Partial Credit.
Psychological Bulletin , 70(4): 213-220.
References ….contd.
Banerjee, M., Capozzoli, M., Mcsweeney,
L. & Sinha, D.
(1999). Beyond Kappa : A
Review of Interrater Agreement Measures.
Canadian Jour. of Statistics , 27(1) : 3 - 23.
Lin. L. I.
(1989). A C oncordance C orrelation
C oefficient to Evaluate Reproducibility.
Biometrics, 45 : 255 - 268.
References …contd.
Lin. L. I.
(2000).
T otal D eviation I ndex for
Measuring Individual Agreement : With
Application in Lab Performance and Bioequivalence .
Statistics in Medicine,19 :255 -
270.
Lin, L. I., Hedayat, A. S., Sinha, Bikas &
Yang, Min (2002). Statistical Methods in
Assessing Agreement: Models, Issues, and
Tools. Jour. Amer. Statist. Assoc. 97 (457)
:
257 - 270.
Measurements : Provided by
Experts / Observers / Raters
• Could be two or more systems, assessors, chemists, psychologists, radiologists, clinicians, nurses, rating system or raters, diagnosis or treatments, instruments or methods, processes or techniques or formulae……
Diverse Application Areas…
• Cross checking of data for agreement, Acceptability of a new or generic drug or of test instruments against standard instruments, or of a new method against gold standard method, statistical process control..
Nature of Agreement Problems…
• Assessment & Recording of Responses …
• Two Assessors for evaluation and recording…
• The Raters examine each “unit” Independently of one another and report separately : “+” for
“Affected” or “-” for “OK” : Discrete Type.
Summary Statistics
UNIT Assessment Table
Assessor \ Assessor # II
# I + -
+ 40% 3%
3% 54%
Q. What is the extent of agreement of the two assessors ?
Nature of Data
Assessor \ Assessor # II
# I + -
-
+ 93% 2%
4% 1%
Assessor \ Assessor # II
# I + -
-
+ 3% 40%
44% 13%
Same Question : Extent of Agreement /
Disagreement ?
Cohen’s Kappa : Nominal Scales
Cohen (1960)
Proposed Kappa statistic for measuring agreement when the responses are nominal
Cohen’s Kappa
• Rater I vs Rater II : 2 x 2 Case
Categories Yes & No : Prop.
(i,j)
(Y,Y) &
(N,N) : Agreement Prop
(Y,N) &
(N,Y) :Disagrmnt. Prop
0
e
=
(Y,Y) +
(N,N) = P[agreement]
=
(Y,.)
(.,Y) +
(N,.)
(.,N)
P [Chancy Agreement]
= [
0
-
e
] / [ 1 -
e
]
Chance-corrected Agreement Index
Kappa Computation….
II Total
Yes No
I Yes 0.40 0.03 0.43
No 0.03 0.54 0.57
Total 0.43 0.57 1.00
0
Observed Agreement
=
(Y,Y) +
(N,N)= 0.40 + 0.54 = 0.94 …94%
e
Chance Factor towards agreement……
=
(Y,.)
(.,Y) +
(N,.)
(.,N)
= 0.43x0.43 + 0.57x0.57=0.5098 ……51%
= [
0
-
e
] / [ 1 -
e
]=.4302/.4902
= 87.76%....Chance-Corrected Agreement
Kappa Computations..
•
•
•
•
•
•
•
• Raters I vs II
0.40 0.03
0.03 0.54 K = 0.8776
0.93 0.02
0.04 0.01 K = 0.2208
0.03 0.40
0.54 0.03 K = - 0.8439
0.02 0.93
0.01 0.04 K = - 0.0184
Nature of Categorical Data
Illustrative Example
Study on Diabetic Retinopathy Screening
Problem : Interpretation of
Single-Field Digital Fundus Images
Assessment of Agreement
WITHIN / ACROSS 4 EXPERT GROUPS
Retina Specialists / General Opthalmologists /
Photographers / Nurses : 3 from each Group
Description of Study Material
400 Diabetic Patients
Selected randomly from a community hospital
One Good Single-Field Digital Fundus Image
Taken from each patient with Signed Consent
Approved by Ethical Committee on
Research with Human Subjects
Raters : Allowed to Magnify / Move the Images
NOT TO MODIFY Brightness / Contrasts
THREE Major Features
#1. D iabetic R etinopathy Severity [6 options]
No Retinopathy / Mild / Moderate NPDR
Severe NPDR / PDR / Ungradable
#2. M acular E dema [ 3 options]
Presence / Absence / Ungradable
#3. R eferrals to O pthalmologists [3 options]
Referrals / Non-Referrals / Uncertain
Retina Specialists’ Ratings [ DR ]
RS1 \ RS2
CODES 0 1 2 3 4 9 Total
0 247 2 2 1 0 0 252
1
2
3
12 18 7 1 0 0
22 10 40 8 0 1
38
81
0 0 3 2 2 0 7
4
9
0 0 0 1 9 0
5 0 1 0 0 6
10
12
Total 286 30 53 13 11 7 400
Retina Specialists’ Ratings [ DR ]
RS1 \ RS3
CODES 0 1 2 3 4 9 Total
0 249 2 0 1 0 0 252
1
2
3
23 8 7 0 0 0
31 4 44 2 0 0
38
81
0 0 7 0 0 0 7
4
9
0 0 0 0 10 0
9 1 0 0 0 2
10
12
Total 312 15 58 3 10 2 400
Retina Specialists’ Ratings [ DR ]
RS2 \ RS3
CODES 0 1 2 3 4 9 Total
0 274 5 6 1 0 0 286
1
2
3
16 5 8 1 0 0
15 2 35 0 0 1
2 2 7 1 1 0
30
53
13
4
9
0 0 2 0 9 0 11
5 1 0 0 0 1 7
Total 312 15 58 3 10 2 400
Retina Specialists’
Consensus Rating [ DR ]
RS1 \ RSCR
CODES 0 1 2 3 4 9 Total
0 252 0 0 0 0 0 252
1
2
3
17 19 2 0 0 0
15 19 43 2 1 1
38
81
0 0 2 4 1 0 7
4
9
0 0 0 0 10 0
8 0 0 0 0 4
10
12
Total 292 38 47 6 12 5 400
Retina Specialists’ Ratings
[ Macular Edema ]
RS1 \ RS2
CODES Presence Absence Subtotal Ungradable Total
Presence 326 11 337
Absence 18 22 40
1 338
3 43
Subtotal 344 33 377
Ungradable 9 0 --
Total 353 33 --
---
10 19
14 400
Retina Specialists’ Ratings [ ME ]
RS1 \ RS3
CODES Presence Absence Subtotal Ungradable Total
Presence 322 13 335
Absence 8 32 40
3 338
3 43
Subtotal 330 45 375
Ungradable 12 0 --
Total 342 45 --
7 19
13 400
Retina Specialists’
Consensus Rating [ ME ]
RS1 \ RSCR
CODES Presence Absence Subtotal Ungradable
Total
Presence 335 2 337
Absence 10 33 43
1 338
0 43
Subtotal 345 35 380
Ungradable 10 0 --
Total 355 35 --
---
9 19
10 400
Photographers on Diabetic ME
PHOTOGRAPHERS 1 vs 2
1 2
Codes Presence Absence SubTotal Ungradable Total
Presence 209 5 214
Absence 65 41 106
Subtotal 274 46 320
Ungradable 2 2 ---
Total 276 48 ---
51 265
4 110
---
21 25
76 400
Photographers’ Consensus Rating on Diabetic Macular Edema
# 1
PHOTOGRAPHERS’s
Consensus Rating
Codes Presence Absence SubTotal Ungradable Total
Presence 257 5 262
Absence 74 30 104
3 265
6 110
Subtotal 331 35 366
Ungradable 24 0 ---
Total 355 35 ---
---
1 25
10 400
Study of RS’s Agreement [ME]
2 x 2 Table : Cohen’s Kappa (K) Coefficient
Retina Specialist Retina Specialist 2
1 Presence Absence Subtotal
Presence 326 11 337
Absence 18 22 40
Subtotal 344 33 377
IGNORED ’Ungradable’ to work with 2 x 2 table
% agreement : (326 + 22) / 377 = 0.9231 = Theta_0
% Chancy Agreement : %Yes. %Yes + %No. %No
(337/377)(344/377) + (40/377)(33/377) = 0.8250 = Theta_e
K = [Theta_0 – Theta_e] / [ 1 – Theta_e] = 56% only !
Nett Agreement Standardized
Study of Photographers’’s
Agreement on Macular Edema
2 x 2 Table : Cohen’s Kappa (K) Coefficient
Photographer Photographer 2
1 Presence Absence Subtotal
Presence 209 5 214
Absence 65 41 106
Subtotal 274 46 320
IGNORED ’Ungradable’ to work with 2 x 2 table
% agreement : (209 + 41) / 320 = 0.7813 = Theta_0
% Chancy Agreement : %Yes. %Yes + %No. %No
(214/320)(274/320) + (106/320)(46/320) = 0.6202 = Theta_e
K = [Theta_0 – Theta_e] / [ 1 – Theta_e] = 42% only !
Nett Agreement Standardized
What About Multiple Ratings like
Diabetic Retinopathy [DR] ?
1 Retina Specialists 2
CODES 0 1 2 3 4 9 Total
0
1
2
247
12
2 2 1 0 0 252
18
22 10
7 1 0 0 38
40 8 0 1 81
3
4
9
0 0 3 2
0 0 0 1
2 0 7
9
5 0 1 0 0
0 10
6 12
Total 286 30 53 13 11 7 400
K Computation……
% Agreement =(247+18+40+2+9+6)/400
= 322/400 =0.8050 = Theta_0
% Chance Agreement = (252/400)(286/400) +
….+(12/400)(7/400) = 0.4860 = Theta_e
K = [Theta_0 – Theta_e] / [ 1 – Theta_e] = 62% !
Note : 100% Credit for ’Hit’ & No Credit for ’Miss’.
Criticism : Heavy Penalty for narrowly missed !
Concept of Unweighted Versus Weighted Kappa
Table of Weights for 6x6 Ratings
Ratings Ratings [ 1 to 6 ]
1
1 2 3 4 5 6
1 24/25 21/25 16/25 9/25 0
2 24/25 1 24/25 21/25 16/25 9/25
3 21/25 24/25 1 24/25 21/25 16/25
4 16/25 21/25 24/25 1 24/25 21/25
5 9/25 16/25 21/25 24/25 1 24/25
6 0 9/25 16/25 21/25 24/25 1
Formula w_ij = 1 – [(i – j)^2 / (6-1)^2]
Formula for Weighted Kappa
• Theta_0(w) = sum sum w_ij f_ij / n
• Theta_e(w) = sum sum w_ij (f_i. /n)(f_.j/n)
• These sum sum are over ALL cells
• For unweighted Kappa : we take into account only the cell freq. along the main diagonal with 100% weight
Computations for
Weighted Kappa
•
• Theta_0(w) =
• Theta_e(w) =
Theta_0(w) – Theta_e(w)
•
Weighted Kappa = ---------------------------------
1 – Theta_e(w)
• Unweighted Kappa =........
K works for pairwise evaluation of Raters’ agreement …….
Kstatistics for Pairs of Raters…
Categories DR ME Referral
Retina Specialists
1 vs 2 0.63 0.58 0.65
1 vs 3 0.55 0.64 0.65
2 vs 3 0.56 0.51 0.59
1 vs CGroup 0.67 0.65 0.66
2 vs CGroup 0.70 0.65 0.66
3 vs CGroup 0.71 0.73 0.72
Unweighted Kappa......
Kstatistics for Pairs of Raters…
Categories DR ME Referral
General Opthalmologists
1 vs 2 0.35 0.17 0.23
1 vs 3 0.44 0.27 0.27
2 vs 3 0.33 0.19 0.27
1 vs CGroup 0.33 0.16 0.18
2 vs CGroup 0.58 0.50 0.51
3 vs CGroup 0.38 0.20 0.24
Kstatistics for Pairs of Raters…
Categories DR ME Referral
Photographers…..
1 vs 2 0.33 0.35 0.23
1 vs 3 0.49 0.38 0.41
2 vs 3 0.34 0.45 0.32
1 vs CGroup 0.33 0.29 0.33
2 vs CGroup 0.26 0.29 0.20
3 vs CGroup 0.39 0.49 0.49
Kstatistics for Pairs of Raters…
Categories DR ME Referral
Nurses………..
1 vs 2 0.28 0.15 0.20
1 vs 3 0.32 NA NA
2 vs 3 0.23 NA NA
1 vs CGroup 0.29 0.27 0.28
2 vs CGroup 0.19 0.15 0.17
3 vs CGroup 0.50 NA NA
NA : Rater #3 did NOT rate ’ungradable’.
K for Multiple Raters’ Agreement
• Judgement on Simultaneous Agreement of
Multiple Raters with Multiple Classification of Attributes…....
# Raters = n
# Subjects = k
# Mutually Exclusive & Exhaustive
Nominal Categories = c
Example....Retina Specialists (n = 3),
Patients (k = 400) & DR (6 codes)
Formula for Kappa
• Set k_ij = # raters to assign ith subject to jth category
P_ j = sum_i k_ij / nk = Prop. Of all assignments to jth category
Chance-corrected assignment to category j
[sum k^2_ij – knP_ j{1+(n-1)P_ j}
K_ j = ------------------------------------------kn(n-1)P_ j (1 – P_ j)
Computation of Kappa
• Chance-corrected measure of over-all agreement
• Sum_ j Numerator of K_ j
• K = -----------------------------------------
• Sum_ j Denominator of K_ j
• Interpretation ….Intraclass correlation
Kstatistic for multiple raters…
CATEGORIES DR ME Referral
Retina Specialsts 0.58 0.58 0.63
Gen. Opthalmo. 0.36 0.19 0.24
Photographers 0.37 0.38 0.30
Nurses 0.26 0.20 0.20
All Raters 0.34 0.27 0.28
Other than Retina Specialists, Photographers also have good agreement for DR & ME…
Conclusion based on K-Study
• Of all 400 cases…..
• 44 warranted Referral to Opthalmologists due to
Retinopathy Severity
• 5 warranted Referral to Opthalmologists due to uncertainty in diagnosis
• Fourth Retina Specialist carried out Dilated
Fundus Exam of these 44 patients and substantial agreement [K = 0.68] was noticed for
DR severity……
• Exam confirmed Referral of 38 / 44 cases.
Discussion on the Study
• Retina Specialists : All in active clinical practice :
Most reliable for digital image interpretation
• Individual Rater’s background and experience play roles in digital image interpretation
• Unusually high % of ungradable images among nonphysician raters, though only 5 out of 400 were declared as ’ungradable’ by consensus of the Retina Specialists’ Group.
• Lack of Confidence of Nonphysicians, rather than true image ambiguity !
• For this study, other factors [blood pressure, blood sugar, cholesterol etc] not taken into account……
Cohen’s Kappa : Need for
Further Theoretical Research
• COHEN’S KAPPA STATISTIC: A
CRITICAL APPRAISAL AND SOME
MODIFICATIONS
• BIKAS K. SINHA^ 1, PORNPIS
YIMPRAYOON^ 2, AND MONTIP
TIENSUWAN^ 2
• ^1 : ISI, Kolkata
• ^2 : Mahidol Univ., Bangkok, Thailand
• CSA BULLETIN, 2007
CSA Bulletin (2007) Paper…
• ABSTRACT : In this paper we consider the problem of assessing agreement between two raters while the ratings are given independently in 2-point nominal scale and critically examine some features of Cohen’s
Kappa Statistic, widely and extensively used in this context. We point out some undesirable features of K and, in the process, propose three modified Kappa Statistics. Properties and features of these statistics are explained with illustrative examples.
Further Theoretical Aspects of
Kappa – Statistics ….
• Recent Study on Standardization of Kappa
• Why standardization ?
• K = [Theta_0 – Theta_e] / [ 1 – Theta_e]
Range : -1 <= K <= 1
• K = 1 iff 100% Perfect Rankings
• = 0 iff 100% Chancy Ranking
• = -1 iff 100% Imperfect BUT Split-Half
Why Split Half ?
Example
Presence Absence
• Presence ---30%
• Absence 70% ---
K_C = - 73% [& not -100 % ]
************************************
Only Split Half ---50% provides
50% ---K_C = - 100 %
KModified….
• [Theta_0 – Theta_e]
K_C(M) = ------------------------------------------
P_I[Marginal Y ]. P_I[Marginal N ]
+ P_II[Marginal Y ]. P_II[Marginal N ]
Y : ’Presence’ Category & N : ’Absence’ Category
’I’ & ’II’ represent the Raters I & II
K_C(M) Satisfies
K = 1 iff 100% Perfect Rankings..whatever
• = 0 iff 100% Chancy Ranking…whatever
• = 1 iff 100% Imperfect Ranking…whatever…
Other Formulae..….
• What if it is known that there is 80% Observed
Agreement i.e., Theta_0 = 80% ?
• K_max = 1 ? K_min = -1 ?...NOT Really....
• So we need standardization of K_C as
• K_C(M2) = [K_C – K_C(min)] OVER
[K_C (max) – K_C(min)] where Max. & Min. are to be evaluated under the stipulated value of observed agreement
Standardization yields….
K_C + (1-Theta_0) / (1+Theta_0)
K_C(M2) = -----------------------------------------------
Theta_0^2 / [1+(1-Theta_0)^2] +
(1-Theta_0) / (1+Theta_0)
K_C(M3)={[K_C(M) + (1-Theta_0)/(1+Theta_0)}
OVER
{[Theta_0 /(2-Theta_0) + (1-Theta_0)/(1+Theta_0)}
Revisiting Cohen’s Kappa…..
2 x 2 Table : Cohen’s Kappa (K) Coefficient
Retina Specialist Retina Specialist 2
1 Presence Absence Subtotal
Presence 326 11 337
Absence 18 22 40
Subtotal 344 33 377
K_C = 56% [computed earlier]
Kappa - Modified
• K_C(M) = 56 % [same as K_C]
Given Theta_0 = 92.30 %
•
•
• 0.5600 + 0.0400
• K_C(M2) = ------------------------- = 61 %
0.8469 + 0.0400
0.5600 + 0.0400
•
• K_C(M3) = --------------------------- = 67 %
0.8570 + 0.0400
Beyond Kappa …..
• A Review of Inter-rater Agreement
Measures
• Banerjee et al : Canadian Journal of Statistics : 1999; 3-23
• Modelling Patterns of Agreement :
• Log Linear Models
• Latent Class Models
That’s it for now……
• Thanks for your attention….
• This is the End of Part I of my talk.
Bikas Sinha
UIC, Chicago
April 29, 2011