McNemar vs. Cohen's Kappa

advertisement
McNemar vs. Cohen’s Kappa
Dennis Mecham
References
• SAS/STAT(R) 9.2 User's Guide, Second Edition:
The FREQ Procedure
• Cohen, J. (1960), "A Coefficient of Agreement
for Nominal Scales," Educational and
Psychological Measurement, 20, 37–46.
Example from SAS Help
• Medical researchers are interested in evaluating
the efficacy of a new treatment for a skin
condition. Dermatologists from participating
clinics were trained to conduct the study and to
evaluate the condition. After the training, two
dermatologists examined patients with the skin
condition from a pilot study and rated the same
patients.
Dermatologist 2
Dermatologist 1
Bad
Bad 29
Good 8
Good
15
36
Do the Dermatologists Agree?
• Possible approaches
– McNemar’s test for agreement
• H0:
• Proportion of patients rated as “bad” by dermatologist
1 equals the proportion of patients rated as “bad” by
dermatologist 2
– Cohen’s Kappa
• H0: κ = 0
• Observed agreement is just due to chance
McNemar’s Test
• Q = (n12 – n21)2 / (n12 + n21)
• Q follows a chi-square distribution with 1
degree of freedom
• Only uses the two cells where there is
disagreement
• Easy to compute
Cohen’s Kappa (simple)
• κ̂ = (Po – Pe) / (1 – Pe)
Where Po = Σpii and Pe = Σpi.p.i
• Po is the proportion of agreement, Pe is the
expected proportion of agreement
• κ = 1 means perfect agreement
• κ = 0 means observed agreement is due to
chance
• κ < 0 means there is disagreement (unusual in
practice)
• Under Null: (κ̂ is asymptotic normal)
Var(κ̂) = [Pe + Pe2 – Σpi.p.i(pi. + p.i)] / (1 – Pe)2 n
SAS Code
dataSkinCondition;
input derm1 $ derm2 $ count;
datalines;
badbad 29
bad good 15
goodgood 36
good bad 8
;
procfreqdata=SkinCondition order=data;
weight count;
tables derm1*derm2 / agree noprint;
Test kappa;
run;
Results
McNemar's Test
Statistic (S)
2.1304
DF
1
Pr> S
0.1444
Simple Kappa Coefficient
Kappa
0.4773
ASE
0.0925
95% Lower Conf Limit
0.2960
95% Upper Conf Limit
0.6585
Test of H0: Kappa = 0
ASE under H0
0.1052
Z
4.5350
One-sided Pr> Z
<.0001
Two-sided Pr> |Z|
<.0001
Results
• McNemar’s test
– P-value = .1444
– There is insufficient evidence to say there is not
agreement.
• Cohen’s Kappa
– P-value < .0001
– There is significant agreement
– κ̂ = .4773, moderate agreement
• McNemar’s test
– Easy
– More intuitive
– Null is agreement
– Perhaps more powerful when disagreement is
small compared to the amount of agreement
• Cohen’s Kappa
– Gives estimate of level of agreement
– Accounts for possibility of disagreement
– Uses all of the data
– Null is no agreement
How do tests respond to changes in
data?
• What if there is more disagreement?
– Higher numbers in the disagreeing cells should
affect McNemar’s more than Cohen’s
• What if the agreeing cells are lower?
– Cohen’s affected, but not McNemar’s
• What if the sample size is larger?
– Is it possible to have enough power that both tests
reject the null hypothesis
SAS Code
dataSkinCondition;
input derm1 $ derm2 $ count;
datalines;
bad bad 29
bad good 20
good good 36
good bad 8
;
Run;
dataSkinCondition;
input derm1 $ derm2 $ count;
datalines;
bad bad 15
bad good 20
good good 18
good bad 8
;
Run;
dataSkinCondition;
input derm1 $ derm2 $ count;
datalines;
bad bad 150
bad good 200
good good 180
good bad 80
;
Run;
procfreq data=SkinCondition order=data;
weight count;
tables derm1*derm2 / agree noprint;
Test kappa;
run;
• With 5 more “bad good” subjects
– McNemar p-value = .0233
– Cohen p-value < .0001
– Both tests affected, but now McNemar’s detects
significant disagreement
• With half the counts in the agreement cells
– McNemar unaffected
– Cohen p-value = .1677 (one-sided)
– Not enough agreement to make up for the
disagreement in Cohen’s test anymore
• With 10x the cell counts
– McNemar p-value < .0001
– Cohen p-value = .0012 (one-sided)
– With a large enough sample, both tests are significant.
Other Tests for Agreement
• Bowker’s Test of Symmetry
– Like McNemar’s, but not restricted to 2x2 tables
• Weighted Kappa Coefficient
– Uses weights to account for differences between
categories (eg: difference between very poor and
poor, versus difference between poor and good)
• Overall Kappa Coefficient
– Used with multiple strata, assumes common kappa
among strata
• Cochran’s Q Test
– Used for 2x2x….x2 tables
Download