McNemar vs. Cohen’s Kappa Dennis Mecham References • SAS/STAT(R) 9.2 User's Guide, Second Edition: The FREQ Procedure • Cohen, J. (1960), "A Coefficient of Agreement for Nominal Scales," Educational and Psychological Measurement, 20, 37–46. Example from SAS Help • Medical researchers are interested in evaluating the efficacy of a new treatment for a skin condition. Dermatologists from participating clinics were trained to conduct the study and to evaluate the condition. After the training, two dermatologists examined patients with the skin condition from a pilot study and rated the same patients. Dermatologist 2 Dermatologist 1 Bad Bad 29 Good 8 Good 15 36 Do the Dermatologists Agree? • Possible approaches – McNemar’s test for agreement • H0: • Proportion of patients rated as “bad” by dermatologist 1 equals the proportion of patients rated as “bad” by dermatologist 2 – Cohen’s Kappa • H0: κ = 0 • Observed agreement is just due to chance McNemar’s Test • Q = (n12 – n21)2 / (n12 + n21) • Q follows a chi-square distribution with 1 degree of freedom • Only uses the two cells where there is disagreement • Easy to compute Cohen’s Kappa (simple) • κ̂ = (Po – Pe) / (1 – Pe) Where Po = Σpii and Pe = Σpi.p.i • Po is the proportion of agreement, Pe is the expected proportion of agreement • κ = 1 means perfect agreement • κ = 0 means observed agreement is due to chance • κ < 0 means there is disagreement (unusual in practice) • Under Null: (κ̂ is asymptotic normal) Var(κ̂) = [Pe + Pe2 – Σpi.p.i(pi. + p.i)] / (1 – Pe)2 n SAS Code dataSkinCondition; input derm1 $ derm2 $ count; datalines; badbad 29 bad good 15 goodgood 36 good bad 8 ; procfreqdata=SkinCondition order=data; weight count; tables derm1*derm2 / agree noprint; Test kappa; run; Results McNemar's Test Statistic (S) 2.1304 DF 1 Pr> S 0.1444 Simple Kappa Coefficient Kappa 0.4773 ASE 0.0925 95% Lower Conf Limit 0.2960 95% Upper Conf Limit 0.6585 Test of H0: Kappa = 0 ASE under H0 0.1052 Z 4.5350 One-sided Pr> Z <.0001 Two-sided Pr> |Z| <.0001 Results • McNemar’s test – P-value = .1444 – There is insufficient evidence to say there is not agreement. • Cohen’s Kappa – P-value < .0001 – There is significant agreement – κ̂ = .4773, moderate agreement • McNemar’s test – Easy – More intuitive – Null is agreement – Perhaps more powerful when disagreement is small compared to the amount of agreement • Cohen’s Kappa – Gives estimate of level of agreement – Accounts for possibility of disagreement – Uses all of the data – Null is no agreement How do tests respond to changes in data? • What if there is more disagreement? – Higher numbers in the disagreeing cells should affect McNemar’s more than Cohen’s • What if the agreeing cells are lower? – Cohen’s affected, but not McNemar’s • What if the sample size is larger? – Is it possible to have enough power that both tests reject the null hypothesis SAS Code dataSkinCondition; input derm1 $ derm2 $ count; datalines; bad bad 29 bad good 20 good good 36 good bad 8 ; Run; dataSkinCondition; input derm1 $ derm2 $ count; datalines; bad bad 15 bad good 20 good good 18 good bad 8 ; Run; dataSkinCondition; input derm1 $ derm2 $ count; datalines; bad bad 150 bad good 200 good good 180 good bad 80 ; Run; procfreq data=SkinCondition order=data; weight count; tables derm1*derm2 / agree noprint; Test kappa; run; • With 5 more “bad good” subjects – McNemar p-value = .0233 – Cohen p-value < .0001 – Both tests affected, but now McNemar’s detects significant disagreement • With half the counts in the agreement cells – McNemar unaffected – Cohen p-value = .1677 (one-sided) – Not enough agreement to make up for the disagreement in Cohen’s test anymore • With 10x the cell counts – McNemar p-value < .0001 – Cohen p-value = .0012 (one-sided) – With a large enough sample, both tests are significant. Other Tests for Agreement • Bowker’s Test of Symmetry – Like McNemar’s, but not restricted to 2x2 tables • Weighted Kappa Coefficient – Uses weights to account for differences between categories (eg: difference between very poor and poor, versus difference between poor and good) • Overall Kappa Coefficient – Used with multiple strata, assumes common kappa among strata • Cochran’s Q Test – Used for 2x2x….x2 tables