McNemar vs. Cohen's Kappa

McNemar vs. Cohen’s Kappa Dennis Mecham References • SAS/STAT(R) 9.2 User's Guide, Second Edition: The FREQ Procedure • Cohen, J. (1960), "A Coefficient of Agreement for Nominal Scales," Educational and Psychological Measurement, 20, 37–46. Example from SAS Help • Medical researchers are interested in evaluating the efficacy of a new treatment for a skin condition. Dermatologists from participating clinics were trained to conduct the study and to evaluate the condition. After the training, two dermatologists examined patients with the skin condition from a pilot study and rated the same patients. Dermatologist 2 Dermatologist 1 Bad Bad 29 Good 8 Good 15 36 Do the Dermatologists Agree? • Possible approaches – McNemar’s test for agreement • H0: • Proportion of patients rated as “bad” by dermatologist 1 equals the proportion of patients rated as “bad” by dermatologist 2 – Cohen’s Kappa • H0: κ = 0 • Observed agreement is just due to chance McNemar’s Test • Q = (n12 – n21)2 / (n12 + n21) • Q follows a chi-square distribution with 1 degree of freedom • Only uses the two cells where there is disagreement • Easy to compute Cohen’s Kappa (simple) • κ̂ = (Po – Pe) / (1 – Pe) Where Po = Σpii and Pe = Σpi.p.i • Po is the proportion of agreement, Pe is the expected proportion of agreement • κ = 1 means perfect agreement • κ = 0 means observed agreement is due to chance • κ < 0 means there is disagreement (unusual in practice) • Under Null: (κ̂ is asymptotic normal) Var(κ̂) = [Pe + Pe2 – Σpi.p.i(pi. + p.i)] / (1 – Pe)2 n SAS Code dataSkinCondition; input derm1 $ derm2 $ count; datalines; badbad 29 bad good 15 goodgood 36 good bad 8 ; procfreqdata=SkinCondition order=data; weight count; tables derm1*derm2 / agree noprint; Test kappa; run; Results McNemar's Test Statistic (S) 2.1304 DF 1 Pr> S 0.1444 Simple Kappa Coefficient Kappa 0.4773 ASE 0.0925 95% Lower Conf Limit 0.2960 95% Upper Conf Limit 0.6585 Test of H0: Kappa = 0 ASE under H0 0.1052 Z 4.5350 One-sided Pr> Z <.0001 Two-sided Pr> |Z| <.0001 Results • McNemar’s test – P-value = .1444 – There is insufficient evidence to say there is not agreement. • Cohen’s Kappa – P-value < .0001 – There is significant agreement – κ̂ = .4773, moderate agreement • McNemar’s test – Easy – More intuitive – Null is agreement – Perhaps more powerful when disagreement is small compared to the amount of agreement • Cohen’s Kappa – Gives estimate of level of agreement – Accounts for possibility of disagreement – Uses all of the data – Null is no agreement How do tests respond to changes in data? • What if there is more disagreement? – Higher numbers in the disagreeing cells should affect McNemar’s more than Cohen’s • What if the agreeing cells are lower? – Cohen’s affected, but not McNemar’s • What if the sample size is larger? – Is it possible to have enough power that both tests reject the null hypothesis SAS Code dataSkinCondition; input derm1 $ derm2 $ count; datalines; bad bad 29 bad good 20 good good 36 good bad 8 ; Run; dataSkinCondition; input derm1 $ derm2 $ count; datalines; bad bad 15 bad good 20 good good 18 good bad 8 ; Run; dataSkinCondition; input derm1 $ derm2 $ count; datalines; bad bad 150 bad good 200 good good 180 good bad 80 ; Run; procfreq data=SkinCondition order=data; weight count; tables derm1*derm2 / agree noprint; Test kappa; run; • With 5 more “bad good” subjects – McNemar p-value = .0233 – Cohen p-value < .0001 – Both tests affected, but now McNemar’s detects significant disagreement • With half the counts in the agreement cells – McNemar unaffected – Cohen p-value = .1677 (one-sided) – Not enough agreement to make up for the disagreement in Cohen’s test anymore • With 10x the cell counts – McNemar p-value < .0001 – Cohen p-value = .0012 (one-sided) – With a large enough sample, both tests are significant. Other Tests for Agreement • Bowker’s Test of Symmetry – Like McNemar’s, but not restricted to 2x2 tables • Weighted Kappa Coefficient – Uses weights to account for differences between categories (eg: difference between very poor and poor, versus difference between poor and good) • Overall Kappa Coefficient – Used with multiple strata, assumes common kappa among strata • Cochran’s Q Test – Used for 2x2x….x2 tables

McNemar vs. Cohen's Kappa

Related documents

Products

Support

McNemar vs. Cohen's Kappa

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib