Statistics of Contingency Tables stat 557 Heike Hofmann Outline • Summary Statistics: Difference of Proportions, Relative Risk, Odds, Odds Ratio • Visualizations: Mosaicplots • Concordance & Discordance D ΠγC = − ΠC D γ= + ΠD ΠC + Π ΠC D πθ00:=π11π00 π11 θ := π10 π01 I J π π � �2 10 01 �� 16 d c )∞ = π Π π − Π π π : (1 − π) D ij π : (1ij− π)C ij n(Π•C Difference + ΠD )4 of Proportion X=0 X=1 i=1 j=1 − π=: =:ππ21, − π2 , j|i=1 πj|i=0 π −j|i=0 πj|i=1 π1 − Y=0 π00 π01 ΠC − ΠD I � J � � � γ = 2 16Π + Π 2 d c Y=1 π ij π11 ππj=1|i=0 σ (γ̂)∞•=Relative Risk C 4 D πrj=1|i=0 ij ΠC πij − ΠD π10 := n(ΠC + Π D r ):= π00 π11 i=1 j=1 πj=1|i=1 π θ := j=1|i=1 π10 π01 ΠC − ΠD 2 2 2 2 γ−== χ= 3.84, χ=1,0.01 = 6.634897 1,0.05 χ 3.84, χ 6.634897 π : (1 π) Odds • 1,0.05 ΠC 1,0.01 + ΠD π100−ππ11 π − π =: π Odds Ratio 2, j|i=0 j|i=1 • θ := π+j ij = i+ · π+j π πij = ππ10 ·π 01 i+ πj=1|i=0 π : (1 − π) r := πj=1|i=1 asymptotics: Agresti pp 70-75, πj|i=0 − πj|i=1 =: π1 77 − π2 , H : P ( heart disease | Cholesterol oP ( heart disease | Cholesterol Ho : ≤ 220)≤=220) = Summaries of 2 x 2 Tables 2 x 2 Mosaics odds ratio: (1364*344)/(367*126) = 10.15 95% CI (Wald): (8.03, 12.83) Gender by Survival No Yes Men Women Died 1364 126 Survived 367 344 Survival by Gender Male Female Mosaicplots • John Hartigan (1980s) • Area plots (i.e. area represents #combinations) • Built hierarchically, i.e. order of variables matters • based on conditional distributions Mosaicplots P(X|Y)*P(Y) No Yes prodplot(tc, Freq~Sex+Survived, c("vspine", "hspine"), subset=level==2) = = P(X,Y) P(Y|X)*P(X) Male Female prodplot(tc, Freq~Survived+Sex, c("vspine", "hspine"), subset=level==2) Visualizing Associations - 2 x 2 tables Visualizing Associations Fourfold Displays X: 0 1 25 18 weakest association 0 1 odds ratio: 1.33 (0.590, 3.01) 1 Y: 1 Y: 0 Y 25 0 37 20 0 X\Y1 0 1 weakest association 37 Mosaicplots 20 X 18 X: 1 1 23 29 medium association 0 1 Y2: 1 23 Y2: 0 34 0 34 14 0 X\Y2 0 1 X: 0 Y2 medium association 1 odds ratio: 3.07 (1.337, 7.014) X 29 14 X: 1 strongest association strongest association 0 43 14 1 14 29 odds ratio: 6.36 (2.645, 15.306) X: 1 0 1 0 29 X\Y3 0 1 Y3 14 Y3: 1 14 Y3: 0 43 1 X: 0 X Visualizing Associations - 2 x 2 tables Visualizing Associations X\Y1 0 1 Row: A 0 30 20 1 10 40 1.1 1.2 1.1 1.2 1.1 1.2 odds ratio: 6.00 (2.453, 14.678) 2.2 Col: B 20 Col: A 30 Mosaicplots 2.1 Fourfold Displays 10 40 X\Y2 0 1 Row: A 36 1 15 25 odds ratio: 6.00 (2.528, 14.234) 15 2.2 Col: B Col: A 14 0 36 14 2.1 Row: B 35 30 Col: B 20 Col: A 10 X\Y3 0 1 0 40 10 1 20 30 odds ratio: 6.00 (2.453, 14.678) Row: B 2.2 Row: A 40 2.1 Row: B Reading the Odds a + b = 1 and c + d = 1 that a + b = 1 and c + d = 1 RatioAssume θ: then Odds Ratio θ: ad 1−b d Taylor ≈ 4(d − b) log θ = log = log +adlog 1 − b d bc log θ =blog −d =1log + log bc 2 b 1−d y+ p̃ = y + 2 n+4 p̃ = � n+4 � 1 p̃ ± zα/2 p̃(1 − p̃) 1 n p̃ ± zα/2 p̃(1 − p̃) n p − po � = ±zpα/2 − po 1 = ±zα/2 po (1 − po ) � probability scale 1.0 ln 1-d d c 2 1 a 0.5 ln 1-b b +inf d 0 -1 -2 b 0 -0.85 0.94 -inf log odds scale Tayl ≈ Odds ratios in 2 x 2 x K tables Survival by Gender plots for each Class 1st 1st 67.09 2nd 2nd 44.07 (23.7, 189.9) (21.5, 90.3) 3rd 3rd Crew Crew 4.07 23.26 (2.8, 5.9) (6.8, 79.1) Odds ratios in 2 x 2 x K tables X and Y are binary variables, Z is categorical with K categories Death Penalty in Florida: X death penalty (yes/no) Y defendant’s race (black/white) Z victim’s race (black/white) Marginal Table of X/Y Defendant white black yes no 53 430 483 15 176 191 68 606 674 Marginal odds ratio: 53*176/(430*15) = 1.45 (±0.59) slight indication in favor of black defendants ?! Odds ratios in 2 x 2 x K tables Conditional Tables of X/Y Z = white victim yes no white 53 black 11 414 37 451 64 Z = black victim 467 48 515 Conditional odds ratios: 0.43 white black yes no 0 4 4 16 139 155 0 very strong indication against black defendants 16 143 159 Florida Data Marginal Association yes black no yes no white black defendant white defendant black no Conditional Associations death victim white yes Simpson’s paradox • Simpson’s paradox: marginal association between X and Y is opposite to conditional associations between X and Y for each level of Z • due to: very strong marginal association between X and Z or Y and Z Florida Data Marginal Association Conditional Associations yes black no yes no white black defendant white victim death Strong Interaction white white black black defendant defendant black no victim white yes Conditional Odds Ratios • X,Y are conditionally independent for level k of Z, if the conditional log odds ratio is 0 • X,Y are conditionally independent given Z, if all conditional odds ratios are 0. (Does not imply marginal independence) • X,Y have homogenous association, if all conditional odds ratios given Z are constant. Testing Independence • Odds ratio of 1 indicates independence, confidence interval helps to determine deviation from independence, but CI is approximation. • Alternative solution: table tests Testing independence • null hypothesis: π = π · π me that a •+ bScore = 1 and + d = 1 1900): Test c(Pearson, Odds Ratio θ: ij i. .j ∀i, j � (nij − µˆij )2 X2 = ad i,j 1 − µ bˆij d Taylor log θ = log = log + log ≈� 4(d − b � �1 − d nij bc b πij = πi. · G π.j2 =∀i, 2 j log Likelihood-Ratio Test: y+2 µˆij p̃ = i,j at a + b = 1 and c + d = 1 n+4 Ratio θ: � � (nij − µˆij )2 both X2 and G2 have the X 2same =1 limiting p̃2 − ± bzα/2 p̃(1 −Taylor p̃)µˆij ad 1 d distribution of chi log θ = log ≈ 4(d − b) = log (I-1)(J-1)+ logn i,j bc b πij =1π−i. d· π.j ∀i, j • • Example: Cholesterol/Heart Disease • 1329 patients of same age/sex Coronary Disease mg/l present absent Cholesterol ≤ 220 y11= 20 y12= 553 > 220 y21= 72 y22= 684 Cholesterol/Heart Disease • Expected Values under independence Coronary Disease mg/l present absent Total Cholesterol ≤ 220 13.66 533.33 573 > 220 52.33 703.67 756 Total 92 1237 1329 Cholesterol/Heart Disease • loglikelihood ratio test G2 = 19.8 • Pearson score test X2 = 18.4 • with df = (2-1)*(2-1) = 1 independence seems to be violated Extensions to I x J Contingency Tables Local Odds Ratios • Each set of four cells forming a rectangle yields one odds ratio • Local Odds Ratio: Use only neighboring cells a b c d • local odds ratios form a minimal sufficient set Example: Marijuana Use • Study on Marijuana use (based on parental use) student parent never occasional regular neither 141 54 40 one 68 44 51 both 17 11 19 • evidence of association? Example: Marijuana Use • Student by Parent Use student prodplot(mj, count~student+parent, c("vspine","hspine"), subset=level==2) • neither positive association? one parent both Summaries of I x J Tables (ordinal variables) X=1 X=2 X=i Y=1 π11 π12 For each pair of subjects count Y=2 π21 π22 #concordant/discordant pairs, ... where Y=J πJ1 πJ2 πIi π2i ... πJi • Concordance/Discordance: • a pair is concordant, if subject 2 is ranked higher on X, it is also ranked higher on Y • a pair is discordant, if subject 2 is ranked higher on X, but ranked lower on Y ... g (T ) = � Concordance/Discordance e eTk k 1, 2, ... 1, 2, ... K T� �=1 1 concordance to (i,j): σ(ν) = −ν 1 + e <i, <j or >i, >j � toM(i,j): Zi,jm = σ(α0m + αm X) discordance m = 1, ..., Tk = β0k + βk� Z fk (X) = gk (T ) Πc = 2 I � J � i=1 j=1 >i, <j or >i, <j k = 1, ..., K k = 1, ..., K πij · �� h>i k>j πhk = � i,j c πij Gamma Statistic • I � J � �2 � let ∏ , ∏ for C D be the probabilities 16 d c = π Π π − Π π ij C ij D ijresp. 4 concordance and discordance, n(ΠC + ΠD ) i=1 j=1 ΠC − ΠD γ= Π C + ΠD π00 π11 θ := approx. normal π10 π01 with I � J � �2 � 16 π : (1 − π) d c σ 2 (γ̂)∞ = π Π π − Π π ij C ij D ij 4 n(ΠC + ΠD ) πj|i=0 − πj|i=1 =: π1 − πi=1 2 , j=1 • • πj=1|i=0γ = ΠC − ΠD