Measures of Association References: Summarize strength of associations Goodman, L.A. and Kruskal, W. H. (1979) Measures of Association for Cross Classications, Springer, New York. Bishop, Fienberg and Holland (1975). Quantify relative risk Types of measures { odds ratio { correlation { Pearson statistic { Prediction { concordance/discordance Discrete Multivariate Analysis: Theory and Practice, MIT Press, Cambridge, MA (Chapter 11). Brown, M. B. and Benedetti (1977). Sampling behavior of tests for correlation in two-way contingency tables, Journal of the American Statistical Association, 72, 309{315. 465 Cohen, J. (1960). A coeÆcient of agreement for nominal scales, Educational and Psychological Measurement, 20, 37{46. Mantel, N. and Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease, Journal of the National Cancer Institute, 22, 719{748. Agresti, A. (1984) Analysis of Ordinal Categorical Data, Wiley, New York, (Chapters 2 & 3). Agresti, A. (2002) Categorical Data Analysis, 2nd edition, Wiley, New York, (Chapter 2). 467 466 The odds ratio: The most frequently used measure for 2 2 tables. Observed counts: Variable 2 j=1 j=2 Variable 1 i = 1 Y11 Y12 i = 2 Y21 Y22 468 Means or "expected" counts: j=1 j=2 i = 1 m11 m12 i = 2 m21 m22 The odds that a sampled unit is in category 1 for variable 1 given that it is in category j for variable 2: 8 < 9 P r: variable 1 is in category 1 8 < variable 1 is P r: in category 2 = True proportions: Odds ratio j=1 j=2 i = 1 11 12 i = 2 21 22 = variable 2 is = in category j ;9 variable 2 is = in category j ; 1j 1j +2j 2j 1j +2j 11 21 12 22 = 21jj = 1122 2112 = m11m22 m21m12 also called the cross-product ratio. 470 469 Odds of capturing a female: Example: Chinook Salmon hook & line: Early run (1999) 172 :5911 291 = = 1:45 119 : 4089 291 Hook & Line Net Female 172 165 Male 119 202 291 367 net: 471 165 :4496 367 = = 0:82 202 : 5504 367 472 Estimated odds ratio: 0 @ 1 A Late run: odds of capturing a female with hook & line 1 ^ = 0 odds of capturing a female @ A with a net = 1:45 = 1:77 0:82 Hook & Line Female 199 Male 162 361 Estimated odds ratio: An approximate 95% condence interval for is = (1:30; 2:42) Conclusion: The odds that a captured sh is female are about 30 to 140 percenct greater with hook & line than with using a net. 473 /* Net 168 268 436 :5512 :4488 :3853 :6147 = 1:23 0:63 = 1:96 An approximate 95% condence interval for is (1.48, 2.60) Conclusion: 474 Program to analyze the 1999 Chinook salmon data. This program is stored in the file chinook2.sas */ /* Attach labels to categories */ proc format; data set1; value run 1 = 'Early' infile 'chinook.dat'; input (year month day biweek run gear age sexa length) (4. 2. 2. 1. 1. 1. 2. $1. 4.); rage=int(age/10); 2 = 'Late'; value sex 1 = 'Female' 2 = 'Male'; value gear 1 = 'Hook' 2 = 'Net'; oage=age-(10*rage); run; if(sexa = 'F') then sex=1; else sex=2; run; 475 476 ---------------- run=Early ----------------The FREQ Procedure proc sort data=set1; by run; run; /* Table of gear by sex Examine partial association between sex and method of capture within each run. */ proc freq data=set1; by run; gear Frequency Expected Row Pct Female Male Total Hook 172 119 149.04 141.96 59.11 40.89 291 Net 165 202 187.96 179.04 44.96 55.04 367 table gear*sex / chisq Fisher all nopercent nocol expected; sex format sex sex. gear gear. run run.; run; Total 337 321 658 478 477 Statistics for Table of gear by sex Statistic Statistics for Table of gear by sex DF Value Prob 1 1 1 1 13.0018 13.0545 12.4417 12.9820 0.1406 0.1392 0.1406 0.0003 0.0003 0.0004 0.0003 Value ASE Gamma Kendall's Tau-b Stuart's Tau-c 0.2778 0.1406 0.1396 0.0733 0.0385 0.0383 Somers' D C|R Somers' D R|C 0.1415 0.1397 0.0388 0.0383 Fisher's Exact Test Pearson Correlation Spearman Correlation 0.1406 0.1406 0.0385 0.0385 Cell (1,1) Frequency (F) 172 Left-sided Pr <= F 0.9999 Right-sided Pr >= F 2.050E-04 Lambda Asymmetric C|R Lambda Asymmetric R|C Lambda Symmetric 0.1153 0.0241 0.0719 0.0561 0.0623 0.0513 Table Probability (P) Two-sided Pr <= P Uncertainty Coefficient C|R Uncertainty Coefficient R|C Uncertainty Coefficient Sym 0.0143 0.0145 0.0144 0.0079 0.0080 0.0079 Chi-Square Likelihood Ratio Chi-Square Continuity Adj. Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V 9.352E-05 4.022E-04 479 Statistic 480 ------------------ run=Late ------------------Table of gear by sex gear Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits Case-Control (Odds Ratio) 1.7695 1.2961 2.4157 Cohort (Col1 Risk) 1.3147 1.1336 1.5246 Cohort (Col2 Risk) 0.7430 0.6292 0.8773 Sample Size = 658 sex Frequency Expected Row Pct Female Male Total Hook 199 162 166.23 194.77 55.12 44.88 361 Net 168 268 200.77 235.23 38.53 61.47 436 Total 367 430 797 481 482 Statistics for Table of gear by sex Statistics for Table of gear by sex Statistic Chi-Square Likelihood Ratio Chi-Square Continuity Adj. Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V DF Value 1 1 1 1 21.8848 21.9550 21.2221 21.8574 0.1657 0.1635 0.1657 Prob <.0001 <.0001 <.0001 <.0001 Fisher's Exact Test Cell (1,1) Frequency (F) 199 Left-sided Pr <= F 1.0000 Right-sided Pr >= F 1.980E-06 Table Probability (P) Two-sided Pr <= P 9.975E-07 3.361E-06 483 Statistic Value ASE Gamma Kendall's Tau-b Stuart's Tau-c 0.3242 0.1657 0.1645 0.0647 0.0350 0.0348 Somers' D C|R Somers' D R|C 0.1659 0.1655 0.0350 0.0350 Pearson Correlation Spearman Correlation 0.1657 0.1657 0.0350 0.0350 Lambda Asymmetric C|R Lambda Asymmetric R|C Lambda Symmetric 0.1008 0.0859 0.0934 0.0491 0.0507 0.0445 Uncertainty Coefficient C|R 0.0200 Uncertainty Coefficient R|C 0.0200 Uncertainty Coefficient Sym 0.0200 0.0085 0.0085 0.0085 484 Properties Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits Case-Control (Odds Ratio) 1.9596 1.4763 2.6012 Cohort (Col1 Risk) 1.4306 1.2305 1.6633 Cohort (Col2 Risk) 0.7301 0.6370 0.8367 Sample Size = 797 (i) The odds ratio is not \margin sensitive". Choose t1; t2; s1; s2 such that t1 +t2 = 1 and s1 +s2 = 1. Then the odds ratio for s1t111 s2t121 485 s1t212 s2t222 486 (iv) Interchanging the rows of the table produces 1=. is s1t111 s2t222 = 11 22 s1t212 s2t121 1221 = (v) Interchanging the columns of the table produces 1=. (vi) Interchanging both the rows and the columns of the 2 2 table produces . (ii) 0 1 (iii) = 1 corresponds to independence 487 So, = 4 indicates the same level of association as = 14 = :25. 488 Estimation of the large sample variance: Estimation: 1 0 1 1 1 2 1 1;^ = ^ @ Y11 + Y12 + Y21 + Y22 A 2 Substitute Yij for mij in : Y Y ^ = 11 22 Y12Y21 For small samples use 2) ^ has a \large sample" N (; 1 distribution with 0 1 1 1 1 1 C 2 = 2 B@ A 1 + + + m11 m12 m21 m22 (when each mij is large). ^ = and (Y11 + :5)(Y22 + :5) (Y12 + :5)(Y21 + :5) 0 1 1 1;^ = (^)2 @ Y11 + :5 + Y12 + :5 ^2 1 + 1 + 1 A Y21 + :5 Y22 + :5 489 490 log-odds ratio: log() Other smooth functions of are often used as measures of assocation, say f (). The large sample distribution for f (^ ) is 2 N f (); [f 0()]2 1 ;^ % The asymptotic variance is obtained from the delta method 491 (i) 1 < log() < 1 (ii) independence , = 1 , log() = 0 (iii) log() is not \margin sensitive" (iv) log() and log() = log(1=) imply equal levels of association (v) log(^) has a more nearly symmetric distribution than ^ for smaller sample sizes 492 Yule's Q: log(^) dist ! N log(); m111 + m112 + m121 + m122 Properties: as mij ! 1 for all (i; j ) Approximate 95% condence intervals: For log(): log(^) (1:96) = [A; B ] s 1 1 1 1 Y11 + Y12 + Y21 + Y22 Yule (1900) 1 Q= +1 1. 1 Q 1 2. Q = 0 for the independence model 3. Q = 1 when either 12 = 0 or 21 = 0 for : [exp(A), exp(B)] 493 4. Q = 1 when either 11 = 0 or 22 = 0 5. Q is \symmetric". When the columns (or rows) or a 2 2 table are interchanged, then 1 and Q ) Q ) 494 Q is the Goodman-Kruskal Gamma statistic for 2 2 tables. Q= 1122 1221 1122 + 1221 1122 + 1221 Estimation: Q^ = ^ 1 Y11Y22 Y12Y21 = ^ + 1 Y11Y22 + Y12Y21 6. Q is a margin free measure 495 496 What is a large or substantial association? What is a large value of ? Large sample distribution: As mij ! 1 for all i = 1; 2; j = 1; 2 `n()? Q? 1 (1 Q2)2 1 Q^ N ; + 1 + 1 + 1 ) ( +1 4 m11 m12 m21 m22 % This is 0 [f ()]22 m111 + m112 + m121 + m122 where f () = +11 1. Use large sample distributions to construct condence intervals or test hypotheses H0 : = 1 versus HA : 6= 1. ) 0 > Z Reject H0 if Z = `n^(^ =2 `n(^ ) 497 1. Suppose you discover that ^ = 2:4 is \signicantly dierent" from zero. Is an odds ratio of 2.4 large enough to have practical importance? There are no absolute guidelines. It depends on the subject matter or eld of study. 498 A useful application of measures of association for two-way tables is to assess dierences in levels of association across time, or locations. College Graduate Yes No 1950 male ^1950 = 10:1 female Yes No 1960 ^1960 = 4:2 Yes No 1970 male female ^1970 = 2:7 Yes No 1980 499 male female male female ^1980 = 1:8 500 Relative Risk In that case, 22 = 1 21 =: 1:0 Heart No heart attack attack Placebo Aspirin 11 21 12 22 n1 n2 Relative risk of a heart attack P r fheart attack jplacebog = P r fheart attack jaspiring 11 21 10 1 0 := @ 11 A @ 22 A 21 12 = when 11 and 21 are small. and 12 = 1 11 =: 1:0 . Data: Heart No heart attack attack Placebo 189 10845 n1 = 11034 Aspirin 104 10933 n2 = 11037 501 Estimated relative risk of heart attack for those taking the placebo versus those taking the aspirin: 189 11034 = :0171 = 1:82 104 11037 :0094 Odds ratio: odds of a heart attack for placebo users ^ = odds of a heart attack for aspirin users = (189)(10933) (104)(10845) = 1:83 502 Condence interval: (case-control) ^ = 1:83205 log(^ ) = 0:6054377 2 Slog (^) = XX 1 = :01509 i j Yij 2 Then log(^) z=2Slog (^) ) ) ) ) 503 p 0:6054 (1:96) :01509 (0:36467; 0:84621) (e0:36467; e0:84621) (1:440; 2:331) 504 Relative risk: 189 RRcol1 = 11034 104 = 1:8178 11037 log(RRcol1) = :5976275 1 ^11 1 ^21 + n1^11 n2^21 Y Y = 12 + 22 n1Y11 n2Y21 = :01472515 2 Slog( RRcol1) = s 2 log(RRcol1) (1:96) Slog( RR) ) p :5976275 (1:96) :01472515 ) (:359788; :835467) ) An approximate 95% condence interval is (e:359788; e:835467) ) (1:433; 2:306) 505 0 B @ 1 C A 11 21 = log(11) log(21) log(RR) = log g(11; 21) Independent binomial experiments Y Y11 Bin(n1; 11) ^11 = 11 n1 Y ^21 = 21 Y21 Bin(n2; 21) n2 507 506 ^11! V = V ^21 = 2 6 6 6 4 11(1 11 n1 = @@g11 " 21(1 21) n2 0 Delta method: V (g(^11; ^21)) " 0 @g @21 # # V 2 6 6 6 4 2 6 6 6 6 4 1 @g @11 @g @21 = 111 211 V 111 11 + 1 21 = 1n111 n221 21 3 7 7 7 5 3 7 7 7 7 5 3 7 7 7 5 508 /* SAS Code /* This program computes the PROC FORMAT; odds ratio for a 2x2 table. VALUE RFMT 1 = 'Placebo' It is stored in the file aspirin.sas Assign labels to values */ 2 = 'Aspirin'; */ VALUE CFMT 1 = 'Yes' 2 = 'No'; run; DATA SET1; INPUT ROW COL COUNT; /* Analyze the table of counts */ LABEL ROW = 'Treatment' COL = 'Heart attack'; PROC FREQ DATA=SET1; CARDS; TABLES ROW*COL / CHISQ ALL 1 1 189 NOPERCENT NOCOL EXPECTED; 1 2 10845 WEIGHT COUNT; 2 1 104 2 2 10933 FORMAT ROW RFMT. COL CFMT.; run; run; 509 510 Statistics for Table of ROW by COL The FREQ Procedure Statistic DF Value 1 1 1 1 25.0139 25.3720 24.4291 25.0128 0.0337 0.0336 0.0337 Prob Table of ROW by COL ROW(Treatment) Frequency Expected Row Pct Yes Placebo Aspirin Total Chi-Square Likelihood Ratio Chi-Square Continuity Adj. Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V COL(Heart attack) No Total <.0001 <.0001 <.0001 <.0001 189 146.48 1.71 10845 10888 98.29 11034 104 146.52 0.94 10933 10890 99.06 11037 Cell (1,1) Frequency (F) 189 Left-sided Pr <= F 1.0000 Right-sided Pr >= F 3.253E-07 293 21778 22071 Table Probability (P) Two-sided Pr <= P Fisher's Exact Test 511 1.516E-07 5.033E-07 512 S-PLUS Code # An S-PLUS function to compute # an odds ratio and construct an # approximate confidence interval Estimates of the Relative Risk (Row1/Row2) Type of Study # This code is posted int he file Value 95% Confidence Limits Case-Control (Odds Ratio) 1.8321 1.4400 2.3308 Cohort (Col1 Risk) 1.8178 1.4330 2.3059 Cohort (Col2 Risk) 0.9922 0.9892 0.9953 # oddsratio.ssc oddsratio <- function(table,conf=.95, cont=0.0) {level <- 1-(1-conf)/2 Sample Size = 22071 tablec <- table + cont alpha <- tablec[1,1]*tablec[2,2]/ (tablec[1,2]*tablec[2,1]) la <- log(alpha) sla <- sqrt(sum(tablec^(-1))) 513 514 sa <- alpha*sla # To run this function, first create a lowera <- # table of counts round(exp(la-qnorm(level)*sla),4) # uppera <- # round(exp(la+qnorm(level)*sla),4) # confper <- round(conf*100,1) # Then source this function into the sar <- round(sa,4) # command window # cat("\n", "odds-ratio = ", alphar) # cat("\n", "std. error = ", sar) cat("\n",confper,"% confidence interval") cat("\n", " 10933), 2, 2, byrow=T) # alphar <- round(alpha,4) cat("\n"," lower limit aspirin <- matrix(c(189, 10845, 104, upper limit") ", lowera, " ", uppera) source("yourdirectory/oddsratio.ssc") # # Then execute the function # # oddsratio(aspirin,conf=.95,cont=0.0) # } 515 516 The heart attack study is an example of a propspective study odds-ratio = 1.8321 std. error = 0.2251 In such studies: Patients are randomly assigned to 95 % confidence interval lower limit 1.44 treatment groups. upper limit 2.3308 The treatments are administered. The proportion that give a certain 517 placebo: Y11 = observed proportion n1 that experience a heart attack aspirin: Y21 = observed proportion n2 that experience a heart attack These are direct estimates of population proportions needed to determine relative risk. 519 response is recorded for each treatment group. 518 Retrospective (case-control) studies: Examine what has happened in the past Example: Take a simple random sample of n1 patient records (cases), e.g. women who have experienced a heart attack 520 Classify each women according to whether or not she ever used oral contraception. Take an independent simple random sample of n2 controls, and classify each woman in the same way. oral contraceptive Heart No heart use attack attack Yes No Y11 n1 Y11 Y21 n1 Y12 Y22 n2 estimates 8 < 9 oral experienced a = P r : used contraceptive heart attack ; Y12 n2 estimates 8 < 9 oral never had a = P r : used contraceptive heart attack ; 522 521 These do not provide a direct estimate of 9 8 > > = < heart used oral P r >: attack contraception >;9 8 RR = >< > = P r >: heart do not use oral >; attack contraception 523 Bayes Rule to the rescue? 8 > < > : 9 > = P r heart use >; = attack o.c. 8 9 8 9 > > < use heart > = < heart > = P r>: P r > > ; : attack > ; o.c. attack 8 9 > < use > = P r>: o.c. >; 524 Then ( use P r o.c. RR = ( do not P r use o.c. ) ( ) heart .P r use attack ) o.c. ( ) . heart do not P r attack use o.c. Relative risk of heart attack cannot be estimated without additional information on n o P r use o.c. the proportion of women in the population who use oral contraceptives. Approximate the relative risk with an odds ratio: 8 2 < 6 Pr 6 : 6 6 8 6 < 6 4P r : 8 < P r: 8 < 9 3 heart use = 7 attack o.c. ;9 777 no heart use = 775 attack o.c. ;9 3 =2 heart do not = 7 6 6 attack use o.c. ;9 777 6 6 6 =7 7 6 4 P r no heart do not 5 : attack ; use o.c. 525 526 Which is equal to Which is equal to ( ) ( ) ( ) use heart P r heart .P r use o.c. attack 9 8 attack 9 o.c. 8 ( ) > > > > no no = < =. < use P r heart Pr P r use heart > o.c. ; > : attack > ; : o.c. attack > ( ) ( ). ( ) P r do not heart P r heart P r do not use o.c. attack 9 8 attack 9 use o.c. 8 ( ) > > > > no no . = < = < do not P r heart P r do not heart Pr > use o.c. ; > : attack > ; : use o.c. attack > Pr 527 8 2 < 6 P r: 6 6 6 8 6 6 < 4P r : 8 2 < 6 P r: 6 6 6 8 6 6 < 4P r : 9 3 use heart = 7 o.c. attack ; 9 777 do not heart = 775 use o.c. attack9; 3 use no heart = 7 o.c. attack ;9 777 do not heart = 775 use o.c. attack ; An estimate is (Y11=n1) (Y21=n1) = Y11Y22 (Y12=n2) Y12Y21 (Y22=n2) 528