Contingency tables Brian Healy, PhD Types of analysis-independent samples Outcome Explanatory Analysis Continuous Dichotomous t-test, Wilcoxon test Continuous Categorical Continuous Continuous ANOVA, linear regression Correlation, linear regression Dichotomous Dichotomous Chi-square test, logistic regression Dichotomous Continuous Logistic regression Time to event Dichotomous Log-rank test Example MS is known to have a genetic component Several single nucleotide polymorphisms have been associated with susceptibility to MS Question: Do patients with susceptibility SNPs experience more sustained progression than patients without susceptibility SNPs? Data Initially, we will focus on presence vs. absence of SNPs Among our 190 GA treated patients, 74 had the SNP and 116 did not – 12 patients with the SNP experienced sustained progression pˆ SNP 12 0.162 74 – 13 patients without the SNP experienced sustained progression pˆ SNP 13 0.112 116 Another way to look at the data Rather than investigating two proportions, we can look at a 2x2 table of the same data SNP+ SNP- Total Prog 12 13 25 No prog 62 103 165 Total 74 116 190 Question In our analysis, we assume that the margins are set If there was no relationship between the two variables, what would we expect the values in the table be? Example As an example, use this table SNP+ Prog No prog Total 50*100/200 =25 50*100/200 =25 50 SNP- Total 150*100/200= 75 100 150*100/200 =75 100 150 200 Expected table Expected table for our analysis SNP+ Prog No prog Total SNP- Total 25*74/190= 9.73 25*116/190 =15.3 25 165*74/190 =64.3 116*165/ 190=100.7 165 74 116 190 How different is our observed data compared to the expected table? Does our data show an effect? To test for an association between the outcome and the predictor, we would like to know if our observed table was different from the expected table How could we investigate if our table was different? O1 E1 O2 E2 O3 E3 O4 E4 cells Oi Ei 2 Ei Chi-square distribution This statistic follows a chi-square distribution with 1 degree of freedom cells Oi Ei 2 Ei Assume x is a normal random variable with mean=0 and variance=1 – x2 has a chi-square distribution with 1 degree of freedom Chi-square distribution Area=0.05 X2=3.84 Critical information for c2 For 1 degree of freedom, cut-off for a=0.05 is 3.84 – For normal distribution, this is 1.96 – Note 1.962=3.84 Inherently, two-sided since it is squared Hypothesis test with c2 1) 2) 3 4) 5) 6) 7) H0: No association between SNP and progression Dichotomous outcome, dichotomous predictor c2 test Test statistic: c2=0.99 p-value=0.32 Since the p-value is greater than 0.05, we fail to reject the null hypothesis We conclude that there is no significant association between SNP and progression p-value c2 statistic Hypothesis test comparison Yesterday, we completed this same test using a comparison of proportions Let’s compare the results Method Test statistic p-value Test of proportions c2 test z=0.996 p=0.32 c2=0.992 p=0.32 We get the same result!!! Question: Continuity correction What is a continuity correction and when should I use it? – Continuity correction subtracts ½ from the numerator of the c2 statistic rc | Oi Ei | 0.52 i 1 Ei c2 – Designed to improve performance of normal approximation – Use default in STATA (or other stat package), but know which you are using – Less important today since exact tests are easily used Question: Why 1 degree of freedom? We used a c2 distribution with 1 degree of freedom, but there are 4 numbers. Why? – For our analysis, we assume that the margins are fixed. – If we pick one number in the table, the rest of the numbers are known SNP+ Prog No Prog Total SNP- 3 22 71 94 74 116 Total 25 165 190 Question: Normal approximation We are using a normal approximation, but yesterday we talked about this being less than perfect. When can we use this test? – Rule of thumb: All cells larger than 5 – Large samples What should I do if I do not have large samples? – Fisher’s exact test Fisher’s exact test Remember that a p-value is the probability of the observed value or something more extreme Fisher’s exact test looks at a table and determines how many tables are as extreme or more extreme than the observed table under the null hypothesis of no association Same concept as exact test from Wilcoxon test Easy to compute this in STATA Hypothesis test with exact test 1) 2) 3) 4) 5) 6) 7) H0: No association between SNP and progression Dichotomous outcome, dichotomous predictor Exact test Test statistic: NA p-value=0.38 Since the p-value is greater than 0.05, we fail to reject the null hypothesis We conclude that there is no significant association between SNP and progression Two-sided p-value Results Our results were very similar to the other tests in part because we have a large sample size – Normal approximation ok In small samples, larger differences are possible Types of studies In a cohort study, people are enrolled based on exposure status so we can somewhat control how many exposed and unexposed people we have In a case-control study, people are enrolled based on disease status so that we ensure that we have both diseased and non-diseased people Measures of association Risk difference RD P( Disease | Exposure) P( Disease | Exposure) – Do these added together equal 1? – Why? – Under the null, what is the risk difference? Relative risk (risk ratio) P( Disease | Exposure) RR P( Disease | Exposure) – Under the null, what is the relative risk? Exposure Disease Y N Total Y a b n1 N c d n2 Total m1 m2 N P(Disease+|Exposure+)= a/m1=p1 – What is another name for this quantity? – Prevalence in patients with exposure P(Disease+|Exposure-)= b/m2=p2 RD=a/m1 – b/m2 Difference between proportions Confidence interval for RD Several confidence intervals are available for the RD – Asymptotic normal distribution p (1 p2 ) p1 (1 p1 ) pˆ 2 pˆ1 ~ N p2 p1 , 2 m2 m1 – Confidence interval ( pˆ 1 pˆ 2 ) za / 2 pˆ 1 (1 pˆ 1 ) pˆ 2 (1 pˆ 2 ) , ( pˆ 1 pˆ 2 ) za / 2 m1 m2 pˆ 1 (1 pˆ 1 ) pˆ 2 (1 pˆ 2 ) m1 m2 Exposure Disease Y N Total Y a b n1 N c d n2 Total m1 m2 N Estimate of RR: a m1 RR b m2 Confidence interval for RR To construct a confidence interval we use a normal approximation In addition, the CI is based on a log transformation of the RR – log(RR)=ln(RR) – I will use ln and log to represent the natural logarithm Quick math: eln(RR)=RR ln(RR) Why do we use the ln(RR)? – It is generally easier to deal with subtraction rather than division – ln(RR)=ln(p1/p2)=ln(p1)-ln(p2) We can estimate the standard error for the ln(RR) using the following formula c d se ln RR am1 bm2 Confidence interval Now that we have an estimate of the variance, we can create a confidence interval for ln(RR) using our standard normal approximation c d c d ln RR za 2 , ln RR za 2 am1 bm2 am1 bm2 To create a confidence interval for RR, we transform this confidence interval ln RR za 2 e b d am1 cm2 ,e ln RR za 2 b d am1 cm2 Estimated proportions in two groups Given the confidence interval, would you reject the null hypothesis? Why? p-value from chisquare test Interpretation of RD The estimated risk difference is 0.05. – The interpretation of this is that the risk of progression for patients with the susceptibility allele is 5% higher than for patients without the allele The 95% confidence interval for the risk difference is (-0.052,0.152) – Is there a significant difference between the allele groups? What was the confidence interval for the difference between the proportions that we investigated two classes ago? – 95% CI: (-0.052,0.152) Interpretation of RR The estimated relative risk is 1.45. – The interpretation of this is that the risk of progression for patients with the susceptibility allele is 1.45 times higher than for patients without the allele The 95% confidence interval for the risk difference is (0.70,3.00) – Is there a significant difference between the allele groups? RD and RR Now that we know how to estimate these measures, can we estimate these with any study design? – Not directly – In a cohort study study, the probabilities of interest, P(Disease|Exposure), are estimated – In a case-control study, the probabilities cannot be estimated directly so more information is required Bayes theorem-technical The relationship between the P(Disease|Exposure) and P(Exposure| Disease) can be shown using Bayes theorem P( E | D ) P( D ) P( D | E ) P ( E | D ) P ( D ) P ( E | D ) P ( D ) Therefore, if we knew P(D+), we can estimate P(D+|E+) from a case control study – P(D+) is prevalence – Usually we do not know this so we can’t directly estimate the relative risk or risk difference Odds ratio Odds: Odds ratio: OR Odds OddsExposure OddsExposure p 1 p P( Disease | Exposure) 1 P( Disease | Exposure P( Disease | Exposure) 1 P( Disease | Exposure – Under the null, what is the OR? Exposure Disease Y N Total Y a b n1 N c d n2 Total m1 m2 N P( D | E ) a ac P( D | E ) a /(a c) a 1 P( D | E ) c /(a c) c b This is the estimate of d OddsDisease |Exposure OddsD |E OddsDisease |Exposure OddsD |E OR OddsD |E OddsD |E a c ad b d bc the odds ratio from a cohort study Exposure Disease Y N Total Y a b n1 N c d n2 Total m1 m2 N P( E | D ) a ab a b c d OddsExposure |Disease OddsE |D OddsExposure |Disease OddsE |D OR OddsE |D OddsE |D a b ad c d bc This is the estimate of the odds ratio from a case-control study Amazing!! Estimated odds ratio from each kind of study ends up being the same thing!!! Therefore, we can complete a case control study and get an estimate that we really care about, which is the effect of the exposure on the disease This relationship is why the odds ratio is so commonly Confidence interval for OR In order to calculate a confidence interval for the OR, we will investigate Woolf’s approximation – Other approximations and exact intervals are available in STATA (Exact is default) Woolf’s approximation focuses in a log transformation of the OR like for the RR – log(OR)=ln(OR) Quick math: eln(OR)=OR Woolf’s approximation gives us 1 1 1 1 se ln OR a b c d Using our normal approximation, we can create a confidence interval for ln(OR) using 1 1 1 1 1 1 1 1 ln OR za 2 , ln OR za 2 a b c d a b c d The confidence interval for OR ln OR za 2 e 1 1 1 1 a b c d ,e ln OR za 2 1 1 1 1 a b c d Example In yesterday’s class, we discussed a study in which we wanted to estimate the effect of a SNP on disease progression – What type of study was this? – Cohort study because we followed people forward over time Let’s estimate the odds ratio and confidence interval for this study CI for OR SNP+ SNP- Total Prog 12 13 25 No prog 62 103 165 Total 74 116 190 Based on this table, the estimated OR=(12*103)/(13*62)=1.53 95% CI: (0.66, 3.57) Should we reject the null hypothesis of OR=1? Interpretation of OR The estimated odds ratio is 1.53. – The interpretation of this is that the ODDS of progression for patients with the susceptibility allele is 1.53 times higher than the ODDS for patients without the allele The 95% confidence interval for the risk difference is (0.66,3.57) – Is there a significant association between SNP and disease? Estimated OR Estimated CI (Woolf) OR vs. RR Although the odds ratio is interesting, the relative risk is more intuitive If we have a rare disease, which is often the case for a case-control study, P( D | E ) P( D | E ) 1 Therefore, in these cases, the odds ratio is also an estimate of the relative risk P( D | E ) P( D | E ) P( D | E ) OR RR P( D | E ) P( D | E ) P( D | E ) In other cases, odds ratio provides valid estimate of relative risk (see other courses) Hypothesis test with CI 1) 2) 3) 4) 5) 6) 7) H0: No association between SNP and progression (RD=0) Dichotomous outcome, dichotomous predictor Risk difference 95% confidence interval Test statistic: Estimated RD=0.50 95% CI: (-0.052, 0.152) p-value>0.05 Since the p-value is greater than 0.05, we fail to reject the null hypothesis We conclude that there is no significant association between SNP and progression Hypothesis test with CI 1) 2) 3) 4) 5) 6) 7) H0: No association between SNP and progression (RR=1) Dichotomous outcome, dichotomous predictor Risk difference 95% confidence interval Test statistic: Estimated RR=1.45 95% CI: (0.70, 3.00) p-value>0.05 Since the p-value is greater than 0.05, we fail to reject the null hypothesis We conclude that there is no significant association between SNP and progression Hypothesis test with CI 1) 2) 3) 4) 5) 6) 7) H0: No association between SNP and progression (OR=1) Dichotomous outcome, dichotomous predictor Risk difference 95% confidence interval Test statistic: Estimated OR=1.53 95% CI: (0.66, 3.57) p-value>0.05 Since the p-value is greater than 0.05, we fail to reject the null hypothesis We conclude that there is no significant association between SNP and progression