Case-Control Studies: Statistical Analysis Greg Stoddard December 16, 2010 University of Utah School of Medicine Rothman claims, “Properly carried out, casecontrol studies provide information that mirrors what could be learned from a cohort study, usually at considerably less cost and time.” [Rothman KJ, Epidemiology: An Introduction, 2002, p.73] Goal: contrast the statistical approaches of the two study designs to verify Rothman’s claim. Diagrammatically, Cohort study E not-E D not-D D not-D Case-Control Study D not-D E not-E E not-E Data Layout, E Not-E D a b nD Not-D c d nnot-D Cohort Study NE Nnot-E E Not-E D a b ND Not-D c d Nnot-D nE nnot-E N = fixed , n = free to vary Case-Control Study E Cohort Study Not-E D a b nD Not-D c d nnot-D NE Nnot-E E Not-E a b ND Not-D c d Nnot-D nnot-E disease cases / persons at risk Case-Control Study D nE incidence proportion = incidence proportion = (not estimable) The incidence proportion not being estimable is not much of a shortcoming. Given a study’s inclusion/exclusion criteria, the incidence proportion does not actually apply to a very wide patient population, anyway. The goal is not to estimate incidence, but rather to assess an exposure-disease association. We can do that just fine with relative measures of effect, the risk ratio and odds ratio. E Cohort Study Not-E D a b nD Not-D c d nnot-D =( NE Nnot-E risk ratio = (a/NE)/(b/Nnot-E) odds ratio = odds(D|E)/odds(D|not-E) = (a/c)/(b/d) = (ad)/(bc) Case-Control Study E Not-E exposure odds ratio D a b ND Not-D c d Nnot-D nE = odds(E|D)/odds(E|not-D) = (a/b)/(c/d) = (ad)/(bc) nnot-E So, as long as either E or D is free to vary, you get the same relative effect measure, the odds ratio, with both study designs. E Not-E D a b nD Not-D c d nnot-D NE Nnot-E Cohort Study a a RR= NE a+c b b Nnot-E b+d If the disease is rare (<10% in both E and Not-E groups), so a ≈ 0 and b ≈ 0, then c ≈ a + c and d ≈ b + d. Substituting, a a ad a+c c RR= = =OR b b bc b+d d So, OR from case-control study approximates RR from cohort study, when the rare disease assumption is met. Why the 10%, or 0.10, incidence proportion is a good cutpoint for “rare disease” is illustrated nicely in a figure published in: Zhang J, Yu KF. What’s the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes. JAMA 1998; 280(19):1690-91. Aside: The formula in Zhang and Yu (1998) for converting an odds ratio to a risk ratio in cohort studies has been convincing criticized as unreliable (Zou, 2004) so you should avoid using it. [Zou G. A modified Poisson regression approach to prospective studies with binary data. Am J Epidemiol 2004;159(7):702-706.] Checking our progress How far have we gotten, thus far, in verifying that a case-control study can mirror what can be learned in a cohort study? Checking our progress We have seen that the OR is the same in both study designs. We have seen that the OR approximates the RR under the rare disease assumption, and so it has a straightforward interpretation. Checking our progress However, cohort studies rarely use the odds ratio, nor do they use the risk ratio. Instead, cohort studies use survival analysis. Why? Risk Ratio Analysis This type of analysis ignores time-at-risk. That is, it assumes an equal follow-up time for every study subject. Exposed Non-Exposed Followup day Begin N Disease Cases DaySpecific Risk Begin N Disease DayDayCases Specific Specific Risk Risk Ratio 1 50 5 0.10 50 2 0.04 2.5 2 30 10 0.33 40 8 0.20 1.7 3 10 10 1.00 20 10 0.50 2.0 Total 90 25 110 20 The risk ratio uses partial information (shown in blue) from the complete data in the life table. Risk ratio analysis data Exposed Not-Exposed Disease 25 (50%) 20 (40%) Not-Disease 25 30 N 50 50 Risk Ratio = (25/50)/(20/50) =1.25 Chi-square test, p = 0.31 Analyzing these data in this way, we do not demonstrate a significant effect. In fact, this crude RR underestimates each of the dayspecific RR estimates. Rate Ratio Analysis Let’s see if we can do better with a rate ratio analysis. It uses a person-time denominator, so in that sense, it relaxes the equal time-at-risk assumption of the risk ratio analysis. Exposed Non-Exposed Followup day Begin N Disease Cases DaySpecific Risk Begin N Disease DayDayCases Specific Specific Risk Risk Ratio 1 50 5 0.10 50 2 0.04 2.5 2 30 10 0.33 40 8 0.20 1.7 3 10 10 1.00 20 10 0.50 2.0 Total 90 25 110 20 The rate ratio uses partial information (shown in blue) from the complete data in the life table. Rate ratio analysis data Exposed Not-Exposed Disease 25 (50%) 20 (40%) Person-Days 90 110 Rate Ratio = (25/90)/(20/110) =1.53 Binomial probability mid-p exact test for person-time data, p = 0.080 Analyzing these data in this way, we almost demonstrate a significant effect. Again, this crude rate ratio underestimates each of the dayspecific risk ratio (rate ratio) estimates. Inefficient Use of Time in Rate Ratio Analysis The reason the rate ratio analysis failed to convey the information in the life table is because it only considers ratio of cases to average person-time, without distinguishing times to event and times to censoring. person-time = total time for subjects = mean time x N Suppose the individual times-at-risk for a sample are: 10, 20, and 30. The persontime is computed as: PT = total time for subjects = 10+20+30 = 60 which is equivalent to : PT = mean time x N = (10+20+30)/3 x 3 = 20 x 3 = 60 So, a rate ratio analysis would find the following two scenarios equal (even though Group B outperforms Group A) (let x----x denote time) x-------------------------------------x (censored) x-----x (died) x--------x (died) x--------------------------------------------x (censored) Group A x-------------------------------------x (died) x-----x (censored) x--------x (censored) x--------------------------------------------x (died) Group B Hazard Ratio Analysis (Survival Analysis) This analysis uses time-at-risk is a very complete way, using all of the information in the life table. Exposed Non-Exposed Followup day Begin N Disease Cases DaySpecific Risk Begin N Disease DayDayCases Specific Specific Risk Risk Ratio 1 50 5 0.10 50 2 0.04 2.5 2 30 10 0.33 40 8 0.20 1.7 3 10 10 1.00 20 10 0.50 2.0 Total 90 25 110 20 From Cox regression, HR = 1.92, p = 0.032 The HR is identically the Mantel-Haenzsel summary risk ratio. Aside Showing a life table like this and pointing out that the HR is just the weighted average of the day specific risk ratios, and so is a relative risk estimate, is a very clear way to explain the HR to a researcher. Checking our progress Recall, we are trying to verify that a casecontrol study can mirror what can be learned in a cohort study. It appears, then, that we need to incorporate survival analysis into the case-control framework in order to keep up with what a cohort study can do. It turns out we can do this, use survival analysis in the case-control framework, if we tweak the study design slightly. The slight variant is called the case-cohort design (also called the density casecontrol design). While presenting this design, I am going to show some simulation results. In this way, I can demonstrate that the case-cohort design really does perform as well as a cohort study design. Dataset The dataset comes from Breslow and Day [Breslow NE, Day NE. (1987). Statistical Methods in Cancer Research, Vol II: The Design and Analysis of Cohort Studies, Lyon, France, IARC, 1987.] Men (n=679) employed in a nickel refinery in South Wales were investigated to determine whether the risk of developing carcinoma of the bronchi and nasal sinuses (ICD = 160), which had been associated with the refining of nickel from previous studies in the 1930s, was present in this cohort. Modified Dataset I also modified the dataset, to create a second dataset that does not meet the rare disease assumption, by duplicating the cases five times. Treating this dataset as the “population”, and then analyzing it, we know what the answer is that a case-control design which samples from this cohort is supposed to achieve. The population relative measures are: Population Relative Effect Measure Actual Dataset with almos rare disease (3% in unexposed, 12% in exposed) Augmented Dataset with frequent disease (15% in unexposed, 60% in exposed) Odds Ratio 3.76 3.76 Risk Ratio 3.43 2.65 Rate Ratio 4.76 3.87 Hazard Ratio 5.02 4.19 Classical Case-Control Study (controls are sampled from the population controls only) Using a 2:1 sampling ratio Exposed to nickel Not exposed to nickel Total Tumor 46 10 56 No Tumor 343 280 56 Total 389 290 679 use all 56 cases sample 56 x 2 controls Monte Carlo simulation, computing OR from 1,000 samples, to get long-run average of OR. (Each sample keeps all 56 subjects from the tumor row of the population 2 x 2 table, and the randomly samples 112 subjects from the no-tumor row of the population 2 x 2 table.) The simulations results are: Classical case-control design (sample controls from no-tumor subjects only) Population Relative Effect Measure Actual Dataset with almos rare disease (3% in unexposed, 12% in exposed) Augmented Dataset with frequent disease (15% in unexposed, 60% in exposed) Odds Ratio 3.76 (OR=3.81) 3.76 (OR=3.77) Risk Ratio 3.43 2.65 Rate Ratio 4.76 3.87 Hazard Ratio 5.02 4.19 Case-Cohort Study Design - In this design, we keep the cases. Then, we sample our controls from the total row of the population 2 x 2 table. - For those cases that get mixed in with the controls, we set their status variable to 0, the control value. - We then calculate the OR in the usual way. Case-Cohort Study (controls are sampled from the population row totals, which includes both cases and controls) Using a 2:1 sampling ratio Exposed to nickel Not exposed to nickel 10 b Total use all 56 cases Tumor 46 a 56 No Tumor 343 280 56 Total 389 c 290 d 679 sample 56 x 2 controls The odds ratio is then a direct calucation of the risk ratio. OR = (a x kd)/(b x kc) = (kad)/(kbc) = (ad)/(bc) , where k=(56x2)/679 RR = (a/c)/(b/d) = (ad)/(bc) = OR The simulations results are: Case-cohort design (sample controls from total row of population 2 x 2 table) Population Relative Effect Measure Actual Dataset with almos rare disease (3% in unexposed, 12% in exposed) Augmented Dataset with frequent disease (15% in unexposed, 60% in exposed) Odds Ratio 3.76 3.76 Risk Ratio 3.43 (OR=3.48) 2.65 (OR=2.67) Rate Ratio 4.76 3.87 Hazard Ratio 5.02 4.19 Case-Cohort Study Design For the case-cohort design, the rare-disease assumption is not required for the OR to be an estimate of RR (Rothman and Greenland, 1998, p.110). We have demonstrated that to be the case. [Rothman KJ, Greenland S. (1998). Modern Epidemiology, 2nd ed. Philadelphia, PA.] Case-Cohort Study Design It is nice to be able to use the OR to directly estimate RR, and not worry about the rare disease assumption at all. It comes with a price, however. Since your controls are now “messy”, with cases mixed in, you do not have as clear of a signal for the effect, so statistical power is reduced. You need to sample additional controls to make up the difference (to get it back to the power of the classic case-control study). Case-Cohort Study Design With Risk Set Sampling In this design, you again keep all of the cases. You then, again, sample controls from the total row of population 2 x 2 table (sampled from cases & controls). This time, however, you sample from total row subjects which have the same or longer time-at-risk. This is called risk set sampling. Exposed Non-Exposed Followup day Begin N Disease Cases DaySpecific Risk Begin N Disease DayDayCases Specific Specific Risk Risk Ratio 1 50 5 0.10 50 2 0.04 2.5 2 30 10 0.33 40 8 0.20 1.7 3 10 10 1.00 20 10 0.50 2.0 In this design, we also use a type of “total row” sampling. That is, we select our controls from the “Beginning N” column’s of the life table. Exposed Non-Exposed Followup day Begin N Disease Cases DaySpecific Risk Begin N Disease DayDayCases Specific Specific Risk Risk Ratio 1 50 5 0.10 50 2 0.04 2.5 2 30 10 0.33 40 8 0.20 1.7 3 10 10 1.00 20 10 0.50 2.0 For the 5+2 cases that occurred on day 1, we sample our controls from the 50+50 persons still at risk on day 1. Exposed Non-Exposed Followup day Begin N Disease Cases DaySpecific Risk Begin N Disease DayDayCases Specific Specific Risk Risk Ratio 1 50 5 0.10 50 2 0.04 2.5 2 30 10 0.33 40 8 0.20 1.7 3 10 10 1.00 20 10 0.50 2.0 For the 10+8 cases that occurred on day 2, we sample our controls from the 30+40 persons still at risk on day 2. …and so on. We do this by forming risk sets. For every case, we form a risk set that includes all subjects with an equal or longer follow-up time. Then we sample 2 controls from that risk set, if we are using a 2:1 sampling ratio, that we match with that case. This is identical to sampling on the correct row from the Beginning N column, like we did above. We have already seen that the OR from a casecohort study design directly estimates the RR. We are now doing a version of the case-cohort approach for each row of the life table. We know that the HR is just the summary RR across the rows of the life table. If we use conditional logistic regression, then, to account for the row-specific matching, it would seem the OR should directly estimate the HR. Let’s see if that is true. This time in the simulation, we will take the OR from the conditional logistic regression, rather than calculate if from a 2 x 2 table like we did for the previous simulations. The mean of the 1,000 conditional logistic regression ORs will be our estimate of the HR. The simulations results are: Case-cohort design with risk set sampling. Population Relative Effect Measure Actual Dataset with almos rare disease (3% in unexposed, 12% in exposed) Augmented Dataset with frequent disease (15% in unexposed, 60% in exposed) Odds Ratio 3.76 3.76 Risk Ratio 3.43 2.65 Rate Ratio 4.76 3.87 Hazard Ratio 5.02 (OR=5.42) 4.19 (OR=4.43) We were close, but the estimates appear to be biased. The way it is really done is to use risk set sampling followed by an actual Cox regression. To adjust the standard error for the way the sampling was done, there are three approaches: Prentice Self and Prentice Barlow In Stata, Prentice: stcascoh, alpha(.18) // risk set sampling stcox nickel, robust Self and Prentice stcascoh, alpha(.18) // risk set sampling with log weights (_wSelPre) stcox nickel, robust offset(_wSelPre) Barlow stcascoh, alpha(.18) // risk set sampling with log weights (_wBarlow) stcox nickel, robust offset(_wBarlow) The simulations results are: Case-cohort design with risk set sampling (Prentice Method) Population Relative Effect Measure Actual Dataset with almos rare disease (3% in unexposed, 12% in exposed) Augmented Dataset with frequent disease (15% in unexposed, 60% in exposed) Odds Ratio 3.76 3.76 Risk Ratio 3.43 2.65 Rate Ratio 4.76 3.87 Hazard Ratio 5.02 (HR=5.08) 4.19 Estimates appear unbiased using this approach.