Chapter 3-14. Case-Cohort Study Design Comparison of Three Relative Effect Measures in Cohort Studies (Risk Ratio, Rate Ratio, and Hazard Ratio) This comparison, pages 1 to 6, was presented in Chapter 11. It is repeated here for review. The topic of this chapter is how to do design and analyze case-control studies to obtain the same types of effect estimates. For illustration, we will use the following data in life table format from a hypothetical cohort study. Life Table of Hypothetical Data Exposed Non-Exposed Follow- Begin Disease DayBegin Disease DayDayup day N Cases Specific N Cases Specific Specific Risk Risk Risk Ratio 1 50 5 0.10 50 2 0.04 2.5 2 30 10 0.33 40 8 0.20 1.7 3 10 10 1.00 20 10 0.50 2.0 totals 90 25 110 20 These data can be entered into Stata using the following commands in the gregchapter4.do dofile. clear input day exposure disease count 1 1 1 5 1 1 0 15 2 1 1 10 2 1 0 10 3 1 1 10 1 0 1 2 1 0 0 8 2 0 1 8 2 0 0 12 3 0 1 10 3 0 0 10 end drop if count==0 expand count drop count _____________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010. Chapter 3-14 (revision 16 May 2010) p. 1 To verify the data were corrected entered, we request a life table using ltable day disease ,by(exposure) noadjust intervals(1) hazard Beg. Cum. Std. Std. Interval Total Failure Error Hazard Error [95% Conf. Int.] ------------------------------------------------------------------------------exposure 0 1 2 50 0.0400 0.0277 0.0400 0.0283 0.0048 0.1114 2 3 40 0.2320 0.0646 0.2000 0.0707 0.0863 0.3606 3 4 20 0.6160 0.0917 0.5000 0.1581 0.2398 0.8542 exposure 1 1 2 50 0.1000 0.0424 0.1000 0.0447 0.0325 0.2048 2 3 30 0.4000 0.0825 0.3333 0.1054 0.1598 0.5695 3 4 10 1.0000 . 1.0000 0.3162 0.4795 1.7085 ------------------------------------------------------------------------------- which agrees with the original table. Chapter 3-14 (revision 16 May 2010) p. 2 Risk Ratio Analysis This type of analysis ignores time-at-risk. For that reason, it assumes an equal follow-up time for every study subject. Life Table of Hypothetical Data Exposed Non-Exposed Follow- Begin Disease DayBegin Disease DayDayup day N Cases Specific N Cases Specific Specific Risk Risk Risk Ratio 1 5 0.10 2 0.04 2.5 50 50 2 30 10 0.33 40 8 0.20 1.7 3 10 10 1.00 20 10 0.50 2.0 totals 90 110 25 20 The risk ratio analysis uses partial information (shown in blue) from the complete data in the life table. Risk Ratio Analysis Data Exposed Non-Exposed Disease 25 (50%) 20 (40%) Non-Disease 25 30 N 50 50 cs disease exposure | exposure | | Exposed Unexposed | Total -----------------+------------------------+---------Cases | 25 20 | 45 Noncases | 25 30 | 55 -----------------+------------------------+---------Total | 50 50 | 100 | | Risk | .5 .4 | .45 | | | Point estimate | [95% Conf. Interval] |------------------------+---------------------Risk difference | .1 | -.0940265 .2940265 Risk ratio | 1.25 | .8064465 1.937512 Attr. frac. ex. | .2 | -.2400079 .4838742 Attr. frac. pop | .1111111 | +----------------------------------------------chi2(1) = 1.01 Pr>chi2 = 0.3149 Analyzing these data in this way, we do not demonstrate a significant effect (RR=1.25, p=0.315). In fact, this crude RR underestimates each of the day-specific RR estimates. Chapter 3-14 (revision 16 May 2010) p. 3 Rate Ratio Analysis This type of analysis uses time-a-risk, but in a crude way. It does not assume an equal follow-up time for each study subject. It assumes, however, that risk is constant across the follow-up time. Life Table of Hypothetical Data Exposed Non-Exposed Follow- Begin Disease DayBegin Disease DayDayup day N Cases Specific N Cases Specific Specific Risk Risk Risk (and Rate*) Ratio 1 50 5 0.10 50 2 0.04 2.5 2 30 10 0.33 40 8 0.20 1.7 3 10 10 1.00 20 10 0.50 2.0 totals 90 25 110 20 * Since the intervals are each one day, the day-specific person-time is just the day-specific beginning sample size, and so the day-specific risk is also the day-specific rate). The rate ratio analysis uses partial information (shown in blue) from the complete data in the life table. Rate Ratio Analysis Data Exposed Disease 25 (50%) Person-Days 90 Non-Exposed 20 (40%) 110 ir disease exposure day | exposure | | Exposed Unexposed | Total -----------------+------------------------+---------disease | 25 20 | 45 day | 90 110 | 200 -----------------+------------------------+---------| | Incidence Rate | .2777778 .1818182 | .225 | | | Point estimate | [95% Conf. Interval] |------------------------+---------------------Inc. rate diff. | .0959596 | -.0389695 .2308887 Inc. rate ratio | 1.527778 | .8147248 2.900724 Attr. frac. ex. | .3454545 | -.2274083 .6552584 Attr. frac. pop | .1919192 | +----------------------------------------------(midp) Pr(k>=25) = 0.0800 (midp) 2*Pr(k>=25) = 0.1599 (exact) (exact) (exact) (exact) Analyzing these data in this way, we almost demonstrate a significant effect (IRR=1.53, p=0.080). Notice again, this crude IRR underestimates each of the day-specific IRR estimates. Chapter 3-14 (revision 16 May 2010) p. 4 Hazard Ratio Analysis (Survival Analysis) This type of analysis uses time-a-risk in a very complete way, using all of the information from the life table. It does not assume an equal follow-up time for each study subject. It allows for and models a changing risk across the follow-up time. Follow- Begin up day N 1 2 3 totals 50 30 10 90 Exposed Non-Exposed Disease DayBegin Disease DayDayCases Specific N Cases Specific Specific Risk Risk Risk Ratio 5 0.10 50 2 0.04 2.5 10 0.33 40 8 0.20 1.7 10 1.00 20 10 0.50 2.0 25 110 20 Analyzing these data using survival analysis, stset day ,failure(disease==1) stcox exposure Cox regression -- Breslow method for ties No. of subjects = No. of failures = Time at risk = 100 45 200 Number of obs = 100 LR chi2(1) = 4.65 Log likelihood = -174.40643 Prob > chi2 = 0.0310 -----------------------------------------------------------------------------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------exposure | 1.916208 .5796004 2.15 0.032 1.059198 3.466632 ------------------------------------------------------------------------------ This time, we observe a significant effect (HR = 1.92, p = 0.032). Chapter 3-14 (revision 16 May 2010) p. 5 Inefficient Use of Time in Rate Ratio Analysis The rate ratio analysis only considers the ratio of cases to average person-time, without distinguishing times to event and times to censored. person-time = total time for subjects = mean time N Suppose the individual times-at-risk for a sample are: 10, 20, and 30. The person-time is computed as: PT = total time for subjects =10+20+30 = 60 which is equivalent to PT = mean time N = (10+20+30)/3 3 = 20 3 = 60 Thus, if we had the scenario where events occurred early while censoring occurred later in one study group, while in the other study group an equal number of events occurred later while censoring occurred early, the person time could be equal for the two study groups and we would erroneously conclude no difference in rates (rate ratio =1) between the two groups. (let x----x denote time) x-------------------------------------x (censored) x-----x (died) x--------x (died) x--------------------------------------------x (censored) Group A x-------------------------------------x (died) x-----x (censored) x--------x (censored) x--------------------------------------------x (died) Group B In this example, the person-time is equal and the rate ratio = 1, yet clearly Group A shows greater risk for death (Group B survives longer). Conclusion: Cox regression is sensitive to a changing risk across time, while Poisson regression (or a rate ratio analysis) is not. Usually both approaches beat the risk ratio approach, since they have the advantage of using more information, namely time, in the analysis. Chapter 3-14 (revision 16 May 2010) p. 6 How This Relates to Case-Control Studies It would be nice if we could gain the additional power of a hazard ratio analysis somehow in a case-control study. In a case-control study, we do not use time in the analysis. It would seem, then, that a casecontrol study can do no better than the risk ratio analysis from a cohort study, which also does not use time in the analysis. If the rare-disease assumption is meet, the ordinary case-control study OR is approximately the RR that would be obtained in a cohort study. However, if a case-cohort study design is used, which we will see how to design in this chapter, the OR provides an unbiased estimate of the RR, without the need for the rare-disease assumption. Furthermore, if a density case-control study design is used, which is also referred to by many as a case-cohort study design, the OR provides an unbiased estimate of the HR, also without the need for the rare-disease assumption. Thus, we can obtain the benefit of a HR analysis, which improves our chances of demonstrating an exposure-disease association. We will see how to conduct the sampling required for these three variants of the case-control study design, use simulation to demonstrate what the OR estimates, and explain why it estimates these effect measures. Chapter 3-14 (revision 16 May 2010) p. 7 Dataset We will use the dataset found in the Breslow and Day (1987, Appendix VIII and Appendix ID). Men (n=679) employed in a nickel refinery in South Wales were investigated to determine whether the risk of developing carcinoma of the bronchi and nasal sinuses (ICD = 160), which had been associated with the refining of nickel from previous studies in the 1930s, was present in this cohort. The data are in the file nickelrefinary.csv. The variables are: CaseID Case ID PrimaryICD Primary ICD Code Exposure Nickel Exposure Level DateBirth Date of Birth AgeEmp Age First Employed AgeBegFol Age Follow-up Began AgeEndFol Age at Death or Loss Executing the following commands in the do-file editor, we see up the variables and save them to a new file (this has already been done, with nickelrefinary.dta in the datasets & do-files subdirectory. * -- set up variables ------------------------------clear set mem 10M cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "IntroEpiCourse\datasets & do-files" insheet using nickelrefinary.csv * -- set up tumor disease variable gen tumor = cond(primaryicd==160,1,0) replace tumor = . if primaryicd == . label define tumorlab 0 "0. no tumor" 1 "1. tumor" label values tumor tumorlab label variable tumor "carcinoma of the bronchi and nasal sinuses" tab tumor * -- set up nickel exposure variable gen nickel = cond(exposure>0,1,0) replace nickel = . if exposure == . label define nickellab 0 "0. no exposure" 1 "1. exposure" label values nickel nickellab label variable nickel "occupational exposure to nickel" tab nickel * -- set up time-at-risk variable gen timerisk = ageendfol - agebegfol label variable timerisk "time at risk" sum timerisk save nickelrefinary , replace * -- end set up variables --------------------------- Chapter 3-14 (revision 16 May 2010) p. 8 Creating a Dataset Without Rare Disease So that we have a dataset where the rare disease assumption is not met, we next duplicate the cases five times save this augmented dataset to a separate file. (This is for illustration only—of course you would not do this in an actual data analyis.) * -- set up data with 5 x cases use nickelrefinary, clear tab tumor keep if tumor==1 // reduce to cases only save tumorcases, replace // save cases to file use nickelrefinary, clear // bring data back in append using tumorcases append using tumorcases append using tumorcases append using tumorcases tab tumor save nickelrefinary5xcases, replace This has already been done. The file nickelrefinary5xcases.dta is in the datasets & do-files subdirectory. Population Effect Measures (Original Data With Rare Disease) For illustration, we will assume our N=679 represents the population that we will sample from. Doing this, we can determine the population effect measures that our samples will be estimates of. Reading in the original data (not the 5 x cases dataset), File Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on nickelrefinary.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\ nickelrefinary.dta ", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use nickelrefinary.dta, clear Chapter 3-14 (revision 16 May 2010) p. 9 Computing the “population” (full cohort dataset before we take a sample) odds ratio, Statistics Observational/Epi. analysis Tables for epidemiologists case control odds ratio Main tab: Case variable: tumor Exposed variable: nickel OK cc tumor nickel cc tumor nickel Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+---------------------Cases | 46 10 | 56 0.8214 Controls | 343 280 | 623 0.5506 -----------------+------------------------+---------------------Total | 389 290 | 679 0.5729 | | | Point estimate | [95% Conf. Interval] |------------------------+---------------------Odds ratio | 3.755102 | 1.824533 8.48588 (exact) Attr. frac. ex. | .7336957 | .4519145 .8821572 (exact) Attr. frac. pop | .6026786 | +----------------------------------------------chi2(1) = 15.41 Pr>chi2 = 0.0001 Chapter 3-14 (revision 16 May 2010) p. 10 Computing the “population” (full cohort dataset before we take a sample) risk ratio, Statistics Observational/Epi. analysis Tables for epidemiologists Cohort study: risk ratio etc. Main tab: Case variable: tumor Exposure variable: nickel OK cs tumor nickel | occupational exposure to| | nickel | | Exposed Unexposed | Total -----------------+------------------------+---------Cases | 46 10 | 56 Noncases | 343 280 | 623 -----------------+------------------------+---------Total | 389 290 | 679 | | Risk | .1182519 .0344828 | .0824742 | | | Point estimate | [95% Conf. Interval] |------------------------+---------------------Risk difference | .0837692 | .0454195 .1221188 Risk ratio | 3.429306 | 1.760546 6.679826 Attr. frac. ex. | .7083958 | .4319943 .8502955 Attr. frac. pop | .5818966 | +----------------------------------------------chi2(1) = 15.41 Pr>chi2 = 0.0001 Chapter 3-14 (revision 16 May 2010) p. 11 Computing the “population” (full cohort dataset before we take a sample) rate ratio, Statistics Observational/Epi. analysis Tables for epidemiologists Incidence rate ratios Main tab: Case variable: tumor Exposed variable: nickel Person-time variable: timerisk OK ir tumor nickel timerisk | occupational exposure to| | nickel | | Exposed Unexposed | Total -----------------+------------------------+---------carcinoma of the | 46 10 | 56 time at risk | 7546.318 7801.739 | 15348.06 -----------------+------------------------+---------| | Incidence Rate | .0060957 .0012818 | .0036487 | | | Point estimate | [95% Conf. Interval] |------------------------+---------------------Inc. rate diff. | .0048139 | .0028815 .0067463 Inc. rate ratio | 4.755696 | 2.367283 10.56925 Attr. frac. ex. | .7897259 | .5775749 .9053859 Attr. frac. pop | .6487034 | +----------------------------------------------(midp) Pr(k>=46) = 0.0000 (midp) 2*Pr(k>=46) = 0.0000 Chapter 3-14 (revision 16 May 2010) (exact) (exact) (exact) (exact) p. 12 Computing the “population” (full cohort dataset before we take a sample) hazard ratio, 1) informing Stata we have survival time variables: Statistics Survival analysis Set up and utilities Declare data to be survival time data Main tab: Time variable: timerisk Failure event: Failure variable: tumor Failure values: 1 OK stset timerisk , failure(tumor==1) 2) requesting a Cox regression, Statistics Survival analysis Regression models Cox proportional hazards model Model tab: Independent variables: nickel OK stcox nickel Cox regression -- Breslow method for ties No. of subjects = No. of failures = Time at risk = Log likelihood = 679 56 15348.05715 -321.86045 Number of obs = 679 LR chi2(1) Prob > chi2 = = 27.68 0.0000 -----------------------------------------------------------------------------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------nickel | 5.022065 1.765996 4.59 0.000 2.520922 10.00472 ------------------------------------------------------------------------------ Chapter 3-14 (revision 16 May 2010) p. 13 Population Effect Measures (Augmented Data With Frequent Disease) We then do something similar with the augmented dataset * 5 x cases sample stats use nickelrefinary5xcases , clear cc tumor nickel cs tumor nickel ir tumor nickel timerisk stset timerisk , failure(tumor==1) stcox nickel . cc tumor nickel Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+---------------------Cases | 230 50 | 280 0.8214 Controls | 343 280 | 623 0.5506 -----------------+------------------------+---------------------Total | 573 330 | 903 0.6346 | | | Point estimate | [95% Conf. Interval] |------------------------+---------------------Odds ratio | 3.755102 | 2.635221 5.405224 (exact) Attr. frac. ex. | .7336957 | .6205252 .8149938 (exact) Attr. frac. pop | .6026786 | +----------------------------------------------chi2(1) = 61.12 Pr>chi2 = 0.0000 . cs tumor nickel | occupational exposure to| | nickel | | Exposed Unexposed | Total -----------------+------------------------+---------Cases | 230 50 | 280 Noncases | 343 280 | 623 -----------------+------------------------+---------Total | 573 330 | 903 | | Risk | .4013962 .1515152 | .3100775 | | | Point estimate | [95% Conf. Interval] |------------------------+---------------------Risk difference | .249881 | .1941373 .3056248 Risk ratio | 2.649215 | 2.013878 3.484987 Attr. frac. ex. | .6225296 | .5034455 .7130549 Attr. frac. pop | .5113636 | +----------------------------------------------chi2(1) = 61.12 Pr>chi2 = 0.0000 Chapter 3-14 (revision 16 May 2010) p. 14 . ir tumor nickel timerisk | occupational exposure to| | nickel | | Exposed Unexposed | Total -----------------+------------------------+---------carcinoma of the | 230 50 | 280 time at risk | 10233.57 8608.377 | 18841.95 -----------------+------------------------+---------| | Incidence Rate | .022475 .0058083 | .0148605 | | | Point estimate | [95% Conf. Interval] |------------------------+---------------------Inc. rate diff. | .0166667 | .0133458 .0199877 Inc. rate ratio | 3.869473 | 2.839303 5.365002 Attr. frac. ex. | .7415669 | .6478009 .8136068 Attr. frac. pop | .6091442 | +----------------------------------------------(midp) Pr(k>=230) = 0.0000 (midp) 2*Pr(k>=230) = 0.0000 (exact) (exact) (exact) (exact) . stset timerisk , failure(tumor==1) . stcox nickel Cox regression -- Breslow method for ties No. of subjects = No. of failures = Time at risk = Log likelihood 903 280 18841.95076 = -1683.4208 Number of obs = 903 LR chi2(1) Prob > chi2 = = 105.83 0.0000 -----------------------------------------------------------------------------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------nickel | 4.188323 .660423 9.08 0.000 3.074829 5.705048 ------------------------------------------------------------------------------ Population Relative Effects Considering our total sample as our population, we observed the following population effect measures. Population Relative Effect Measure Odds Ratio (OR) Risk Ratio (RR) Incidence Rate Ratio (IRR) Hazard Ratio (HR) Chapter 3-14 (revision 16 May 2010) Actual Dataset with almost rare disease (3% in unexposed 12% in exposed) 3.76 3.43 4.76 5.02 Augmented Dataset with frequent disease (15% in unexposed 60% in exposed) 3.76 2.65 3.87 4.19 p. 15 Classical Case-Control Study (Controls Are Sampled From Population Controls Only) Most researchers choose their controls from the population controls only. First we will do this for one sample. We will use 2 controls for each case (2:1 sampling ratio). In real practice, you might choose a greater number, such as 8 controls for each case. We use 2:1 in this illustration, in order to keep the sample size much smaller than the population size, which makes the simulation more believable. Using the original dataset, File Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on nickelrefinary.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\ nickelrefinary.dta ", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use nickelrefinary.dta, clear Chapter 3-14 (revision 16 May 2010) p. 16 Crosstabulating disease and exposure, Statistics Summaries, tables & tests Tables Two-way tables with measures of association Main tab: Row variable: tumor Column variable: nickel Cell contents: Within-column relative frequencies OK tabulate tumor nickel, column carcinoma | of the | bronchi and | occupational exposure nasal | to nickel sinuses | 0. no exp 1. exposu | Total ------------+----------------------+---------0. no tumor | 280 343 | 623 | 96.55 88.17 | 91.75 ------------+----------------------+---------1. tumor | 10 46 | 56 | 3.45 11.83 | 8.25 ------------+----------------------+---------Total | 290 389 | 679 | 100.00 100.00 | 100.00 From this population 2 × 2 table, we want to use all of the cases (n=56, the entire tumor row) and twice as many controls (sample 112 controls from the “no tumor” row), carcinoma | of the | bronchi and | occupational exposure nasal | to nickel sinuses | 0. no exp 1. exposu | Total ------------+----------------------+---------0. no tumor | 280 343 | 623 | 96.55 88.17 | 91.75 ------------+----------------------+---------1. tumor | 10 46 | 56 | 3.45 11.83 | 8.25 ------------+----------------------+---------Total | 290 389 | 679 | 100.00 100.00 | 100.00 <= sample 56 x 2 = 112 controls <= use all 56 cases First, we set the random number generator seed so we can get the same sample if we need to replicate our results later, set seed 999 We now want to sample n=112 if tumor = 0 and just keep all of the cases (tumor = 1), Chapter 3-14 (revision 16 May 2010) p. 17 Statistics Resampling Draw random sample Main tab: Sample size: 112 by/if/in tab: Restrict to observations (if expression): tumor == 0 OK sample 112 if tumor==0, count <or> sample 112 , count , if tumor==0 Seeing what we got, tabulate tumor nickel, column carcinoma | of the | bronchi and | occupational exposure nasal | to nickel sinuses | 0. no exp 1. exposu | Total ------------+----------------------+---------0. no tumor | 56 56 | 112 | 84.85 54.90 | 66.67 ------------+----------------------+---------1. tumor | 10 46 | 56 | 15.15 45.10 | 33.33 ------------+----------------------+---------Total | 66 102 | 168 | 100.00 100.00 | 100.00 which is just what we wanted. Chapter 3-14 (revision 16 May 2010) p. 18 Now, computing the sample odds ratio, cc tumor nickel Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+-----------------------Cases | 46 10 | 56 0.8214 Controls | 56 56 | 112 0.5000 -----------------+------------------------+-----------------------Total | 102 66 | 168 0.6071 | | | Point estimate | [95% Conf. Interval] |------------------------+-----------------------Odds ratio | 4.6 | 2.019615 11.17006 (exact) Attr. frac. ex. | .7826087 | .5048562 .910475 (exact) Attr. frac. pop | .6428571 | +------------------------------------------------chi2(1) = 16.17 Pr>chi2 = 0.0001 In our sample, we get an OR of 4.60 (in contrast to the population OR of 3.76), which seems rather off. However, we cannot judge if this is an unbiased estimate for the population OR, because this estimate is subject to sampling variability. Summarizing the steps to conducting a classical case-control study, * -- classical sampling for case-control study * sample controls from controls) use nickelrefinary , clear tab tumor nickel, col // observe 56 cases, 623 controls set seed 999 sample 112 , count , if tumor==0 * used 2:1 sampling ratio selecting controls from control row tab tumor nickel, col // 56 cases, 112 controls cc tumor nickel // compute odds ratio Chapter 3-14 (revision 16 May 2010) p. 19 Using a Monte Carlo simulation, we compute the OR from 1,000 separate samples, to determine the long-run average OR. This will inform us whether or not the approach we used produces unbiased estimates of the population OR. * -- long-run average ordinary case-control study (control row sampling) clear set obs 1 gen or=. * create a file with 1 missing observation (a blank file) save or_control_row, replace * set seed 999 forvalues i=1(1)1000{ quietly use nickelrefinary, clear quietly gen or=. // variable to hold odds ratio quietly sample 112 , count, if tumor==0 // control row sampling quietly cc tumor nickel quietly replace or=r(or) in 1/1 quietly keep or quietly keep in 1/1 quietly append using or_control_row quietly save or_control_row, replace } use or_control_row, clear histogram or sum or .4 0 .2 Density .6 .8 The result is: 2 4 6 8 or Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------or | 1000 3.814517 .6563894 2.090909 7.965854 Using the mean of the 1,000 ORs as the long-run average, we get OR = 3.81. Chapter 3-14 (revision 16 May 2010) p. 20 Performing a similar simulation for the augmented dataset, with a 1:1 sampling ratio to keep the sample as small as possible relative to the population size, using the mean of the 1,000 ORs as the long-run average, we get OR = 3.77. The simulation results are: Classic Case-Control Design (sample controls from controls only) Almost Rare Disease Frequent Disease (3% unexposed, 12% exposed) (15% unexposed, 60% exposed) Relative Population Simulation Long Population Simulation Long Effect Measures Run Average Measures Run Average Measure OR OR OR 3.76 3.81 3.76 3.77 RR 3.43 2.65 IRR 4.76 3.87 HR 5.02 4.19 We see that the OR is an unbiased estimate of the OR, regardless of the rare disease assumption. The OR is not an unbiased estimate of the RR, however. If the rare disease assumption was met, however, it would be a reasonable close estimate. Notice the OR=3.81 is much closer to the RR=3.43 in the “almost rare disease” column. Chapter 3-14 (revision 16 May 2010) p. 21 Let’s see why the odds ratio is not affected by our choice of sampling ratio (e.g, 2:1, 3:1, etc.). Data Layout for Case-Control Study (Stata’s cc and cci commands) Exposure Disease Exposed (1) Unexposed (0) Totals* cases (1) a b M1 noncases (0) c d M0 Totals* m1 m0 *The uppercase Ns (sample sizes) are fixed by the researcher, and the lowercase Ns are observed. odds ratio = ad/bc With a 1:1 sampling ratio, 1 M1 = M0 = 1 (c+d) = 1c + 1d With a 2:2 sampling ratio, 2 M1 = M0 = 2 (c+d) = 2c + 2d We see that both c and d are multiplied by the same constant. This has no effect on the odds ratio, regardless of the constant k, odds ratio = a(kd)/b(kc) = ad/bc since the k’s cancel Sampling in general, Data Layout for Case-Control Study Exposure Disease Exposed (1) Unexposed (0) Totals* cases (1) a b M1 noncases (0) c d M0 Totals* m1 m0 *The uppercase Ns (sample sizes) are fixed by the researcher, and the lowercase Ns are observed. <-select some n fraction of cases <- select some k fraction of controls odds ratio = (na)(kd)/(nb)(kc) = ad/bc since both the n’s and k’s cancel Chapter 3-14 (revision 16 May 2010) p. 22 Case-Cohort Study Design Rothman (2002, pp.84-86) suggests sampling controls from the entire population, regardless of case or control status. Thus some cases may be selected as controls as well. (Rothman does not even mention the classical case-control design, where controls are sampled only from nondiseased subjects.) When controls are selected this way, which is from the entire population at risk, than the study design is called a case-cohort design, rather than a case-control design (Rothmand and Greenland, 1998, p.108). This time we sample from the total row Data Layout for Case-Control Study Exposure Disease Exposed (1) Unexposed (0) Totals* cases (1) a b M1 noncases (0) c d M0 Totals* m1 m0 *The uppercase Ns (sample sizes) are fixed by the researcher, and the lowercase Ns are observed. <- select some k fraction of controls so choosing our controls as some fraction k of the total row, we have c = km1 d = km0 Our odds ratio is then a a 1 a ad a(km0 ) a km0 km1 k m1 m1 OR b 1 b b RR bc b(km1 ) km1 b km0 k m0 m0 So if we choose our controls as some fraction of the total row, our odds ratio is identically the risk ratio. That is, in a case-cohort study, the OR directly estimates the RR, regardless of the rare disease assumption. Exercise. Look at the Cai methods paper. Notice in the second sentence of the abstract, they point out that the controls are sampled from the “total row of the full cohort 2 x 2 table” when they state, ...a case-cohort design, which consists of a small random sample of the whole cohort and all of the disease subjects...” In the second paragraph, they cite some studies that have used the case-cohort design. Chapter 3-14 (revision 16 May 2010) p. 23 Conducting a case-cohort study on this full-cohort, we use all of the cases and take a random sample of controls from the total row of the 2 x 2 table. This is a bit more complex, so we will commands rather than menus, since it will be easier to see what we are doing. * -- case-cohort study (sample controls from total sample) use nickelrefinary , clear tab tumor nickel, col // observe 56 cases, 623 controls keep if tumor==1 // reduce sample to cases save tumorcases, replace // save cases to file * use nickelrefinary , clear set seed 999 sample 112 , count // 2:1 sampling ratio, selecting controls from // total row since we do not use the “if // tumor==0” this time replace tumor=0 if tumor==1 // set these all to control status append using tumorcases // bring cases back in tab tumor nickel, col // 56 cases, 112 controls cc tumor nickel Then using the Monte Carlo method to obtain the long-run average OR from the original (rare disease) dataset * -- long-run average ordinary case-control study (total row sampling) clear set obs 1 gen or=. save or_control_row, replace // create a file with 1 missing observation * set seed 999 forvalues i=1(1)1000{ quietly use nickelrefinary, clear quietly gen or=. // variable to hold odds ratio quietly sample 112 , count // total row sampling quietly replace tumor=0 if tumor==1 // set all to control status quietly append using tumorcases // bring cases back in quietly cc tumor nickel quietly replace or=r(or) in 1/1 quietly keep or quietly keep in 1/1 quietly append using or_control_row quietly save or_control_row, replace } use or_control_row, clear histogram or sum or with a similar simulation using the augmented (frequent disease) dataset, also in the do-file. Chapter 3-14 (revision 16 May 2010) p. 24 The simulation results are: Classic Case-Control Design (sample controls from controls only) Almost Rare Disease Frequent Disease (3% unexposed, 12% exposed) (15% unexposed, 60% exposed) Relative Population Simulation Long Population Simulation Long Effect Measures Run Average Measures Run Average Measure OR OR OR 3.76 3.76 RR 3.43 3.48 2.65 2.67 IRR 4.76 3.87 HR 5.02 4.19 For this sampling approach, we see that the sample OR is an unbiased estimator of the population RR, as Rothman claims it should be. For the case-cohort design, the rare-disease assumption is not required for the OR to be an estimate of RR (Rothman and Greenland, 1998, p.110). We have demonstrated that to be the case. Chapter 3-14 (revision 16 May 2010) p. 25 Density Case-Control Study Where Controls Are Sampled From Cases & Controls Which Have Same Or Longer Time-At-Risk (Risk Set Sampling) Rothman (2002, pp.76-80) suggests sampling controls from the entire population, regardless of case or control status, but also select the controls from subjects with similar or longer time at risk as the cases, in a matched fashion. Again, some cases may be selected as controls as well. This is called risk-set sampling. In this study design, we want the OR to be an unbiased estimate of the hazard ratio, HR, where HR is a type of weighted average of the day-specific risk ratios. Using the hypothetical data table we used above, Life Table of Hypothetical Data Exposed Non-Exposed Follow- Begin Disease DayBegin Disease DayDayup day N Cases Specific N Cases Specific Specific Risk Risk Risk Ratio 1 50 5 0.10 50 2 0.04 2.5 2 30 10 0.33 40 8 0.20 1.7 3 10 10 1.00 20 10 0.50 2.0 totals 90 25 110 20 In this design, we also use a type of “total row sampling”. That is we select our controls from the “Begin N” column’s of the life table. For the 5+2 cases that occurred on day 1, we sample our controls from the 50+50 persons still at risk on day 1. For the 10+8 cases that occurred on day 2, we sample our controls from the 30+40 persons still at risk on day 2, and so on. We do this by forming risk sets. For every case, we form a risk set that includes all subjects with an equal or longer follow-up time. Then we sample 2 controls from that risk set, if we use a 2:1 sampling ratio, that we match with that case. This is identically sampling from the correct row of the Begin N column. Just as we saw in the case-cohort study, proportionality is maintained, which guarantees that our OR is an estimate of RR for each row of the life table. Chapter 3-14 (revision 16 May 2010) p. 26 Since we will analyze the data with conditional logistic regression, which maintains the matches with the case and controls on the same row of the life table, we maintain the day-specific RR analysis. Finally, the OR that comes out of the conditional logistic regression is a type of weighted average across the rows. Cox regression, which computes the HR directly, also summarizes the day-specific RR, (which is also called the day-specific HR), computing a type of weighted average which it reports as the HR. Thus, the conditional logistic regression from the case-control study does the same thing as the Cox regression from a cohort study. (NOTE: the conditional logistic approach is biased and so should not be used, as is pointed out below.) Let’s do it. Obtaining a risk-set sample (density sample) and computing the OR, * --- density case-control study ---use nickelrefinary, clear // begin with full dataset stset timerisk , failure(tumor==1) set seed 999 // set seed so can replicate the analysis * following command selects 2 controls per case from risk set of * subjects with same or longer time-at-risk sttocc, number(2) clogit _case nickel, group(_set) or we get Conditional (fixed-effects) logistic regression Log likelihood = -52.698159 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 168 17.65 0.0000 0.1434 -----------------------------------------------------------------------------_case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------nickel | 4.951997 2.131085 3.72 0.000 2.130428 11.51049 ------------------------------------------------------------------------------ Notice we used conditional logistic regression to obtain the odds ratio. Given our matching of time-at-risk (from risk-set sampling), we had to obtain an odds ratio using a matched sample approach since matched studies require a matched analysis (Rothman and Greenland, 1998, p. 98; Greenland and Thomas, 1982). Chapter 3-14 (revision 16 May 2010) p. 27 Look at the data in Stata after running this. Notice Stata created three new variables, called _case, _set, and _time. Interpret these. list tumor nickel timerisk _case _set _time in 1/9 , sep(3) 1. 2. 3. 4. 5. 6. 7. 8. 9. +--------------------------------------------------------------------+ | tumor nickel timerisk _case _set _time | |--------------------------------------------------------------------| | 0. no tumor 1. exposure 35.6823 0 1 .70149994 | | 0. no tumor 1. exposure 16.3425 0 1 .70149994 | | 1. tumor 1. exposure .7014999 1 1 .70149994 | |--------------------------------------------------------------------| | 0. no tumor 1. exposure 9.000103 0 2 1.1797981 | | 0. no tumor 1. exposure 6.370903 0 2 1.1797981 | | 1. tumor 1. exposure 1.179798 1 2 1.1797981 | |--------------------------------------------------------------------| | 0. no tumor 1. exposure 19.6768 0 3 1.4220009 | | 0. no tumor 0. no exposure 41.7836 0 3 1.4220009 | | 1. tumor 0. no exposure 1.422001 1 3 1.4220009 | +--------------------------------------------------------------------+ Notice that for each risk set, the follow-up time of the matched controls was greater than or equal to the follow-up time of the case. Chapter 3-14 (revision 16 May 2010) p. 28 To demonstrate that the OR using this study design provides an unbiased estimate of the HR, we next obtain the long-run average. * --- density case-control study (long-run average)---clear set obs 1 gen or=. save or_density, replace // create a file with 1 missing observation * set more off // turn off scrolling prompt when display iteration number set seed 999 forvalues i=1(1)1000{ quietly use nickelrefinary, clear quietly stset timerisk , failure(tumor==1) quietly sttocc, number(2) quietly clogit _case nickel, group(_set) or quietly gen or = exp(_b[nickel]) in 1/1 // convert coefficient to OR quietly keep or quietly keep in 1/1 quietly append using or_density quietly save or_density, replace display `i' // display iteration number } set more on use or_density, clear *histogram or sum or We then do the same thing for the augmented dataset. The simulation results are: Density Case-Control Design Almost Rare Disease (3% unexposed, 12% exposed) Relative Population Simulation Long Effect Measures Run Average Measure OR OR 3.76 RR 3.43 IRR 4.76 HR 5.02 5.42 Frequent Disease (15% unexposed, 60% exposed) Population Simulation Long Measures Run Average OR 3.76 2.65 3.87 4.19 4.43 We see that the OR is a biased estimate of the HR, and so the conditional logistic regression model should not be used for the analysis. It is close though. Another approach is taught below. Chapter 3-14 (revision 16 May 2010) p. 29 Terminology Inconsistencies Not all authors use the study design names consistently. For example, A) Rothman (2002, p.84-86) uses the term case-cohort study design to refer to the situation when follow-up is assumed equal for all subjects, or just simply ignored, so controls are simply selected from all subjects at risk (from the total row of the 2 × 2 table). Prentice (1986, p.2) calls this study design a case-cohort design: binary response. B) Rothman (2002, pp. 76-80) uses the term density case-control study design to refer to the situation when follow-up is not equal for all subjects, so risk-set sampling is used to select controls from all subjects at risk with equal or longer follow-up times (from the appropriate row of a life table). Prentice (1986, p. 4) calls this study design a case-cohort design: time to response data. Some Methods Papers Jewell (2004, pp.51-53) presents risk-set sampling (density case-control study design) as a way to use a case-control study to obtain an estimate of the hazard ratio (HR). Rothman (2002, pp.76-80) presents the density case-control study design as a way to obtain an estimate of the incidence rate ratio (IRR). Although he does not say so, Rothman is apparently making the assumption that risk is constant across time. Under that assumption, HR = IRR. Prentice RL. (1986). A case-cohort design for epidemiologic cohort studies and diease prevention trials. Biometrika 73:1-11. King G, Zeng L. (2002). Estimating risk and rate levels, ratios and differences in case-control studies. Statist Med 21:1409-1427. Volovics A, van den Brandt PA. (1997). Methods for the analysis of case-cohort studies. Biom J 39(2):195-214. Chapter 3-14 (revision 16 May 2010) p. 30 Some Studies That Used the Case-Cohort Approach 1) Rossing MA, Daling JR, Weiss NS, Moore DE, Self SG. (1996). Risk of breast cancer in a cohort of infertile women. Gynecol Oncol 60(1):3-7. Abstract The purpose of this study was to assess: (1) the risk of breast cancer associated with use of ovulation-inducing agents (such as clomiphene citrate) as treatment for infertility; and (2) the risk associated with ovulatory abnormalities that result in infertility. We performed a case-cohort study among 3837 women evaluated for infertility at clinics in Seattle, Washington, at some time during 1974–1985. Computer linkage with a population-based tumor registry was used to identify women diagnosed with breast cancer before January 1, 1992. Data regarding infertility testing and treatment were abstracted from the infertility clinic medical records for women who developed breast cancer and a randomly selected subcohort. Twenty-seven women in the cohort developedin situor invasive breast cancer, in comparison with an expected number of 28.8 cases (standardized incidence ratio, 0.9; 95% confidence interval (CI), 0.6–1.4). Infertile women with evidence of an ovulatory abnormality were at a risk of breast cancer similar to that of women whose infertility was believed to be due to other causes. The risk among women who had taken clomiphene was reduced relative to infertile women who had not used this drug (adjusted relative risk, 0.5; 95% CI, 0.2–1.2), but the reduction in risk did not increase with duration of use. The possibility that use of clomiphene as treatment for infertility lowers the risk of breast cancer should be examined in other, larger studies. Notice this study has a long and unequal follow-up, so the density case-control design is wellsuited for this study. 2) Savitz DA, Cai J, van Wijngaarden E, et al (2000). Case-cohort analysis of brain cancer and leukemia in electric utility workers using a refined magnetic field job-exposure matrix. American Journal of Industrial Medicine 38:417-425. 3) Voorrips LE, Goldbohm RA, Brants HA, et al. (2000). A prospective cohort study on antioxidant and folate intake and male lung cancer risk. Cancer Epidemiol Biomarkers Prev 9:357-65. In the 3rd paragraph of their Data Analysis section, they state they did something special to adjust the variance estimates, using a software routine they developed: “Because standard software was not available for case-cohort analysis, specific macros were developed to account for the additional variance introduced by sampling from the cohort instead of using the entire cohort (29).” Chapter 3-14 (revision 16 May 2010) p. 31 Something similar is available in Stata, but you must update your Stata to get it. First use the help facility to search on “case cohort”. Then click on the sbe41 link when you see this: STB-59 sbe41 . . . . . . . . . . . . Ordinary case-cohort design and analysis (help stcascoh, stselpre if installed) . . . . . . . . . V. Coviello 1/01 pp.12--18; STB Reprints Vol 10, pp.121--129 selects a sample from a cohort, prepares the dataset for analysis using a Cox regression model, and computes the Self-Prentice variance estimator of the parameters This approach in Stata fits a Cox regression model to the data with an appropriate variance estimate, so the p values and confidence intervals are correct. What Researchers Actually Use Usually when sampling is from a larger cohort, follow-up times are available. Rather than using the risk set sampling and conditional regression approached described above, researchers instead using Cox regression model with a special variance estimator (at least three such estimators have been proposed). All of the example papers presented in this chapter followed this suitably adapted Cox regression analysis approach. After updating Stata to get the stcascho and stselpre commands, use nickelrefinary, clear stset timerisk , failure(tumor==1) id(caseid) stcox nickel // full cohort stcascoh, alpha(.2) seed(999) // sample 20% of the controls stselpre nickel The results are: Chapter 3-14 (revision 16 May 2010) p. 32 1) full cohort Cox regression -- Breslow method for ties No. of subjects = 676 Number of obs = 679 -----------------------------------------------------------------------------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------nickel | 5.020635 1.765523 4.59 0.000 2.520176 10.00199 ------------------------------------------------------------------------------ 2) sampled cohort . stcascoh, alpha(.2) seed(999) failure _d: analysis time _t: id: // .2 or 20% of the cohort tumor == 1 timerisk caseid Total sample = 174 -----------------------------------------------------------------------------191 total obs. 0 exclusions -----------------------------------------------------------------------------191 obs. remaining, representing 174 subjects 56 failures in single failure-per-subject data Self Prentice Variance Estimate for Case-Cohort Design Self Prentice Scheme -----------------------------------------------------------------------------| Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------nickel | 4.652655 1.846243 3.87 0.000 2.137624 10.12676 -----------------------------------------------------------------------------Prentice Scheme -----------------------------------------------------------------------------| Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------nickel | 4.587959 1.82057 3.84 0.000 2.1079 9.98594 ------------------------------------------------------------------------------ Saving the Case-Cohort Sample At this point, only the sample of N=191 subjects are in the data editor. Be sure to save this file if you want to do something like a chart review to collect further predictor variables on this casecohort sample. Chapter 3-14 (revision 16 May 2010) p. 33 Using simulation to check the unbiasedness of this approach, with the Prentice Scheme clear set obs 1 gen or=. save hr_simulation, replace // create a file with 1 missing observation * set more off // turn off scrolling prompt * set seed 999 // doesn't work outside of shcascoh command forvalues i=1(1)1000{ quietly use nickelrefinary, clear quietly stset timerisk , failure(tumor==1) id(caseid) quietly stcascoh, alpha(.1798) // sample 18% (n=112) of controls quietly stselpre nickel // fit model quietly matrix A=e(b) quietly svmat A // creates variables from matrix columns quietly gen hr = exp(A1) in 1/1 // convert coefficient to OR quietly keep hr quietly keep in 1/1 quietly append using hr_simulation quietly save hr_simulation, replace display `i' // display iteration number } set more on use hr_simulation, clear sum hr we get . sum hr Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------hr | 1000 5.051037 1.065216 2.663914 9.633725 so the long-run HR=5.05, compared to the population HR=5.02, which is an unbiased estimate. In contrast, the risk-set sampling, followed by conditional logistic regression, which was demonstrated above and gave a long-run average HR=5.42, produces a biased estimate and so should not be used. Chapter 3-14 (revision 16 May 2010) p. 34 Sample Size Determination We have seen that subjects can be included as both cases and controls in the case-cohort approach. This overlap requires that the sample size be inflated to allow for this. Rothman, Greenland, and Lash (2008) comment, “Case-cohort designs have other advantages as well as disadvantags relative to alternative case-conrol designs (Wacholder, 991). One disadvantage is that, because of the overlap of membership in the case and control groups (controls who are sleeced may also develop disease and enter the study as cases), one will need to select more controls in a casecohort study than in an ordinary case-control study with the same number of cases, if one is to achieve the same amount of statistical precision. Extra controls are needed because the stratistical precision of a study is strongly determined by the numbers of distinct cases and noncases. Thus, if 20% of the source cohort members will become cases, and all cases will be included in the study, one will have to select 1.25 times as many controls as cases in a case-cohort study to ensure that there wil be as many controls who never become cases in the study. On average, only 80% of the controls in such a situation will remain noncases; the other 20% will become cases. Of course, if the disease is uncommon, the number of extra controls needed for a case-cohort study will be small.” -----Wacholder S. (1991). Practical considerations in choosing beteen the case-cohort and nested case-control design Epidemiology 2:155-158. The “1.25” comes from: 80%, or 4/5 of the controls are “controls only”. To get this sample size back up to 100% controls only, with equals number of cases, you (5/4)(4/5) = 1, where 5/4 = 1.25. Chapter 3-14 (revision 16 May 2010) p. 35 References Breslow NE, Day NE. (1987). Statistical Methods in Cancer Research, Vol II: The Design and Analysis of Cohort Studies, Lyon, France, IARC. Cai J, Zeng D. (2004). Sample size/power calculation for case-cohort studies. Biometrics 60:1015-1024. Dupont WD. (2002). Statistical Modeling for Biomedical Researchers: A Simple Introduction to the Analysis of Complex Data. Cambridge UK, Cambridge University Press. Greenland S, Thomas DC. (1982). On the need for the rare disease assumption in case-control studies. Am J Epidemiol 116(3):547-553. with erratum in Am J Epidemiol 1990;131(6):1102. King G, Zeng L. (2002). Estimating risk and rate levels, ratios and differences in case-control studies. Statist Med 21:1409-1427. Jewell NP. (2004). Statistics for Epidemiology. New York, Chapman & Hall/CRC. National Heart, Lung, and Blood Institute. (1998). Clinical guidelines for the identification, evaluation, and treatment of overweight and obesity in adults: the evidence report. Bethesda, MD, National Heart, Lung, and Blood Institute. Onyike CU, Crum RM, Lee HB, Lyketsos CG, Eaton WW. (2003). Is obesity associated with major depression? Results from the third national health and nutrition examination survey. Am J Epidemiol 158(12):1139-1153. Prentice RL. (1986). A case-cohort design for epidemiologic cohort studies and diease prevention trials. Biometrika 73:1-11. Rothman KJ. (2002). Epidemiology: An Introduction. New York, Oxford University Press. Rothman KJ, Greenland S. (1998). Modern Epidemiology, 2nd ed. Philadelphia, PA. Rothman KJ, Greenland S, Lash TL. (2008). Case-control studies. In Rothman KJ, Greenland S, Lash TL, Modern Epidemiology, Philadelphia, Lippincott Williams & Wilkins, 2008, pp.111-127. Savitz DA, Cai J, van Wijngaarden E, et al (2000). Case-cohort analysis of brain cancer and leukemia in electric utility workers using a refined magnetic field job-exposure matrix. American Journal of Industrial Medicine 38:417-425. Volovics A, van den Brandt PA. (1997). Methods for the analysis of case-cohort studies. Biom J 39(2):195-214. Voorrips LE, Goldbohm RA, Brants HA, et al. (2000). A prospective cohort study on antioxidant Chapter 3-14 (revision 16 May 2010) p. 36 and folate intake and male lung cancer risk. Cancer Epidemiol Biomarkers Prev 9:357-65. Chapter 3-14 (revision 16 May 2010) p. 37