Chapter 3-6. Study Designs There are three primary study designs in epidemiologic research: cohort study (which include clinical trials) case-control study cross-sectional study Cohort Study (Rothman, 2002, p.57) In epidemiology, a cohort is defined as: a designated group of individuals who are followed over a period of time. A cohort study involves measuring the occurrence of disease within one or more cohorts. Typically, a cohort comprises persons with a common characteristic, such as an exposure to a risk factor, which we refer to as the exposed and unexposed cohort. When measurements are taken on the subjects at multiple time points, the cohort study is often called a longitudinal study (Rothman and Greenland, 1998, p.422). The repeated measurements are usually for both exposures and outcomes. _____________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010. Chapter 3-6 (revised 16 May 2010) p. 1 Diagrammatically, Cohort Study D D E follow-up D D E where E = exposed cohort, E = unexposed cohort D = disease occurred, D = disease not occurred Data layout, Cohort Study Data Layout for Incidence Proportions (Stata’s csi command) Exposed Not Exposed Disease a b Not Disease c d N1 N0 or Data Layout for Incidence Rates (Stata’s iri command) Exposed Not Exposed Disease Cases a b Person-Time N1 N0 or Life Table (survival analysis) Chapter 3-6 (revised 16 May 2010) p. 2 Experiments (Rothman, 2002, p.60) Experiments are cohort studies, although not all cohort studies are experiments. In epidemiology, an experiment is a study in which the incidence rate or the risk of disease in two or more cohorts is compared, after assigning the exposure to the people who comprise the cohorts. Diagrammatically, Experiment D D E follow-up E D D assigned by investigator The data layouts are similar to the general cohort study described above. In an experiment, the reason for the exposure assignment is solely to suit the objectives of the study. If people receive their exposure assignment based on considerations other than the study protocol, it is not a true experiment. Chapter 3-6 (revised 16 May 2010) p. 3 Clinical Trials (Rothman, 2002, p.60) Epidemiologic experiments are most frequently conducted in a clinical setting, with the aim of evaluating which treatment for a disease is better. Such studies are known as clinical trials. (The word trial is used as a synonym for experiment). All study subjects have been diagnosed with a specific disease, but that disease is not the disease event that is being studied. Rather, it is some consequence of that disease, such as death or spread of a cancer, that becomes the “disease” event studied in a clinical trial. The aim of a clinical trial is to evaluate the incidence rate of disease complications (or improvements) in the cohorts assigned to the different treatment groups. In most trials, treatments are assigned by randomization, using random number assignment. Randomization tends to produce comparability between the cohorts with respect to factors that might affect the rate of complications (helps to satisfy the “otherwise comparable” assumption). Diagrammatically, Clinical Trial C C E follow-up C C D E assigned by investigator where C = complication or no improvement, C = no complication or improvement Chapter 3-6 (revised 16 May 2010) p. 4 Prospective Cohort Studies (Rothman, 2002, p.70) A prospective cohort study is one in which the exposure information is recorded at the beginning of the follow-up and the period of time at risk for disease occurs during the conduct of the study. This is always the case with experiments and with many non-experimental cohort studies. Retrospective Cohort Studies (Rothman, 2002, p.70) In a retrospective cohort study (also known as historical cohort studies), the cohorts are identified from recorded information and the time during which they are at risk for disease (follow-up period) occurred before the beginning of the study. Because a retrospective cohort study must rely on existing records, important information may be missing or otherwise unavailable. Nevertheless, when a retrospective cohort study is feasible, it offers the advantage of providing information that is usually much less costly than that from a prospective cohort study, and it may produce results much sooner because there is no need to wait for the disease to occur. Chapter 3-6 (revised 16 May 2010) p. 5 Case-Control Studies In a cohort study, we: 1) Define the exposed and unexposed cohorts 2) Determine the number of people in these cohorts. These N’s are used as the denominators for the risk or part of the person-time for incidence rates. 3) Observe the number of cases occurring in each cohort during the follow-up period. 4) Calculate the risk (cases/N) or incidence rate (cases/PT) for each cohort. Cohort Study D D E follow-up D D E where E = exposed cohort, E = unexposed cohort D = disease occurred, D = disease not occurred With cohort studies, you have to follow a large sample of subjects in order to measure the risk or rate of disease. This is because you start with healthy subjects, and only a small number of them develop the disease. Because of this, we generally measure the entire cohort. With just a sample, perhaps your study will end with no disease cases at all. To avoid the need for a large study, with the cost and effort of a follow-up period, the casecontrol study design is frequently used. The case-control study only requires a relatively small sample, without the need for a follow-up period. If properly carried out, a case-control study mirrors what would be learned in a cohort study. Chapter 3-6 (revised 16 May 2010) p. 6 In a case-control study, we: 1) Identify the cases that have occurred 2) Select a control group of non-cases 3) For both the case and control groups, we retrospectively uncover if the subject was exposed or unexposed. 4) We estimate the risk ratio or incidence rate ratio from the odds ratio (but we cannot estimate the incidence proportions or incidence rates, themselves—see box below). Case-Control Study E E D E D E where E = exposure, E = no exposure D = disease cases, D = non-case controls Case-Control Study Data Layout for case-control study Exposed Not Exposed Disease a b (cases) Not Disease c d (controls) row totals Ncases Ncontrols where the sample sizes are the row totals. In contrast, Cohort Study Data Layout for cohort study Exposed Disease a (cases) Not Disease c (controls) column totals Nexp Not Exposed b d Nnot exp where the sample sizes are the column totals. Chapter 3-6 (revised 16 May 2010) p. 7 Prospective and Retrospective Case-Control Studies Case-control studies, like cohort studies, can also be either prospective or retrospective. In a retrospective case-control study, cases have already occurred when the study begins; there is no waiting for new cases to occur. In a prospective case-control study, the investigator must wait, just as in a prospective cohort study, for new cases to occur. Cross-Sectional Studies A cross-section study simply collects a snapshot of the exposure-disease prevalence relationship. The only total that is fixed is the total sample size. The row and column totals are random. There is no fixed exposure cohort and no fixed number of cases. A survey sample is a crosssectional study. Cross-Sectional Study Data Layout for cross-sectional study Exposed Not Exposed Disease a b (cases) Not Disease c d (controls) column totals nexp nnot exp row totals ndis nnot dis where the sample sizes, both row and column totals, are shown in lowercase to denote they are observed, or random, rather than fixed by the study. For this study design, the prevalence is estimated using the incidence proportion formula, prevalence among exposed = a / nexp prevalence among not exposed = b/ nnot exp Chapter 3-6 (revised 16 May 2010) p. 8 Why incidence proportions and incidence rates cannot be estimated with a case-control study In the cohort study data layout Data Layout for Cohort Study Exposure Disease Exposed Unexposed Totals* cases a b n1 noncases c d n0 Totals* N1 N0 *The uppercase Ns (sample sizes) are fixed by the researcher, and the lowercase Ns are observed. we know the number-at-risk (the column totals, or sample sizes) so we can estimate the incidence proportion by incidence proportion = disease cases / persons at risk Similarly, we will know the time-at-risk for these persons, so we can estimate the incidence rate by incidence rate = disease cases / person-time In the case-control study, however, Data Layout for Case-Control Study Exposure Disease Exposed Unexposed Totals* cases a b N1 noncases c d N0 Totals* n1 n0 *The uppercase Ns (sample sizes) are fixed by the researcher, and the lowercase Ns are observed. we cannot form a meaningful incidence proportion and incidence rate estimates because our column totals (n1 and n0) are simply observed. These column totals do not represent the number at risk. Also the sum a + b, and so the number of cases relative to n1 and n0, is going to be artifically high. For example, suppose that half of the sample are cases, whereas in the population the incidence of disease is only 1%. There is no possible way to assign the N1 cases into cells a and b for which both the Exposed and Unexposed columns could correctly estimate the population incidence. Chapter 3-6 (revised 16 May 2010) p. 9 The Odds Ratio In the cohort study data layout, with fixed column totals for denominators, Data Layout for Cohort Study (shown with 0-1 scores) Exposure Disease Exposed (1) Unexposed (0) Totals* cases (1) a (col %) b (col %) n1 noncases (0) c d n0 Totals* N1 N0 *The uppercase Ns (sample sizes) are fixed by the researcher, and the lowercase Ns are observed. we can define the risk ratio as the ratio of the two column percents RR risk of disease if exp osed P(D=1|E=1) a / N1 risk of disease if un exp osed P(D=1|E=0) b / N 0 using conditional probability notation, P( . | . ), read as the probability of what is on the left of the “|” given what is on the right of the “|”. Thus P(D=1|E=1) represents “the probability of disease given exposure”. In the case-control study data layout, with fixed row totals for denominators Data Layout for Case-Control Study (shown with 0-1 scores) Exposure Disease Exposed (1) Unexposed (0) Totals* cases (1) a (row %) b N1 noncases (0) c (row %) d N0 Totals* n1 n0 *The uppercase Ns (sample sizes) are fixed by the researcher, and the lowercase Ns are observed. we can form meaning probabilities of the form P(E|D) and P(E|D) , which are reversed from the cohort study probabilities of the form P(D|E) and P(D|E) . Then defining “odds” as the ratio of the probability of some event occurring to not occurring (such as the odds of heads vs tails on a coin flip), we define the odds ratio (also called exposure odds ratio) as: (a / N1 ) a P(E=1|D=1)/P(E=0|D=1) (b / N1 ) b ad OR P(E=1|D=0)/P(E=0|D=0) (c / N 0 ) c bc (d / N 0 ) d Chapter 3-6 (revised 16 May 2010) p. 10 For the cohort study, we have (a / N1 ) a P(D=1|E=1)/P(D=0|E=1) (b / N1 ) b ad OR P(D=1|E=0)/P(D=0|E=0) (c / N 0 ) c bc (d / N 0 ) d which in its final form (ad/bc) is identical to the case-control study. For the cross-sectional study, there are no fixed totals that can be used as denominators to calculate a probability (prevalence is not a probability). Data Layout for Cross-Sectional Study (shown with 0-1 scores) Exposure Disease Exposed (1) Unexposed (0) Totals* cases (1) a (row %) b n1 noncases (0) c (row %) d n0 Totals* n1 n0 *The uppercase Ns (sample sizes) are fixed by the researcher, and the lowercase Ns are observed. In the cross-sectional study, however, we can still define the odds ratio in terms of odds only, and thus require no fixed denominator. a Odds(D=1|E=1)/Odds(D=0|E=1) c ad OR Odds(D=1|E=0)/Odds(D=0|E=0) b bc d which in its final form (ad/bc) is identical to the above two study designs. We see, then, that the odds ratio in its final form (ad/bc) does not depend on the row or column totals for any of these three study designs, only the cell counts. The odds ratio is identically OR = ad/bc for all three study designs. In the cohort study, with fixed column totals, the column cells, a & c, and b & d, are free to vary in the way they occur in the population. In the case-control study, with fixed row totals, the row cells, a & b, and c & d, are free to vary in the way they occur in the population. In the crosssection study, all cells are free to vary in the way they occur in the population. Because of this freedom to vary the cells proportional to the population frequencies, all three study designs can correctly measure an association between between exposure and disease. Chapter 3-6 (revised 16 May 2010) p. 11 Rare Disease Assumption Rule of Thumb (Rosner, 1995, p. 368) If the data are collected using a case-control study design, and the disease under study is rare (disease incidence < .10), then we can estimate the risk ratio by the odds ratio. Woodward (1999, p.113) adds: The estimation is only good if both the exposed and unexposed groups have a rare disease incidence. (In other words, apply the < .10 rule to both groups.) Exercise: where does the “< .10” rule-of-thumb come from? Look at the figure in the Zhang and Yu (1998) article, ZhangJAMA1998CorrectingOR.pdf, which illustrates how when the incidence of the outcome is low (<10%), the odds ratio is close to the risk ratio. Note: The method of Zhang and Yu has been convincing criticized as unreliable as a method to correct the ORs to obtain RRs. A better approach is given in Chapter 10. Chapter 3-6 (revised 16 May 2010) p. 12 Why OR RR under rare disease assumption (Woodward, 1999, p.113) For the cohort study, we have Disease cases noncases Totals Exposure Exposed Unexposed a b c d a+c b+d a RR a c b bd If the disease is rare, there will be few disease cases, so a 0 and b 0, making c nearly equal to a+c and d nearly equal to b+d. That is, and ca+c db+d Substituting, a a ad RR a c c OR b b bc bd d We seen, then, that the OR from the case-control study approximates the RR from the cohort study when the rare disease assumption is met. Chapter 3-6 (revised 16 May 2010) p. 13 Interpreting the Size of Relative Measures of Effect One of Hill’s “causal criteria” was strength of association. Although Hill did not propose that his criteria were a checklist for evaluating causation, his criteria have frequently been applied that way. Rothman (Chapter 2, p.19) points out that strength depends on the prevalence of other causes and, thus, is not a biologic characteristic. The strength could be the result of confounding as well. Thus, Rothman, makes no suggestion for interpreting the size of a risk ratio, rate ratio, or odds ratio Clinical researchers, unaware of Rothman’s argument, still desire to interpret the size of relative measures of effect (risk ratio, rate ratio, odds ratio). This might, in part, be due to the wide acceptance of other rules of thumb for measures of association (e.g., reliability coefficients and correlation coefficients). For example, the Pearson correlation coefficient has the range [-1 to 0] or [0 to 1] with perfect correlation being 1.0 and 0 being no linear association. A rule of thumb for interpreting the size of the correlation coefficient is found in Hinkle et al (1998, p.120): Rule of Thumb for Interpreting the Size of a Correlation Coefficient Size of Correlation Interpretation 0.90 to 1.00 (-0.90 to -1.00) Very high correlation 0.70 to 0.90 (-0.70 to -0.90) High correlation 0.50 to 0.70 (-0.50 to -0.70) Moderate correlation 0.30 to 0.50 (-0.30 to -0.50) Low correlation 0.00 to 0.30 ( 0.00 to -0.30) Little if any correlation The odds ratio is likewise considered to be a measure of association. The rate ratio and odds ratio have the range [0 to 1] and [1 to infinity]. Despite that comparisons of strength of associations between exposures is theoretically unwarranted, Monson (1980, p. 94), while admitting it was theoretically unwarranted, still suggested a rule of thumb for interpreting rate ratios. He did this to aid researchers and policy makers in taking action to protect workers from occupational exposures. Monson’s Rule of Thumb for Interpreting the Rate Ratio rate ratio strength of association 0.9 to 1.0 (1.0 to 1.2) None 0.7 to 0.9 (1.2 to 1.5) Weak 0.4 to 0.7 (1.5 to 3.0) Moderate 0.1 to 0.4 (3.0 to 10.0) Strong 0.0 to 0.1 (>10.0) Infinite Chapter 3-6 (revised 16 May 2010) p. 14 It is problematic to apply this rule of thumb to the risk ratio, however. The risk ratio is the ratio of two proportions (RR=p1/p2) and its value is constrained by the denominator proportion. For example, if p2 = 0.5, then the RR can be no larger than 1/0.5 = 2; if p2 = 0.8, then the RR can be no larger than 1/0.8 = 1.25. The rate ratio and the odds ratio do not have this restriction, which is one of the reasons why the odds ratio is more popular than the risk ratio (it saves you from having to explain to the reader that your risk ratios are constrained, making your associations seem so weak). (Rosner, 1995, p.365) A rule to thumb for interpreting the odds ratio, however, cannot be proposed. This is because the value of the odds ratio is so dependent on the underlying disease rate, as shown in the Zhang and Yu (1998) article. Many researchers, however, have adopted the OR=3.0 cutoff as a guideline for judging a “strong” effect, similar to Monson’s rule of thumb for the rate ratio (see Taubes, 1995, for some informal examples of epidemiologists using this cutoff). Chapter 3-6 (revised 16 May 2010) p. 15 Exercise Look at the Fowler et al (2005) article. Notice that the authors report the odds ratio, instead of risk ratio, even though they have a cohort study design. This is very common. Logistic regression is the most widely used model for binary outcomes, and it always provides an odds ratio, rather than a risk ratio, even for cohort studies. (When the time to the disease outcome is available, Cox regression is also popularly used in cohort studies.) Exercise Look at the Cooper et al (2006) article. 1) Notice in the Methods section of the abstract they have a cohort study, and then follow 3 subcohorts composing the study cohort. 2) In Table 2, they show these three cohorts. In their analysis they compare the ACE inhibitor cohort to the other two cohorts. 3) For statistical comparisons, they use “modified Poisson regression” with a count (not persontime) denominator, and robust standard error. This is one of the models that is used in place of logistic regression. It permitted them to estimate a risk ratio, instead of an odds ratio, in contrast to what Fowler (2005) did. Using alternatives to logistic regression, where the alternative provide risk ratios rather than odds ratios, is becoming more popular. How to compute a modified Poisson regression model is covered in Chapter 3-11. Exercise Look at the Lee (2006) article, where two study designs are reported in the same paper. In the cross-sectional study, they report odds ratios and use multivariable logistic regression to compute adjusted odds ratios, adjusted for potential confounders (Table 4). In the longitudinal study (cohort study), they report incidence proportions (which they downplay), incidence rates, and hazard ratios. The hazard ratios and adjusted hazard ratios come from time-dependent multivariable Cox regression. (Table 5) Chapter 3-6 (revised 16 May 2010) p. 16 Exercise: Calculating the odds ratio in Stata Start the Stata program and read in the data evans.dta, the same dataset used in Chapter 5. File Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on evans.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\evans.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\” cd “Biostats & Epi With Stata\datasets & do-files" use evans.dta, clear To look at the association between CHD and Smoking using a risk ratio, we use Statistics Epidemiology and related Tables for epidemiologists Cohort study risk ratio etc. Main tab: Case variable: chd Exposed variable: smk OK cs chd smk | smk | | Exposed Unexposed | Total -----------------+------------------------+---------Cases | 54 17 | 71 Noncases | 333 205 | 538 -----------------+------------------------+---------Total | 387 222 | 609 | | Risk | .1395349 .0765766 | .1165846 | | | Point estimate | [95% Conf. Interval] |------------------------+---------------------Risk difference | .0629583 | .0138116 .112105 Risk ratio | 1.822161 | 1.083858 3.063382 Attr. frac. ex. | .4512012 | .0773703 .6735634 Attr. frac. pop | .3431671 | +----------------------------------------------chi2(1) = 5.43 Pr>chi2 = 0.0198 This dataset is from a cohort study, so the risk ratio is the correct statistic to use. Chapter 3-6 (revised 16 May 2010) p. 17 For illustration, however, we will now compute an odds ratio for smoking-CHD association. Statistics Epidemiology and related Tables for epidemiologists Case-control odds ratio Main tab: Case variable: chd Exposed variable: smk OK cc chd smk Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+-----------------------Cases | 54 17 | 71 0.7606 Controls | 333 205 | 538 0.6190 -----------------+------------------------+-----------------------Total | 387 222 | 609 0.6355 | | | Point estimate | [95% Conf. Interval] |------------------------+-----------------------Odds ratio | 1.955485 | 1.079872 3.695643 (exact) Attr. frac. ex. | .4886179 | .0739645 .7294111 (exact) Attr. frac. pop | .3716249 | +------------------------------------------------chi2(1) = 5.43 Pr>chi2 = 0.0198 Stata calls this command cc to denote case-control study, a logical choice since only the odds ratio can be calculated for that study design (not a risk ratio or a rate ratio). Notice we have RR = 1.82 and OR = 1.96. The OR is always larger than the RR, the inflation being greater as we move away from the rare disease assumption. Was the rare disease assumption met? This time, let’s compute the OR using the odds ratio calculator (cci command). The cci command uses the same data layout as the csi command. Stata data layout for odds ratio (cci command) Exposed Unexposed Cases a b Noncases c d Chapter 3-6 (revised 16 May 2010) p. 18 Statistics Epidemiology and related Tables for epidemiologists Case-control odds ratio calculator 54 17 333 205 OK cci 54 17 333 205 Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+-----------------------Cases | 54 17 | 71 0.7606 Controls | 333 205 | 538 0.6190 -----------------+------------------------+-----------------------Total | 387 222 | 609 0.6355 | | | Point estimate | [95% Conf. Interval] |------------------------+-----------------------Odds ratio | 1.955485 | 1.079872 3.695643 (exact) Attr. frac. ex. | .4886179 | .0739645 .7294111 (exact) Attr. frac. pop | .3716249 | +------------------------------------------------chi2(1) = 5.43 Pr>chi2 = 0.0198 which, of course, provides identical output to what was produced before. Prevalence Statistics in Cross-Sectional Study In Chapter 5, it was stated that because prevalence is a mixture of incidence rate and disease duration, it is not as useful for studying the cause of disease. (Rothman, 2002, p.42) Clarifying further, its drawback in studying the cause of disease is that factors that increase prevalence may do so not by increasing the occurrence of disease, but by increasing the duration of the condition. (Rothman, 2002, p.43) Rothman (2002, p.43-44) gives an example, “...a factor associated with the prevalence of ventricular septal defect at birth could be a cause of ventricular septal defect, but it could also be a factor that does not cause the defect but instead enables embryos that develop the defect to survive until birth.” However, under the assumption of a steady state and equal disease duration in the exposed and unexposed groups, the prevalence odds ratio directly estimates the incidence rate ratio. Pearce (2004, p.1048) povides an eloquent presentation of this relationship. Exercise. Read the section Measures of effect in prevalence studies in the Pearce article (2004, p.1048). Chapter 3-6 (revised 16 May 2010) p. 19 Exercise. Look at the Moran article (2006). Pearce (2004) points out, under the assumption of a steady state and equal disease duration in the exposed and unexposed groups, the prevalence odds ratio directly estimates the incidence rate ratio. As interesting as that is, researchers usually do not take that approach, but merely report the prevalence odds ratios, or “odds ratios”, as correlation coefficients, without discussing the strength of the effect or implying that they may be estimates of the incidence rate ratio. The Moran article is a good example of this approach. References Cooper WO, Hernandez-Diaz S, Arbogast PG, et al. (2006). Major congenital malformations after first-trimester exposure to ACE inhibitors. NEJM 354(23):2443-2451. Fowler VG, Miro JM, Hoen B, et al. (2005) Staphylococcus aureus endocarditis: a consequence of medical progress. JAMA 293(4):3012-21. Hinkle DE, Wiersma W, Jurs SG. (1998). Applied Statistics for the Behavioral Sciences, 4th ed. Boston, Houghton Mifflin Company. Monson RR. (1980). Occupational Epidemiology. Boca Raton, FL, CRC Press, Inc. Moran GJ, Krishnadasan A, Gorwitz RJ, et al. (2006). Methicillin-resistant S. auerus infections among patients in the emergency department. NEJM 355(7):666-674. Pearce N. (2004). Effect measures in prevalence studies. Environmental Health Perspectives. 112(10):1047-1050. Rosner B. (1995). Fundamentals of Biostatistics, 4th ed., Belmont CA, Duxbury Press. Rothman KJ. (2002). Epidemiology: An Introduction. New York, Oxford University Press. Rothman KJ, Greenland S. (1998). Modern Epidemiology, 2nd ed. Philadelphia, PA, Lippincott-Raven Publishers. Taubes G. (1995). Epidemiology faces its limits. Science 269(July 14):164-169. Woodward M. (1999). Epidemiology: Study Design and Data Analysis. New York, Chapman & Hall/CRC. Zhang J, Yu KF. (1998). What’s the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes. JAMA 280(19):1690-91. Chapter 3-6 (revised 16 May 2010) p. 20