Chapter 5-17. Conditional Logistic Regression--Analyzing Dichotomous Outcomes from Matched Case-Control Study Designs Matching is a design strategy to eliminate, or control for, confounding. When the data are from a matched case-control study design, conditional logistic regression is commonly used for the analysis. It could be used, as well, in a match cohort study; however, cohort study data are usually analyzed using a Cox regression or Poisson regression approach since a time at risk variable is available. Review of Confounding A confounding variable must have two associations (Rothman, 2002, p.108): 1) A confounder must be associated with the disease (either as a cause or as a proxy for a cause, but not as an effect of the disease). 2) A confounder must be associated with exposure (imbalanced between the exposure groups). Diagrammatically, the two necessary associations for confounding are: Confounder association association Exposure Disease confounded effect There is also a third requirement. A factor that is an effect of the exposure and an intermediate step in the causal pathway from exposure to disease will have the above associations, but causal intermediates are not confounders; they are part of the effect that we wish to study. Thus, the third property of a confounder is as follows: 3) A confounder must not be an effect of the exposure. _________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010. Chapter 5-17 (revision 16 May 2010) p. 1 Motivation for Matching The motivation for conducting a matched study is to control for confounding variables. Matching removes one of the association arrows in the diagram. Confounder association association Exposure Disease confounded effect In a cohort study, an exposed subject is matched with one or more unexposed subjects, such as matching on age. Creating this balance removes the confounder-exposure association. In a case-control study, a diseased subject is matched with one or more non-diseased subjects. Creating this balance removes the confounder-disease association. It was originally thought that matching enhanced validity, where validity is obtaining the correct risk ratio or odds ratio. (Miettinen, 1970). Later it was shown that the real advantage is greater efficiency in the statistical analysis. That is, the matching may lead to a smaller p value and tighter confidence interval around the risk ratio or odds ratio than would be achieved without matching. (Kupper et al, 1981). Today, it is understood that for both cohort and case-control studies, matching can lead to either a gain or a loss in efficiency. (Rothman and Greenland, 1998, p.148; Kupper et al, 1981; Smith and Day, 1981; Thompson et al, 1982; Thomas and Greenland, 1983; Howe and Choi, 1983; Greenland and Morgenstern, 1990). Matching is clearly useful in certain well-defined circumstances, and clearly not useful in other well-defined circumstances. In between, it is probably not worth the effort. A well-written chapter on matching can be found in Rothman and Greenland (1998, Chapter 10 “Matching”) and in Rothman, Greenland, and Lash (2008, Chapter 11, “Design Strategies to Improve Study Accuracy”). A Matched Design Requires a Matched Analysis If matching was used, for either a cohort study or case-control study, one should perform a matched analysis as well. This is particularly important in a matched case-control study, where if not used, a selection bias can be introduced into the study, leading to invalid OR estimates. (Rothman and Greenland, 1998, pp. 147). This will be illustrated below. Chapter 5-17 (revision 16 May 2010) p. 2 Matched Analysis Using Stratification Matched sample data can be analyzed using stratification, where the individual strata are each a matched set. For example, if we had a paired matched case-control study (1 case matched with 1 control) involving 100 matched pairs (total n=200), the analysis would involve 100 strata, each stratum containing the two subjects forming the matched pair. These data could be analyzed using the Cochran-Mantel-Haenszel chi-square test for association based on the 100 strata. The summary Mantel-Haenszel odds ratio would be computed as the estimate of effect. Example We will use the mi2.dta dataset (see box). Dataset: mi2.dta (source: Kleinbaum and Klein, 2002, Chapter 8) This file is a 1:1 matched case-control study in which n=78 subjects are formed into 39 matched strata. Each stratum contains two subjects, one of whom is a case diagnosed with myocardial infarction and the other is a matched control. Matching was done on age, race, sex, and hospital status. Codebook (mi2.dta) outcome mi myocardial infarction (1=presence, 0=absence) predictors smk smoker (1=current smoker, 0=not current smoker) sbp systolic blood pressure (continuous) ecg electrocardiogram abnormality (1=presence, 0=absence) data management match variable indicating subject’s matched stratum (range 1 to 39) person subject identifier (unique #, one observation per subject) Chapter 5-17 (revision 16 May 2010) p. 3 Reading the data in, File Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on mi2.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\mi2.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use mi2, clear Listing the first six lines of data with a separator between every two lines. Data Describe data List data by/if/in tab: Use a range of observations: From 1 to 6 Options tab: Separators Place sepators every N lines: 2 OK list in 1/6 , sep(2) 1. 2. 3. 4. 5. 6. +---------------------------------------+ | match person mi smk sbp ecg | |---------------------------------------| | 1 1 1 0 160 1 | | 1 2 0 0 140 0 | |---------------------------------------| | 2 4 1 0 160 1 | | 2 5 0 0 140 0 | |---------------------------------------| | 3 7 1 0 160 0 | | 3 8 0 0 140 0 | +---------------------------------------+ Each two consecutive lines represents a matched pair (matched strata). A matched pair identifier, called “match” in this dataset, is included to identify a unique pair. This variable will be needed by Stata. We notice there is 1 case (mi = 1) and 1 control (mi = 0) for each matched strata. The variable smk is our exposure variable of interest. The two covariates, sbp and ecg are potential confounding variables. Chapter 5-17 (revision 16 May 2010) p. 4 Looking at the 2 × 2 table for each strata (each value of match), Statistics Summaries, tables, & tests Tables Twoway tables with measures of assocation Main tab: row variable: mi column variable: smk by/if/in tab: Repeat command by groups: variables that define groups: match OK bysort match: tab mi smk -> match = 1 | smk mi | 0 | Total -----------+-----------+---------0 | 1 | 1 1 | 1 | 1 -----------+-----------+---------Total | 2 | 2 ... -> match = 17 | smk mi | 0 1 | Total -----------+----------------------+---------0 | 1 0 | 1 1 | 0 1 | 1 -----------+----------------------+---------Total | 1 1 | 2 ... Chapter 5-17 (revision 16 May 2010) p. 5 We find four possible permutations in the data: smk smk present absent 1 0 mi present 1 0 1 1 absent 0 0 1 1 * have n = 19 of this pattern (matches 1-16, 26-27,31) smk smk present absent 1 0 mi present 1 1 0 1 absent 0 1 0 1 * have n = 3 of this pattern (matches 33,38,39) smk smk present absent 1 0 mi present 1 1 0 1 absent 0 0 1 1 * have n = 12 of this pattern (matches 17-25,32,34,35) smk smk present absent 1 0 mi present 1 0 1 1 absent 0 1 0 1 * have n = 5 of this pattern (matches 28-30,36,37) Chapter 5-17 (revision 16 May 2010) p. 6 These tables follow the data layout for stratified case-control studies: Exposed Unexposed Cases ai bi Controls ci di Total N1i N0i where i denotes a specific stratum Total M1i M0i Ti To test for an association between exposure and disease (between smoking and MI), we can use the Cochran-Mantel-Haenszel chi-square test (also called the Mantel-Haenszel chi-square test). The formula is: 2 CMH N1i M 1i ai Ti i i N N M M i 1Ti 2 (0Ti 1i1) 0i i i 2 , which is a chi-square statistic with 1 degree of freedom. Rothman’s formula (2002, p.162) is the square root of this, as he choose to use the standard normal distribution for computing the p value (from the identity, df2 1 z ) The CMH Chi-square is simply a weighted average of all the stratum-specific chi-square tests. The Mantel-Haenszel summary odds ratio (also called pooled odds ratio) is given by (Rothman, 2002, p.156): ai di Ti i bci i T i ORMH The MH odds ratio is simply a weighted average of all the stratum-specific odds ratios. Chapter 5-17 (revision 16 May 2010) p. 7 Computing the Mantel-Haenszel summary odds ratio and testing it for significance with the Cochran-Mantel-Haenszel chi-square test, ...I can’t find mhodds in the menus... mhodds mi smk match Mantel-Haenszel estimate of the odds ratio Comparing smk==1 vs. smk==0, controlling for match note: only 17 of the 39 strata formed in this analysis contribute information about the effect of the explanatory variable ---------------------------------------------------------------Odds Ratio chi2(1) P>chi2 [95% Conf. Interval] ---------------------------------------------------------------2.400000 2.88 0.0896 0.845521 6.812364 ---------------------------------------------------------------- The note in this output states that only 17 of the 39 strata contributed to the analysis. This comes from the fact that if a stratum has a row or column of zeros, it has no variability to be explained (everyone in the stratum was a smoker, or everyone was not). We see that the summary odds ratio is 2.4, with a p value of 0.0896. We can also get these statistics using Statistics Observational/Epi. analysis Tables for epidemiologists Tabulate odds of failure by category Main tab: Case exposed variable: mi Control exposed variable: smk Report odds ratios adjusted for variables: match OK tabodds mi smk ,adjust(match) Mantel-Haenszel odds ratios adjusted for match --------------------------------------------------------------------------smk | Odds Ratio chi2 P>chi2 [95% Conf. Interval] -------------+------------------------------------------------------------0 | 1.000000 . . . . 1 | 2.400000 2.88 0.0896 0.845521 6.812364 --------------------------------------------------------------------------Score test for trend of odds: chi2(1) = 2.88 Pr>chi2 = 0.0896 Chapter 5-17 (revision 16 May 2010) p. 8 McNemar Test The McNemar test can be found in any introductory statistics textbook. It turns out that for 1:1 matched design, when we do not adjust for covariates beyond the matching variables, that the Cochran-Mantel-Haenszel chi-square test is identically the McNemar test. The McNemar test (also called the McNemar test for the significance of changes) is popularly used to analyze categorical data in a “before and after” design, where each subject is used as its own control (Siegel and Castellan, 1988, p.75). The data layout for the McNemar test is Data Layout for McNemar Test After + Before + A B C D The McNemar test statistic is 2 ( B C )2 BC with df = 1 We see that it only uses information in the “discordant pairs” (+ on one, - on the other), or cells where the Before and After differ, and ignores the other cells. This was the case with the Cochran-Mantel-Haenszel chi-square test above, as well, where it ignored the permutations for which both the case and control were smokers, or both were non-smokers. The odds ratio is computed from this data layout as B/C. Small Expected Frequencies The chi-square test requires a sufficiently large sample size to provide an accurate p value. The rule-of-thumb for the McNemar test version of the chi-square test is that when (B + C) < 10, the exact form of the test should be used (Siegel and Castellan, 1988, p.79). Since the data are paired, the Fisher’s exact test is not appropriate, and so the binomial test is used. In Stata, this binomial test is labeled “Exact McNemar”. Chapter 5-17 (revision 16 May 2010) p. 9 To compute the McNemar test, Stata expects to find the “before” and “after”, the case and control pair, on the same line. We can get Stata to reshape the dataset into this form using the following commands in a do-file: * -- re-format for McNemar test use mi2, clear keep match mi smk list in -10/l reshape wide smk , i(match) j(mi) list in -5/l // // // // easier to follow if limit to needed variables look at 10th from last to last side by side, instead of line by line look at 5th from last to last Before the reshape, the data are in “long” format, with cases and controls on separate consecutive lines. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. +------------------+ | match mi smk | |------------------| | 35 1 1 | | 35 0 0 | | 36 1 0 | | 36 0 1 | | 37 1 0 | |------------------| | 37 0 1 | | 38 1 1 | | 38 0 1 | | 39 1 1 | | 39 0 1 | +------------------+ After the reshape, the data are in “wide” format, with each case and its control on the same line (as if it were the same person), the smk0 variable being the smoking status of the control and the smk1 variable being the smoking status of the case (as if it were pretest and posttest data). 35. 36. 37. 38. 39. +---------------------+ | match smk0 smk1 | |---------------------| | 35 0 1 | | 36 1 0 | | 37 1 0 | | 38 1 1 | | 39 1 1 | +---------------------+ Chapter 5-17 (revision 16 May 2010) p. 10 Calculating the McNemar test on these reformatted data, Statistics Observational/Epi. analysis Tables for epidemiologists Matched case-control studies Main tab: Exposed case variable: smk1 Exposed control variable: smk0 OK mcc smk1 smk0 | Controls | Cases | Exposed Unexposed | Total -----------------+------------------------+---------Exposed | 3 12 | 15 Unexposed | 5 19 | 24 -----------------+------------------------+---------Total | 8 31 | 39 McNemar's chi2(1) = 2.88 Prob > chi2 = 0.0896 Exact McNemar significance probability = 0.1435 Proportion with factor Cases .3846154 Controls .2051282 --------difference .1794872 ratio 1.875 rel. diff. .2258065 odds ratio 2.4 [95% Conf. Interval] --------------------.0455585 .4045329 .8966452 3.920865 -.003563 .4551759 .7870459 8.695981 (exact) Comparing this with the output from above, Mantel-Haenszel estimate of the odds ratio Comparing smk==1 vs. smk==0, controlling for match note: only 17 of the 39 strata formed in this analysis contribute information about the effect of the explanatory variable ---------------------------------------------------------------Odds Ratio chi2(1) P>chi2 [95% Conf. Interval] ---------------------------------------------------------------2.400000 2.88 0.0896 0.845521 6.812364 ---------------------------------------------------------------- we see that the odds ratios from the two approaches are identical, the chi-square statistics are identical, and the p values are identical. This verifies that the CMH Chi-square test using the matched pairs as the stratification variable is identically the McNemar test, and that the MH summary odds ratio is identically the odds ratio computed from the discordant pairs (the off diagonal) of the McNemar 2 2 table. Chapter 5-17 (revision 16 May 2010) p. 11 Conditional Logistic Regression If the data were not matched, we could test the smoker-MI association using ordinary logistic regression (also called unconditional logistic regression). Since they are matched, we must use the matched version, which is called conditional logistic regression. Now, let’s compute the conditional logistic regression model for comparison. Bringing the original data back in, which are in long format, cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use mi2, clear Requesting a conditional logistic regression, Statistics Categorical outcomes Conditional logistic regression Model: Dependent variable: mi Independent variables: smk Group variable: match OK clogit mi smk, group(match) Conditional (fixed-effects) logistic regression Log likelihood = -25.547795 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 78 2.97 0.0848 0.0549 -----------------------------------------------------------------------------mi | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------smk | .8754687 .5322906 1.64 0.100 -.1678018 1.918739 ------------------------------------------------------------------------------ Unlike the command “logistic”, the default is the regression coefficient, rather than the “odds ratio”. This coefficient has to be exponentiated to convert it to an odds ratio, exp(coef)=OR. This is done by specifying the OR option. Let’s try again. Chapter 5-17 (revision 16 May 2010) p. 12 Requesting a conditional logistic regression with the OR display option, Statistics Categorical outcomes Conditional logistic regression Model tab: Dependent variable: mi Independent variables: smk Group variable: match Reporting tab: Report odds ratio. OK clogit mi smk, group(match) or Conditional (fixed-effects) logistic regression Log likelihood = -25.547795 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 78 2.97 0.0848 0.0549 -----------------------------------------------------------------------------mi | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------smk | 2.4 1.277498 1.64 0.100 .8455214 6.812364 ------------------------------------------------------------------------------ The odds ratios and CIs are identical to previous approaches, but a different test statistic (Likelihood Ratio Chi-Square) is used, which is slightly different than the CMH Chi-Square. The conditional logistic regression is similar to CMH Chi-square, in that it stratifies on the matched pairs (or more generally, matched sets 1:1 match, 1:2 match, etc). It then uses an approach called conditional maximum likelihood estimation. The conditional maximum likelihood approach is necessary, since the model includes a large number of dummy variables (one less than the number of matched sets) relative to the sample size. For the example dataset, there are 39-1, or 38, dummy variables included in the model (behind the scenes) in addition to the smk main effect term. The restricted maximum likelihood estimation method is able to handle this without “overfitting”. (Kleinbaum and Klein, 2002, pp. 235-238) Chapter 5-17 (revision 16 May 2010) p. 13 Something similar can be done with unconditional logistic regression, which uses “unconditional” maximum likelihood estimation. This is illustrated in the following model, where dummy variables are included to form the matched strata. We will use the “xi:” facility, which creates dummy variables for each variable that is preceded by “i.”. xi: logistic mi smk i.match // version 10 * <or> logistic mi smk i.match // version 11 Logistic regression Log likelihood = -51.095591 Number of obs LR chi2(39) Prob > chi2 Pseudo R2 = = = = 78 5.94 1.0000 0.0549 -----------------------------------------------------------------------------mi | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------smk | 5.759999 4.33597 2.33 0.020 1.317229 25.18742 | match | 2 | 1 2 0.00 1.000 .0198425 50.39681 3 | 1 2 0.00 1.000 .0198425 50.39681 4 | 1 2 0.00 1.000 .0198425 50.39681 5 | 1 2 0.00 1.000 .0198425 50.39681 6 | 1 2 0.00 1.000 .0198425 50.39681 7 | 1 2 0.00 1.000 .0198425 50.39681 8 | 1 2 0.00 1.000 .0198425 50.39681 9 | 1 2 0.00 1.000 .0198425 50.39681 10 | 1 2 0.00 1.000 .0198425 50.39681 11 | 1 2 0.00 1.000 .0198425 50.39681 12 | 1 2 0.00 1.000 .0198425 50.39681 13 | 1 2 0.00 1.000 .0198425 50.39681 14 | 1 2 0.00 1.000 .0198425 50.39681 15 | 1 2 0.00 1.000 .0198425 50.39681 16 | 1 2 0.00 1.000 .0198425 50.39681 17 | .4166667 .8887804 -0.41 0.681 .0063696 27.2561 18 | .4166667 .8887804 -0.41 0.681 .0063696 27.2561 19 | .4166667 .8887804 -0.41 0.681 .0063696 27.2561 20 | .4166667 .8887804 -0.41 0.681 .0063696 27.2561 21 | .4166667 .8887804 -0.41 0.681 .0063696 27.2561 22 | .4166667 .8887804 -0.41 0.681 .0063696 27.2561 23 | .4166667 .8887804 -0.41 0.681 .0063696 27.2561 24 | .4166667 .8887804 -0.41 0.681 .0063696 27.2561 25 | .4166667 .8887804 -0.41 0.681 .0063696 27.2561 26 | 1 2 0.00 1.000 .0198425 50.39681 27 | 1 2 0.00 1.000 .0198425 50.39681 28 | .4166667 .8887804 -0.41 0.681 .0063696 27.2561 29 | .4166667 .8887804 -0.41 0.681 .0063696 27.2561 30 | .4166667 .8887804 -0.41 0.681 .0063696 27.2561 31 | 1 2 0.00 1.000 .0198425 50.39681 32 | .4166667 .8887804 -0.41 0.681 .0063696 27.2561 33 | .1736111 .3710028 -0.82 0.413 .0026338 11.44392 34 | .4166667 .8887804 -0.41 0.681 .0063696 27.2561 35 | .4166667 .8887804 -0.41 0.681 .0063696 27.2561 36 | .4166667 .8887804 -0.41 0.681 .0063696 27.2561 37 | .4166667 .8887804 -0.41 0.681 .0063696 27.2561 38 | .1736111 .3710028 -0.82 0.413 .0026338 11.44392 39 | .1736111 .3710028 -0.82 0.413 .0026338 11.44392 ------------------------------------------------------------------------------ Chapter 5-17 (revision 16 May 2010) p. 14 We see that the odds ratio for smk is much larger. In fact it is exactly the square of the conditional odds ratio when pair matching is used (1:1 matching) (Kleinbaum and Klein, 2002, p.236). unconditional OR = (conditional OR)2. display 2.4^2 which returns 5.76 exactly the odds ratio in the unconditional logistic model, where an indicator variable is included for each match pair, except one left out as the referent. The unconditional OR is an overestimate. The conditional OR is the correct result. (Kleinbaum and Klein, 2002, p.236). In other words, putting the indicator variables into an ordinary logistic regression is an incorrect analysis (only shown here for illustration). The advantage of using conditional logistic regression over the McNemar test or the CMH Chisquare test, is that covariates can be included in the model that are not in the list of the matching variables. Statistics Categorical outcomes Conditional logistic regression Model tab: Dependent variable: mi Independent variables: smk sbp ecg Group variable: match Reporting tab: Report odds ratio. OK clogit mi smk sbp ecg, group(match) or Conditional (fixed-effects) logistic regression Log likelihood = -20.752435 Number of obs LR chi2(3) Prob > chi2 Pseudo R2 = = = = 78 12.56 0.0057 0.2323 -----------------------------------------------------------------------------mi | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------smk | 2.666869 1.76445 1.48 0.138 .7291737 9.753765 sbp | 1.036602 .0173276 2.15 0.032 1.003191 1.071126 ecg | 4.369953 3.953285 1.63 0.103 .7420546 25.73461 ------------------------------------------------------------------------------ Chapter 5-17 (revision 16 May 2010) p. 15 R-to-1 matching We can also use any number of controls for each case, the R-to-1 matches simply being their own strata in the conditional logistic regression model. It is not even necessary that the same number of controls be used for every matched set (for some it could be 3:1, for others 2:1 or 1:1). This occurs in practice, since it is not always possible to find the full complement of R controls in the same matching category for some cases. (Kleinbaum and Klein, 2002, p.231) A 2:1 match is illustrated with the mi.dta file. Reading the data in, cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use mi, clear Listing the first nine lines of data with a separator between every three lines. Data Describe data List data by/if/in tab: Use a range of observations: From 1 to 9 Options tab: Separators Place sepators every N lines: 3 OK list in 1/9 , sep(3) * <or> list in 1/9 , sepby(match) // better if not always have 2 controls 1. 2. 3. 4. 5. 6. 7. 8. 9. +---------------------------------------+ | match person mi smk sbp ecg | |---------------------------------------| | 1 1 1 0 160 1 | | 1 2 0 0 140 0 | | 1 3 0 0 120 0 | |---------------------------------------| | 2 4 1 0 160 1 | | 2 5 0 0 140 0 | | 2 6 0 0 120 0 | |---------------------------------------| | 3 7 1 0 160 0 | | 3 8 0 0 140 0 | | 3 9 0 0 120 0 | +---------------------------------------+ We cannot analyze these data with McNemar test (which requires a 1:1 match). We can use either the CMH Chi-square or the conditional logistic, either is appropriate. (Kleinbaum and Klein, 2002, p.231) Chapter 5-17 (revision 16 May 2010) p. 16 However, the stratified approach (CHM Chi-square) is known to not agree exactly with the conditional logistic regression, except for the 1:1 match design. mhodds mi smk match clogit mi smk , group(match) or Mantel-Haenszel estimate of the odds ratio Comparing smk==1 vs. smk==0, controlling for match note: only 21 of the 39 strata formed in this analysis contribute information about the effect of the explanatory variable ---------------------------------------------------------------Odds Ratio chi2(1) P>chi2 [95% Conf. Interval] ---------------------------------------------------------------2.200000 3.43 0.0641 0.934342 5.180115 ---------------------------------------------------------------Conditional (fixed-effects) logistic regression Log likelihood = -41.162776 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 117 3.37 0.0665 0.0393 -----------------------------------------------------------------------------mi | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------smk | 2.324173 1.083284 1.81 0.070 .9322409 5.794406 ------------------------------------------------------------------------------ Consistent with what is known from statistical theory, we no longer get identical odds ratios between the Mantel-Haenszel approach and conditional logistic regression, for this 2:1 matched design. That’s not a problem—it is correct to use either approach. Chapter 5-17 (revision 16 May 2010) p. 17 How a Nonmatched Analysis in a Matched Case-Control Study Can Introduce Selection Bias We started out with Rothman and Greenland’s claim (1998, p.147): If matching was used, for either a cohort study or case-control study, one should perform a matched analysis as well. This is particularly important in a matched case-control study, where if not used, a selection bias can be introduced into the study, leading to invalid OR estimates. We will now see how this occurs with Rothman and Greenland’s (1998, p.152) hypothetical data example: This example shows how a crude analysis provides a biased estimate of effect in a casecontrol study when the matching variable is correlated with the exposure. Being correlated, the matching variable is basically a surrogate for exposure. When we match, we make the cases and controls look similar on the matching variable, which simultaneously makes the cases and controls look similar on the exposure variable (since we matched on a surrogate for the exposure variable). In the population, the data look like: Disease N Yes No Males Exposed Unexposed 450 10 89,550 9,990 90,000 10,000 RR = 5 Females Exposed Unexposed 50 90 9,950 89,910 10,000 90,000 RR = 5 The population crude risk ratio is: RR = (Disease/Exposed) / (Disease/Unexposed) = [(450+50)/(90,000+10,000)] / [(10+90)/(10,000+90,000)] =5 Chapter 5-17 (revision 16 May 2010) p. 18 Reading the data in, cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use RothmanP152, clear and listing list, abbrev(15) count 1. 2. 3. 4. 5. 6. 7. 8. +----------------------------------------+ | female disease exposed cellcount | |----------------------------------------| | 0 1 1 450 | | 0 1 0 10 | | 0 0 1 89550 | | 0 0 0 9990 | | 1 1 1 50 | |----------------------------------------| | 1 1 0 90 | | 1 0 1 9950 | | 1 0 0 89910 | +----------------------------------------+ . count 8 We see that there are 8 observations, from the count command. The variable, cellcount, is the cell count from the table from the Rothman and Greenland text. Disease N Yes No Males Exposed Unexposed 450 10 89,550 9,990 90,000 10,000 RR = 5 Chapter 5-17 (revision 16 May 2010) Females Exposed Unexposed 50 90 9,950 89,910 10,000 90,000 RR = 5 p. 19 Expanding to create one observation per subject, expand cellcount drop cellcount count . expand cellcount (199992 observations created) . count 200000 The count command informs us that there are now 200,000 observations in the data editor. Verifying the data were entered correctly, bysort female: tab disease exposed -> female = 0 | exposed disease | 0 1 | Total -----------+----------------------+---------0 | 9,990 89,550 | 99,540 1 | 10 450 | 460 -----------+----------------------+---------Total | 10,000 90,000 | 100,000 -> female = 1 | exposed disease | 0 1 | Total -----------+----------------------+---------0 | 89,910 9,950 | 99,860 1 | 90 50 | 140 -----------+----------------------+---------Total | 90,000 10,000 | 100,000 which agrees with the table from the text. Disease N Yes No Males Exposed Unexposed 450 10 89,550 9,990 90,000 10,000 RR = 5 Chapter 5-17 (revision 16 May 2010) Females Exposed Unexposed 50 90 9,950 89,910 10,000 90,000 RR = 5 p. 20 Verifying that the stratum-specific RR’s are 5, and that the crude RR is 5 as well, cs disease exposed , by(female) female | RR [95% Conf. Interval] M-H Weight -----------------+------------------------------------------------0 | 5 2.672822 9.353411 9 1 | 5 3.540793 7.060566 9 -----------------+------------------------------------------------Crude | 5 4.034626 6.196361 M-H combined | 5 3.496972 7.149042 ------------------------------------------------------------------Test of homogeneity (M-H) chi2(1) = 0.000 Pr>chi2 = 1.0000 We will be fitting logistic regression models to these data, which compute odds ratios, ORs, rather than relative risks, RRs. To see what these are, cc disease exposed , by(female) female | OR [95% Conf. Interval] M-H Weight -----------------+------------------------------------------------0 | 5.020101 2.703308 10.54258 8.955 (exact) 1 | 5.020101 3.47729 7.176783 8.955 (exact) -----------------+------------------------------------------------Crude | 5.020101 4.041642 6.285267 (exact) M-H combined | 5.020101 3.508934 7.182071 ------------------------------------------------------------------Test of homogeneity (M-H) chi2(1) = 0.00 Pr>chi2 = 1.0000 Test that combined OR = 1: Mantel-Haenszel chi2(1) = Pr>chi2 = 96.37 0.0000 Fitting a logistic regression model to these data, to see the stratum-specific estimates logistic disease exposed female Logistic regression Log likelihood = -3938.6321 Number of obs LR chi2(2) Prob > chi2 Pseudo R2 = = = = 200000 291.91 0.0000 0.0357 -----------------------------------------------------------------------------disease | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------exposed | 5.020101 .7764691 10.43 0.000 3.70728 6.797817 female | 1 .1363785 -0.00 1.000 .7654457 1.306428 ------------------------------------------------------------------------------ Obtaining the crude estimates (not stratified by gender) logistic disease exposed Logistic regression Log likelihood = -3938.6321 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 200000 291.91 0.0000 0.0357 -----------------------------------------------------------------------------disease | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------exposed | 5.020101 .5503828 14.72 0.000 4.049396 6.223498 ------------------------------------------------------------------------------ Chapter 5-17 (revision 16 May 2010) p. 21 Let’s let these data be the population we will sample from for this illustration. Disease Yes No N Males Exposed Unexposed 450 10 89,550 9,990 90,000 10,000 RR = 5 Females Exposed Unexposed 50 90 9,950 89,910 10,000 90,000 RR = 5 Now, we will conduct a case-control study, where we do not match. Disease N Yes No Males Exposed Unexposed 450 10 89,550 9,990 90,000 10,000 RR = 5 Females Exposed Unexposed 50 90 9,950 89,910 10,000 90,000 RR = 5 Displaying the “population” 2 × 2 table, tab disease exposed | exposed disease | 0 1 | Total -----------+----------------------+---------0 | 99,900 99,500 | 199,400 1 | 100 500 | 600 -----------+----------------------+---------Total | 100,000 100,000 | 200,000 We will randomly sample 600 controls, which equals the sample size for the 600 available cases. set seed 777 sample 600, count, if disease==0 // so can replicate sample Seeing what our sample now looks like, tab disease exposed | exposed disease | 0 1 | Total -----------+----------------------+---------0 | 321 279 | 600 1 | 100 500 | 600 -----------+----------------------+---------Total | 421 779 | 1,200 Chapter 5-17 (revision 16 May 2010) <- a sample of first row above <- same as second row above p. 22 Looking at the odds ratios, cc disease exposed , by(female) female | OR [95% Conf. Interval] M-H Weight -----------------+------------------------------------------------0 | 5.60241 2.616844 13.00143 3.364865 (exact) 1 | 5.37037 3.127288 9.271097 5.869565 (exact) -----------------+------------------------------------------------Crude | 5.752688 4.36561 7.59901 (exact) M-H combined | 5.454921 3.585552 8.298908 ------------------------------------------------------------------Test of homogeneity (M-H) chi2(1) = 0.01 Pr>chi2 = 0.9255 Test that combined OR = 1: Mantel-Haenszel chi2(1) = Pr>chi2 = 0.0000 73.12 Or, doing the same thing with logistic regression model to these data, logistic logistic logistic logistic disease disease disease disease exposed if female==0 exposed if female==1 exposed exposed female . logistic disease exposed if female==0 // // // // male stratum female stratum crude combined // male stratum Logistic regression Number of obs = 740 -----------------------------------------------------------------------------disease | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------exposed | 5.60241 2.084942 4.63 0.000 2.701465 11.61851 -----------------------------------------------------------------------------. logistic disease exposed if female==1 // female stratum Logistic regression Number of obs = 460 -----------------------------------------------------------------------------disease | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------exposed | 5.37037 1.399316 6.45 0.000 3.222651 8.949427 -----------------------------------------------------------------------------. logistic disease exposed // crude Logistic regression Number of obs = 1200 -----------------------------------------------------------------------------disease | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------exposed | 5.752688 .7866572 12.80 0.000 4.4002 7.52089 -----------------------------------------------------------------------------. logistic disease exposed female // combined Logistic regression Number of obs = 1200 -----------------------------------------------------------------------------disease | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------exposed | 5.446019 1.160892 7.95 0.000 3.586198 8.270355 female | .9336106 .1920174 -0.33 0.738 .623875 1.397121 ------------------------------------------------------------------------------ We get a result close to the population estimates (OR = 5.0), being different due to sampling variability. (It would require Monte Carlo simulation to verify the long-run average ORs match the population ORs). That is, NOT using a matched case-control study design, the result does not appear to be biased. Chapter 5-17 (revision 16 May 2010) p. 23 Next we will conduct a matched case-control study, matching on gender. Starting with the full dataset, cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use RothmanP152, clear expand cellcount drop cellcount First, take a close look at the data. Notice that gender is a close proxy for exposure in this dataset (almost all males are exposed, while almost all females are unexposed). Disease Yes No N Males Exposed Unexposed 450 10 89,550 9,990 90,000 10,000 RR = 5 Females Exposed Unexposed 50 90 9,950 89,910 10,000 90,000 RR = 5 This strong association can be revealed with a Pearson correlation coefficient. corr (obs=200000) | female disease exposed -------------+--------------------------female | 1.0000 disease | -0.0293 1.0000 exposed | -0.8000 0.0366 1.0000 We see that the female-exposed association is strong (r = -0.80). When we match on gender, then, we will introduce a selection bias if a nonmatched analysis is not performed, but the bias will not exist if a matched analysis is performed (the point of this illustration). Chapter 5-17 (revision 16 May 2010) p. 24 We will conduct a “frequency match”, rather than a one-to-one match, on gender. Seeing what the gender distribution is for cases, tab disease female, row | female disease | 0 1 | Total -----------+----------------------+---------0 | 99,540 99,860 | 199,400 | 49.92 50.08 | 100.00 -----------+----------------------+---------1 | 460 140 | 600 | 76.67 23.33 | 100.00 -----------+----------------------+---------Total | 100,000 100,000 | 200,000 | 50.00 50.00 | 100.00 We see 140 females and 460 males in the diseased group. Sampling 140 females and 460 males from the control pool (the non-disease row of the population table) set seed 777 sample 140, count, if disease==0 & female==1 sample 460, count, if disease==0 & female==0 tab disease female | female disease | 0 1 | Total -----------+----------------------+---------0 | 460 140 | 600 1 | 460 140 | 600 -----------+----------------------+---------Total | 920 280 | 1,200 Chapter 5-17 (revision 16 May 2010) p. 25 Performing a nonmatched analysis of these data, cc disease exposed logistic disease exposed . cc disease exposed Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+-----------------------Cases | 500 100 | 600 0.8333 Controls | 436 164 | 600 0.7267 -----------------+------------------------+-----------------------Total | 936 264 | 1200 0.7800 | | | Point estimate | [95% Conf. Interval] |------------------------+-----------------------Odds ratio | 1.880734 | 1.409024 2.515956 (exact) Attr. frac. ex. | .4682927 | .2902887 .6025367 (exact) Attr. frac. pop | .3902439 | +------------------------------------------------chi2(1) = 19.89 Pr>chi2 = 0.0000 . logistic disease exposed Logistic regression Log likelihood = -821.75147 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 1200 20.05 0.0000 0.0121 -----------------------------------------------------------------------------disease | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------exposed | 1.880734 .2685639 4.42 0.000 1.421602 2.488151 ------------------------------------------------------------------------------ we see that the crude OR of 1.88 is a very biased estimate of the population crude OR of 5.02. Chapter 5-17 (revision 16 May 2010) p. 26 Next, performing a matched analysis of these data, cc disease exposed, by(female) logistic disease exposed female . cc disease exposed, by(female) female | OR [95% Conf. Interval] M-H Weight -----------------+------------------------------------------------0 | 4.403341 2.132179 9.972196 4.554348 (exact) 1 | 4.019608 2.105171 7.908589 5.464286 (exact) -----------------+------------------------------------------------Crude | 1.880734 1.409024 2.515956 (exact) M-H combined | 4.194048 2.639437 6.664316 ------------------------------------------------------------------Test of homogeneity (M-H) chi2(1) = 0.04 Pr>chi2 = 0.8479 Test that combined OR = 1: Mantel-Haenszel chi2(1) = Pr>chi2 = 41.22 0.0000 . logistic disease exposed female Logistic regression Log likelihood = -810.07334 Number of obs LR chi2(2) Prob > chi2 Pseudo R2 = = = = 1200 43.41 0.0000 0.0261 -----------------------------------------------------------------------------disease | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------exposed | 4.183013 .9848099 6.08 0.000 2.636879 6.635723 female | 2.833092 .6508466 4.53 0.000 1.805984 4.44434 ------------------------------------------------------------------------------ Alternatively, performing a matched analysis of these data using conditional logistic regression, clogit disease exposed, group(female) or Conditional (fixed-effects) logistic regression Log likelihood = -803.44316 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 1200 43.30 0.0000 0.0262 -----------------------------------------------------------------------------disease | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------exposed | 4.168447 .9801334 6.07 0.000 2.629238 6.608739 ------------------------------------------------------------------------------ The OR of 4.2 is quite close to the population OR of 5.02, being different due to sampling variability. Monte Carlo simulation could verify that the long-run average of the sample ORs is the same as the population OR. This verifies that a matched sample analysis eliminates the selection bias, introduced by a matching factor that is correlated with the exposure, that is present if a nonmatched analysis is used. Chapter 5-17 (revision 16 May 2010) p. 27 Notice that the ordinary logistic regression, which included the matching variable as a predictor, was nearly identical to the conditional logistic regression. When the matching is done on only a few distinct levels of a matching variable, such as with sex in this example, the data can be analyzed with ordinary logistic regression as long as the variables that we used to form the match are included in the model. (Jewell, 2004, p.258). Consistent with Jewell’s statement, Cheung (2003) points out frequency-matched data do not require conditional logistic regression, as long as the match variables are included in the unconditional model as covariates: “Frequency-matched data do not require conditional logistic regression; individually matched data do. As Rahman et al. themselves pointed out, the unconditional logistic regression is a suitable analytic method if the number of parameters is small in relation to the number of subjects. This is usually the case in the analysis of frequency-matched case-control data. As long as the matching variables are included in the unconditional logistic regression model as covariates, the odds ratios will not be biased by the procedure of frequency matching. To avoid suspicion, frequency-matched case-control studies should always report whether they have included the matching variables in the analysis.” When you report the model, however, you would not show the lines for the match variables— which would have no significant ORs since you forced balance on these variables, but merely mention in a footnote that they were included in the model. Frequency matched studies, which general only have a few distinct levels of a matching variable, are often analyzed this way. Analysis of Matched Cohort Studies Usually cohort studies are analyzed with Cox regression, or with Poisson regression if the data are in aggregated form or simply counts of events. Both models allow for specification of one or more stratification, or matching, variables and are available in Stata. Chapter 5-17 (revision 16 May 2010) p. 28 References Cheung Y-B. (2003). Analysis of matched case-control data. J Clin Epidemiol 56:814. Cochran WG. (1954). Some methods of strengthening the common chi-square test. Biometrics 10:417-51. Gart JJ. (1976). Letter: Contingency tables. The American Statistician 30:204. Greenland S, Morgenstern H. Matching and efficiency in cohort studies. Am J Epidemiol 131:151-159. Howe GR, Choi BCK. (1983). Methodological issues in case-control studies: validity and power of various design/analysis strategies. Int J Epidemiol 12:238-245. Jewell NP. (2004). Statistics for Epidemiology. New York, Chapman & Hall/CRC. Kleinbaum DG, Klein M. (2002). Logistic Regression: A Self-Learning Text. 2nd ed. New York, Springer-Verlag. Kupper LL, Karon JM, Kleinbaum DG et al. (1981). Matching in epidemiologic studies: validity and efficiency considerations. Biometrics 37:271-292. Mantel N. (1977). Letter: Contingency tables—a reply. The American Statistician 31:135. Mantel N, Haenszel W. (1959). Statistical aspects of the analysis of data from the retrospective studies of disease. J. National Cancer Inst. 22:719-48. Miettinen OS. (1970). Matching and design efficiency in retrospective studies. Am J Epidemiol 91:111-118. Monson RR. (1980). Occupational Epidemiology. Boca Raton, FL, CRC Press, Inc. Rosner B. (1995). Fundamentals of Biostatistics, 4th ed., Belmont CA, Duxbury Press. Rothman KJ. (2002). Epidemiology: An Introduction. Oxford, Oxford University Press. Rothman KJ, Greenland S. (1998). Modern Epidemiology, 2nd ed. Philadelphia, PA, Lippincott-Raven Publishers. Rothman KJ, Greenland S, Lash TL. (2008). Design Stategies to Improve Study Accuracy. In Rothman KJ, Greenland S, Lash TL eds. Modern Epidemiology, 3rd ed. Philadelphia, PA, Lippincott Williams & Wilkins, 2008, pp. 168-182. Siegel S and Castellan NJ Jr (1988). Nonparametric Statistics for the Behavioral Sciences, 2nd ed. New York, McGraw-Hill. Chapter 5-17 (revision 16 May 2010) p. 29 Smith PG, Day NE. (1981). Matching and confounding in the design and analysis of epidemiological case-control studies. In: Blithell JF, Coppi R, eds. Perspectives in Medical Statistics. New York, Academic Press. Thomas DC, Greenland S. (1983). The efficiency of matching case-control studies of risk factor interaction. J Chronic Dis 38:569-574. Thompson WD, Kelsey JL, Walter SD. (1982). Cost and efficiency in the choice of matched and unmatch case-control studies. Am J Epidemiol 116:840-851. Chapter 5-17 (revision 16 May 2010) p. 30