Chapter 5-12. Variable Selection and Collinearity In this chapter, we discuss how variable selection differs, depending on the goal of the model. We will cover various approaches to variable section and the concept of collinearity. Goal of the Model Researchers usually have more potential predictor variables than end up in the final model. What variables to include is largely a question of the goal of the model. Vittinghoff et al (2005, p.134) list three possible goals: “1. Prediction. Here the primary issue is minimizing prediction error rather than causal interpretation of the predictors in the model…. 2. Evaluating a predictor of primary interest. In pursuing this inferential goal, a central problem in observational data is confounding, which relatively inclusive models are more likely to minimize. Predictors necessary for face validity as well as those that behave like confounders should be included in the model…. 3. Identifying the important independent predictors of an outcome. This is the most difficult of the three inferential goals, and one in which both causal interpretation and statistical inference are most problematic. Pitfalls include false-positive associations, the protential complexity of causal pathways, and the difficulty of identifying a single best model. We also endorse inclusive models in this context, and recommend a selection procedure that affords increased protection against false-positive results. Cautious interpretation of weak associations is key to this approach.” _________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010. Chapter 5-12 (revision 16 May 2010) p. 1 Evaluating a predictor of primary interest If the goal is to evaluate a predictor of primary interest, then eliminating variables based solely on statistical significance is not the best approach. Vittinghoff et al (2005, p.146) state, “However, we do not recommend ‘parsimoniuous’ models that only include predictors that are statistically significant at P < 0.05 or even stricter criteria, because the potential for residual confounding in such models is substantial.” Maldonado and Greenland (1993) suggest that potential confounders be eliminated only if p > 0.20, in order to protect against residual confounding. For the other goals of identifying the list of important predictors or to develop a prediction model, than retaining only significant predictors makes sense. “10% change in estimate” variable selection rule Confounding is said to be present if the unadjusted effect differs from the effect adjusted for putative confounders. (Rothman, 1998). A variable selection rule consistent with this definition of confounding is the change-in-estimate method of variable selection. In this method, a potential confounder is included in the model if it changes the coefficient, or effect estimate, of the primary exposure variable by 10%. This method has been shown to produce more reliable models than variable selection methods based on statistical significance (Greenland, 1989). Protocol Suggestion Grant reviewers like to see some discussion about variable selection. Here is some suggested wording for the change-in-estimate method. Given that the goal of the multivariable model is to assess the effect of the study intervention, while controlling for putative confounding variables, variable selection will be done using the change-in-estimate method. This method has been shown to produce more reliable models than variable selection methods based on statistical significance (Greenland, 1989). In this method, a potential confounder is included in the model if it changes the coefficient of the primary exposure variable, our study intervention, by 10 percent. This approach is consistent with the definition of confounding, where confounding is said to be present if the unadjusted effect differs from the effect adjusted for putative confounders (Rothman and Greenland, 1998). Chapter 5-12 (revision 16 May 2010) p. 2 Using Both 10% Rule and P Values The most common approach is to just use significance for variable selection. It is rare to see just the 10% rule being used, even though it is a better approach. Frequently authors use a combination of the the two approaches. This is consistent with what was said on page 1 under the heading “2. Evaluating a predictor of primary interest”, where it was mentioned that variables to provide face validity be included. Example 1) “10% change in estimate” and statistical significance variable selection Kulkarni et al (N Engl J Med, 2006) state in their Statistical Analysis section, “In the multiple regression models, confounders were included if they were significant at a 0.05 level or they altered the coefficient of the main varible by more than 10 percent in cases in which the main association was significant.” Example 2) “10% change in estimate” and statistical significance variable selection Chaves et al (N Engl J Med, 2007) state in their Statistical Analysis section, “We examined any association between potential predictors of increased severity of disease spearately for subjects who were vaccinated and those who were not vaccinated, using a two-sided chi-square test. We constructed two unconditional logistic-regression models—one for vaccinated subjects and one for unvaccinated subjects—to determine which variables remained independent predictors that subjects would have moderate-tosevere disease. Variables that had a significant association with disease severity in the univariate analysis were included in the multivaraite regression models. Variables that were not significantly associated with disease severity but that changed the odds ratio for severity by 10% or more when removed from the analysis were also kept in the final model.17” -------------------17 Maldonado G, Greenland S. Simulation study of confounder-selection strategies. Am J Epidemiol 1993;138:923-936. Chapter 5-12 (revision 16 May 2010) p. 3 More Cautious Approach to Guard Against Confounding (10% Rule + conservative P value + a priori confounders) Since confounding does not depend on statistical significance, nor is the 10% rule a definitive cutpoint for defining confounding, some investigators take a more cautious approach. Thompson et al (N Engl J Med, 2007) provides a good example in their Statistical Analysis section, “We analyzed raw test scores adjusted for a priori confounders, including linear terms for age, family income, and score on the HOME scale14,15 and dummy-coded variables for sex, HMO, maternal IQ, maternal education, single-parent status, and birth weight. Other covariates were included in the full model if the P value was less than 0.20 or if their inclusion resulted in a change of 10% or more in the estimate of the main effect of mercury exposure19,20…” -------19 Maldonado G, Greenland S. Simulation study of confounder-selection strategies. Am J Epidemiol 1993;83:923-936. 20 Budtz-Jørgensen E, Keiding N, Grandjean P, Weighe P. Confounder selection in environmental epidemiology: assessment of health effects of prenatal mercury exposure. Ann Epidemiol 2007;17:27-35. Backwards Elimination Backwards selection is considered superior to forwards selection (forward selection adds one variable at a time), because negatively confounded sets of variables are less likely to be omitted from the model (Sun et al, 1999), since the complete set is included in the initial model. In contrast, forward and stepwise (stepwise is where variables can be added and subsequently removed) selection procedures will only include such sets if at least one member meets the inclusion criterion in the absence of the others. (Vittinghoff et al, 2005, p.151). By “negatively confounded sets” , we are referring to the situation where two or more variables must be included in the model as a set to control for confounding. When one of the variables is dropped, confounding increases. Budtz-Jørgensen et al (2006) recommend using p=0.20 as the cut-off when backwards elimination is used. In the Tompson et al (2007) example shown above on this page, the researchers use p=0.20 and cite Budtz-Jørgensen. Automated Variable Selection Procedures Statistical software packages provide automated variable selection routines, giving you the choice of forward, backward, or stepwise. Although these were once popular, they have fallen under enough criticism that it is very rare to find an article that admits to using them. These automated routines, although finding a significant set of predictors, have no way to make decisions about collinearity or confounding, and they can even produce nonsensical models (Greenland, 1989). Chapter 5-12 (revision 16 May 2010) p. 4 A better approach, then, is to use “interactive backwards elimination”, where you, the researcher, makes the decision at each step. The automated variable selection routines are available in Stata for any type of regression model. We will practice with the Framingham Heart Study dataset. Framingham Heart Study dataset (2.20.Framingham.dta) This is a dataset distributed with Dupont (2002, p 77). The dataset comes from a long-term follow-up study of cardiovascular risk factors on 4699 patients living in the town of Framingham, Massachusetts. The patients were free of coronary heart disease at their baseline exam (recruitment of patients started in 1948). Date Codebook Baseline exam: sbp systolic blood pressure (SBP) in mm Hg dbp diastolic blood pressure (DBP) in mm Hg age age in years scl serum cholesterol (SCL) in mg/100ml bmi body mass index (BMI) = weight/height2 in kg/m2 sex gender (1=male, 2=female) month month of year in which baseline exam occurred id patient identification variable (numbered 1 to 4699) Follow-up information on coronary heart disease: followup follow-up in days chdfate CHD outcome (1=patient develops CHD at the end of follow-up, 0=otherwise) Reading in the data, File Open Find the directory where you copied the course CD Change to the subdirectory datasets & do-files Single click on 2.20.Framingham.dta Open Chapter 5-12 (revision 16 May 2010) p. 5 use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\ 2.20.Framingham.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use 2.20.Framingham.dta, clear These are cohort data with follow-up times, so Cox regression is a good choice. Informing Stata that these are survival analysis data, and coverting recoding sex into a 0-1 variable, stset followup , failure(chdfate==1) recode sex 1=1 2=0 ,gen(male) tab sex male, nolabel Using all of the variables, we could potentially fit the following model, stcox sbp dbp age scl bmi male Cox regression -- Breslow method for ties No. of subjects = No. of failures = Time at risk = Log likelihood = 4658 1465 37582433 -11373.616 Number of obs = 4658 LR chi2(6) Prob > chi2 = = 770.54 0.0000 -----------------------------------------------------------------------------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------sbp | 1.010413 .0018671 5.61 0.000 1.006761 1.01408 dbp | 1.004468 .0033593 1.33 0.183 .9979051 1.011073 age | 1.043218 .00364 12.13 0.000 1.036108 1.050377 scl | 1.005465 .0005845 9.38 0.000 1.00432 1.006611 bmi | 1.03364 .0069549 4.92 0.000 1.020098 1.047361 male | 2.192773 .1199258 14.36 0.000 1.969882 2.440884 ------------------------------------------------------------------------------ Chapter 5-12 (revision 16 May 2010) p. 6 For illustration, lets reduce the sample size so that we do not get so much significance, making variable selection more of a challenge set seed 999 sample 200 , count tab chdfate stcox sbp dbp age scl bmi male Coronary | Heart | Disease | Freq. Percent Cum. ------------+----------------------------------Censored | 137 68.50 68.50 CHD | 63 31.50 100.00 ------------+----------------------------------Total | 200 100.00 Cox regression -- no ties No. of subjects = 199 Number of obs = 199 -----------------------------------------------------------------------------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------sbp | 1.013399 .0087138 1.55 0.122 .9964632 1.030622 dbp | .9927938 .0157639 -0.46 0.649 .9623729 1.024176 age | 1.041757 .01777 2.40 0.016 1.007504 1.077174 scl | 1.007924 .0028503 2.79 0.005 1.002353 1.013526 bmi | 1.041716 .0356593 1.19 0.233 .9741182 1.114005 male | 2.312335 .6250733 3.10 0.002 1.361297 3.927793 ------------------------------------------------------------------------------ We see that the 63 events allow for 6 predictors (m/10 rule), as discussed in the sample size chapter (Chapter 2-5, p.30). Also, only one-half of the predictors are now significant. To select the variables using backward selection, removing variables in order of least significance until all retained variables have p<0.05 (p for removal 0.05), we use stepwise , pr(.05): stcox sbp dbp age scl bmi male p = 0.6488 >= 0.0500 p = 0.2676 >= 0.0500 begin with full model removing dbp removing bmi Cox regression -- no ties No. of subjects = No. of failures = Time at risk = Log likelihood = 199 63 1647414 -292.06721 Number of obs = 199 LR chi2(4) Prob > chi2 = = 33.62 0.0000 -----------------------------------------------------------------------------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------sbp | 1.012388 .0055913 2.23 0.026 1.001489 1.023406 male | 2.348138 .6325655 3.17 0.002 1.3849 3.981338 age | 1.043096 .0179411 2.45 0.014 1.008519 1.07886 scl | 1.007679 .0028454 2.71 0.007 1.002117 1.013271 ------------------------------------------------------------------------------ Chapter 5-12 (revision 16 May 2010) p. 7 To select the variables using forward selection, adding variables in order of most significance until all retained variables have p<.05 (p for entry 0.05), we use stepwise , pe(.05): stcox sbp dbp age scl bmi male p p p p = = = = 0.0001 0.0054 0.0085 0.0142 < < < < 0.0500 0.0500 0.0500 0.0500 begin with empty model adding scl adding sbp adding male adding age Cox regression -- no ties No. of subjects = No. of failures = Time at risk = Log likelihood = 199 63 1647414 -292.06721 Number of obs = 199 LR chi2(4) Prob > chi2 = = 33.62 0.0000 -----------------------------------------------------------------------------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------scl | 1.007679 .0028454 2.71 0.007 1.002117 1.013271 sbp | 1.012388 .0055913 2.23 0.026 1.001489 1.023406 male | 2.348138 .6325655 3.17 0.002 1.3849 3.981338 age | 1.043096 .0179411 2.45 0.014 1.008519 1.07886 ------------------------------------------------------------------------------ Using stepwise, where variables can enter and leave the model, which is a combination of the forward and backwards selection procedures, we use stepwise , pe(.05) pr(.10): stcox sbp dbp age scl bmi male begin with full model removing dbp removing bmi p = 0.6488 >= 0.1000 p = 0.2676 >= 0.1000 Cox regression -- no ties No. of subjects = No. of failures = Time at risk = Log likelihood = 199 63 1647414 -292.06721 Number of obs = 199 LR chi2(4) Prob > chi2 = = 33.62 0.0000 -----------------------------------------------------------------------------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------sbp | 1.012388 .0055913 2.23 0.026 1.001489 1.023406 male | 2.348138 .6325655 3.17 0.002 1.3849 3.981338 age | 1.043096 .0179411 2.45 0.014 1.008519 1.07886 scl | 1.007679 .0028454 2.71 0.007 1.002117 1.013271 ------------------------------------------------------------------------------ In this example, we ended up with the same final model with each of the variable selection procedures. That is frequently the case, but it does not have to occur. Sometimes, we need a set of indicator variables to be entered or removed as a set. If we converted BMI to four categories that would be the case. Recoding BMI, Chapter 5-12 (revision 16 May 2010) p. 8 recode bmi 30/max=4 25/30=3 tab bmicat , gen(bmicat) 18.5/25=2 min/18.5=1 ,gen(bmicat) RECODE of | bmi (Body | Mass Index) | Freq. Percent Cum. ------------+----------------------------------1 | 2 1.00 1.00 2 | 98 49.00 50.00 3 | 68 34.00 84.00 4 | 32 16.00 100.00 ------------+----------------------------------Total | 200 100.00 We see that category 1 “underweight” has only 2 observations. We will let this become part of the referent, category 2 “normal”, by leaving both bmicat1 and bmicat2 out of the model. To specify the two BMI categories being removed together, we simply include them in parentheses, stepwise , pr(.05): stcox sbp dbp age scl (bmicat3 bmicat4) male p = 0.6846 >= 0.0500 p = 0.6545 >= 0.0500 begin with full model removing dbp removing bmicat3 bmicat4 Cox regression -- no ties No. of subjects = No. of failures = Time at risk = Log likelihood = 199 63 1647414 -292.06721 Number of obs = 199 LR chi2(4) Prob > chi2 = = 33.62 0.0000 -----------------------------------------------------------------------------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------sbp | 1.012388 .0055913 2.23 0.026 1.001489 1.023406 male | 2.348138 .6325655 3.17 0.002 1.3849 3.981338 age | 1.043096 .0179411 2.45 0.014 1.008519 1.07886 scl | 1.007679 .0028454 2.71 0.007 1.002117 1.013271 ----------------------------------------------------------------------------- Example (Backwards Elimination Variable Selection). Itani et al (N Engl J Med, 2006) is an example of a paper that reports using a backwards elimination variable selection. In their Statistical Analysis section they state, “An exploratory evaluation assessed whether preoperative and interoperative risk factors contributed to the development of surgical-site infection. For the univariate analysis, the significance level of each factor was tested alone. For the multivariate analysis, a backward-elimination approach in a multiple logistic-regression model was performed. In this model, the significant factors from the univariate analysis were removed one at a time, starting with the factor that had the largest P value, until all remaining factors had a two-sided P value of less than 0.10. Odds ratios and P values were reported for each factor alone and for the factors found to be significant from the backward elimination.” Chapter 5-12 (revision 16 May 2010) p. 9 In their Table 5, they report the univariate risk factors and the multivariate risk factors in separate columns of the same table. This is a very useful display. For example, if the reader expects to a see a particular predictor in the model, because other papers have shown it to be significant, it is nice to be able to see it was significant in a univariable model when it has been eliminated in the multivariable model—otherwise, the reader has more difficultly accepting your model. Example (Stepwise Variable Selection). Weinstein et al (N Engl J Med, 2007) is an example of a paper that reports using a stepwise variable selection. Although they do not state whether it was automated or user interactive, it “reads” like it was automated. In their Statistical Analysis section they state, “Baseline predictors of time until surgical treatment in both cohorts (including treatment crossovers) were determined by a stepwise proportional-hazards regression model with an inclusion criterion of P<0.1 to enter and P>0.05 to exit.” Variable Selection Based on Significance: Wald Test vs Likelihood Ratio Test The p value found in the regression model output is called a Wald test. It assumes a sufficiently large sample size in order to provide an accurate p value. Alternatively, if the model uses maximum likelihood estimation, which is what logistic regession and Cox regression uses, significance can be tested using the likelihood ratio test. The p values of the two approaches are usually very close, particularly in moderate to large samples. In general, the likelihood-ratio test is more powerful than the Wald test, and so many statisticians advocate their use exclusively over the Wald test. However, the difference is usually small, so just using the Wald test because it is more convenient is a still a good choice. For small sample sizes, Vittinghoff et al (2005, p.173) point out that the p values for the Wald test and likelihood-ratio test can differ substantially. They suggest that for small sample sizes that the likelihood ratio test be used because it is, in general, more reliable. The likelihood ratio test compares a “full” model to the “restricted” model, where the restricted model has one less predictor variable (Long and Freese, 2006, p.101-103). Other than that, the two models are similar. It is important to make sure that the same observations are used in both models, otherwise the two models are not comparable for using the likelihood ratio test (Vittinghoff et al, 2005, p.173). Such a problem could arise by listwise deletion of missing data. Practicing with the Farmingham dataset, using a much smaller sample size, we will fit a model with age a predictor and second model omitting age. use 2.20.Framingham.dta, clear set seed 999 sample 50 , count logistic chdfate age estimates store fmodel // store "full" model estimates logistic chdfate // just fit the baseline risk, or “intercept” estimates store rmodel // store "restricted" model estimates lrtest fmodel rmodel Chapter 5-12 (revision 16 May 2010) p. 10 Logistic regression Log likelihood = -29.927666 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 50 4.25 0.0393 0.0663 -----------------------------------------------------------------------------chdfate | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------age | 1.08241 .0438029 1.96 0.050 .9998744 1.171759 -----------------------------------------------------------------------------Logistic regression Log likelihood = -32.051774 Number of obs LR chi2(0) Prob > chi2 Pseudo R2 = = = = 50 -0.00 . -0.0000 -----------------------------------------------------------------------------chdfate | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+--------------------------------------------------------------------------------------------------------------------------------------------Likelihood-ratio test (Assumption: rmodel nested in fmodel) LR chi2(1) = Prob > chi2 = 4.25 0.0393 We see that the Wald test for age was just barely significant (p = 0.050), while the Likelihoodratio test provided more power (p = 0.039). Also, notice the warning (Stata assumes you did it right) that it is up to you to make sure the reduce model is nested in the full model, which means the two models differ only by the one predictor you are testing. If you were testing race, with five indicator variables, than all five indicator variables would be omitted “as the one predictor” in the reduced model. If you used a full model with three predictors and a reduced model with three different predictors, the likelihood-ratio test would give a meaningless p value. That is, you cannot use the likelihood ratio test to decide which of two models is the better model, unless one model is nested within the other. Chapter 5-12 (revision 16 May 2010) p. 11 Let’s try another example. This time we will use the same sample of n=200 that we used above, which has one missing value for scl. use 2.20.Framingham.dta, clear set seed 999 sample 200 , count sum chdfate sbp dbp age scl bmi sex Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------chdfate | 200 .315 .4656815 0 1 sbp | 200 132.795 23.07608 80 225 dbp | 200 81.33 12.91219 50 130 age | 200 45.82 8.700349 32 65 scl | 199 233.4472 42.53158 150 375 -------------+-------------------------------------------------------bmi | 200 25.678 4.277593 17.7 40.5 sex | 200 1.57 .4963181 1 2 If we want to use the likelihood ratio test, the model with scl absent will be n=200 observations and the model with it present will be n=199. The likelihood ratio test requires the same observations be present in both the full and restricted model. The best thing to do is first impute the missing values with hotdeck imputation, or replacing it with the median of scl, since <5% of the data were missing. Alternatively, you can reduce the dataset to those observations which are complete for all variables that will be modeled. To do this, you could use, keep if chdfate~=. & sbp~=. & dbp~=. & age~=. & scl~=. & bmi~=. /// & sex~=. sum chdfate sbp dbp age scl bmi sex Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------chdfate | 199 .3165829 .4663163 0 1 sbp | 199 132.8593 23.11631 80 225 dbp | 199 81.40704 12.8986 50 130 age | 199 45.86432 8.699627 32 65 scl | 199 233.4472 42.53158 150 375 -------------+-------------------------------------------------------bmi | 199 25.70201 4.27485 17.7 40.5 sex | 199 1.567839 .4966258 1 2 In the keep statement, the “~=.” indicated not equal to missing. Chapter 5-12 (revision 16 May 2010) p. 12 Illustrating the likelihood ratio test with more variables in the model, logistic chdfate age scl estimates store fmodel // store "full" model estimates logistic chdfate age // just fit the baseline risk, or “intercept” estimates store rmodel // store "restricted" model estimates lrtest fmodel rmodel Logistic regression Number of obs = 199 LR chi2(2) = 14.45 Prob > chi2 = 0.0007 Log likelihood = -117.00523 Pseudo R2 = 0.0581 -----------------------------------------------------------------------------chdfate | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------age | 1.011431 .0192098 0.60 0.550 .974473 1.049791 scl | 1.013314 .0040531 3.31 0.001 1.005401 1.021289 -----------------------------------------------------------------------------. estimates store fmodel . logistic chdfate age // store "full" model estimates // just fit the baseline risk, or "intercept" Logistic regression Log likelihood = -122.85266 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 199 2.75 0.0971 0.0111 -----------------------------------------------------------------------------chdfate | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------age | 1.0296 .0181825 1.65 0.099 .9945726 1.065861 -----------------------------------------------------------------------------. estimates store rmodel // store "restricted" model estimates . lrtest fmodel rmodel Likelihood-ratio test (Assumption: rmodel nested in fmodel) LR chi2(1) = Prob > chi2 = 11.69 0.0006 We see that the Wald test p value, 0.001, is the same as the likelihood-ratio test p value, 0.0006, to 3 decimal places. Chapter 5-12 (revision 16 May 2010) p. 13 Increased False Positives (Type I Error) With Stepwise Variable Selection The process of selecting variables for inclusion in your model based on significance, whether using a forward, backwards, or stepwise (combination) variable selection procedure, whether automated or done manually by you, can produce unreliable results. That is, some of these variables will not be identified as significant predictors by other investigators. It is a type of muliplicity, or multiple comparison, problem. The p value assume that the variable was prespecified as important. Vittinghoff et al (2005, p.134) warns of this when fitting models to identify important independent predictors of an outcome, “…Pitfalls include false-positive associations...” Steyerberg (2009, p.204) advises, “The p-value of predictors in a stepwise model should generally not be trusted; the pvalue is calculated as if the model was pre-specified.” Most researchers are not even aware of this problem. Still, informing your reader of which of your predictors were identified in previous studies, and so pre-specified, and which are exploratory, would be helpful. Chapter 5-12 (revision 16 May 2010) p. 14 Multicollinearity Multicollinearity is simply a linear relationship among the predictor variables. The term “collinearity” means two predictor variables are correlated and “multicollinearity” means a predictor variable is correlated with two or more other predictor variables as a set. This can pose a problem since regression models attempt to estimate the independent effects of each predictor variable. When a predictor is highly correlated with a set of other predictors, there is little variation left over for this predictor to describe. Hamilton (2006, p.210) describes, “When we add a new x variable that is strongly related to x variables already in the model, symptoms of possible trouble include the following: 1. Substantially higher standard errors, with correspondingly lower t statistics. 2. Unexpected changes in coefficient magnitudes or signs. 3. Nonsignificant coefficients despite a high R2.” The best way to assess multicollinearity is to regress each predictor variable on all of the other predictor variables. Then calculating 1 - R2 informs us of the fraction of the first predictor’s variance that is independent of the rest. To see what variables are collinearly related to sbp, by using sbp temporarily as the outcome variable, we use use 2.20.Framingham.dta, clear regress sbp dbp age scl bmi sex Source | SS df MS -------------+-----------------------------Model | 1599636.87 5 319927.374 Residual | 820428.634 4652 176.360411 -------------+-----------------------------Total | 2420065.5 4657 519.661908 Number of obs F( 5, 4652) Prob > F R-squared Adj R-squared Root MSE = 4658 = 1814.05 = 0.0000 = 0.6610 = 0.6606 = 13.28 -----------------------------------------------------------------------------sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------dbp | 1.296349 .0168998 76.71 0.000 1.263218 1.329481 age | .547101 .0245019 22.33 0.000 .4990657 .5951364 scl | .0074756 .0046025 1.62 0.104 -.0015475 .0164987 bmi | .1515357 .0517225 2.93 0.003 .050135 .2529364 sex | 3.113267 .394185 7.90 0.000 2.340478 3.886057 _cons | -9.873957 1.88447 -5.24 0.000 -13.56841 -6.179502 ------------------------------------------------------------------------------ Then, calculating 1 - R2 informs us of the fraction of sbp’s variance that is independent of the rest is 1 - R2 = 1 – 0.6610 = 0.3390, or 33.90% of sbp’s variance is independent of the remaining variables. The 33.9% is much smaller than 100%, so if we use sbp as a predictor variable along with the other variables as predictor variables in a regression model, we will have a multicollinearity problem. Chapter 5-12 (revision 16 May 2010) p. 15 [Note: for this diagnostic with a dichotomous variable as the dependent variable, we use the fact that linear regression with a dichotomous outcome is “approximately” okay, even though the model might produce predicted values outside of the 0-1 range. This was discussed in the logistic regression chapter.] Rather than doing this for all variables, we can get this all at once using the vif, for variance inflation factor, after fitting the linear regression model of interest. regress chdfate sbp dbp age scl bmi sex vif Variable | VIF 1/VIF -------------+---------------------sbp | 2.95 0.339011 dbp | 2.77 0.361306 age | 1.27 0.789918 bmi | 1.18 0.848778 scl | 1.11 0.900151 sex | 1.02 0.976823 -------------+---------------------Mean VIF | 1.72 The VIF column reflects the degree to which other coefficients’ variances, and so standard errors, are increased due to the inclusion of that predictor. The 1/VIF column is the 1 - R2 from regressing each predictor variable on the remaining predictor variables. Interpreting the VIF Output The VIF column reflects the degree to which other coefficients’ variances, and so standard errors, are increased due to the inclusion of that predictor. The 1/VIF column is the 1 - R2 from regressing each predictor variable on the remaining predictor variables. Hamilton (2006, p.212) provides a rule-of-thumb for interpreting the VIF table: “How much variance inflation is too much? Chatterjee, Hadi, and Price (2000) suggest the following as guidelines for the presence of multicollinearity: 1. The largest VIF is greater than 10; or 2. the mean VIF is larger than 1.” Some multicollinearity is okay in a model, so don’t think you have to achieve points 1 and 2 of this rule-of-thumb. It does not mean the model “fails the multicollinearity diagnostic”. The VIF is simply informing you how predictor variables are related, so you can make a decision about whether or not the predictors variables should be simultaneously included in your model. For example, if the list of predictors are all statistically significant, then you might want to retain them all, even though multicollinearity is present. Chapter 5-12 (revision 16 May 2010) p. 16 Protocol Suggestion It is extremely rare to see any reference to how collinearity, or multicollinearity is assessed in an article or discussed in a protocol. It is probably best to just leave such dicussion out. If you wanted to do it, however, you could use something like, The presence of collinearity in the linear regresison model will be assessed with the variance inflation factor (VIF) diagnostic (Hamilton, 2006). If it is necessary to drop a variable due to collinearity, the decision will be made on order of clinical importance. There is really no need for the VIF command. You can tell if collinearity is present simply by predicting any given variable by the list of other variables. To see which variables are collinearly related to sbp, use, regress sbp dbp age scl bmi sex Source | SS df MS -------------+-----------------------------Model | 1599636.87 5 319927.374 Residual | 820428.634 4652 176.360411 -------------+-----------------------------Total | 2420065.5 4657 519.661908 Number of obs F( 5, 4652) Prob > F R-squared Adj R-squared Root MSE = 4658 = 1814.05 = 0.0000 = 0.6610 = 0.6606 = 13.28 -----------------------------------------------------------------------------sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------dbp | 1.296349 .0168998 76.71 0.000 1.263218 1.329481 age | .547101 .0245019 22.33 0.000 .4990657 .5951364 scl | .0074756 .0046025 1.62 0.104 -.0015475 .0164987 bmi | .1515357 .0517225 2.93 0.003 .050135 .2529364 sex | 3.113267 .394185 7.90 0.000 2.340478 3.886057 _cons | -9.873957 1.88447 -5.24 0.000 -13.56841 -6.179502 ------------------------------------------------------------------------------ Since they are all significant, except scl, all of these variables are collinearly related to sbp. Chapter 5-12 (revision 16 May 2010) p. 17 To see which variables are the most collinearly related, we can look at the standardized coefficients, betas, which are the coefficient after first converting all of the variables to standardized scores, or z scores. regress sbp dbp age scl bmi sex, beta Source | SS df MS -------------+-----------------------------Model | 1599636.87 5 319927.374 Residual | 820428.634 4652 176.360411 -------------+-----------------------------Total | 2420065.5 4657 519.661908 Number of obs F( 5, 4652) Prob > F R-squared Adj R-squared Root MSE = 4658 = 1814.05 = 0.0000 = 0.6610 = 0.6606 = 13.28 -----------------------------------------------------------------------------sbp | Coef. Std. Err. t P>|t| Beta -------------+---------------------------------------------------------------dbp | 1.296349 .0168998 76.71 0.000 .7238852 age | .547101 .0245019 22.33 0.000 .203824 scl | .0074756 .0046025 1.62 0.104 .0146103 bmi | .1515357 .0517225 2.93 0.003 .0271222 sex | 3.113267 .394185 7.90 0.000 .0677646 _cons | -9.873957 1.88447 -5.24 0.000 . ------------------------------------------------------------------------------ With standardized variables, all of the variables are in the same units so the beta coefficients, or standardized slopes, are directly comparable to each other. We see that diastolic blood pressure is highly correlated to systolic blood pressure, so you might want to include only one of the two in your model. For a categorical variable, such as sex, you could use logistic sex sbp dbp age scl bmi outcome does not vary; remember: 0 = negative outcome, all other nonmissing values = positive outcome r(2000); although you will get an error message because sex is not coded as 0-1. Chapter 5-12 (revision 16 May 2010) p. 18 Instead use, capture drop male recode sex 1=1 2=0 ,gen(male) tab sex male, nolabel logistic male sbp dbp age scl bmi Logistic regression Log likelihood = -3137.7513 Number of obs LR chi2(5) Prob > chi2 Pseudo R2 = = = = 4658 109.95 0.0000 0.0172 -----------------------------------------------------------------------------male | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------sbp | .9820943 .002284 -7.77 0.000 .977628 .9865811 dbp | 1.033619 .0041198 8.30 0.000 1.025576 1.041725 age | .9994883 .0039613 -0.13 0.897 .9917542 1.007283 scl | .9987112 .0007088 -1.82 0.069 .9973228 1.000101 bmi | 1.03336 .008224 4.12 0.000 1.017366 1.049605 ------------------------------------------------------------------------------ We discover that all the variables, except age, are collinearly related to male. Collinearity does not necessarily present a problem. The only time you really need to be concerned about it is if it makes you lose significance of a variable when you add the related variable. You then need to stop and think whether this is due to confounding, or simply because the two variables are expressions of the same thing. SBP and DBP do not confound each other, but are simply related expressions of blood pressure. If you lose the significance of SBP as a predictor when you add DBP, then simply drop DBP from the model. Since they were highly collinearly related, only one could be in the model at a time. Chapter 5-12 (revision 16 May 2010) p. 19 If the sample size is large, collinearity is even less of a problem. In a model fitted above, Cox regression -- Breslow method for ties No. of subjects = 4658 Number of obs = 4658 -----------------------------------------------------------------------------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------sbp | 1.010413 .0018671 5.61 0.000 1.006761 1.01408 dbp | 1.004468 .0033593 1.33 0.183 .9979051 1.011073 . . . ------------------------------------------------------------------------------ the inclusion of dbp did not negate the significance of sbp. In the smaller sized sample model fitted above, however, Cox regression -- no ties No. of subjects = 199 Number of obs = 199 -----------------------------------------------------------------------------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------sbp | 1.013399 .0087138 1.55 0.122 .9964632 1.030622 dbp | .9927938 .0157639 -0.46 0.649 .9623729 1.024176 . . . ------------------------------------------------------------------------------ significance was lost. This is because smaller effects can be detected with larger sample sizes. There was still enough variability for sbp to explain, after controlling for dbp, in the larger sized sample model only because the sample size was sufficient. So, it is not that collinearity goes away with larger samples, it just does not detract from significance as much. Chapter 5-12 (revision 16 May 2010) p. 20 References Budtz-Jørgensen E, Keiding N, Grandjean P, Weihe P. (2006). Confounder selection in environmental epidemiology: assessment of health effects of prenatal mercury exposure. Ann Epidemiol 17:27-35. Chatterjee S, Hadi AS, Price B. (2000). Regression Analysis by Example. 3rd ed. New York, John Wiley and Sons. Chaves SS, Gargiullo P, Zhang JX. (2007). Loss of vaccine-induced immunity to varicella over time. N Engl J Med 356(11):1121-9. Dupont WD. (2002). Statistical Modeling for Biomedical Researchers: A Simple Introduction to the Analysis of Complex Data. Cambridge UK, Cambridge University Press. Greenland S. (1989). Modeling and variable selection in epidemiologic analysis. Am J Public Health 79(3):340-349. Hamilton LC. (2006). Statistics With Stata. Updated for Version 9. Belmont CA, Thomson Brooks/Cole. Kulkarni N, Pierse N, Rushton L, Grigg J. (2006). Carbon in airway macrophages and lung function in children. N Engl J Med 355(1):21-30. Long JS, Freese J. (2006). Regression Models for Categorical Dependent Variables Using Stata. 2nd ed. College Station, TX, Stata Press. Maldonado G, Greenland S. (1993). Simulation study of confounder-selection strategies. Am J Epidemiol 138:923-936. Rothman KJ, Greenland S. (1998). Modern Epidemiology, 2nd ed. Philadelphia, PA, Lippincott-Raven Publishers. Steyerberg EW. (2009). Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. New York, Springer. Sun GW, Shock TL, Kay GL. (1999). Inappropriate use of bivariable analysis to screen risk factors for use in multivariable analysis. Journal of Clinical Epidemiology 49:907-916. Thompson WW, Price C, Goodson B, et al. (2007). Early thimerosal exposure and neuropsychological outcomes at 7 to 10 years. N Engl J Med 357;13:1281-1292. Vittinghoff E, Glidden DV, Shiboski SC, McCulloch CE. (2005). Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. New York, Springer. Weinstein JN, Lurie JD, Tosteson TD, et al. (2007). Surgical versus nonsurgical treatment for Chapter 5-12 (revision 16 May 2010) p. 21 lumbar degenerative spondylolisthesis. N Engl J Med 356(22):2257-70. Chapter 5-12 (revision 16 May 2010) p. 22