Chapter 5-12. Variable Selection and Collinearity

advertisement
Chapter 5-12. Variable Selection and Collinearity
In this chapter, we discuss how variable selection differs, depending on the goal of the model.
We will cover various approaches to variable section and the concept of collinearity.
Goal of the Model
Researchers usually have more potential predictor variables than end up in the final model. What
variables to include is largely a question of the goal of the model. Vittinghoff et al (2005, p.134)
list three possible goals:
“1. Prediction. Here the primary issue is minimizing prediction error rather than causal
interpretation of the predictors in the model….
2. Evaluating a predictor of primary interest. In pursuing this inferential goal, a central
problem in observational data is confounding, which relatively inclusive models are more
likely to minimize. Predictors necessary for face validity as well as those that behave like
confounders should be included in the model….
3. Identifying the important independent predictors of an outcome. This is the most
difficult of the three inferential goals, and one in which both causal interpretation and
statistical inference are most problematic. Pitfalls include false-positive associations, the
protential complexity of causal pathways, and the difficulty of identifying a single best
model. We also endorse inclusive models in this context, and recommend a selection
procedure that affords increased protection against false-positive results. Cautious
interpretation of weak associations is key to this approach.”
_________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2010.
Chapter 5-12 (revision 16 May 2010)
p. 1
Evaluating a predictor of primary interest
If the goal is to evaluate a predictor of primary interest, then eliminating variables based solely
on statistical significance is not the best approach.
Vittinghoff et al (2005, p.146) state,
“However, we do not recommend ‘parsimoniuous’ models that only include predictors
that are statistically significant at P < 0.05 or even stricter criteria, because the potential
for residual confounding in such models is substantial.”
Maldonado and Greenland (1993) suggest that potential confounders be eliminated only if p >
0.20, in order to protect against residual confounding.
For the other goals of identifying the list of important predictors or to develop a prediction
model, than retaining only significant predictors makes sense.
“10% change in estimate” variable selection rule
Confounding is said to be present if the unadjusted effect differs from the effect adjusted for
putative confounders. (Rothman, 1998).
A variable selection rule consistent with this definition of confounding is the change-in-estimate
method of variable selection. In this method, a potential confounder is included in the model if it
changes the coefficient, or effect estimate, of the primary exposure variable by 10%. This
method has been shown to produce more reliable models than variable selection methods based
on statistical significance (Greenland, 1989).
Protocol Suggestion
Grant reviewers like to see some discussion about variable selection. Here is some suggested
wording for the change-in-estimate method.
Given that the goal of the multivariable model is to assess the effect of the study
intervention, while controlling for putative confounding variables, variable selection will
be done using the change-in-estimate method. This method has been shown to produce
more reliable models than variable selection methods based on statistical significance
(Greenland, 1989). In this method, a potential confounder is included in the model if it
changes the coefficient of the primary exposure variable, our study intervention, by 10
percent. This approach is consistent with the definition of confounding, where
confounding is said to be present if the unadjusted effect differs from the effect adjusted
for putative confounders (Rothman and Greenland, 1998).
Chapter 5-12 (revision 16 May 2010)
p. 2
Using Both 10% Rule and P Values
The most common approach is to just use significance for variable selection. It is rare to see just
the 10% rule being used, even though it is a better approach. Frequently authors use a
combination of the the two approaches.
This is consistent with what was said on page 1 under the heading “2. Evaluating a predictor of
primary interest”, where it was mentioned that variables to provide face validity be included.
Example 1) “10% change in estimate” and statistical significance variable selection
Kulkarni et al (N Engl J Med, 2006) state in their Statistical Analysis section,
“In the multiple regression models, confounders were included if they were significant at
a 0.05 level or they altered the coefficient of the main varible by more than 10 percent in
cases in which the main association was significant.”
Example 2) “10% change in estimate” and statistical significance variable selection
Chaves et al (N Engl J Med, 2007) state in their Statistical Analysis section,
“We examined any association between potential predictors of increased severity of
disease spearately for subjects who were vaccinated and those who were not vaccinated,
using a two-sided chi-square test. We constructed two unconditional logistic-regression
models—one for vaccinated subjects and one for unvaccinated subjects—to determine
which variables remained independent predictors that subjects would have moderate-tosevere disease. Variables that had a significant association with disease severity in the
univariate analysis were included in the multivaraite regression models. Variables that
were not significantly associated with disease severity but that changed the odds ratio for
severity by 10% or more when removed from the analysis were also kept in the final
model.17”
-------------------17
Maldonado G, Greenland S. Simulation study of confounder-selection strategies.
Am J Epidemiol 1993;138:923-936.
Chapter 5-12 (revision 16 May 2010)
p. 3
More Cautious Approach to Guard Against Confounding (10% Rule + conservative P
value + a priori confounders)
Since confounding does not depend on statistical significance, nor is the 10% rule a definitive
cutpoint for defining confounding, some investigators take a more cautious approach. Thompson
et al (N Engl J Med, 2007) provides a good example in their Statistical Analysis section,
“We analyzed raw test scores adjusted for a priori confounders, including linear terms for
age, family income, and score on the HOME scale14,15 and dummy-coded variables for
sex, HMO, maternal IQ, maternal education, single-parent status, and birth weight. Other
covariates were included in the full model if the P value was less than 0.20 or if their
inclusion resulted in a change of 10% or more in the estimate of the main effect of
mercury exposure19,20…”
-------19
Maldonado G, Greenland S. Simulation study of confounder-selection strategies.
Am J Epidemiol 1993;83:923-936.
20
Budtz-Jørgensen E, Keiding N, Grandjean P, Weighe P. Confounder selection in
environmental epidemiology: assessment of health effects of prenatal mercury exposure.
Ann Epidemiol 2007;17:27-35.
Backwards Elimination
Backwards selection is considered superior to forwards selection (forward selection adds one
variable at a time), because negatively confounded sets of variables are less likely to be omitted
from the model (Sun et al, 1999), since the complete set is included in the initial model. In
contrast, forward and stepwise (stepwise is where variables can be added and subsequently
removed) selection procedures will only include such sets if at least one member meets the
inclusion criterion in the absence of the others. (Vittinghoff et al, 2005, p.151).
By “negatively confounded sets” , we are referring to the situation where two or more variables
must be included in the model as a set to control for confounding. When one of the variables is
dropped, confounding increases.
Budtz-Jørgensen et al (2006) recommend using p=0.20 as the cut-off when backwards
elimination is used. In the Tompson et al (2007) example shown above on this page, the
researchers use p=0.20 and cite Budtz-Jørgensen.
Automated Variable Selection Procedures
Statistical software packages provide automated variable selection routines, giving you the
choice of forward, backward, or stepwise. Although these were once popular, they have fallen
under enough criticism that it is very rare to find an article that admits to using them. These
automated routines, although finding a significant set of predictors, have no way to make
decisions about collinearity or confounding, and they can even produce nonsensical models
(Greenland, 1989).
Chapter 5-12 (revision 16 May 2010)
p. 4
A better approach, then, is to use “interactive backwards elimination”, where you, the researcher,
makes the decision at each step.
The automated variable selection routines are available in Stata for any type of regression model.
We will practice with the Framingham Heart Study dataset.
Framingham Heart Study dataset (2.20.Framingham.dta)
This is a dataset distributed with Dupont (2002, p 77). The dataset comes from a long-term
follow-up study of cardiovascular risk factors on 4699 patients living in the town of
Framingham, Massachusetts. The patients were free of coronary heart disease at their baseline
exam (recruitment of patients started in 1948).
Date Codebook
Baseline exam:
sbp
systolic blood pressure (SBP) in mm Hg
dbp
diastolic blood pressure (DBP) in mm Hg
age
age in years
scl
serum cholesterol (SCL) in mg/100ml
bmi
body mass index (BMI) = weight/height2 in kg/m2
sex
gender (1=male, 2=female)
month
month of year in which baseline exam occurred
id
patient identification variable (numbered 1 to 4699)
Follow-up information on coronary heart disease:
followup
follow-up in days
chdfate
CHD outcome (1=patient develops CHD at the end of follow-up,
0=otherwise)
Reading in the data,
File
Open
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on 2.20.Framingham.dta
Open
Chapter 5-12 (revision 16 May 2010)
p. 5
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\
2.20.Framingham.dta", clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use 2.20.Framingham.dta, clear
These are cohort data with follow-up times, so Cox regression is a good choice. Informing Stata
that these are survival analysis data, and coverting recoding sex into a 0-1 variable,
stset followup , failure(chdfate==1)
recode sex 1=1 2=0 ,gen(male)
tab sex male, nolabel
Using all of the variables, we could potentially fit the following model,
stcox sbp dbp age scl bmi male
Cox regression -- Breslow method for ties
No. of subjects =
No. of failures =
Time at risk
=
Log likelihood
=
4658
1465
37582433
-11373.616
Number of obs
=
4658
LR chi2(6)
Prob > chi2
=
=
770.54
0.0000
-----------------------------------------------------------------------------_t | Haz. Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------sbp |
1.010413
.0018671
5.61
0.000
1.006761
1.01408
dbp |
1.004468
.0033593
1.33
0.183
.9979051
1.011073
age |
1.043218
.00364
12.13
0.000
1.036108
1.050377
scl |
1.005465
.0005845
9.38
0.000
1.00432
1.006611
bmi |
1.03364
.0069549
4.92
0.000
1.020098
1.047361
male |
2.192773
.1199258
14.36
0.000
1.969882
2.440884
------------------------------------------------------------------------------
Chapter 5-12 (revision 16 May 2010)
p. 6
For illustration, lets reduce the sample size so that we do not get so much significance, making
variable selection more of a challenge
set seed 999
sample 200 , count
tab chdfate
stcox sbp dbp age scl bmi male
Coronary |
Heart |
Disease |
Freq.
Percent
Cum.
------------+----------------------------------Censored |
137
68.50
68.50
CHD |
63
31.50
100.00
------------+----------------------------------Total |
200
100.00
Cox regression -- no ties
No. of subjects =
199
Number of obs
=
199
-----------------------------------------------------------------------------_t | Haz. Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------sbp |
1.013399
.0087138
1.55
0.122
.9964632
1.030622
dbp |
.9927938
.0157639
-0.46
0.649
.9623729
1.024176
age |
1.041757
.01777
2.40
0.016
1.007504
1.077174
scl |
1.007924
.0028503
2.79
0.005
1.002353
1.013526
bmi |
1.041716
.0356593
1.19
0.233
.9741182
1.114005
male |
2.312335
.6250733
3.10
0.002
1.361297
3.927793
------------------------------------------------------------------------------
We see that the 63 events allow for 6 predictors (m/10 rule), as discussed in the sample size
chapter (Chapter 2-5, p.30). Also, only one-half of the predictors are now significant.
To select the variables using backward selection, removing variables in order of least
significance until all retained variables have p<0.05 (p for removal 0.05), we use
stepwise , pr(.05): stcox sbp dbp age scl bmi male
p = 0.6488 >= 0.0500
p = 0.2676 >= 0.0500
begin with full model
removing dbp
removing bmi
Cox regression -- no ties
No. of subjects =
No. of failures =
Time at risk
=
Log likelihood
=
199
63
1647414
-292.06721
Number of obs
=
199
LR chi2(4)
Prob > chi2
=
=
33.62
0.0000
-----------------------------------------------------------------------------_t | Haz. Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------sbp |
1.012388
.0055913
2.23
0.026
1.001489
1.023406
male |
2.348138
.6325655
3.17
0.002
1.3849
3.981338
age |
1.043096
.0179411
2.45
0.014
1.008519
1.07886
scl |
1.007679
.0028454
2.71
0.007
1.002117
1.013271
------------------------------------------------------------------------------
Chapter 5-12 (revision 16 May 2010)
p. 7
To select the variables using forward selection, adding variables in order of most significance
until all retained variables have p<.05 (p for entry 0.05), we use
stepwise , pe(.05): stcox sbp dbp age scl bmi male
p
p
p
p
=
=
=
=
0.0001
0.0054
0.0085
0.0142
<
<
<
<
0.0500
0.0500
0.0500
0.0500
begin with empty model
adding scl
adding sbp
adding male
adding age
Cox regression -- no ties
No. of subjects =
No. of failures =
Time at risk
=
Log likelihood
=
199
63
1647414
-292.06721
Number of obs
=
199
LR chi2(4)
Prob > chi2
=
=
33.62
0.0000
-----------------------------------------------------------------------------_t | Haz. Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------scl |
1.007679
.0028454
2.71
0.007
1.002117
1.013271
sbp |
1.012388
.0055913
2.23
0.026
1.001489
1.023406
male |
2.348138
.6325655
3.17
0.002
1.3849
3.981338
age |
1.043096
.0179411
2.45
0.014
1.008519
1.07886
------------------------------------------------------------------------------
Using stepwise, where variables can enter and leave the model, which is a combination of the
forward and backwards selection procedures, we use
stepwise , pe(.05) pr(.10): stcox sbp dbp age scl bmi male
begin with full model
removing dbp
removing bmi
p = 0.6488 >= 0.1000
p = 0.2676 >= 0.1000
Cox regression -- no ties
No. of subjects =
No. of failures =
Time at risk
=
Log likelihood
=
199
63
1647414
-292.06721
Number of obs
=
199
LR chi2(4)
Prob > chi2
=
=
33.62
0.0000
-----------------------------------------------------------------------------_t | Haz. Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------sbp |
1.012388
.0055913
2.23
0.026
1.001489
1.023406
male |
2.348138
.6325655
3.17
0.002
1.3849
3.981338
age |
1.043096
.0179411
2.45
0.014
1.008519
1.07886
scl |
1.007679
.0028454
2.71
0.007
1.002117
1.013271
------------------------------------------------------------------------------
In this example, we ended up with the same final model with each of the variable selection
procedures. That is frequently the case, but it does not have to occur.
Sometimes, we need a set of indicator variables to be entered or removed as a set. If we
converted BMI to four categories that would be the case. Recoding BMI,
Chapter 5-12 (revision 16 May 2010)
p. 8
recode bmi 30/max=4 25/30=3
tab bmicat , gen(bmicat)
18.5/25=2
min/18.5=1 ,gen(bmicat)
RECODE of |
bmi (Body |
Mass Index) |
Freq.
Percent
Cum.
------------+----------------------------------1 |
2
1.00
1.00
2 |
98
49.00
50.00
3 |
68
34.00
84.00
4 |
32
16.00
100.00
------------+----------------------------------Total |
200
100.00
We see that category 1 “underweight” has only 2 observations. We will let this become part of
the referent, category 2 “normal”, by leaving both bmicat1 and bmicat2 out of the model.
To specify the two BMI categories being removed together, we simply include them in
parentheses,
stepwise , pr(.05): stcox sbp dbp age scl (bmicat3 bmicat4) male
p = 0.6846 >= 0.0500
p = 0.6545 >= 0.0500
begin with full model
removing dbp
removing bmicat3 bmicat4
Cox regression -- no ties
No. of subjects =
No. of failures =
Time at risk
=
Log likelihood
=
199
63
1647414
-292.06721
Number of obs
=
199
LR chi2(4)
Prob > chi2
=
=
33.62
0.0000
-----------------------------------------------------------------------------_t | Haz. Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------sbp |
1.012388
.0055913
2.23
0.026
1.001489
1.023406
male |
2.348138
.6325655
3.17
0.002
1.3849
3.981338
age |
1.043096
.0179411
2.45
0.014
1.008519
1.07886
scl |
1.007679
.0028454
2.71
0.007
1.002117
1.013271
-----------------------------------------------------------------------------
Example (Backwards Elimination Variable Selection). Itani et al (N Engl J Med, 2006) is an
example of a paper that reports using a backwards elimination variable selection. In their
Statistical Analysis section they state,
“An exploratory evaluation assessed whether preoperative and interoperative risk factors
contributed to the development of surgical-site infection. For the univariate analysis, the
significance level of each factor was tested alone. For the multivariate analysis, a
backward-elimination approach in a multiple logistic-regression model was performed.
In this model, the significant factors from the univariate analysis were removed one at a
time, starting with the factor that had the largest P value, until all remaining factors had a
two-sided P value of less than 0.10. Odds ratios and P values were reported for each
factor alone and for the factors found to be significant from the backward elimination.”
Chapter 5-12 (revision 16 May 2010)
p. 9
In their Table 5, they report the univariate risk factors and the multivariate risk factors in separate
columns of the same table. This is a very useful display. For example, if the reader expects to a
see a particular predictor in the model, because other papers have shown it to be significant, it is
nice to be able to see it was significant in a univariable model when it has been eliminated in the
multivariable model—otherwise, the reader has more difficultly accepting your model.
Example (Stepwise Variable Selection). Weinstein et al (N Engl J Med, 2007) is an example of
a paper that reports using a stepwise variable selection. Although they do not state whether it
was automated or user interactive, it “reads” like it was automated. In their Statistical Analysis
section they state,
“Baseline predictors of time until surgical treatment in both cohorts (including treatment
crossovers) were determined by a stepwise proportional-hazards regression model with an
inclusion criterion of P<0.1 to enter and P>0.05 to exit.”
Variable Selection Based on Significance: Wald Test vs Likelihood Ratio Test
The p value found in the regression model output is called a Wald test. It assumes a sufficiently
large sample size in order to provide an accurate p value. Alternatively, if the model uses
maximum likelihood estimation, which is what logistic regession and Cox regression uses,
significance can be tested using the likelihood ratio test. The p values of the two approaches are
usually very close, particularly in moderate to large samples. In general, the likelihood-ratio test
is more powerful than the Wald test, and so many statisticians advocate their use exclusively
over the Wald test. However, the difference is usually small, so just using the Wald test because
it is more convenient is a still a good choice.
For small sample sizes, Vittinghoff et al (2005, p.173) point out that the p values for the Wald
test and likelihood-ratio test can differ substantially. They suggest that for small sample sizes
that the likelihood ratio test be used because it is, in general, more reliable.
The likelihood ratio test compares a “full” model to the “restricted” model, where the restricted
model has one less predictor variable (Long and Freese, 2006, p.101-103). Other than that, the
two models are similar. It is important to make sure that the same observations are used in both
models, otherwise the two models are not comparable for using the likelihood ratio test
(Vittinghoff et al, 2005, p.173). Such a problem could arise by listwise deletion of missing data.
Practicing with the Farmingham dataset, using a much smaller sample size, we will fit a model
with age a predictor and second model omitting age.
use 2.20.Framingham.dta, clear
set seed 999
sample 50 , count
logistic chdfate age
estimates store fmodel // store "full" model estimates
logistic chdfate // just fit the baseline risk, or “intercept”
estimates store rmodel // store "restricted" model estimates
lrtest fmodel rmodel
Chapter 5-12 (revision 16 May 2010)
p. 10
Logistic regression
Log likelihood = -29.927666
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
=
=
=
=
50
4.25
0.0393
0.0663
-----------------------------------------------------------------------------chdfate | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
1.08241
.0438029
1.96
0.050
.9998744
1.171759
-----------------------------------------------------------------------------Logistic regression
Log likelihood = -32.051774
Number of obs
LR chi2(0)
Prob > chi2
Pseudo R2
=
=
=
=
50
-0.00
.
-0.0000
-----------------------------------------------------------------------------chdfate | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+--------------------------------------------------------------------------------------------------------------------------------------------Likelihood-ratio test
(Assumption: rmodel nested in fmodel)
LR chi2(1) =
Prob > chi2 =
4.25
0.0393
We see that the Wald test for age was just barely significant (p = 0.050), while the Likelihoodratio test provided more power (p = 0.039).
Also, notice the warning (Stata assumes you did it right) that it is up to you to make sure the
reduce model is nested in the full model, which means the two models differ only by the one
predictor you are testing.
If you were testing race, with five indicator variables, than all five indicator variables would be
omitted “as the one predictor” in the reduced model.
If you used a full model with three predictors and a reduced model with three different predictors,
the likelihood-ratio test would give a meaningless p value. That is, you cannot use the likelihood
ratio test to decide which of two models is the better model, unless one model is nested within
the other.
Chapter 5-12 (revision 16 May 2010)
p. 11
Let’s try another example. This time we will use the same sample of n=200 that we used above,
which has one missing value for scl.
use 2.20.Framingham.dta, clear
set seed 999
sample 200 , count
sum chdfate sbp dbp age scl bmi sex
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------chdfate |
200
.315
.4656815
0
1
sbp |
200
132.795
23.07608
80
225
dbp |
200
81.33
12.91219
50
130
age |
200
45.82
8.700349
32
65
scl |
199
233.4472
42.53158
150
375
-------------+-------------------------------------------------------bmi |
200
25.678
4.277593
17.7
40.5
sex |
200
1.57
.4963181
1
2
If we want to use the likelihood ratio test, the model with scl absent will be n=200 observations
and the model with it present will be n=199. The likelihood ratio test requires the same
observations be present in both the full and restricted model.
The best thing to do is first impute the missing values with hotdeck imputation, or replacing it
with the median of scl, since <5% of the data were missing. Alternatively, you can reduce the
dataset to those observations which are complete for all variables that will be modeled. To do
this, you could use,
keep if chdfate~=. & sbp~=. & dbp~=. & age~=. & scl~=. & bmi~=. ///
& sex~=.
sum chdfate sbp dbp age scl bmi sex
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------chdfate |
199
.3165829
.4663163
0
1
sbp |
199
132.8593
23.11631
80
225
dbp |
199
81.40704
12.8986
50
130
age |
199
45.86432
8.699627
32
65
scl |
199
233.4472
42.53158
150
375
-------------+-------------------------------------------------------bmi |
199
25.70201
4.27485
17.7
40.5
sex |
199
1.567839
.4966258
1
2
In the keep statement, the “~=.” indicated not equal to missing.
Chapter 5-12 (revision 16 May 2010)
p. 12
Illustrating the likelihood ratio test with more variables in the model,
logistic chdfate age scl
estimates store fmodel // store "full" model estimates
logistic chdfate age // just fit the baseline risk, or “intercept”
estimates store rmodel // store "restricted" model estimates
lrtest fmodel rmodel
Logistic regression
Number of obs
=
199
LR chi2(2)
=
14.45
Prob > chi2
=
0.0007
Log likelihood = -117.00523
Pseudo R2
=
0.0581
-----------------------------------------------------------------------------chdfate | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
1.011431
.0192098
0.60
0.550
.974473
1.049791
scl |
1.013314
.0040531
3.31
0.001
1.005401
1.021289
-----------------------------------------------------------------------------. estimates store fmodel
. logistic chdfate age
// store "full" model estimates
// just fit the baseline risk, or "intercept"
Logistic regression
Log likelihood = -122.85266
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
=
=
=
=
199
2.75
0.0971
0.0111
-----------------------------------------------------------------------------chdfate | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
1.0296
.0181825
1.65
0.099
.9945726
1.065861
-----------------------------------------------------------------------------. estimates store rmodel // store "restricted" model estimates
. lrtest fmodel rmodel
Likelihood-ratio test
(Assumption: rmodel nested in fmodel)
LR chi2(1) =
Prob > chi2 =
11.69
0.0006
We see that the Wald test p value, 0.001, is the same as the likelihood-ratio test p value, 0.0006,
to 3 decimal places.
Chapter 5-12 (revision 16 May 2010)
p. 13
Increased False Positives (Type I Error) With Stepwise Variable Selection
The process of selecting variables for inclusion in your model based on significance, whether
using a forward, backwards, or stepwise (combination) variable selection procedure, whether
automated or done manually by you, can produce unreliable results. That is, some of these
variables will not be identified as significant predictors by other investigators. It is a type of
muliplicity, or multiple comparison, problem. The p value assume that the variable was prespecified as important.
Vittinghoff et al (2005, p.134) warns of this when fitting models to identify important
independent predictors of an outcome,
“…Pitfalls include false-positive associations...”
Steyerberg (2009, p.204) advises,
“The p-value of predictors in a stepwise model should generally not be trusted; the pvalue is calculated as if the model was pre-specified.”
Most researchers are not even aware of this problem. Still, informing your reader of which of
your predictors were identified in previous studies, and so pre-specified, and which are
exploratory, would be helpful.
Chapter 5-12 (revision 16 May 2010)
p. 14
Multicollinearity
Multicollinearity is simply a linear relationship among the predictor variables. The term
“collinearity” means two predictor variables are correlated and “multicollinearity” means a
predictor variable is correlated with two or more other predictor variables as a set.
This can pose a problem since regression models attempt to estimate the independent effects of
each predictor variable. When a predictor is highly correlated with a set of other predictors, there
is little variation left over for this predictor to describe.
Hamilton (2006, p.210) describes,
“When we add a new x variable that is strongly related to x variables already in the
model, symptoms of possible trouble include the following:
1. Substantially higher standard errors, with correspondingly lower t statistics.
2. Unexpected changes in coefficient magnitudes or signs.
3. Nonsignificant coefficients despite a high R2.”
The best way to assess multicollinearity is to regress each predictor variable on all of the other
predictor variables. Then calculating 1 - R2 informs us of the fraction of the first predictor’s
variance that is independent of the rest.
To see what variables are collinearly related to sbp, by using sbp temporarily as the outcome
variable, we use
use 2.20.Framingham.dta, clear
regress sbp dbp age scl bmi sex
Source |
SS
df
MS
-------------+-----------------------------Model | 1599636.87
5 319927.374
Residual | 820428.634 4652 176.360411
-------------+-----------------------------Total |
2420065.5 4657 519.661908
Number of obs
F( 5, 4652)
Prob > F
R-squared
Adj R-squared
Root MSE
=
4658
= 1814.05
= 0.0000
= 0.6610
= 0.6606
=
13.28
-----------------------------------------------------------------------------sbp |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------dbp |
1.296349
.0168998
76.71
0.000
1.263218
1.329481
age |
.547101
.0245019
22.33
0.000
.4990657
.5951364
scl |
.0074756
.0046025
1.62
0.104
-.0015475
.0164987
bmi |
.1515357
.0517225
2.93
0.003
.050135
.2529364
sex |
3.113267
.394185
7.90
0.000
2.340478
3.886057
_cons | -9.873957
1.88447
-5.24
0.000
-13.56841
-6.179502
------------------------------------------------------------------------------
Then, calculating 1 - R2 informs us of the fraction of sbp’s variance that is independent of the
rest is
1 - R2 = 1 – 0.6610 = 0.3390, or 33.90%
of sbp’s variance is independent of the remaining variables. The 33.9% is much smaller than
100%, so if we use sbp as a predictor variable along with the other variables as predictor
variables in a regression model, we will have a multicollinearity problem.
Chapter 5-12 (revision 16 May 2010)
p. 15
[Note: for this diagnostic with a dichotomous variable as the dependent variable, we use the fact
that linear regression with a dichotomous outcome is “approximately” okay, even though the
model might produce predicted values outside of the 0-1 range. This was discussed in the
logistic regression chapter.]
Rather than doing this for all variables, we can get this all at once using the vif, for variance
inflation factor, after fitting the linear regression model of interest.
regress chdfate sbp dbp age scl bmi sex
vif
Variable |
VIF
1/VIF
-------------+---------------------sbp |
2.95
0.339011
dbp |
2.77
0.361306
age |
1.27
0.789918
bmi |
1.18
0.848778
scl |
1.11
0.900151
sex |
1.02
0.976823
-------------+---------------------Mean VIF |
1.72
The VIF column reflects the degree to which other coefficients’ variances, and so standard errors,
are increased due to the inclusion of that predictor.
The 1/VIF column is the 1 - R2 from regressing each predictor variable on the remaining
predictor variables.
Interpreting the VIF Output
The VIF column reflects the degree to which other coefficients’ variances, and so standard errors,
are increased due to the inclusion of that predictor.
The 1/VIF column is the 1 - R2 from regressing each predictor variable on the remaining
predictor variables.
Hamilton (2006, p.212) provides a rule-of-thumb for interpreting the VIF table:
“How much variance inflation is too much? Chatterjee, Hadi, and Price (2000) suggest
the following as guidelines for the presence of multicollinearity:
1. The largest VIF is greater than 10; or
2. the mean VIF is larger than 1.”
Some multicollinearity is okay in a model, so don’t think you have to achieve points 1 and 2 of
this rule-of-thumb. It does not mean the model “fails the multicollinearity diagnostic”. The VIF
is simply informing you how predictor variables are related, so you can make a decision about
whether or not the predictors variables should be simultaneously included in your model. For
example, if the list of predictors are all statistically significant, then you might want to retain
them all, even though multicollinearity is present.
Chapter 5-12 (revision 16 May 2010)
p. 16
Protocol Suggestion
It is extremely rare to see any reference to how collinearity, or multicollinearity is assessed in an
article or discussed in a protocol. It is probably best to just leave such dicussion out. If you
wanted to do it, however, you could use something like,
The presence of collinearity in the linear regresison model will be assessed with the
variance inflation factor (VIF) diagnostic (Hamilton, 2006). If it is necessary to drop a
variable due to collinearity, the decision will be made on order of clinical importance.
There is really no need for the VIF command. You can tell if collinearity is present simply by
predicting any given variable by the list of other variables. To see which variables are collinearly
related to sbp, use,
regress sbp dbp age scl bmi sex
Source |
SS
df
MS
-------------+-----------------------------Model | 1599636.87
5 319927.374
Residual | 820428.634 4652 176.360411
-------------+-----------------------------Total |
2420065.5 4657 519.661908
Number of obs
F( 5, 4652)
Prob > F
R-squared
Adj R-squared
Root MSE
=
4658
= 1814.05
= 0.0000
= 0.6610
= 0.6606
=
13.28
-----------------------------------------------------------------------------sbp |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------dbp |
1.296349
.0168998
76.71
0.000
1.263218
1.329481
age |
.547101
.0245019
22.33
0.000
.4990657
.5951364
scl |
.0074756
.0046025
1.62
0.104
-.0015475
.0164987
bmi |
.1515357
.0517225
2.93
0.003
.050135
.2529364
sex |
3.113267
.394185
7.90
0.000
2.340478
3.886057
_cons | -9.873957
1.88447
-5.24
0.000
-13.56841
-6.179502
------------------------------------------------------------------------------
Since they are all significant, except scl, all of these variables are collinearly related to sbp.
Chapter 5-12 (revision 16 May 2010)
p. 17
To see which variables are the most collinearly related, we can look at the standardized
coefficients, betas, which are the coefficient after first converting all of the variables to
standardized scores, or z scores.
regress sbp dbp age scl bmi sex, beta
Source |
SS
df
MS
-------------+-----------------------------Model | 1599636.87
5 319927.374
Residual | 820428.634 4652 176.360411
-------------+-----------------------------Total |
2420065.5 4657 519.661908
Number of obs
F( 5, 4652)
Prob > F
R-squared
Adj R-squared
Root MSE
=
4658
= 1814.05
= 0.0000
= 0.6610
= 0.6606
=
13.28
-----------------------------------------------------------------------------sbp |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------dbp |
1.296349
.0168998
76.71
0.000
.7238852
age |
.547101
.0245019
22.33
0.000
.203824
scl |
.0074756
.0046025
1.62
0.104
.0146103
bmi |
.1515357
.0517225
2.93
0.003
.0271222
sex |
3.113267
.394185
7.90
0.000
.0677646
_cons | -9.873957
1.88447
-5.24
0.000
.
------------------------------------------------------------------------------
With standardized variables, all of the variables are in the same units so the beta coefficients, or
standardized slopes, are directly comparable to each other. We see that diastolic blood pressure
is highly correlated to systolic blood pressure, so you might want to include only one of the two
in your model.
For a categorical variable, such as sex, you could use
logistic sex sbp dbp age scl bmi
outcome does not vary; remember:
0 = negative outcome,
all other nonmissing values = positive outcome
r(2000);
although you will get an error message because sex is not coded as 0-1.
Chapter 5-12 (revision 16 May 2010)
p. 18
Instead use,
capture drop male
recode sex 1=1 2=0 ,gen(male)
tab sex male, nolabel
logistic male sbp dbp age scl bmi
Logistic regression
Log likelihood = -3137.7513
Number of obs
LR chi2(5)
Prob > chi2
Pseudo R2
=
=
=
=
4658
109.95
0.0000
0.0172
-----------------------------------------------------------------------------male | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------sbp |
.9820943
.002284
-7.77
0.000
.977628
.9865811
dbp |
1.033619
.0041198
8.30
0.000
1.025576
1.041725
age |
.9994883
.0039613
-0.13
0.897
.9917542
1.007283
scl |
.9987112
.0007088
-1.82
0.069
.9973228
1.000101
bmi |
1.03336
.008224
4.12
0.000
1.017366
1.049605
------------------------------------------------------------------------------
We discover that all the variables, except age, are collinearly related to male.
Collinearity does not necessarily present a problem. The only time you really need to be
concerned about it is if it makes you lose significance of a variable when you add the related
variable. You then need to stop and think whether this is due to confounding, or simply because
the two variables are expressions of the same thing. SBP and DBP do not confound each other,
but are simply related expressions of blood pressure. If you lose the significance of SBP as a
predictor when you add DBP, then simply drop DBP from the model. Since they were highly
collinearly related, only one could be in the model at a time.
Chapter 5-12 (revision 16 May 2010)
p. 19
If the sample size is large, collinearity is even less of a problem. In a model fitted above,
Cox regression -- Breslow method for ties
No. of subjects =
4658
Number of obs
=
4658
-----------------------------------------------------------------------------_t | Haz. Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------sbp |
1.010413
.0018671
5.61
0.000
1.006761
1.01408
dbp |
1.004468
.0033593
1.33
0.183
.9979051
1.011073
. . .
------------------------------------------------------------------------------
the inclusion of dbp did not negate the significance of sbp.
In the smaller sized sample model fitted above, however,
Cox regression -- no ties
No. of subjects =
199
Number of obs
=
199
-----------------------------------------------------------------------------_t | Haz. Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------sbp |
1.013399
.0087138
1.55
0.122
.9964632
1.030622
dbp |
.9927938
.0157639
-0.46
0.649
.9623729
1.024176
. . .
------------------------------------------------------------------------------
significance was lost. This is because smaller effects can be detected with larger sample sizes.
There was still enough variability for sbp to explain, after controlling for dbp, in the larger sized
sample model only because the sample size was sufficient. So, it is not that collinearity goes
away with larger samples, it just does not detract from significance as much.
Chapter 5-12 (revision 16 May 2010)
p. 20
References
Budtz-Jørgensen E, Keiding N, Grandjean P, Weihe P. (2006). Confounder selection in
environmental epidemiology: assessment of health effects of prenatal mercury exposure.
Ann Epidemiol 17:27-35.
Chatterjee S, Hadi AS, Price B. (2000). Regression Analysis by Example. 3rd ed. New York,
John Wiley and Sons.
Chaves SS, Gargiullo P, Zhang JX. (2007). Loss of vaccine-induced immunity to varicella over
time. N Engl J Med 356(11):1121-9.
Dupont WD. (2002). Statistical Modeling for Biomedical Researchers: A Simple Introduction to
the Analysis of Complex Data. Cambridge UK, Cambridge University Press.
Greenland S. (1989). Modeling and variable selection in epidemiologic analysis. Am J
Public Health 79(3):340-349.
Hamilton LC. (2006). Statistics With Stata. Updated for Version 9. Belmont CA, Thomson
Brooks/Cole.
Kulkarni N, Pierse N, Rushton L, Grigg J. (2006). Carbon in airway macrophages and lung
function in children. N Engl J Med 355(1):21-30.
Long JS, Freese J. (2006). Regression Models for Categorical Dependent Variables Using Stata.
2nd ed. College Station, TX, Stata Press.
Maldonado G, Greenland S. (1993). Simulation study of confounder-selection strategies.
Am J Epidemiol 138:923-936.
Rothman KJ, Greenland S. (1998). Modern Epidemiology, 2nd ed. Philadelphia, PA,
Lippincott-Raven Publishers.
Steyerberg EW. (2009). Clinical Prediction Models: A Practical Approach to Development,
Validation, and Updating. New York, Springer.
Sun GW, Shock TL, Kay GL. (1999). Inappropriate use of bivariable analysis to screen risk
factors for use in multivariable analysis. Journal of Clinical Epidemiology 49:907-916.
Thompson WW, Price C, Goodson B, et al. (2007). Early thimerosal exposure and
neuropsychological outcomes at 7 to 10 years. N Engl J Med 357;13:1281-1292.
Vittinghoff E, Glidden DV, Shiboski SC, McCulloch CE. (2005). Regression Methods in
Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. New
York, Springer.
Weinstein JN, Lurie JD, Tosteson TD, et al. (2007). Surgical versus nonsurgical treatment for
Chapter 5-12 (revision 16 May 2010)
p. 21
lumbar degenerative spondylolisthesis. N Engl J Med 356(22):2257-70.
Chapter 5-12 (revision 16 May 2010)
p. 22
Download