1 - Duke University

advertisement
What You See May Not Be
What You Get:
A Primer on Regression Artifacts
Michael A. Babyak, PhD
Duke University Medical Center
Topics to Cover
1.
2.
3.
4.
5.
6.
7.
Models: what and why?
Preliminaries—requirements for a good model
Dichotomizing a graded or continuous variable
is dumb
Using degrees of freedom wisely
Covariate selection
Transformations and smoothing techniques for
non-linear effects
Resampling as a superior method of model
validation
What is a model ?
Y = f(x1, x2, x3…xn)
Y = a + b1x1 + b2x2…bnxn
Y = e a + b1x1 + b2x2…bnxn
Why Model? (instead of test)
1. Can capture theoretical/predictive
system
2. Estimates of population parameters
3. Allows prediction as well as hypothesis
testing
4. More information for replication
Preliminaries
1. Correct model
2. Measure well and don’t throw
information away
3. Adequate Sample Size
Correct Model
• Gaussian: General Linear Model
• Multiple linear regression
• Binary (or ordinal): Generalized Linear
Model
• Logistic Regression
• Proportional Odds/Ordinal Logistic
• Time to event:
• Cox Regression
• Distribution of predictors generally not
important
Measure well and don’t
throw information away
• Reliable, interpretable
• Use all the information about the variables
of interest
• Don’t create “clinical cutpoints” before
modeling
• Model with ALL the data first, then use
prediction to make decisions about
cutpoints
Dichotomizing for Convenience
Can Destroy a Model
Implausible measurement
assumption
“depressed”
Depression score
44
36
32
28
C
24
20
16
12
8
4
0
AB
40
“not depressed”
Dichotomization, by definition,
reduces power by a minimum of about
30%
http://psych.colorado.edu/~mcclella/MedianSplit/
Dichotomization, by definition,
reduces power by a minimum of about
30%
Dear Project Officer,
In order to facilitate analysis and interpretation, we
have decided to throw away about 30% of our
data. Even though this will waste about 3 or 4
hundred thousand dollars worth of subject
recruitment and testing money, we are confident
that you will understand.
Sincerely,
Dick O. Tomi, PhD
Prof. Richard Obediah Tomi, PhD
Examples from the WCGS Study:
Correlations with CHD Mortality (n = 750)
Continuous
Dichotomized
at median
Reduction
in r2
Variable
r
r2
r
r2
Systolic
Blood
Pressure
Hostility
.15
.023
.12
.014
-39%
.15
.023
.08
.006
-74%
Dichotomizing does not reduce
measurement error
Gustafson, P. and Le, N.D. (2001). A comparison of continuous
and discrete measurement error: is it wise to dichotomize
imprecise covariates? Submitted. Available at
http://www.stat.ubc.ca/people/gustaf.
Simulation: Dichotomizing
makes matters worse when
measure is unreliable
b1 = .4
X1
Y
True Model: X1 continuous
Simulation: Dichotomizing
makes matters worse when
measure is unreliable
b1 = .4
X1
Y
Same Model with X1 dichotomized
Simulation: Dichotomizing
makes matters worse when
measure is unreliable
Reliability=.65, .75., .85, 1.00
Contin.
Dich.
b1 = .4
X1
Y
b1 = .4
X1
Y
Models with reliability of X1 manipulated
% correct rejections of null
hypothesis
Dichotomization of a variable
measured with error (y = .4x + e)
100
Continuous x
90
80
70
60
50
1.00
0.85
0.75
Reliability of x
0.65
% correct rejections of null
hypothesis
Dichotomization of a variable
measured with error (y = .4x + e)
Continuous x
Dichotomized x
100
90
80
70
60
50
1.00
0.85
0.75
Reliability of x
0.65
Dichotomizing will obscure non-linearity
D
ic
h
o
to
m
iz
e
da
tM
e
d
ia
n(C
E
S
-D
=
7
)
3
0
2
4
PercentwithWalMotionAbnormality
1
8
1
2
6
0
N
o
tD
e
p
re
s
s
e
d
D
e
p
re
s
s
e
d
Dichotomizing will obscure non-linearity
W
M
A
o
n
a
tL
e
a
s
t1
T
a
s
k
U
sin
gC
u
b
icS
p
lin
e
1
.0
0
.8
ProbabilityofWMA
0
.6
0
.4
0
.2
0
.0
0
5
1
0
1
5
2
0
C
E
S
-D
S
c
o
re
2
5
3
0
3
5
4
0
Simulation 2: Dichotomizing a continuous
predictor that is correlated with another
predictor
X1 and X2 continuous
X1
b1 = .4
Y
X2
b2 = .0
Simulation 2: Dichotomizing a continuous
predictor that is correlated with another
predictor
X1 dichotomized
X1
b1 = .4
Y
X2
b2 = .0
Simulation 2: Dichotomizing a continuous
predictor that is correlated with another
predictor
X1 dichotomized; rho12
manipulated
X1
b1 = .4
r12 =
Y
.0, .4, .7
X2
b2 = .0
Simulation 2: Dichotomizing a continuous
predictor that is correlated with another
predictor
(%) Incorrect rejections
of X2 = 0
X1 and X2 continuous
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
0
0.4
Correlation between x1, x2
0.7
Simulation 2: Dichotomizing a continuous
predictor that is correlated with another
predictor
Both continuous
x1 dichotomous, x2 continuous
30
(%) Incorrect rejections
of X2 = 0
25
20
15
10
5
0
0
0.4
0.7
Correlation between x1, x2
Is it ever a good idea to categorize
quantitatively measured variables?
• Yes:
– when the variable is truly categorical
– for descriptive/presentational purposes
– for hypothesis testing, if enough categories
are made.
• However, using many categories can lead to problems of
multiple significance tests and still run the risk of
misclassification
CONCLUSIONS
• Cutting:
– Doesn’t always make measurement sense
– Almost always reduces power
– Can fool you with too much power in some
instances
– Can completely miss important features of the
underlying function
• Modern computing/statistical packages can
“handle” continuous variables
• Want to make good clinical
cutpoints? Model first, cut later.
Clinical Events and LVEF Change during
Mental Stress: 5 Year follow-up
Model first, cut later
Maximum Change in LVEF (%)
Requirements: Sample Size
• Linear regression
– minimum of N = 50 + 8:predictor (Green, 1990)
• Logistic Regression
– Minimum of N = 10-15/predictor among
smallest group (Peduzzi et al., 1990a)
• Survival Analysis
– Minimum of N = 10-15/predictor (Peduzzi et al.,
1990b)
Concept of Simulation
Y = b X + error
bs1
bs2
bs3
bs4 …………………. bsk-1
bsk
Concept of Simulation
Y = b X + error
bs1
bs2
bs3
bs4 …………………. bsk-1
Evaluate
bsk
Simulation Example
Y = .4 X + error
bs1
bs2
bs3
bs4 …………………. bsk-1
bsk
Simulation Example
Y = .4 X + error
bs1
bs2
bs3
bs4 …………………. bsk-1
Evaluate
bsk
2000
1500
1000
500
0
Frequency of beta value
2500
True Model:
Y = .4*x1 + e
0.2
0.4
Value of beta for x1
0.6
Sample Size
• Linear regression
– minimum of N = 50 + 8:predictor (Green,
1990)
• Logistic Regression
– Minimum of N = 10-15/predictor among
smallest group (Peduzzi et al., 1990a)
• Survival Analysis
– Minimum of N = 10-15/predictor (Peduzzi et
al., 1990b)
10
2
4
6
8
n/p~3
n/p~6.6
n/p=10
n/p~13.3
0
Density
12
14
16
All-noise, but good fit
0.0
0.1
0.2
0.3
0.4
0.5
0.6
R-Square from Full Model
0.7
0.8
0.9
1.0
Simulation: number of
events/predictor ratio
Y = .5*x1 + 0*x2 + .2*x3 + 0*x4
-- Where r x1 x4 = .4
-- N/p = 3, 5, 10, 20, 50
Parameter stability and n/p ratio
x2
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
Density
x1
n/p=3
n/p=5
n/p=10
n/p=20
n/p=50
-2.0
-1.0
0.0
0.5
1.0
1.5
2.0
-2.0
-1.0
0.5
1.0
1.5
2.0
0.5
1.0
1.5
2.0
0 1 2 3 4 5 6 7 8
x4
0 1 2 3 4 5 6 7 8
Density
x3
0.0
-2.0
-1.0
0.0
0.5
Parameter Estimate
1.0
1.5
2.0
-2.0
-1.0
0.0
Parameter Estimate
Peduzzi’s Simulation: number of
events/predictor ratio
P(survival) =a + b1*NYHA + b2*CHF + b3*VES
+b4*DM + b5*STD + b6*HTN + b7*LVC
--Events/p = 2, 5, 10, 15, 20, 25
--% relative bias =
(estimated b – true b/true b)*100
Simulation results: number of
events/predictor ratio
% Relative Bias
50
40
NYHA
CHF
VES
DM
STD
HTN
LVC
30
20
10
0
-10
-20
0
2
5
10
15
Events per variable
20
25
Simulation results: number of
events/predictor ratio
Proportion w/ Bias >
100%
0.7
0.6
NYHA
CHF
VES
DM
STD
HTN
LVC
0.5
0.4
0.3
0.2
0.1
0
0
2
5
10
15
Events per variable
20
25
Predictor (covariate) selection
1.
2.
3.
4.
Theory, substantive knowledge, prior models
Testing for confounding
Univariate testing
Last (and least), automated methods, aka
stepwise and best subset regression
Searching for Confounders
• Fundamental tension between
underfitting and overfitting
• Underfitting = not adjusting for
important confounders
• Overfitting = capitalizing on chance
relations (sample fluctuation)
Covariate selection
• Overfitting has been studied extensively
• “Scariest” study is by Faraway (1992)—
showed that any pre-modeling strategy
cost a df over and above df used later
in modeling.
• Premodeling strategies included:
variable selection, outlier detection,
linearity tests, residual analysis.
Covariate selection
• Therefore, if you transform, select, etc.,
you must include the DF in (i.e.,
penalize for) the “Final Model”
Covariate selection: Univariate
Testing
• Non-Significant tests also cost a DF
• Variables may not behave the same way
in a multivariable model—variable “not
significant” at univariate test may be
very important in the presence of other
variables
Covariate selection
• Despite the convention, testing for
confounding has not been
systematically studied—likely leads to
overadjustment and underestimate of
true effect of variable of interest.
• At the very least, pulling variables in
and out of models inflates the Type I
error rate, sometimes dramatically
SOME of the problems with
stepwise variable selection.
1. It yields R-squared values that are badly biased high
SOME of the problems with
stepwise variable selection.
1. It yields R-squared values that are badly biased high
2. The F and chi-squared tests quoted next to each
variable on the printout do not have the claimed
distribution
SOME of the problems with
stepwise variable selection.
1. It yields R-squared values that are badly biased high
2. The F and chi-squared tests quoted next to each variable on the printout do not
have the claimed distribution
3. The method yields confidence intervals for effects and
predicted values that are falsely narrow (See Altman
and Anderson Stat in Med)
SOME of the problems with
stepwise variable selection.
1. It yields R-squared values that are badly biased high
2. The F and chi-squared tests quoted next to each variable on the printout do not
have the claimed distribution
3. The method yields confidence intervals for effects and predicted values that are
falsely narrow (See Altman and Anderson Stat in Med)
4. It yields P-values that do not have the proper meaning
and the proper correction for them is a very difficult
problem
SOME of the problems with
stepwise variable selection.
1. It yields R-squared values that are badly biased high
2. The F and chi-squared tests quoted next to each variable on the printout do not
have the claimed distribution
3. The method yields confidence intervals for effects and predicted values that are
falsely narrow (See Altman and Anderson Stat in Med)
4. It yields P-values that do not have the proper meaning and the proper correction
for them is a very difficult problem
5. It gives biased regression coefficients that need
shrinkage (the coefficients for remaining variables are
too large; see Tibshirani, 1996)
SOME of the problems with
stepwise variable selection.
1. It yields R-squared values that are badly biased high
2. The F and chi-squared tests quoted next to each variable on the printout do not
have the claimed distribution
3. The method yields confidence intervals for effects and predicted values that are
falsely narrow (See Altman and Anderson Stat in Med)
4. It yields P-values that do not have the proper meaning and the proper correction
for them is a very difficult problem
5. It gives biased regression coefficients that need shrinkage (the coefficients for
remaining variables are too large; see Tibshirani, 1996).
6. It has severe problems in the presence of collinearity
SOME of the problems with
stepwise variable selection.
1. It yields R-squared values that are badly biased high
2. The F and chi-squared tests quoted next to each variable on the printout do not
have the claimed distribution
3. The method yields confidence intervals for effects and predicted values that are
falsely narrow (See Altman and Anderson Stat in Med)
4. It yields P-values that do not have the proper meaning and the proper correction
for them is a very difficult problem
5. It gives biased regression coefficients that need shrinkage (the coefficients for
remaining variables are too large; see Tibshirani, 1996).
6. It has severe problems in the presence of collinearity
7. It is based on methods (e.g. F- tests for nested models)
that were intended to be used to test pre-specified
hypotheses
SOME of the problems with
stepwise variable selection.
1. It yields R-squared values that are badly biased high
2. The F and chi-squared tests quoted next to each variable on the printout do not
have the claimed distribution
3. The method yields confidence intervals for effects and predicted values that are
falsely narrow (See Altman and Anderson Stat in Med)
4. It yields P-values that do not have the proper meaning and the proper correction
for them is a very difficult problem
5. It gives biased regression coefficients that need shrinkage (the coefficients for
remaining variables are too large; see Tibshirani, 1996).
6. It has severe problems in the presence of collinearity
7. It is based on methods (e.g. F tests for nested models) that were intended to be
used to test pre-specified hypotheses.
8. Increasing the sample size doesn't help very much (see
Derksen and Keselman)
SOME of the problems with
stepwise variable selection.
1. It yields R-squared values that are badly biased high
2. The F and chi-squared tests quoted next to each variable on the printout do not
have the claimed distribution
3. The method yields confidence intervals for effects and predicted values that are
falsely narrow (See Altman and Anderson Stat in Med)
4. It yields P-values that do not have the proper meaning and the proper correction
for them is a very difficult problem
5. It gives biased regression coefficients that need shrinkage (the coefficients for
remaining variables are too large; see Tibshirani, 1996).
6. It has severe problems in the presence of collinearity
7. It is based on methods (e.g. F tests for nested models) that were intended to be
used to test pre-specified hypotheses.
8. Increasing the sample size doesn't help very much (see Derksen and Keselman)
9. It allows us to not think about the problem
SOME of the problems with
stepwise variable selection.
1. It yields R-squared values that are badly biased high
2. The F and chi-squared tests quoted next to each variable on the printout do not
have the claimed distribution
3. The method yields confidence intervals for effects and predicted values that are
falsely narrow (See Altman and Anderson Stat in Med)
4. It yields P-values that do not have the proper meaning and the proper correction
for them is a very difficult problem
5. It gives biased regression coefficients that need shrinkage (the coefficients for
remaining variables are too large; see Tibshirani, 1996).
6. It has severe problems in the presence of collinearity
7. It is based on methods (e.g. F tests for nested models) that were intended to be
used to test pre-specified hypotheses.
8. Increasing the sample size doesn't help very much (see Derksen and Keselman)
9. It allows us to not think about the problem
10. It uses a lot of paper
“I now wish I had never written the
stepwise selection code for SAS.”
--Frank Harrell, author of forward and backwards
selection algorithm for SAS PROC REG
Automated Selection:
Derksen and Keselman (1992) Simulation Study
• Studied backward and forward selection
• Some authentic variables and some
noise variables among candidate
variables
• Manipulated correlation among
candidate predictors
• Manipulated sample size
Automated Selection:
Derksen and Keselman (1992) Simulation Study
• “The degree of correlation between candidate
predictors affected the frequency with which the
authentic predictors found their way into the
model.”
• “The greater the number of candidate predictors,
the greater the number of noise variables were
included in the model.”
• “Sample size was of little practical importance in
determining the number of authentic variables
contained in the final model.”
Simulation results: Number of
Noise Variables Included
35
Sample Size
% of samples
30
100
200
500
1000
10000
25
20
15
10
5
0
0
1
2
3
4
5
Variables in Final Model
20 candidate predictors; 100 samples
6
7
% of samples
Simulation results: R-Square From
Noise Variables
100
90
80
70
60
50
40
30
20
10
0
Sample Size
100
200
500
1000
10000
0
0-5
5-10 10-15 15-20 20-25 > 25
% Variance Explained
20 candidate predictors; 100 samples
Simulation results: R-Square From
Noise Variables
0.3
Sample Size
R-Square
0.25
10,000
1,000
500
200
100
0.2
0.15
0.1
0.05
0
Samples (Deciles)
20 candidate predictors; 100 samples
Variable Selection
• Pick variables a priori
• Stick with them
• Penalize appropriately for any datadriven decision about how to model a
variable
Spending DF wisely
• Select variables of most importance
• Use DF to assess non-linearity using
flexible curve approach (more about this
later)
• If not enough N/predictor, combine
covariates using techniques that do not
look at Y in the sample, PCA, FA,
conceptual clustering, collapsing, scoring,
established indexes, propensity scores.
Can use data to determine where
to spend DF
• Use Spearman’s Rho to test “importance”
• Not peeking because we have chosen to
include the term in the model regardless
of relation to Y
• Use more DF for non-linearity
Example-Predict Survival from
age, gender, and fare on Titanic
If you have already decided to
include them (and promise to
keep them in the model) you can
peek at predictors in order to see
where to add complexity
Spearman Test
N df
age
1046 1
fare
1308 1
sex
1309 1
0.0
0.05
0.10
0.15
Adjusted rho^2
0.20
0.25
Non-linearity using splines
Linear Spline
(piecewise regression)
Y = a + b1(x<10) + b2(10<x<20) +
b3 (x >20)
2.5
2
Y
1.5
1
0.5
0
0
0
5
10
X
15
20
25
Cubic Spline
(non-linear piecewise regression)
knots
2.5
2
Y
1.5
1
0.5
0
0
0
X
Logistic regression model
fitfare<-lrm(survived~(rcs(fare,3)+age+sex)^2,x=T,y=T)
anova(fitfare)
Spline with 3 knots
Wald Statistics
Response: survived
Factor
Chi-Square d.f.
fare (Factor+Higher Order Factors)
55.1
6
All Interactions
13.8
4
Nonlinear (Factor+Higher Order Factors)
21.9
3
age (Factor+Higher Order Factors)
22.2
4
All Interactions
16.7
3
sex (Factor+Higher Order Factors)
208.7
4
All Interactions
20.2
3
fare * age (Factor+Higher Order Factors)
8.5
2
Nonlinear
8.5
1
Nonlinear Interaction : f(A,B) vs. AB
8.5
1
fare * sex (Factor+Higher Order Factors)
6.4
2
Nonlinear
1.5
1
Nonlinear Interaction : f(A,B) vs. AB
1.5
1
age * sex (Factor+Higher Order Factors)
9.9
1
TOTAL NONLINEAR
21.9
3
TOTAL INTERACTION
24.9
5
TOTAL NONLINEAR + INTERACTION
38.3
6
TOTAL
245.3
9
P
<.0001
0.0079
0.0001
0.0002
0.0008
<.0001
0.0002
0.0142
0.0036
0.0036
0.0401
0.2153
0.2153
0.0016
0.0001
0.0001
<.0001
<.0001
Wald Statistics
Response: survived
Factor
Chi-Square d.f.
fare (Factor+Higher Order Factors)
55.1
6
All Interactions
13.8
4
Nonlinear (Factor+Higher Order Factors)
21.9
3
age (Factor+Higher Order Factors)
22.2
4
All Interactions
16.7
3
sex (Factor+Higher Order Factors)
208.7
4
All Interactions
20.2
3
fare * age (Factor+Higher Order Factors)
8.5
2
Nonlinear
8.5
1
Nonlinear Interaction : f(A,B) vs. AB
8.5
1
fare * sex (Factor+Higher Order Factors)
6.4
2
Nonlinear
1.5
1
Nonlinear Interaction : f(A,B) vs. AB
1.5
1
age * sex (Factor+Higher Order Factors)
9.9
1
TOTAL NONLINEAR
21.9
3
TOTAL INTERACTION
24.9
5
TOTAL NONLINEAR + INTERACTION
38.3
6
TOTAL
245.3
9
P
<.0001
0.0079
0.0001
0.0002
0.0008
<.0001
0.0002
0.0142
0.0036
0.0036
0.0401
0.2153
0.2153
0.0016
0.0001
0.0001
<.0001
<.0001
Wald Statistics
Response: survived
Factor
Chi-Square d.f.
fare (Factor+Higher Order Factors)
55.1
6
All Interactions
13.8
4
Nonlinear (Factor+Higher Order Factors)
21.9
3
age (Factor+Higher Order Factors)
22.2
4
All Interactions
16.7
3
sex (Factor+Higher Order Factors)
208.7
4
All Interactions
20.2
3
fare * age (Factor+Higher Order Factors)
8.5
2
Nonlinear
8.5
1
Nonlinear Interaction : f(A,B) vs. AB
8.5
1
fare * sex (Factor+Higher Order Factors)
6.4
2
Nonlinear
1.5
1
Nonlinear Interaction : f(A,B) vs. AB
1.5
1
age * sex (Factor+Higher Order Factors)
9.9
1
TOTAL NONLINEAR
21.9
3
TOTAL INTERACTION
24.9
5
TOTAL NONLINEAR + INTERACTION
38.3
6
TOTAL
245.3
9
P
<.0001
0.0079
0.0001
0.0002
0.0008
<.0001
0.0002
0.0142
0.0036
0.0036
0.0401
0.2153
0.2153
0.0016
0.0001
0.0001
<.0001
<.0001
Wald Statistics
Response: survived
Factor
Chi-Square d.f.
fare (Factor+Higher Order Factors)
55.1
6
All Interactions
13.8
4
Nonlinear (Factor+Higher Order Factors)
21.9
3
age (Factor+Higher Order Factors)
22.2
4
All Interactions
16.7
3
sex (Factor+Higher Order Factors)
208.7
4
All Interactions
20.2
3
fare * age (Factor+Higher Order Factors)
8.5
2
Nonlinear
8.5
1
Nonlinear Interaction : f(A,B) vs. AB
8.5
1
fare * sex (Factor+Higher Order Factors)
6.4
2
Nonlinear
1.5
1
Nonlinear Interaction : f(A,B) vs. AB
1.5
1
age * sex (Factor+Higher Order Factors)
9.9
1
TOTAL NONLINEAR
21.9
3
TOTAL INTERACTION
24.9
5
TOTAL NONLINEAR + INTERACTION
38.3
6
TOTAL
245.3
9
P
<.0001
0.0079
0.0001
0.0002
0.0008
<.0001
0.0002
0.0142
0.0036
0.0036
0.0401
0.2153
0.2153
0.0016
0.0001
0.0001
<.0001
<.0001
Wald Statistics
Response: survived
Factor
Chi-Square d.f.
fare (Factor+Higher Order Factors)
55.1
6
All Interactions
13.8
4
Nonlinear (Factor+Higher Order Factors)
21.9
3
age (Factor+Higher Order Factors)
22.2
4
All Interactions
16.7
3
sex (Factor+Higher Order Factors)
208.7
4
All Interactions
20.2
3
fare * age (Factor+Higher Order Factors)
8.5
2
Nonlinear
8.5
1
Nonlinear Interaction : f(A,B) vs. AB
8.5
1
fare * sex (Factor+Higher Order Factors)
6.4
2
Nonlinear
1.5
1
Nonlinear Interaction : f(A,B) vs. AB
1.5
1
age * sex (Factor+Higher Order Factors)
9.9
1
TOTAL NONLINEAR
21.9
3
TOTAL INTERACTION
24.9
5
TOTAL NONLINEAR + INTERACTION
38.3
6
TOTAL
245.3
9
P
<.0001
0.0079
0.0001
0.0002
0.0008
<.0001
0.0002
0.0142
0.0036
0.0036
0.0401
0.2153
0.2153
0.0016
0.0001
0.0001
<.0001
<.0001
Predictors of Survival on Titanic
0.50
2.00
4.00
6.00
8.00
10.00
12.00
fare - 31:7.9
sex - female:male
0.
95
age - 39:21
Adjusted to:fare=14 age=28 sex=male
0
Prob. of Survival
0.2 0.4 0.6 0.8
1
Fare and Age Interaction
60
50
250
40
3
ag 0
e
200
150
20
100 Fare
10
Adjusted to: sex=m ale
50
0
1.0
Fare and Gender Interaction
0.6
0.4
male
0.2
Prob. of Survival
0.8
female
0
50
100
150
Fare
Adjusted to: age=28
200
250
300
Validation
• Apparent
• too optimistic
• Internal
• cross-validation, bootstrap
• honest estimate for model performance
• provides an upper limit to what would be
found on external validation
• External validation
• replication with new sample, different
circumstances
Validation
• Steyerburg, et al. (1999)
compared validation methods
• Found that split-half was far too
conservative
• Bootstrap was equal or superior to
all other techniques
Bootstrap
My Sample
?1
?2
?3
?4 ………………….
WITH REPLACEMENT
Evaluate
?k-1
?k
1, 3, 4, 5, 7, 10
7
1
1
4
5
10
10
3
2
2
2
1
3
5
1
4
2
7
2
1
1
7
2
7
4
4
1
4
2
10
Bootstrap Validation
Index
Dxy
R2
Intercept
Slope
Training
0.6565
0.4273
0.0000
1.0000
Corrected
0.646
0.407
-0.011
0.952
Summary
• Think about your model
• Collect enough data
Summary
• Measure well
• Don’t destroy what you’ve
measured
Summary
• Pick your variables ahead of time
and collect enough data to test the
model you want
• Keep all your variables in the model
unless extremely unimportant
Summary
• Use more df on important
variables, fewer df on “nuisance”
variables
• Don’t peek at Y to combine,
discard, or transform variables
Summary
• Estimate validity and shrinkage
with bootstrap
Summary
• By all means, tinker with the model
later, but be aware of the costs of
tinkering
• Don’t forget to say you tinkered
• Go collect more data
Web links for references,
software, and more
• Harrell’s regression modeling text
– http://hesweb1.med.virginia.edu/biostat/rms/
• SAS Macros for spline estimation
– http://hesweb1.med.virginia.edu/biostat/SAS/survrisk.txt
• Some results comparing validation methods
– http://hesweb1.med.virginia.edu/biostat/reports/logistic.val.pdf
• SAS code for bootstrap
– ftp://ftp.sas.com/pub/neural/jackboot.sas
• S-Plus home page
– insightful.com
• Mike Babyak’s e-mail
– michael.babyak@duke.edu
• This presentation
– http://www.duke.edu/~mbabyak
Download