Tests for Continuous Outcomes II

advertisement
Tests for Continuous
Outcomes II
Overview of common
statistical tests
Are the observations independent or
correlated?
Outcome Variable
independent
correlated
Assumptions
Continuous
Ttest
ANOVA
Linear correlation
Linear regression
Paired ttest
Repeated-measures ANOVA
Mixed models/GEE modeling
Outcome is normally
distributed (important
for small samples).
Outcome and predictor
have a linear
relationship.
Relative risks
Chi-square test
Logistic regression
McNemar’s test
Conditional logistic regression
GEE modeling
Sufficient numbers in
each cell (>=5)
Kaplan-Meier statistics
Cox regression
n/a
Cox regression
assumes proportional
hazards between
groups
(e.g. pain scale,
cognitive function)
Binary or
categorical
(e.g. fracture yes/no)
Time-to-event
(e.g. time to fracture)
Overview of common
statistical tests
Are the observations independent or
correlated?
Outcome Variable
independent
correlated
Assumptions
Continuous
Ttest
ANOVA
Linear correlation
Linear regression
Paired ttest
Repeated-measures ANOVA
Mixed models/GEE modeling
Outcome is normally
distributed (important
for small samples).
Outcome and predictor
have a linear
relationship.
Relative risks
Chi-square test
Logistic regression
McNemar’s test
Conditional logistic regression
GEE modeling
Sufficient numbers in
each cell (>=5)
Kaplan-Meier statistics
Cox regression
n/a
Cox regression
assumes proportional
hazards between
groups
(e.g. pain scale,
cognitive function)
Binary or
categorical
(e.g. fracture yes/no)
Time-to-event
(e.g. time to fracture)
Continuous outcome (means)
Are the observations independent or correlated?
Outcome
Variable
Continuous
(e.g. pain
scale,
cognitive
function)
independent
correlated
Alternatives if the normality
assumption is violated (and
small sample size):
Ttest: compares means
Paired ttest: compares means
Non-parametric statistics
between two independent
groups
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Linear regression:
multivariate regression technique
used when the outcome is
continuous; gives slopes
between two related groups (e.g.,
the same subjects before and
after)
Wilcoxon sign-rank test:
Repeated-measures
ANOVA: compares changes
Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
over time in the means of two or
more groups (repeated
measurements)
Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
non-parametric alternative to the
paired ttest
parametric alternative to the ttest
Kruskal-Wallis test: non-
parametric alternative to ANOVA
Spearman rank correlation
coefficient: non-parametric
alternative to Pearson’s correlation
coefficient
Divalproex vs. placebo for
treating bipolar depression
Davis et al. “Divalproex in the treatment of bipolar depression: A placebo controlled study.” J
Affective Disorders 85 (2005) 259-266.
Repeated-measures ANOVA
Statistical question: Do subjects in the treatment group
have greater reductions in depression scores over
time than those in the control group?
 What is the outcome variable? Depression score
 What type of variable is it? Continuous
 Is it normally distributed? Yes
 Are the observations correlated? Yes, there are
multiple measurements on each person
 How many time points are being compared? >2
 repeated-measures ANOVA
Repeated-measures ANOVA



For before and after studies, a paired
ttest will suffice.
For more than two time periods, you
need repeated-measures ANOVA.
Serial paired ttests is incorrect, because
this strategy will increase your type I
error.
Repeated-measures ANOVA

Answers the following questions, taking
into account the fact the correlation
within subjects:



Are there significant differences across
time periods?
Are there significant differences between
groups (=your categorical predictor)?
Are there significant differences between
groups in their changes over time?
Two groups (e.g., treatment
placebo)
id
1
2
3
4
5
6
group
time1 time2 time3 time4
A
A
A
B
B
B
31
24
14
38
25
30
29
28
20
34
29
28
15
20
28
30
25
16
26
32
30
34
29
34
Hypothetical data: measurements of depression scores
over time in treatment (A) and placebo (B).
Profile plots by group
B
A
Mean plots by group
B
A
Repeated measures ANOVA tells you if and how these two profile
plots differ…
Possible questions…

Overall, are there significant differences between time
points?


Do the two groups differ at any time points?


From plots: looks like some differences (time3 and 4 look different)
From plots: certainly at baseline; some difference everywhere
Do the two groups differ in their responses over time?**

From plots: their response profile looks similar over time, though A
and B are closer by the end.
repeated-measures ANOVA…

Overall, are there significant differences between
time points?


Do the two groups differ at any time points?


Time factor
Group factor
Do the two groups differ in their responses over
time?**

Group x time factor
From rANOVA analysis…

Overall, are there significant differences between
time points?


Do the two groups differ at any time points?


No, Time not statistically significant (p=.1743)
No, Group not statistically significant (p=.1408)
Do the two groups differ in their responses over
time?**

No, not even close; Group*Time (p-value>.60)
rANOVA
Time is significant.
Group*time is significant.
Group is not significant.
rANOVA
Time is not
significant.
Group*time is not
significant.
Group IS significant.
rANOVA
Time is significant.
Group is not
significant.
Time*group is not
significant.
Homeopathy vs. placebo in
treating pain after surgery
Day of surgery
p>.05; rANOVA
Mean pain
assessments by
visual analogue
scales (VAS)
(Group x Time)
Days 1-7 after surgery
(morning and evening)
Copyright ©1995 BMJ Publishing Group Ltd.
Lokken, P. et al. BMJ 1995;310:1439-1442
Pint of milk vs. control on bone
acquisition in adolescent females
Mean (SE) percentage increases in
total body bone mineral and bone
density over 18 months.
P values are for
the differences
between groups
by repeated
measures
analysis of
variance
Copyright ©1997 BMJ Publishing Group Ltd.
Cadogan, J. et al. BMJ 1997;315:1255-1260
Counseling vs. control on
smoking in pregnancy
P<.05; rANOVA
Copyright ©2000 BMJ Publishing Group Ltd.
Hovell, M. F et al. BMJ 2000;321:337-342
Review Question 1
In a study of depression, I measured depression score
(a continuous, normally distributed variable) at baseline;
1 month; 6 months; and 12 months. What statistical
test will best tell me whether or not depression
improved between baseline and the end of the study?
a.
b.
c.
d.
e.
Repeated-measures ANOVA.
One-way ANOVA.
Two-sample ttest.
Paired ttest.
Wilcoxon sum-rank test.
Review Question 2
In the same depression study, what statistical test will
best tell me whether or not two treatments for
depression had different effects over time?
a.
b.
c.
d.
e.
Repeated-measures ANOVA.
One-way ANOVA.
Two-sample ttest.
Paired ttest.
Wilcoxon sum-rank test.
Continuous outcome (means)
Are the observations independent or correlated?
Outcome
Variable
Continuous
(e.g. pain
scale,
cognitive
function)
independent
correlated
Alternatives if the normality
assumption is violated (and
small sample size):
Ttest: compares means
Paired ttest: compares means
Non-parametric statistics
between two independent
groups
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Linear regression:
multivariate regression technique
used when the outcome is
continuous; gives slopes
between two related groups (e.g.,
the same subjects before and
after)
Wilcoxon sign-rank test:
Repeated-measures
ANOVA: compares changes
Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
over time in the means of two or
more groups (repeated
measurements)
Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
non-parametric alternative to the
paired ttest
parametric alternative to the ttest
Kruskal-Wallis test: non-
parametric alternative to ANOVA
Spearman rank correlation
coefficient: non-parametric
alternative to Pearson’s correlation
coefficient
Example: class data
Political Leanings and Rating of Obama
r=0.39148,
p=.07
Example: class data
Political Leanings and Rating of Health Care Law
r= -0.00768,
p=.97
Example 2: pain and injection
pressure
r=.75, p<.0001
Correlation coefficient
Statistical question: Is injection pressure related
to pain?
 What is the outcome variable? VAS pain score
 What type of variable is it? Continuous
 Is it normally distributed? Yes
 Are the observations correlated? No
 Are groups being compared? No—the
independent variable is also continuous
 correlation coefficient
New concept: Covariance
n
cov ( X , Y ) 
 ( X  X)( Y  Y )
i1
i
i
n 1
Interpreting Covariance

Covariance between two random
variables:
cov(X,Y) > 0
X and Y tend to move in the same direction
cov(X,Y) < 0
X and Y tend to move in opposite directions
cov(X,Y) = 0
X and Y are independent
Correlation coefficient

Pearson’s Correlation Coefficient is
standardized covariance (unitless):
r
cov ariance( x, y )
var x var y
Corrrelation

Measures the relative strength of the linear
relationship between two variables

Unit-less

Ranges between –1 and 1



The closer to –1, the stronger the negative linear
relationship
The closer to 1, the stronger the positive linear
relationship
The closer to 0, the weaker any positive linear
relationship
Scatter Plots of Data with
Various Correlation Coefficients
Y
Y
Y
X
X
r = -1
r = -.6
Y
r=0
Y
Y
r = +1
X
X
X
r = +.3
** Next 4 slides from “Statistics for Managers”4th Edition, Prentice-Hall 2004
X
r=0
Linear Correlation
Linear relationships
Y
Curvilinear relationships
Y
X
Y
X
Y
X
X
Linear Correlation
Strong relationships
Y
Weak relationships
Y
X
Y
X
Y
X
X
Linear Correlation
No relationship
Y
X
Y
X
Recall: correlation coefficient
(large n)

Hypothesis test:
Z

r -0
1 r 2
n
Confidence Interval
1 r 2
confidence interval  observed r  Z/2 * (
)
n
Correlation coefficient (small
n)

Hypothesis test:
Tn  2 

r 0
1 r 2
n2
Confidence Interval
1 r 2
confidence interval  observed r  Tn  2,/2 * (
)
n2
Review Problem 3

a.
b.
c.
d.
e.
What’s a good guess for the Pearson’s
correlation coefficient (r) for this
scatter plot?
–1.0
+1.0
0
-.5
-.1
Continuous outcome (means)
Are the observations independent or correlated?
Outcome
Variable
Continuous
(e.g. pain
scale,
cognitive
function)
independent
correlated
Alternatives if the normality
assumption is violated (and
small sample size):
Ttest: compares means
Paired ttest: compares means
Non-parametric statistics
between two independent
groups
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Linear regression:
multivariate regression technique
used when the outcome is
continuous; gives slopes
between two related groups (e.g.,
the same subjects before and
after)
Wilcoxon sign-rank test:
Repeated-measures
ANOVA: compares changes
Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
over time in the means of two or
more groups (repeated
measurements)
Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two
or more groups; gives rate of
change over time
non-parametric alternative to the
paired ttest
parametric alternative to the ttest
Kruskal-Wallis test: non-
parametric alternative to ANOVA
Spearman rank correlation
coefficient: non-parametric
alternative to Pearson’s correlation
coefficient
Example: class data
Political Leanings and Rating of Obama
Expected Obama Rating = 50.5 + 0.28*politics.
Example 2: pain and injection
pressure
R-squared =
correlation
coefficient
squared.
Meaning: the
percent of
variance in Y that
is “explained by”
X.
Simple linear regression
Statistical question: Does injection pressure
“predict” pain?
 What is the outcome variable? VAS pain score
 What type of variable is it? Continuous
 Is it normally distributed? Yes
 Are the observations correlated? No
 Are groups being compared? No—the
independent variable is also continuous
 simple linear regression
Linear regression
In correlation, the two variables are treated as equals.
In regression, one variable is considered independent
(=predictor) variable (X) and the other the dependent
(=outcome) variable Y.
What is “Linear”?

Remember this:

Y=mX+B?
m
B
What’s Slope?
A slope of 0.28 means that every 1-unit change in X
yields a .28-unit change in Y.
Simple linear regression
The linear regression model:
Intercept (x=0), not
shown on graph
Ratings of Obama = 50.5 + 0.28*(political bent)
slope
Simple linear regression
Wake-up Time versus Exercise
Expected Wake-up Time =
8:06 - 0:11*Hours of exercise/week
Every additional hour of weekly exercise costs you about 11 minutes of sleep in the morning
(p=.0015).
The linear regression model…
yi=
 + *xi + random errori
Fixed –
exactly
on the
line
Follows a normal
distribution
Assumptions (or the fine print)

Linear regression assumes that…




1. The relationship between X and Y is linear
2. Y is distributed normally at each value of X
3. The variance of Y at every value of X is the
same (homogeneity of variances)
4. The observations are independent
The standard error of Y given X is the average variability around the
regression line at any given value of X. It is assumed to be equal at
all values of X.
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Recall example: cognitive
function and vitamin D

Hypothetical data loosely based on [1];
cross-sectional study of 100 middleaged and older European men.

Cognitive function is measured by the Digit
Symbol Substitution Test (DSST).
1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged
and older European men. J Neurol Neurosurg Psychiatry. 2009 Jul;80(7):722-9.
Distribution of vitamin D
Mean= 63 nmol/L
Standard deviation = 33 nmol/L
Distribution of DSST
Normally distributed
Mean = 28 points
Standard deviation = 10 points
Four hypothetical datasets

I generated four hypothetical datasets,
with increasing TRUE slopes (between
vit D and DSST):




0
0.5 points per 10 nmol/L
1.0 points per 10 nmol/L
1.5 points per 10 nmol/L
Dataset 1: no relationship
Dataset 2: weak relationship
Dataset 3: weak to moderate
relationship
Dataset 4: moderate
relationship
The “Best fit” line
Regression
equation:
E(Yi) = 28 + 0*vit
Di (in 10 nmol/L)
The “Best fit” line
Note how the line is
a little deceptive; it
draws your eye,
making the
relationship appear
stronger than it
really is!
Regression
equation:
E(Yi) = 26 + 0.5*vit
Di (in 10 nmol/L)
The “Best fit” line
Regression equation:
E(Yi) = 22 + 1.0*vit
Di (in 10 nmol/L)
The “Best fit” line
Regression equation:
E(Yi) = 20 + 1.5*vit Di
(in 10 nmol/L)
Note: all the lines go
through the point
(63, 28)!
Estimating the intercept and
slope: least squares estimation
** Least Squares Estimation
A little calculus….
What are we trying to estimate? β, the slope, from
What’s the constraint? We are trying to minimize the squared distance (hence the “least squares”) between the
observations themselves and the predicted values , or (also called the “residuals”, or left-over unexplained variability)
Differencei = yi – (βx + α)
Differencei2 = (yi – (βx + α)) 2
Find the β that gives the minimum sum of the squared differences. How do you maximize a function? Take the
derivative; set it equal to zero; and solve. Typical max/min problem from calculus….
d
d
n
(y
i
i 1
n
2(
 ( xi   ))  2(
2

n
(y
i
 xi   )( xi ))
i 1
( y i xi  xi  xi ))  0...
2
i 1
From here takes a little math trickery to solve for β…
Resulting formulas…
Slope (beta coefficient) =
Intercept=
Cov( x, y )
ˆ

Var ( x)
Calculate : ˆ  y - ˆx
Regression line always goes through the point:
( x, y)
Relationship with correlation
SDx
rˆ  ̂
SDy
In correlation, the two variables are treated as equals. In regression, one variable is considered
independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.
Example: dataset 4
SDx = 33 nmol/L
SDy= 10 points
Cov(X,Y) = 163
points*nmol/L
̂
SS x
SS y
Beta = 163/332 = 0.15
points per nmol/L
= 1.5 points per 10
nmol/L
r = 163/(10*33) = 0.49
Or
r = 0.15 * (33/10) = 0.49
Significance testing…
Slope
Distribution of slope ~ Tn-2(β,s.e.( ˆ ))
H0: β1 = 0
H1: β1  0
Tn-2=
(no linear relationship)
(linear relationship does exist)
ˆ  0
s.e.( ˆ )
Formula for the standard error of
beta (you will not have to calculate
by hand!):
n
 ( y  yˆ )
i 1
sˆ 
i
2
i
n2
SS x

n
where SSx   ( xi  x ) 2
i 1
and yˆ i  ˆ  ˆxi
sy / x
2
SS x
Example: dataset 4

Standard error (beta) = 0.03
T98 = 0.15/0.03 = 5, p<.0001

95% Confidence interval = 0.09 to 0.21

Review Problem 4
Researchers fit a regression equation to predict baby
weights from weeks of gestation:
 Y/X = 100 grams/week*X weeks
What is the expected weight of a baby born at 22
weeks?
a.
b.
c.
d.
e.
2000g
2100g
2200g
2300g
2400g
Review Problem 5
The model predicts that:
a.
b.
c.
d.
All babies born at 22 weeks will weigh 2200
grams.
Babies born at 22 weeks will have a mean
weight of 2200 grams with some variation.
Both of the above.
None of the above.
Residual Analysis: check
assumptions
ei  Yi  Yˆi


The residual for observation i, ei, is the difference
between its observed and predicted value
Check the assumptions of regression by examining the
residuals





Examine for linearity assumption
Examine for constant variance for all levels of X
(homoscedasticity)
Evaluate normal distribution assumption
Evaluate independence assumption
Graphical Analysis of Residuals

Can plot residuals vs. X
Predicted values…
yˆ i  20  1.5 xi
For Vitamin D = 95 nmol/L (or 9.5 in 10 nmol/L):
yˆ i  20  1.5(9.5)  34
Residual =
observed - predicted
X=95
nmol/L
34
yi  48
yˆ i  34
yi  yˆ i  14
Residual Analysis for
Linearity
Y
Y
x
x
Not Linear
residuals
residuals
x
x

Linear
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Residual Analysis for
Homoscedasticity
Y
Y
x
x
Non-constant variance
residuals
residuals
x
x

Constant variance
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Residual Analysis for
Independence
Not Independent
X
Independent
residuals
residuals
X
residuals

Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

X
Residual plot, dataset 4
Review Problem 6
A medical journal article reported the following
linear regression equation:
Cholesterol = 150 + 2*(age past 40)
Based on this model, what is the expected
cholesterol for a 60 year old?
a.
b.
c.
d.
e.
150
370
230
190
200
Review Problem 7
If a particular 60 year old in your study sample
had a cholesterol of 250, what is his/her
residual?
a.
b.
c.
d.
e.
+50
-50
+60
-60
0
Multiple linear regression…

What if age is a confounder here?



Older men have lower vitamin D
Older men have poorer cognition
“Adjust” for age by putting age in the
model:

DSST score = intercept + slope1xvitamin D
+ slope2 xage
2 predictors: age and vit D…
Different 3D view…
Fit a plane rather than a line…
On the plane, the
slope for vitamin
D is the same at
every age; thus,
the slope for
vitamin D
represents the
effect of vitamin
D when age is
held constant.
Equation of the “Best fit”
plane…




DSST score = 53 + 0.0039xvitamin D
(in 10 nmol/L) - 0.46 xage (in years)
P-value for vitamin D >>.05
P-value for age <.0001
Thus, relationship with vitamin D was
due to confounding by age!
Multiple Linear Regression

More than one predictor…
E(y)=  + 1*X + 2 *W + 3 *Z…
Each regression coefficient is the amount of
change in the outcome variable that would be
expected per one-unit change of the
predictor, if all other variables in the model
were held constant.
Review Problem 8
A medical journal article reported the following linear
regression equation:
Cholesterol = 150 + 2*(age past 40) +
10*(gender: 1=male, 0=female)
Based on this model, what is the expected
cholesterol for a 60 year-old man?
a.
b.
c.
d.
e.
150
370
230
190
200
A ttest is linear regression!

Divide vitamin D into two groups:



Insufficient vitamin D (<50 nmol/L)
Sufficient vitamin D (>=50 nmol/L), reference
group
We can evaluate these data with a ttest or a
linear regression…
T98 
40  32.5  7.5
2
10.8 10.8

54
46
2
 3.46; p  .0008
As a linear regression…
Intercept
represents the
mean value in
the sufficient
group.
Slope represents
the difference in
means between the
groups. Difference
is significant.
Parameter
Variable
Intercept
insuff
````````````````Standard
Estimate
Error
t Value
40.07407
-7.53060
1.47511
2.17493
27.17
-3.46
Pr > |t|
<.0001
0.0008
ANOVA is linear regression!

Divide vitamin D into three groups:



Deficient (<25 nmol/L)
Insufficient (>=25 and <50 nmol/L)
Sufficient (>=50 nmol/L), reference group
DSST=  (=value for sufficient) + insufficient*(1
if insufficient) + 2 *(1 if deficient)
This is called “dummy coding”—where multiple
binary variables are created to represent
being in each category (or not) of a
categorical variable
The picture…
Sufficient vs.
Insufficient
Sufficient vs.
Deficient
Results…
Parameter Estimates
Variable
DF
Intercept
deficient
insufficient

1
1
1
Parameter
Estimate
40.07407
-9.87407
-6.87963
Standard
Error
1.47817
3.73950
2.33719
t Value
Pr > |t|
27.11
-2.64
-2.94
<.0001
0.0096
0.0041
Interpretation:


The deficient group has a mean DSST 9.87 points
lower than the reference (sufficient) group.
The insufficient group has a mean DSST 6.87
points lower than the reference (sufficient) group.
Functions of multivariate
analysis:



Control for confounders
Test for interactions between predictors
(effect modification)
Improve predictions
Other types of multivariate
regression

Multiple linear regression is for normally
distributed outcomes

Logistic regression is for binary outcomes

Cox proportional hazards regression is used when
time-to-event is the outcome
Common multivariate regression models.
Example
outcome
variable
Appropriate
multivariate
regression
model
Example equation
What do the coefficients give
you?
Continuous
Blood
pressure
Linear
regression
blood pressure (mmHg) =
 + salt*salt consumption (tsp/day) +
age*age (years) + smoker*ever
smoker (yes=1/no=0)
slopes—tells you how much
the outcome variable
increases for every 1-unit
increase in each predictor.
Binary
High blood
pressure
(yes/no)
Logistic
regression
ln (odds of high blood pressure) =
 + salt*salt consumption (tsp/day) +
age*age (years) + smoker*ever
smoker (yes=1/no=0)
odds ratios—tells you how
much the odds of the
outcome increase for every
1-unit increase in each
predictor.
Time-to-event
Time-todeath
Cox regression
ln (rate of death) =
 + salt*salt consumption (tsp/day) +
age*age (years) + smoker*ever
smoker (yes=1/no=0)
hazard ratios—tells you how
much the rate of the outcome
increases for every 1-unit
increase in each predictor.
Outcome
(dependent
variable)
Multivariate regression pitfalls

Multi-collinearity
 Residual confounding
 Overfitting
Multicollinearity

Multicollinearity arises when two variables that
measure the same thing or similar things (e.g.,
weight and BMI) are both included in a multiple
regression model; they will, in effect, cancel each
other out and generally destroy your model.

Model building and diagnostics are tricky
business!
Residual confounding


You cannot completely wipe out
confounding simply by adjusting for
variables in multiple regression unless
variables are measured with zero error
(which is usually impossible).
Residual confounding can lead to
significant effect sizes of moderate size
if measurement error is high.
Residual confounding:
example

Hypothetical Example: In a case-control study
of lung cancer, researchers identified a link
between alcohol drinking and cancer in
smokers only. The OR was 1.3 for 1-2 drinks
per day (compared with none) and 1.5 for 3+
drinks per day. Though the authors adjusted
for number of cigarettes smoked per day in
multivariate (logistic) regression, we cannot
rule out residual confounding by level of
smoking (which may be tightly linked to
alcohol drinking).
Overfitting


In multivariate modeling, you can get
highly significant but meaningless
results if you put too many predictors in
the model.
The model is fit perfectly to the quirks
of your particular sample, but has no
predictive ability in a new sample.
Overfitting: class data
example

I asked SAS to automatically find
predictors of optimism in our class
dataset. Here’s the resulting linear
regression model:
Variable
Parameter
Estimate
Standard
Error
Intercept
exercise
sleep
obama
Clinton
mathLove
11.80175
-0.29106
-1.91592
1.73993
-0.83128
0.45653
2.98341
0.09798
0.39494
0.24352
0.17066
0.10668
Type II SS
F Value
Pr > F
11.96067
6.74569
17.98818
39.01944
18.13489
13.99925
15.65
8.83
23.53
51.05
23.73
18.32
0.0019
0.0117
0.0004
<.0001
0.0004
0.0011
Exercise, sleep, and high ratings for Clinton are negatively related to optimism (highly
significant!) and high ratings for Obama and high love of math are positively related to
optimism (highly significant!).
If something seems to good to
be true…
Clinton, univariate:
Variable
Label
Intercept Intercept
Clinton Clinton
DF
Parameter
Estimate
1
1
5.43688
0.24973
Standard
Error t Value
2.13476
0.27111
2.55
0.92
Pr > |t|
0.0188
0.3675
Sleep, Univariate:
Variable
Label
DF
Parameter
Estimate
Standard
Error t Value
Pr > |t|
Intercept Intercept 1
8.30817
4.36984
1.90 0.0711
sleep
1
-0.14484
0.65451
-0.22 0.8270
Exercise, Univariate: sleep
Parameter
Standard
Variable Label
DF
Estimate
Error t Value Pr > |t|
Intercept Intercept
exercise exercise
1
1
6.65189
0.19161
0.89153
0.20709
7.46
0.93
<.0001
0.3658
More univariate models…
Obama, Univariate:
Variable
Label
DF
Intercept Intercept
obama
obama
1
1
Parameter
Estimate
0.82107
0.87276
Standard
Error t Value
2.43137
0.31973
Pr > |t|
0.34 0.7389
2.73 0.0126
Love of Math, univariate:
Variable
Label
DF
Intercept Intercept 1
mathLove mathLove
Parameter
Estimate
Standard
Error t Value
Pr > |t|
3.70270
1.25302
2.96 0.0076
1
0.59459
0.19225
3.09 0.0055
Compare
with
multivariate
result;
p<.0001
Compare
with
multivariate
result;
p=.0011
Overfitting
Rule of thumb: You need at
least 10 subjects for each
additional predictor
variable in the multivariate
regression model.
Pure noise variables still produce good R2 values if the model is
overfitted. The distribution of R2 values from a series of
simulated regression models containing only noise variables.
(Figure 1 from: Babyak, MA. What You See May Not Be What You Get: A Brief, Nontechnical Introduction
to Overfitting in Regression-Type Models. Psychosomatic Medicine 66:411-421 (2004).)
Overfitting example, class data…
PREDICTORS OF EXERCISE HOURS PER WEEK (multivariate model):
Variable
Intercept
Coffee
wakeup
engSAT
mathSAT
writingLove
sleep

Beta
p-VALUE
-14.74660
0.23441
-0.51383
-0.01025
0.03064
0.88753
0.37459
0.0257
0.0004
0.0715
0.0168
0.0005
<.0001
0.0490
R-Square = 0.8192
N=20, 7
parameters in
the model!
Univariate models…







Variable
Beta
p-value
Coffee
0.05916
0.3990
Wakeup -0.06587
0.8648
MathSAT -0.00021368 0.9731
EngSAT -0.01019
0.1265
Sleep
-0.41185
0.4522
WritingLove 0.38961
0.0279
Download