model

advertisement
Statistics for Health Research
Entering Multidimensional
Space: Multiple
Regression
Peter T. Donnan
Professor of Epidemiology and Biostatistics
Objectives of session
• Recognise the need for multiple
regression
• Understand methods of selecting
variables
• Understand strengths and weakness
of selection methods
• Carry out Multiple
Regression in SPSS
and interpret the output
Why do we need multiple
regression?
Research is not as simple as effect
of one variable on one outcome,
Especially with observational data
Need to assess many factors
simultaneously; more realistic
models
Dependent (y)
Consider Fitted line of
y = a + b1x1 + b2x2
Explanatory (x1)
3-dimensional scatterplot from
SPSS of Min LDL in relation to
baseline LDL and age
When to use multiple
regression modelling (1)
Assess relationship between two
variables while adjusting or allowing
for another variable
Sometimes the second variable is
considered a ‘nuisance’ factor
Example: Physical Activity allowing
for age and medications
When to use multiple
regression modelling (2)
In RCT whenever there is imbalance
between arms of the trial at baseline
in characteristics of subjects
e.g. survival in colorectal cancer on
two different randomised therapies
adjusted for age, gender, stage, and
co-morbidity at baseline
When to use multiple
regression modelling (2)
A special case of this is when
adjusting for baseline level of the
primary outcome in an RCT
Baseline level added as a factor in
regression model
This will be covered in Trials part of
the course
When to use multiple
regression modelling (3)
With observational data in order to
produce a prognostic equation for
future prediction of risk of mortality
e.g. Predicting future risk of CHD
used 10-year data from the
Framingham cohort
When to use multiple
regression modelling (4)
With observational designs in order
to adjust for possible confounders
e.g. survival in colorectal cancer in
those with hypertension adjusted for
age, gender, social deprivation and
co-morbidity
Definition of Confounding
A confounder is a factor which
is related to both the variable
of interest (explanatory) and
the outcome, but is not an
intermediary in a causal
pathway
Example of Confounding
Lung
Cancer
Deprivation
Smoking
But, also worth adjusting for
factors only related to outcome
Lung
Cancer
Deprivation
Exercise
Not worth adjusting for intermediate
factor in a causal pathway
Exercise
Blood
viscosity
Stroke
In a causal pathway each factor is
merely a marker of the other
factors i.e correlated - collinearity
SPSS: Add both baseline LDL and
age in the independent box in linear
regression
Output from SPSS linear
regression on Age at baseline
Coefficientsa
Model
1
(Constant)
Age at baseline
Unstandardized
Standardized
Coefficients
Coefficients
B
Std. Error
Beta
2.024
.105
-.008
.002
-.121
a. Dependent Variable: Min LDL achieved
t
19.340
-4.546
95% Confidence Interval for B Collinearity Statistics
Sig.
Lower Bound Upper Bound Tolerance VIF
.000
1.819
2.229
.000
-.011
-.004
1.000
1.000
Output from SPSS linear
regression on Baseline LDL
Coefficientsa
Model
1
(Constant)
Baseline LDL
Unstandardized
Coeff icients
B
Std. Error
.668
.066
.257
.018
a. Dependent Variable: Min LDL achieved
Standardized
Coeff icients
Beta
.351
t
10.091
13.950
95% Confidence Interv al for B
Sig.
Lower Bound Upper Bound
.000
.538
.798
.000
.221
.293
Output: Multiple regression
Model Summary
R2 now
improved
to 13%
Model
1
R
.360a
R Square
.130
Adjusted
R Square
.129
St d. Error of
the Estimate
.6753538
a. Predictors: (Constant), Age at baseline, Baseline LDL
Coefficientsa
Model
1
(Constant)
Baseline LDL
Age at baseline
Unstandardized
Coeff icients
B
Std. Error
1.003
.124
.250
.019
-.005
.002
Standardized
Coeff icients
Beta
.342
-.081
t
8.086
13.516
-3.187
Sig.
.000
.000
.001
a. Dependent Variable: Min LDL achieved
Both variables still significant
INDEPENDENTLY of each other
95% Confidence Interval for B
Lower Bound Upper Bound
.760
1.246
.214
.286
-.008
-.002
How do you select which
variables to enter the model?
•Usually consider what hypotheses are you testing?
•If main ‘exposure’ variable, enter first and assess
confounders one at a time
•For derivation of CPR you want powerful predictors
•Also clinically important factors e.g. cholesterol in CHD
prediction
•Significance is important but
•It is acceptable to have an ‘important’ variable without
statistical significance
How do you decide what variables to
enter in model?
Correlations? With great difficulty!
3-dimensional scatterplot from
SPSS of Time from Surgery in
relation to Duke’s staging and age
Approaches to model building
1. Let Scientific or Clinical factors
guide selection
2. Use automatic selection algorithms
3. A mixture of above
1) Let Science or Clinical
factors guide selection
Baseline LDL cholesterol is an
important factor determining LDL
outcome so enter first
Next allow for age and gender
Add adherence as important?
Add BMI and smoking?
1) Let Science or Clinical
factors guide selection
Results in model of:
1. Baseline LDL
2. age and gender
3. Adherence
4. BMI and smoking
Is this a ‘good’ model?
1) Let Science or Clinical factors
guide selection: Final Model
Note three variables entered but not statistically significant
1) Let Science or Clinical factors
guide selection
Is this the ‘best’ model?
Should I leave out the non-significant factors (Model 2)?
Model
Adj R2
F from
ANOVA
No. of
Parameters
p
1
0.137
37.48
7
2
0.134
72.021
4
Adj R2 lower, F has increased and number of
parameters is less in 2nd model. Is this better?
Kullback-Leibler
Information
Kullback and Leibler (1951)
quantified the meaning of
‘information’ – related to
Fisher’s ‘sufficient statistics’
Basically we have reality f
And a model g to approximate f
So K-L information is
I(f,g)
f
g
Kullback-Leibler
Information
We want to minimise I (f,g)
to obtain the best model
over other models
I (f,g) is the information
lost or ‘distance’ between
reality and a model so need
to minimise:
f ( x)
I ( f , g )   f ( x) log(
)dx
g( x  )
Akaike’s Information
Criterion
It turns out that the
function I(f,g) is
related to a very simple
measure of goodnessof-fit:
Akaike’s Information
Criterion or AIC
Selection Criteria
•With a large number of factors type 1 error
large, likely to have model with many variables
•Two standard criteria:
1) Akaike’s Information Criterion (AIC)
2) Schwartz’s Bayesian Information
Criterion (BIC)
•Both penalise models with large number of
variables if sample size is large
Akaike’s Information
Criterion
AIC  2 * loglikelihood 2 * p
•Where p = number of parameters and 2*log likelihood is in the output
•Hence AIC penalises models with large
number of variables
•Select model that minimises (-2LL+2p)
Generalized linear models
•Unfortunately the standard
REGRESSION in SPSS does not give
these statistics
•Need to use
Analyze
Generalized Linear Models…..
Generalized linear models.
Default is linear
•Add Min LDL
achieved as
dependent as in
REGRESSION in
SPSS
•Next go to
predictors…..
Generalized linear models:
Predictors
•WARNING!
•Make sure you
add the
predictors in
the correct box
•Categorical in
FACTORS box
•Continuous in
COVARIATES
box
Generalized linear models:
Model
•Add all
factors and
covariates in
the model as
main effects
Generalized Linear Models
Parameter Estimates
Note identical to REGRESSION output
Generalized Linear Models
Goodness-of-fit
Note output gives
log likelihood and
AIC = 2835
(AIC = -2x-1409.6
+2x7= 2835)
Footnote explains
smaller AIC is
‘better’
Let Science or Clinical factors
guide selection: ‘Optimal’ model
•The log likelihood is a measure of
GOODNESS-OF-FIT
•Seek ‘optimal’ model that maximises the log
likelihood or minimises the AIC
Model
2LL
p
AIC
1 Full Model
-1409.6
7
2835.6
2 Non-significant
variables removed
-1413.6
4
2837.2
Change
is 1.6
1) Let Science or Clinical
factors guide selection
Key points:
1.
Results demonstrate a significant
association with baseline LDL, Age and
Adherence
2.
Difficult choices with Gender, smoking and
BMI
3.
AIC only changes by 1.6 when removed
4.
Generally changes of 4 or more in AIC are
considered important
1) Let Science or Clinical
factors guide selection
Key points:
1.
Conclude little to chose between models
2.
AIC actually lower with larger model and
consider Gender, and BMI important
factors so keep larger model but have to
justify
3.
Model building manual, logical, transparent
and under your control
2) Use automatic selection
procedures
These are based on automatic
mechanical algorithms usually related
to statistical significance
Common ones are stepwise, forward
or backward elimination
Can be selected in SPSS using
‘Method’ in dialogue box
2) Use automatic selection
procedures (e.g Stepwise)
Select
Method =
Stepwise
2) Use automatic selection
procedures (e.g Stepwise)
1st step
2nd step
Final
Model
2) Change in AIC with Stepwise
selection
Note: Only available from Generalized Linear Models
Step
Model
Log
Likelihood
AIC
Change
in AIC
No. of
Parameters
p
1
Baseline LDL
-1423.1
2852.2
-
2
2
+Adherence
-1418.0
2844.1
8.1
3
3
+Age
-1413.6
2837.2
6.9
4
2) Advantages and
disadvantages of stepwise
Advantages
Simple to implement
Gives a parsimonious model
Selection is certainly objective
Disadvantages
Non stable selection – stepwise considers many
models that are very similar
P-value on entry may be smaller once procedure is
finished so exaggeration of p-value
Predictions in external dataset usually worse for
stepwise procedures – tends to add bias
2) Automatic procedures:
Backward elimination
Backward starts by eliminating the least
significant factor form the full model and has a
few advantages over forward:
•Modeller has to consider the ‘full’ model and
sees results for all factors simultaneously
•Correlated factors can remain in the model (in
forward methods they may not even enter)
•Criteria for removal tend to be more lax in
backward so end up with more parameters
2) Use automatic selection
procedures (e.g Backward)
Select
Method =
Backward
2) Backward elimination in
SPSS
1st step
Gender
removed
2nd step
BMI
removed
Final
Model
Summary of automatic
selection
•
Automatic selection may not give ‘optimal’
model (may leave out important factors)
•
Different methods may give different results
(forward vs. backward elimination)
•
Backward elimination preferred as less
stringent
•
Too easily fitted in SPSS!
•
Model assessment still requires some thought
3) A mixture of automatic
procedures and self selection
• Use automatic procedures as a
guide
• Think about what factors are
important
• Add ‘important’ factors
• Do not blindly follow statistical
significance
• Consider AIC
Summary of Model
selection
• Selection of factors for Multiple Linear
regression models requires some
judgement
• Automatic procedures are available but
treat results with caution
• They are easily fitted in SPSS
• Check AIC or log likelihood for fit
Summary
•
Multiple regression models are the most used
analytical tool in quantitative research
•
They are easily fitted in SPSS
•
•
Model assessment requires some thought
Parsimony is better – Occam’s Razor
• Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD,
Donnan PT. Apolipoprotein E genotypes are associated with lipid lowering
response to statin treatment in diabetes: A Go-DARTS study.
Pharmacogenetics and Genomics, 2008; 18: 279-87.
Remember Occam’s Razor
‘Entia non sunt
multiplicanda
praeter
necessitatem’
‘Entities must not be
multiplied beyond
necessity’
William of Ockham 14th
century Friar and logician
1288-1347
Practical on Multiple
Regression
Read in ‘LDL Data.sav’
1) Try fitting multiple regression model on Min
LDL obtained using forward and backward
elimination. Are the results the same? Add
other factors than those considered in the
presentation such as BMI, smoking.
Remember the goal is to assess the
association of APOE with LDL response.
2) Try fitting multiple regression models for
Min Chol achieved. Is the model similar to
that found for Min Chol?
Download