Linear Regression
Hein Stigum
Presentation, data and programs at:
http://folk.uio.no/heins/courses
Apr-15
H.S.
1
Linear regression
CONCEPTS
Apr-15
H.S.
2
Outcome and regression types
• Numerical data
– Discrete
• number of partners
Poisson regression
– Continuous
• Weight
Linear regression
• Categorical data
– Nominal
• disease/ no disease
Logistic regression
– Ordinal
• small/ medium/ large
Apr-15
H.S.
Ordinal regression
3
Regression idea
2500 3000 3500 4000 4500 5000
model: y  b0  b1 x  e
y = outcome
x = covariate
b1  coefficient , effectof x
e  error,residual
250
260
270
280
290
gestational age (days)
300
310
model with manycofactors: y  b0  b1 x1  b2 x2  e
x1 , x 2 = covariate
Apr-15
H.S.
4
Measures and Assumptions
weight  b0  b1  gest  b2  sex  e
• Adjusted effects
– b1 is the increase in weight per day of gestational age
– b1 is adjusted for b2
• Assumptions
– Independent errors
– Linear effects
– Constant error variance
• Robustness
– influence
Apr-15
H.S.
5
Workflow
• DAG
• Plots: distribution and scatter
• Bivariate analysis
• Regression
– Model estimation
– Test of assumptions
• Independent errors
• Linear effects
• Constant error variance
Discuss
Plot
– Robustness
• Influence
Apr-15
Plot
H.S.
6
Continuous outcome: Linear regression, Birth weight
ANALYSIS
Apr-15
H.S.
7
DAGs
C2
C1
parity
sex
E
D
gest age
birth weight
Associations
Causal effects
Bivariate (unadjusted)
Multivariable (adjusted)
Draw your assumptions before your conclusions
Apr-15
H.S.
8
Plot outcome by exposure
Effects on linear regression:
OK
Be clear on the research question:
overall birth weight: linear regression
low birth weight:
logistic regression
linear and logistic can give opposite results
May lead to non-constant error variance
May have high influential outliers
Apr-15
H.S.
9
Plot outcome by exposure, cont.
Linear effects?
Yes
Apr-15
H.S.
10
Bivariate analysis
Outcome: birthweight
All
Gestational age
<=280 days
>280 days
Sex
Boy
Girl
Parity
0
1
2
Apr-15
N
564
Mean
3604
p-value
<0.001
230
288
3436
3744
0.004
291
273
3668
3535
<0.001
225
215
123
3485
3677
3695
H.S.
11
Continuous outcome: Linear regression, Birth weight
REGRESSION
Apr-15
H.S.
12
Categorical covariates
• 2 categories
– OK, but know the coding
• 3+ categories
– Use “dummies”
•
•
•
•
“Dummies” are 0/1 variables used to create contrasts
Want 3 categories for parity: 0, 1 and 2-7 children
Choose 0 as reference
Make dummies for the two other categories
generate Parity1
generate Parity2_7
Apr-15
=
=
(parity==1) if parity<.
(parity>=2) if parity<.
H.S.
13
Model estimation
Syntax:
regress weight gest sex Parity1 Parity2_7
Apr-15
H.S.
14
Create meaningful constant
Expected birth weight  E ( y ) 
0  1  gest   2  sex  3  Parity1   4  Parity2 _ 7
Expected birth weight at:
0
 1972gr
0  1  280  2 1  3524gr
gest=
0, sex=0, parity=0
gest=280, sex=1, parity=0
Alternative: center variables
gen gest280=gest-280
gest280 has a meaningful zero at 280 days
gen sex0=sex-1
sex0 has a meaningful zero at boys
Model results
Birth weight at ref
Gestational age
per day
Sex
Boy
Girl
Parity
0
1
2-7
Apr-15
coeff
3524.3
6.0
95% conf. Int.
(3.9 , 8.2)
0
-139.2
(-228.9 , -49.5)
0
232.0
226.0
(130.6 , 333.5)
(106.9 , 345)
H.S.
16
Test of assumptions
• Discuss
• Independent
residuals?
1000 1500
• Plot residuals
versus predicted y
-1000
-500
0
Residuals
500
• Linear effects?
• constant variance?
3200
3400
3600
Linear prediction
3800
4000
Outlier not included
Apr-15
H.S.
17
Violations of assumptions
• Dependent residuals
.5
1
Use linear mixed models
-.5
0
• Non linear effects
-1
Add square term
Or use piecewise linear
220
240 260
gest
280
300
2
200
0
-1
-2
Use robust variance estimation
res
1
• Non-constant variance
3400
Apr-15
H.S.
3500
3600
p
3700
18
3800
6000
Influence
5000
Regression
without outlier
4000
Regression with outlier
2000
3000
Outlier
200
Apr-15
300
400
500
Gestational age
H.S.
600
700
19
.2
Measures of influence
-.6
-.4
-.2
0
Remove obs 1, see change
remove obs 2, see change
1
2
10
Id
• Measure change in:
– Predicted outcome
– Deviance
– Coefficients (beta)
• Delta beta
Apr-15
H.S.
20
-10
-8
-6
-4
-2
0
Delta beta for gestational age
539
2000
3000
4000
weight
5000
beta for gestational age= 6.04
Apr-15
H.S.
6000
If obs nr 539 is
removed, beta
will change
from 6 to 16
21
Removing outlier
Full data
Birth weight at ref
Gestational age
per day
Sex
Boy
Girl
Parity
0
1
2-7
Outlier removed
coeff 95% conf. Int.
3524
6
0
-139
0
232
226
Birth weight at ref
Gestational age
per day
Sex
Boy
Girl
Parity
0
1
2-7
(4 , 8)
(-229 , -49)
(131 , 333)
(107 , 345)
One outlier affected two estimates
Apr-15
coeff 95% conf. Int.
3531
17
(13 , 20)
0
-166
(-252 , -80)
0
229
225
(132 , 326)
(112 , 339)
Final model
H.S.
22
Summing up
• DAGs
– Guide analysis
• Plots
– Unequal variance, non-linearity, outliers
• Bivariate analysis
• Linear regression
–
–
–
–
Fit model
Check assumptions
Check robustness
Make meaningful constant
Apr-15
H.S.
23