Simple Linear regression & correlation

advertisement
Section VIII
Simple Linear Regression &
Correlation
Ex: Riddle, J. of Perinatology (2006) 26, 556–561
50th percentile for birth weight (BW) in g
as a function of gestational age
Birth Wt (g) =42 exp( 0.1155 gest age)
Or
Loge(BW) = 3.74 + 0.1155 gest age
In general: BW = A exp(B gest age), A & B change for
different percentiles
Example: Nishio et. al. Cardiovascular
Revascularization Medicine 7 (2006) 54– 60
Simple Linear Regression statistics
Statistics for the association between a
continuous X and a continuous Y.
A linear relation is given by an equation
Y = a + b X + errors
(errors=e=Y-Ŷ)
Ŷ = predicted Y = a + b X
a = intercept, b =slope= rate of change
r = correlation coefficient, R2=r2
R2= proportion of Y’s variation due to X
SDe=residual SD=RMSE=√mean square error
Ex: X=age (yrs) vs Y=SBP (mmHg)
220
200
SBP (mmHg)
180
160
140
120
100
80
20
30
40
50
60
70
80
90
age (yrs)
SBP = 81.5 + 1.22 age + error
SDe = 18.6 mm Hg, r = 0.718, R2 = 0.515
“Residual” error
Residual error = e = Y – Ŷ
The sum and mean of the ei’s will always be
zero. Their standard deviation, SDe, is a
measure of how close the observed Y
values are to their equation predicted
values (Ŷ). When r=R2=1, SDe=0.
age vs SBP in women - Predicted SBP (mmHg) = 81.5 + 1.22 age, r=0.72, R2=0.515
Patient
X
Y
Predicted
Error=e
Patient
X
Y
Predicted
Error=e
1
22
131
108.41
22.59
17
49
133
141.41
-8.41
2
23
128
109.63
18.37
18
49
128
141.41
-13.41
3
24
116
110.85
5.15
19
50
183
142.64
40.36
4
27
106
114.52
-8.52
20
51
130
143.86
-13.86
5
28
114
115.74
-1.74
21
51
133
143.86
-10.86
6
29
123
116.97
6.03
22
51
144
143.86
0.14
7
30
117
118.19
-1.19
23
52
128
145.08
-17.08
8
32
122
120.63
1.37
24
54
105
147.53
-42.53
9
33
99
121.86
-22.86
25
56
145
149.97
-4.97
10
35
121
124.30
-3.30
26
57
141
151.19
-10.19
11
40
147
130.41
16.59
27
58
153
152.42
0.58
12
41
139
131.64
7.36
28
59
157
153.64
3.36
13
41
171
131.64
39.36
29
63
155
158.53
-3.53
14
46
137
137.75
-0.75
30
67
176
163.42
12.58
15
47
111
138.97
-27.97
31
71
172
168.31
3.69
16
48
115
140.19
-25.19
32
77
178
175.64
2.36
33
81
217
180.53
36.47
X
Y
Predicted
Error
Mean
46.7
138.6
138.6
0.0
SD
15.5
26.4
18.9
18.3
Mean error is always zero
Confidence intervals (CI)
Prediction intervals (PI)
Model: predicted SBP=Ŷ=81.5 + 1.22 age
For age=50, Ŷ=81.5+1.22(50) = 142.6 mm Hg
95% CI: Ŷ ± 2 SEM, 95% PI: Ŷ ± 2 SDe
SEM=3.3 mm Hg ↔ 95% CI is (136.0, 149.2)
SDe=18.6 mm Hg ↔ 95% PI (104.8,180.4)
The Ŷ=142.6 is predicted mean for age 50 and
predicted value for one individual age 50.
R2 interpretation
R2 is the proportion of the total (squared)
variation in Y that is “accounted for” by X.
R2= r2 = (SDy2– SDe2)/SDy2 =1- (SDe2/SDy2)
SDy(1-r2) = SDe
Under Gaussian theory, 95% of the
errors are within +/- 2 SDe of their
corresponding predicted Y value, Ŷ.
How big should
2
R
be?
SBP SD = 26.4 mm Hg, SDe=18.6
95% PI: Ŷ± 2(18.6) or Ŷ± 37.2 mm Hg
How big does R2 have to be to make
95% PI: Ŷ ± 10 mm Hg?  SDe≈ 5 mm Hg
R2=1-(SDe/SDy)2=
1-(5/26.4)2 = 1-0.036=0.964 or 96.4%
(with age only, R2 = 0.515)
Correlation-interpretation, |r| < 1
Pearson vs Spearman corr=r
Pearson r – Assumes relationship between Y
and X is linear except for noise.
“parametric” (inspired by bivariate normal
model). Strongly affected by outliers.
Spearman rs – Based on ranks of Y and X.
Assume relation between Y and X is
monotone (non increasing, non
decreasing). “Non parametric”. Less
affected by outliers.
Pearson r vs Spearman rs
BMI vs HbA1c
6.0
5.0
HBA1c
4.0
3.0
2.0
1.0
0.0
25
30
35
40
45
BMI
r =0.25, rs = 0.48
50
Slope is related to correlation
(simple regression)
Slope = correlation x (SDy/SDx)
b = r (SDy/SDx)
b=1.22=0.7178(26.4/15.5)
where SDy is the SD of the Y variable
SDx is the SD of the X variable
r = b (SDx/SDy)
0.7178=1.22(15.5/26.4)
r = b SDx/ b2 SDx2 + SDe2
where SDe is the residual error and SDx is the SD
of the X variable
Limitations of Linear Statistics
Example of a nonlinear relationship
Pct reeptors bound
Receptor Binding
50%
40%
30%
20%
10%
0%
0
1
2
3
4
Ligand concentration
5
6
7
Pathological Behavior
Ŷ = 3 + 0.5 X, r = 0.817, SDe = 13.75,
(for all four datasets below)
n=11
Weisberg, Applied Linear Regression, p 108
Ecologic Fallacy
truncating X, true r=0.9, R2=0.81
Full data
Interpreting correlation in experiments
Since r=b(SDx/SDy), an artificially lowered SDx will also lower r.
R2, b and SDe when X is systematically changed
Data
Complete data
(“truth”)
R2
0.81
b
0.90
SDe
0.43
Truncated
(X < -1 SD deleted)
0.47
1.03
0.43
center deleted
0.91
( -1 SD< X < 1 SD deleted)
0.90
0.45
extremes deleted
0.58
(X < -1 SD deleted, X > 1 SD deleted)
0.92
0.42
Assumes intrinsic relation between X and Y is linear.
Attenuation of regression coefficients
when there is error in X (true slope=β= 4.0)
70
70
60
60
50
50
40
Y
Y
40
30
30
20
20
10
10
0
0
0
5
10
X
15
0
5
10
nois y X
Negligible errors in X:
Noisy errors in X:
Y=1.149 + 3.959 X
Y=-2.132 + 3.487 X
SE(b) = 0.038
SE(b) = 0.276
15
20
Means diffs do not imply correlation
Just because there is a positive mean X change
and a positive mean Y change does not
necessarily imply that X and Y are correlated.
mean chg X
mean chg Y
Checking for linearity – smoothing & splines
Basic idea: In a plot of Y vs X, also plot Ŷ vs X where
Ŷi = ∑ Wni Yi
where ∑ Wni=1, Wni>0.
The “weights” Wni, are larger near Yi and smaller far from Yi.
Smooth: define a moving “window” of a given width around
the ith data point and fit a mean (weighted moving
average) in this window.
Spline: break the X axis into non-overlapping bins and fit a
polynomial within each bin such that the “ends” all
“match”.
The size of the window or bins control the amount of
smoothing.
We smooth until we obtain a smooth curve but go no further.
2000
2000
1500
1500
IGFBP
IGFBP
Smoothing example
IGFBP by BMI
1000
500
Insufficient
smoothing
1000
500
0
0
15
20
25
30
35
40
45
15
50
20
25
30
40
45
50
2000
2000
Smoothing
1500
Over
smoothing
1500
IGFBP
IGFBP
35
BMI
BMI
1000
1000
500
500
0
15
20
25
30
35
BMI
40
45
50
0
15
20
25
30
35
BMI
40
45
50
IGFBP by BMI
2000
IGFBP
1500
1000
500
0
15
20
25
30
35
BMI
40
45
50
Smoothing example
IGFBP by BMI
2000
IGFBP
1500
Smoothing
1000
500
0
15
20
25
30
35
BMI
40
45
50
Smoothing example
IGFBP by BMI
2000
Insufficient
smoothing
IGFBP
1500
1000
500
0
15
20
25
30
35
BMI
40
45
50
Smoothing example
IGFBP by BMI
2000
Over
smoothing
IGFBP
1500
1000
500
0
15
20
25
30
35
BMI
40
45
50
Check linearity
ANDRO by BMI
4000
4000
3500
3500
3000
3000
2500
ANDRO
ANDRO
2500
2000
2000
1500
1500
1000
1000
500
500
0
0
15
20
25
30
35
BMI
40
45
50
15
20
25
30
35
BMI
40
45
50
ANDRO by BMI
4000
3500
3000
ANDRO
2500
2000
1500
1000
500
0
15
20
25
30
35
BMI
40
45
50
Check linearity
ANDRO by BMI
4000
3500
3000
ANDRO
2500
2000
1500
1000
500
0
15
20
25
35
30
BMI
40
45
50
Download