Section VIII Simple Linear Regression & Correlation Ex: Riddle, J. of Perinatology (2006) 26, 556–561 50th percentile for birth weight (BW) in g as a function of gestational age Birth Wt (g) =42 exp( 0.1155 gest age) Or Loge(BW) = 3.74 + 0.1155 gest age In general: BW = A exp(B gest age), A & B change for different percentiles Example: Nishio et. al. Cardiovascular Revascularization Medicine 7 (2006) 54– 60 Simple Linear Regression statistics Statistics for the association between a continuous X and a continuous Y. A linear relation is given by an equation Y = a + b X + errors (errors=e=Y-Ŷ) Ŷ = predicted Y = a + b X a = intercept, b =slope= rate of change r = correlation coefficient, R2=r2 R2= proportion of Y’s variation due to X SDe=residual SD=RMSE=√mean square error Ex: X=age (yrs) vs Y=SBP (mmHg) 220 200 SBP (mmHg) 180 160 140 120 100 80 20 30 40 50 60 70 80 90 age (yrs) SBP = 81.5 + 1.22 age + error SDe = 18.6 mm Hg, r = 0.718, R2 = 0.515 “Residual” error Residual error = e = Y – Ŷ The sum and mean of the ei’s will always be zero. Their standard deviation, SDe, is a measure of how close the observed Y values are to their equation predicted values (Ŷ). When r=R2=1, SDe=0. age vs SBP in women - Predicted SBP (mmHg) = 81.5 + 1.22 age, r=0.72, R2=0.515 Patient X Y Predicted Error=e Patient X Y Predicted Error=e 1 22 131 108.41 22.59 17 49 133 141.41 -8.41 2 23 128 109.63 18.37 18 49 128 141.41 -13.41 3 24 116 110.85 5.15 19 50 183 142.64 40.36 4 27 106 114.52 -8.52 20 51 130 143.86 -13.86 5 28 114 115.74 -1.74 21 51 133 143.86 -10.86 6 29 123 116.97 6.03 22 51 144 143.86 0.14 7 30 117 118.19 -1.19 23 52 128 145.08 -17.08 8 32 122 120.63 1.37 24 54 105 147.53 -42.53 9 33 99 121.86 -22.86 25 56 145 149.97 -4.97 10 35 121 124.30 -3.30 26 57 141 151.19 -10.19 11 40 147 130.41 16.59 27 58 153 152.42 0.58 12 41 139 131.64 7.36 28 59 157 153.64 3.36 13 41 171 131.64 39.36 29 63 155 158.53 -3.53 14 46 137 137.75 -0.75 30 67 176 163.42 12.58 15 47 111 138.97 -27.97 31 71 172 168.31 3.69 16 48 115 140.19 -25.19 32 77 178 175.64 2.36 33 81 217 180.53 36.47 X Y Predicted Error Mean 46.7 138.6 138.6 0.0 SD 15.5 26.4 18.9 18.3 Mean error is always zero Confidence intervals (CI) Prediction intervals (PI) Model: predicted SBP=Ŷ=81.5 + 1.22 age For age=50, Ŷ=81.5+1.22(50) = 142.6 mm Hg 95% CI: Ŷ ± 2 SEM, 95% PI: Ŷ ± 2 SDe SEM=3.3 mm Hg ↔ 95% CI is (136.0, 149.2) SDe=18.6 mm Hg ↔ 95% PI (104.8,180.4) The Ŷ=142.6 is predicted mean for age 50 and predicted value for one individual age 50. R2 interpretation R2 is the proportion of the total (squared) variation in Y that is “accounted for” by X. R2= r2 = (SDy2– SDe2)/SDy2 =1- (SDe2/SDy2) SDy(1-r2) = SDe Under Gaussian theory, 95% of the errors are within +/- 2 SDe of their corresponding predicted Y value, Ŷ. How big should 2 R be? SBP SD = 26.4 mm Hg, SDe=18.6 95% PI: Ŷ± 2(18.6) or Ŷ± 37.2 mm Hg How big does R2 have to be to make 95% PI: Ŷ ± 10 mm Hg? SDe≈ 5 mm Hg R2=1-(SDe/SDy)2= 1-(5/26.4)2 = 1-0.036=0.964 or 96.4% (with age only, R2 = 0.515) Correlation-interpretation, |r| < 1 Pearson vs Spearman corr=r Pearson r – Assumes relationship between Y and X is linear except for noise. “parametric” (inspired by bivariate normal model). Strongly affected by outliers. Spearman rs – Based on ranks of Y and X. Assume relation between Y and X is monotone (non increasing, non decreasing). “Non parametric”. Less affected by outliers. Pearson r vs Spearman rs BMI vs HbA1c 6.0 5.0 HBA1c 4.0 3.0 2.0 1.0 0.0 25 30 35 40 45 BMI r =0.25, rs = 0.48 50 Slope is related to correlation (simple regression) Slope = correlation x (SDy/SDx) b = r (SDy/SDx) b=1.22=0.7178(26.4/15.5) where SDy is the SD of the Y variable SDx is the SD of the X variable r = b (SDx/SDy) 0.7178=1.22(15.5/26.4) r = b SDx/ b2 SDx2 + SDe2 where SDe is the residual error and SDx is the SD of the X variable Limitations of Linear Statistics Example of a nonlinear relationship Pct reeptors bound Receptor Binding 50% 40% 30% 20% 10% 0% 0 1 2 3 4 Ligand concentration 5 6 7 Pathological Behavior Ŷ = 3 + 0.5 X, r = 0.817, SDe = 13.75, (for all four datasets below) n=11 Weisberg, Applied Linear Regression, p 108 Ecologic Fallacy truncating X, true r=0.9, R2=0.81 Full data Interpreting correlation in experiments Since r=b(SDx/SDy), an artificially lowered SDx will also lower r. R2, b and SDe when X is systematically changed Data Complete data (“truth”) R2 0.81 b 0.90 SDe 0.43 Truncated (X < -1 SD deleted) 0.47 1.03 0.43 center deleted 0.91 ( -1 SD< X < 1 SD deleted) 0.90 0.45 extremes deleted 0.58 (X < -1 SD deleted, X > 1 SD deleted) 0.92 0.42 Assumes intrinsic relation between X and Y is linear. Attenuation of regression coefficients when there is error in X (true slope=β= 4.0) 70 70 60 60 50 50 40 Y Y 40 30 30 20 20 10 10 0 0 0 5 10 X 15 0 5 10 nois y X Negligible errors in X: Noisy errors in X: Y=1.149 + 3.959 X Y=-2.132 + 3.487 X SE(b) = 0.038 SE(b) = 0.276 15 20 Means diffs do not imply correlation Just because there is a positive mean X change and a positive mean Y change does not necessarily imply that X and Y are correlated. mean chg X mean chg Y Checking for linearity – smoothing & splines Basic idea: In a plot of Y vs X, also plot Ŷ vs X where Ŷi = ∑ Wni Yi where ∑ Wni=1, Wni>0. The “weights” Wni, are larger near Yi and smaller far from Yi. Smooth: define a moving “window” of a given width around the ith data point and fit a mean (weighted moving average) in this window. Spline: break the X axis into non-overlapping bins and fit a polynomial within each bin such that the “ends” all “match”. The size of the window or bins control the amount of smoothing. We smooth until we obtain a smooth curve but go no further. 2000 2000 1500 1500 IGFBP IGFBP Smoothing example IGFBP by BMI 1000 500 Insufficient smoothing 1000 500 0 0 15 20 25 30 35 40 45 15 50 20 25 30 40 45 50 2000 2000 Smoothing 1500 Over smoothing 1500 IGFBP IGFBP 35 BMI BMI 1000 1000 500 500 0 15 20 25 30 35 BMI 40 45 50 0 15 20 25 30 35 BMI 40 45 50 IGFBP by BMI 2000 IGFBP 1500 1000 500 0 15 20 25 30 35 BMI 40 45 50 Smoothing example IGFBP by BMI 2000 IGFBP 1500 Smoothing 1000 500 0 15 20 25 30 35 BMI 40 45 50 Smoothing example IGFBP by BMI 2000 Insufficient smoothing IGFBP 1500 1000 500 0 15 20 25 30 35 BMI 40 45 50 Smoothing example IGFBP by BMI 2000 Over smoothing IGFBP 1500 1000 500 0 15 20 25 30 35 BMI 40 45 50 Check linearity ANDRO by BMI 4000 4000 3500 3500 3000 3000 2500 ANDRO ANDRO 2500 2000 2000 1500 1500 1000 1000 500 500 0 0 15 20 25 30 35 BMI 40 45 50 15 20 25 30 35 BMI 40 45 50 ANDRO by BMI 4000 3500 3000 ANDRO 2500 2000 1500 1000 500 0 15 20 25 30 35 BMI 40 45 50 Check linearity ANDRO by BMI 4000 3500 3000 ANDRO 2500 2000 1500 1000 500 0 15 20 25 35 30 BMI 40 45 50