Chapter 11 Linear Regression and Correlation Linear Regression and Correlation • Explanatory and Response Variables are Numeric • Relationship between the mean of the response variable and the level of the explanatory variable assumed to be approximately linear (straight line) • Model: Y 0 1 x ~ N (0, ) • 1 > 0 Positive Association • 1 < 0 Negative Association • 1 = 0 No Association 2 e Least Squares Estimation of 0, 1 0 Mean response when x=0 (y-intercept) 1 Change in mean response when x increases by 1 unit (slope) • 0, 1 are unknown parameters (like m) • 0+1x Mean response when explanatory variable takes on the value x • Goal: Choose values (estimates) that minimize the sum of squared errors (SSE) of observed values to the straight-line: ^ ^ ^ y 0 1 x n SSE i 1 yi y i i 1 yi 0 1 xi n ^ 2 ^ ^ 2 Example - Pharmacodynamics of LSD • Response (y) - Math score (mean among 5 volunteers) • Predictor (x) - LSD tissue concentration (mean of 5 volunteers) • Raw Data and scatterplot of Score vs LSD concentration: 80 70 60 LSD Conc (x) 1.17 2.97 3.26 4.69 5.83 6.00 6.41 50 40 SCORE Score (y) 78.93 58.20 67.47 37.47 45.65 32.92 29.97 30 20 1 2 LSD_CONC Source: Wagner, et al (1968) 3 4 5 6 7 Least Squares Computations Parameter Estimates Summary Calculations x x y y y y S xx x x S xy S yy 2 2 2 SSE S yy S xy S xx ^ 1 x x y y S S x x xy 2 ^ xx ^ 0 y 1 x 2 yi y i SSE 2 i 1 se n2 n2 n ^ Example - Pharmacodynamics of LSD Score (y) 78.93 58.20 67.47 37.47 45.65 32.92 29.97 350.61 LSD Conc (x) 1.17 2.97 3.26 4.69 5.83 6.00 6.41 30.33 x-xbar -3.163 -1.363 -1.073 0.357 1.497 1.667 2.077 -0.001 y-ybar 28.843 8.113 17.383 -12.617 -4.437 -17.167 -20.117 0.001 Sxx 10.004569 1.857769 1.151329 0.127449 2.241009 2.778889 4.313929 22.474943 Sxy -91.230409 -11.058019 -18.651959 -4.504269 -6.642189 -28.617389 -41.783009 -202.487243 Syy 831.918649 65.820769 302.168689 159.188689 19.686969 294.705889 404.693689 2078.183343 (Column totals given in bottom row of table) 350.61 30.33 y 50.087 x 4.333 7 7 ^ ^ ^ 202.4872 1 9.01 0 y 1 x 50.09 (9.01)( 4.33) 89.10 22.4749 ^ y 89.10 9.01x se2 50.72 SPSS Output and Plot of Equation Coefficientsa Model 1 Uns tandardized Coefficients B Std. Error 89.124 7.048 -9.009 1.503 (Cons tant) LSD_CONC Standardized Coefficients Beta -.937 t 12.646 -5.994 a. Dependent Variable: SCORE Math Score vs LSD Concentration (SPSS) 80.00 Linear Regression 70.00 60.00 score 50.00 40.00 30.00 1.00 2.00 score = 89.12 + -9.01 * lsd_conc R-Square4.00 = 0.88 5.00 3.00 6.00 lsd_conc Sig. .000 .002 Inference Concerning the Slope (1) • Parameter: Slope in the population model (1) ^ • Estimator: Least squares estimate: 1 • Estimated standard error: SE ^ se 1 S xx 1 • Methods of making inference regarding population: – Hypothesis tests (2-sided or 1-sided) – Confidence Intervals Hypothesis Test for 1 • 2-Sided Test – H0: 1 = 0 – HA: 1 0 • 1-sided Test – H0: 1 = 0 – HA+: 1 > 0 or – HA-: 1 < 0 ^ 1 T .S . : tobs se 1 S xx 1 T .S . : tobs se 1 S xx R.R. : | tobs | t / 2,n 2 R.R. : tobs t ,n 2 P value : 2 P(t | tobs |) P val : P(t tobs ) P val : P(t tobs ) ^ R.R. : tobs t ,n 2 (1-)100% Confidence Interval for 1 ^ ^ 1 t / 2 SE 1 t / 2 se ^ 1 1 S xx • Conclude positive association if entire interval above 0 • Conclude negative association if entire interval below 0 • Cannot conclude an association if interval contains 0 • Conclusion based on interval is same as 2-sided hypothesis test Example - Pharmacodynamics of LSD ^ n 7 1 9.01 se 50.72 7.12 S xx 22.475 SE ^ 1 1 7.12 1.50 22.475 • Testing H0: 1 = 0 vs HA: 1 0 9.01 T .S . : tobs 6.01 1.50 R.R. :| tobs | t.025,5 2.571 • 95% Confidence Interval for 1 : 9.01 2.571(1.50) 9.01 3.86 (12.87,5.15) Confidence Interval for Mean When x=x* • Mean Response at a specific level x* is E ( y | x*) m y 0 1 x * • Estimated Mean response and standard error (replacing unknown 0 and 1 with estimates): ^ ^ ^ m y 0 1 x* SE ^ se m 1 x * x n S xx 2 • Confidence Interval for Mean Response: ^ m y t / 2,n2SE ^ ^ m m y t / 2,n 2 se 1 x * x n S xx 2 Prediction Interval of Future Response @ x=x* • Response at a specific level x* is y x* m y 0 1 x * • Estimated response and standard error (replacing unknown 0 and 1 with estimates): ^ ^ ^ y 0 1 x* 1 x * x SE ^ se 1 y n S xx 2 • Prediction Interval for Future Response: ^ y t / 2,n 2SE ^ y 1 x * x y t / 2,n 2 se 1 n S xx ^ 2 Correlation Coefficient • Measures the strength of the linear association between two variables • Takes on the same sign as the slope estimate from the linear regression • Not effected by linear transformations of y or x • Does not distinguish between dependent and independent variable (e.g. height and weight) • Population Parameter: ryx • Pearson’s Correlation Coefficient: ryx S xy S xx S yy 1 r 1 Correlation Coefficient • Values close to 1 in absolute value strong linear • • • • association, positive or negative from sign Values close to 0 imply little or no association If data contain outliers (are non-normal), Spearman’s coefficient of correlation can be computed based on the ranks of the x and y values Test of H0:ryx = 0 is equivalent to test of H0:1=0 Coefficient of Determination (ryx2) - Proportion of variation in y “explained” by the regression on x: r (ryx ) 2 yx 2 S yy SSE S yy SS (Total ) SS (Residual ) SS (Total ) 0 r2 1 Example - Pharmacodynamics of LSD S xx 22.475 S xy 202.487 S yy 2078.183 SSE 253.89 202.487 0.94 (22.475)( 2078.183) ryx 2078.183 253.89 r 0.88 (0.94) 2 2078.183 2 yx Syy 80.00 Mean 70.00 80.00 60.00 Linear Regression 70.00 score score SSE 50.00 Mean = 50.09 40.00 60.00 50.00 40.00 1.00 30.00 30.00 2.00 3.00 4.00 lsd_conc 5.00 6.00 1.00 score = 89.12 + -9.01 * lsd_conc 2.00 R-Square 3.00 = 0.88 4.00 5.00 6.00 lsd_conc Example - SPSS Output Pearson’s and Spearman’s Measures l a O C S P 1 7 * S 2 . N 7 7 L P 7 1 * S 2 . N 7 7 * C l a O C S S C 0 9 * S 3 . N 7 7 L C 9 0 * S 3 . N 7 7 * C Hypothesis Test for ryx • 2-Sided Test – H0: ryx = 0 – HA: ryx 0 T .S . : tobs ryx • 1-sided Test – H0: ryx = 0 – HA+: ryx > 0 or – HA-: ryx < 0 n2 1 ryx2 T .S . : tobs ryx n2 1 ryx2 R.R. : | tobs | t / 2,n 2 R.R. : tobs t ,n 2 P value : 2 P(t | tobs |) P val : P(t tobs ) P val : P(t tobs ) R.R. : tobs t ,n 2 Analysis of Variance in Regression • Goal: Partition the total variation in y into variation “explained” by x and random variation ^ ^ ( yi y ) ( yi y i ) ( y i y ) 2 ^ ^ ( y y) ( y y ) ( y y) 2 i i i 2 i • These three sums of squares and degrees of freedom are: •Total (TSS) DFT = n-1 • Error (SSE) DFE = n-2 • Model (SSR) DFR = 1 Analysis of Variance for Regression Source of Variation Model Error Total Sum of Squares SSR SSE TSS Degrees of Freedom 1 n-2 n-1 • Analysis of Variance - F-test • H0: 1 = 0 T .S . : Fobs R.R. : Fobs HA: 1 0 MSR MSE F ,1, n 2 P value : P ( F Fobs ) Mean Square MSR = SSR/1 MSE = SSE/(n-2) F F = MSR/MSE Example - Pharmacodynamics of LSD • Total Sum of squares: TSS ( yi y) 2 2078.183 DFT 7 1 6 • Error Sum of squares: ^ SSE ( yi y i ) 2 253.890 DFE 7 2 5 • Model Sum of Squares: ^ SSR ( y i y) 2 2078.183 253.890 1824.293 DFR 1 Example - Pharmacodynamics of LSD Source of Variation Model Error Total Sum of Squares 1824.293 253.890 2078.183 Degrees of Freedom 1 5 6 Mean Square 1824.293 50.778 •Analysis of Variance - F-test • H0: 1 = 0 T .S . : Fobs R.R. : Fobs HA: 1 0 MSR 35.93 MSE F.05,1, 5 6.61 P val : P ( F 35.93) F 35.93 Example - SPSS Output b O m d F S M i a g f a 1 R 2 1 2 8 2 R 1 5 6 T 3 6 a P b D Linearity of Regression (SLR) F -Test for Lack-of-Fit (n j observations at c distinct levels of "X") H 0 : E Yi 0 1 X i H A : E Yi mi 0 1 X i Compute fitted value Y j and sample mean Y j for each distinct X level c nj Lack-of-Fit: SS LF Y j Y j j 1 i 1 c nj Pure Error: SS PE Yij Y j j 1 i 1 2 2 df LF c 2 df PE n c SS ( LF ) c 2 MS ( LF ) ~ MS ( PE ) SS ( PE ) n c H0 Test Statistic: FLOF Reject H 0 if FLOF F 1 ; c 2, n c Fc 2,n c