7 - Linear Correlati..

CHAPTER 7 Linear Correlation & Regression Methods • 7.1 - Motivation • 7.2 - Correlation / Simple Linear Regression • 7.3 - Extensions of Simple Linear Regression Parameter Estimation via SAMPLE … Testing for association between two DATA POPULATION variables X and Y… • Categorical variables • Numerical variables  Chi-squared Test Categories of X  ??????? PARAMETERS Categories of Y  Means:  X  E[ X ]  Variances: Y  E[Y ]  X2  E ( X   X ) 2   Y2  E (Y  Y ) 2   Covariance: Examples: X = Disease status (D+, D–) Y = Exposure status (E+, E–) X = # children in household (0, 1-2, 3-4, 5+) Y = Income level (Low, Middle, High)  XY  E ( X   X )(Y  Y ) Parameter Estimation via SAMPLE DATA … x1, x2 , x3 , x4 ,  y1, y2 , y3 , y4 , , xn  , yn  • Numerical variables  ??????? PARAMETERS STATISTICS x y    Means:  [ X ] y Y n E[Y ] xX E n ( x  x )2 2 22   Variances:  s  E ( X   )  X   2 ( y  y )2 2  s Y  E (Y  Y ) 2  xX n 1 y n 1  Covariance: y) s XYE( x(xX)( y   X(can )(Y be  +,Y )–, or 0) xy n 1 Parameter Estimation via SAMPLE DATA … x1, x2 , x3 , x4 , , xn  • Numerical variables x 1 x2 x3 x4 … xn  ??????? y 1 y2 y3 y4 … yn  y1, y2 , y3 , y4 , , yn  PARAMETERS STATISTICS x y    Means:  [ X ] y Y n E[Y ] xX E n ( x  x )2 2   Variances: s  Y JAMA. 2003;290:1486-1493 x n 1 ( y  y )2  s  n 1 Scatterplot (n data points) 2 y  Covariance: y) s XYE( x(xX)( y   X(can )(Y be  +,Y )–, or 0) xy X n 1 Parameter Estimation via SAMPLE DATA … • Numerical variables x 1 x2 x3 x4 … xn  ??????? y 1 y2 y3 y4 … yn PARAMETERS STATISTICS x y    Means:  [ X ] y Y n E[Y ] xX E n ( x  x )2 2   Variances: s  Y JAMA. 2003;290:1486-1493 x n 1 ( y  y )2  s  n 1 Scatterplot (n data points) 2 y  Covariance: y) s XYE( x(xX)( y   X(can )(Y be  +,Y )–, or 0) xy n 1 Does this suggest a linear trend between X and Y? X If so, how do we measure it? Testing for association between two population variables X and Y… ^ • Numerical variables  ??????? PARAMETERS  Means:  X  E[ X ]  Variances: Y  E[Y ]  X2  E ( X   X ) 2   Y2  E (Y  Y ) 2   Covariance:  XY  E ( X   X )(Y  Y )  Linear Correlation Coefficient:   XY  X2  Y2 Always between –1 and +1 Parameter Estimation via SAMPLE DATA … • Numerical variables x 1 x2 x3 x4 … xn  ??????? y 1 y2 y3 y4 … yn PARAMETERS STATISTICS x y    Means:  [ X ] y Y n E[Y ] xX E n ( x  x )2 2   Variances: s  Y JAMA. 2003;290:1486-1493 x n 1 ( y  y )2  s  n 1 Scatterplot (n data points) 2 y  Covariance: y) s XYE( x(xX)( y   X(can )(Y be  +,Y )–, or 0) xy n 1  Linear Correlation Coefficient: X   r sxy XY sXx2 s yY22 Always between –1 and +1 Parameter Estimation via SAMPLE DATA … Example in R (reformatted for brevity): x x x x … x y y y y … y > pop 1 =2seq(0, 3 20, 4 0.1) n > x = sort(sample(pop, 10)) 1.11 1.8 4.0 n7.3 2 2.1 3 3.7 4 11.9 12.4 17.1 > yY = sample(pop, 10) 13.1 18.3 17.6 19.1 19.3 13.6 8.0 3.0 3.2 • Numerical variables  ??????? 9.1 PARAMETERS STATISTICS x   y  x y > c(mean(x), mean(y))  Means: [X ] X  E Y n E[Y ] n 7.05 12.08 2 2 22var(x)( x  x ) >  Variances: X  E  ( X  X )  x 29.48944  n1 5.6 JAMA. 2003;290:1486-1493 plot(x, y, pch = 19) s    s   Y ) 2  ( y  y )2 > 22var(y)  E ( X  Y43.76178 y n 1 Scatterplot n = 10 (n data points)  Covariance: y) > Ycov(x, y)0) s XYE( x(xX)( y   X(can )(  +,Y )–, or be xy n 1 -25.86667  Linear Correlation Coefficient: X r sxy sx2 s y2 >Always cor(x, y) between -0.7200451 –1 and +1 Parameter Estimation via SAMPLE DATA … • Numerical variables x 1 x2 x3 x4 … xn  Linear Correlation Coefficient: y 1 y2 y3 y4 … yn r Y JAMA. 2003;290:1486-1493 sxy sx2 s y2 Always between –1 and +1 r measures the strength of linear association Scatterplot (n data points) X Parameter Estimation via SAMPLE DATA … • Numerical variables x 1 x2 x3 x4 … xn  Linear Correlation Coefficient: y 1 y2 y3 y4 … yn r Y JAMA. 2003;290:1486-1493 sxy sx2 s y2 Always between –1 and +1 r measures the strength of linear association Scatterplot (n data points) r –1 0 negative linear correlation X +1 positive linear correlation Parameter Estimation via SAMPLE DATA … • Numerical variables x 1 x2 x3 x4 … xn  Linear Correlation Coefficient: y 1 y2 y3 y4 … yn r Y JAMA. 2003;290:1486-1493 sxy sx2 s y2 Always between –1 and +1 r measures the strength of linear association Scatterplot (n data points) r –1 0 negative linear correlation X +1 positive linear correlation Parameter Estimation via SAMPLE DATA … • Numerical variables x 1 x2 x3 x4 … xn  Linear Correlation Coefficient: y 1 y2 y3 y4 … yn r Y JAMA. 2003;290:1486-1493 sxy sx2 s y2 Always between –1 and +1 r measures the strength of linear association Scatterplot (n data points) r –1 0 negative linear correlation X +1 positive linear correlation Parameter Estimation via SAMPLE DATA … • Numerical variables x 1 x2 x3 x4 … xn  Linear Correlation Coefficient: y 1 y2 y3 y4 … yn r Y JAMA. 2003;290:1486-1493 > cor(x, y) -0.7200451 Scatterplot (n data points) sxy sx2 s y2 Always between –1 and +1 r measures the strength of linear association r –1 0 negative linear correlation X +1 positive linear correlation Testing for linear association between two numerical population variables X and Y…  Linear Correlation Coefficient   XY Now that we have r, we can conduct HYPOTHESIS TESTING on  H 0 :   0 "No linear association  X2  Y2 between X and Y ." H A :   0 "Linear association between X and Y ."  Linear Correlation Coefficient ̂  r  T sxy sx2 Test Statistic for p-value s y2  r 1 r n  2 ~ tn  2 2 0.72 1  (.72) 2 * pt(-2.935, 8) 2 10  2  2.935 on t8 p-value = .0189 < .05 Parameter Estimation via SAMPLE DATA …  Linear Correlation Coefficient: r r measures the sxy sx2 s y2 strength of linear association If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have… Y  0  1 X   “Response = Model + Error” > cor(x, y) -0.7200451 ( xi , yî ) Residuals ei  yi  yˆ i ( xi , yi ) Find estimates ̂ 0 and ̂1 for the “best” line Ŷ  ˆ0  ˆ1 X in what sense??? SSErr   ei 2 Parameter Estimation via SAMPLEvia DATA … SIMPLE LINEAR REGRESSION the METHOD OF LEAST SQUARES If such an association between X and Y  Linear Correlation Coefficient: exists, then it follows that for any intercept 0 and slope 1, we have… r measures the sxy r strength of linear sx2 s y2 association Y  0  1 X   “Response = Model + Error” > cor(x, y) -0.7200451 ( xi , yî ) Residuals ei  yi  yˆ i ( xi , yi ) Find estimates ̂ 0 and ̂1 for the “best” line Ŷ  ˆ0  ˆ1 X in what i.e., thatsense??? minimizes “Least Squares Regression Line” SSErr   ei 2 sxy 25.86667 ˆ  0.87715 1  2  29.48944 sx ˆ0  y  ˆ1 x  12.08  (0.87715)(7.05) ( x , y ) is on line  18.26391 SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES If such an association between X and Y  Linear Correlation Coefficient: exists, then it follows that for any intercept 0 and slope 1, we have… r measures the sxy r strength of linear sx2 s y2 association Y  0  1 X   “Response = Model + Error” > cor(x, y) -0.7200451 ( xi , yî ) Find estimates ̂ 0 and ̂1 for the “best” line Ŷ  ˆ0 ˆ0.87715 Yˆ  18.26391 X 1X i.e., that minimizes SSErr   ei 2 sxy 25.86667 ˆ  0.87715 1  2  29.48944 sx Residuals ei  yi  yˆ i ( xi , yi ) ˆ0  y  ˆ1 x  12.08  (0.87715)(7.05) Check  ( x , y ) is on line  18.26391 SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 observed response Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 > cor(x, y) -0.7200451 ( xi , yî ) Residuals ei  yi  yˆ i ( xi , yi ) Find estimates ̂ 0 and ̂1 for the “best” line Ŷ  ˆ0 ˆ0.87715 Yˆ  18.26391 X 1X i.e., that minimizes SSErr   ei 2 sxy 25.86667 ˆ  0.87715 1  2  29.48944 sx ˆ0  y  ˆ1 x  12.08  (0.87715)(7.05)  18.26391 SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 observed response Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 fitted response Yˆ > cor(x, y) -0.7200451 ( xi , yî ) Residuals ei  yi  yˆ i ( xi , yi ) Find estimates ̂ 0 and ̂1 for the “best” line Yˆ  18.26391  0.87715 X i.e., that minimizes SSErr   ei 2 sxy 25.86667 ˆ  0.87715 1  2  29.48944 sx ˆ0  y  ˆ1 x  12.08  (0.87715)(7.05)  18.26391 SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 observed response Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 fitted response Yˆ ~ E X E R C I S E ~ > cor(x, y) -0.7200451 ( xi , yî ) Residuals ei  yi  yˆ i ( xi , yi ) Find estimates ̂ 0 and ̂1 for the “best” line Yˆ  18.26391  0.87715 X i.e., that minimizes SSErr   ei 2 sxy 25.86667 ˆ  0.87715 1  2  29.48944 sx ˆ0  y  ˆ1 x  12.08  (0.87715)(7.05)  18.26391 SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 observed response Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 fitted response Yˆ ~ E X E R C I S E ~ residuals Y  Yˆ > cor(x, y) -0.7200451 ( xi , yî ) Residuals ei  yi  yˆ i ( xi , yi ) Find estimates ̂ 0 and ̂1 for the “best” line Yˆ  18.26391  0.87715 X i.e., that minimizes SSErr   ei 2 sxy 25.86667 ˆ  0.87715 1  2  29.48944 sx ˆ0  y  ˆ1 x  12.08  (0.87715)(7.05)  18.26391 SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 observed response Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 fitted response Yˆ ~ E X E R C I S E ~ residuals Y  Yˆ ~ E X E R C I S E ~ > cor(x, y) -0.7200451 ( xi , yî ) Residuals ei  yi  yˆ i ( xi , yi ) Find estimates ̂ 0 and ̂1 for the “best” line Yˆ  18.26391  0.87715 X i.e., that minimizes SSErr   ei 2  189.6555 sxy 25.86667 ˆ  0.87715 1  2  29.48944 sx ˆ0  y  ˆ1 x  12.08  (0.87715)(7.05)  18.26391 Testing for linear association between two numerical population variables X and Y…  Linear Regression Coefficients Y  0  1 X   “Response = Model + Error” Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 H 0 : 1  0 "No linear association between X and Y ." H A : 1  0 "Linear association between X and Y ."  Linear Regression Coefficients Ŷ  ˆ0  ˆ1 X sxy ˆ 1  2 sx ˆ0  y  ˆ1 x Test Statistic for p-value? SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 observed response Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 fitted response Yˆ ~ E X E R C I S E ~ residuals Y  Yˆ ~ E X E R C I S E ~ > cor(x, y) -0.7200451 ( xi , yî ) Residuals ei  yi  yˆ i ( xi , yi ) Find estimates ̂ 0 and ̂1 for the “best” line Yˆ  18.26391  0.87715 X i.e., that minimizes SSErr   ei 2  189.6555 sxy 25.86667 ˆ  0.87715 1  2  29.48944 sx ˆ0  y  ˆ1 x  12.08  (0.87715)(7.05)  18.26391 Testing for linear association between two numerical population variables X and Y…  Linear Regression Coefficients Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 Y  0  1 X   H 0 : 1  0 "No linear association between X and Y ." “Response = Model + Error” H A : 1  0 "Linear association between X and Y ." SSErr   ( y  yˆ )2 Test Statistic for p-value  Linear Regression Coefficients Ŷ  ˆ0  ˆ1 X ˆ1  sxy s 2 x ˆ0  y  ˆ1 x T  ˆ1  1 MSErr (n  1) sx2 MSErr  SSErr n2 tn  2 0.87715  0 (9)(29.48944)  2.935 on t8 189.6555 / 8 Same t-score as H0:  = 0! p-value = .0189 > > > > plot(x, y, pch = 19) lsreg = lm(y ~ x) # or lsfit(x,y) abline(lsreg) summary(lsreg) BUT WHY HAVE TWO METHODS FOR THE Call: SAME PROBLEM??? lm(formula = y ~ x) Residuals: Min 1Q -8.6607 -3.2154 Median 0.8954 3Q 3.4649 Max 5.7742 Coefficients: Estimate Std. Error t value (Intercept) 18.2639 2.6097 6.999 x -0.8772 0.2989 -2.935 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 Because this second method generalizes… Pr(>|t|) 0.000113 *** 0.018857 * ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.869 on 8 degrees of freedom Multiple R-squared: 0.5185, Adjusted R-squared: 0.4583 F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886 ANOVA Table H 0 : 1  0 Y  0  1 X   H A : 1  0 Source SS df MS Treatment Error Total – F-ratio p-value ANOVA Table H 0 : 1  0 Y  0  1 X   H A : 1  0 Source SS df MS Regression Error Total – F-ratio p-value ANOVA Table H 0 : 1  0 Y  0  1 X   H A : 1  0 Source df SS Regression 1 MS Error Total – F-ratio p-value Testing for linear association between two numerical population variables X and Y…  Linear Regression Coefficients Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 Y  0  1 X   H 0 : 1  0 "No linear association between X and Y ." “Response = Model + Error” H A : 1  0 "Linear association between X and Y ." SSErr   ( y  yˆ )2 Test Statistic for p-value  Linear Regression Coefficients Ŷ  ˆ0  ˆ1 X ˆ1  sxy s 2 x ˆ0  y  ˆ1 x T  ˆ1  1 MSErr (n  1) sx2 tn  2 MSErr  SSErr n2 df Err  8 0.87715  0 (9)(29.48944)  2.935 on t8 189.6555 / 8 Same t-score as H0:  = 0! p-value = .0189 ANOVA Table H 0 : 1  0 Y  0  1 X   H A : 1  0 Source df SS Regression 1 Error 8 Total MS – F-ratio p-value Parameter Estimation via SAMPLE DATA … x 1 x2 x3 x4 … xn y 1 y2 y3 y4 … yn STATISTICS  Means: x  x  Variances: JAMA. 2003;290:1486-1493 Scatterplot (n data points) n y  y ( x x )  s  n 1 2 x n 2 SSTotal ( y  y )2   s  n 1 2 y df Total Parameter Estimation via SAMPLE DATA … x 1 x2 x3 x4 … xn y 1 y2 y3 y4 … yn STATISTICS  Means: x  x  Variances: JAMA. 2003;290:1486-1493 n y  y ( x x )  s  n 1 2 x n 2 SSTotal ( y  y )2   s  n 1 2 y df Total Scatterplot (n data points) SSTot   ( y  y )2  (n  1) s y2 SSTot is a measure of the total amount of variability in the observed responses (i.e., before any model-fitting). Parameter Estimation via SAMPLE DATA … x 1 x2 x3 x4 … xn y 1 y2 y3 y4 … yn STATISTICS  Means: x  x  Variances: ( x  x )2  s  n 1 n 2 x SSTotal ( y  y )2   s  n 1 2 y JAMA. 2003;290:1486-1493 Scatterplot (n data points) n y  y df Total SSReg   ( yˆ  y )2 SSTot   ( y  y )2  (n  1) s y2 SSReg is a measure of the total amount of variability in the fitted responses (i.e., after model-fitting.) Parameter Estimation via SAMPLE DATA … x 1 x2 x3 x4 … xn y 1 y2 y3 y4 … yn STATISTICS  Means: x  x  Variances: ( x x )  s  n 1 2 x n 2 SSTotal ( y  y )2   s  n 1 2 y JAMA. 2003;290:1486-1493 Scatterplot (n data points) n y  y df Total SSReg   ( yˆ  y )2 SSErr   ( y  yˆ )2 SSTot   ( y  y )2  (n  1) s y2 SSErr is a measure of the total amount of variability in the resulting residuals (i.e., after model-fitting). SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 observed response Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 fitted response Yˆ ~ E X E R C I S E ~ residuals Y  Yˆ ~ E X E R C I S E ~ > cor(x, y) -0.7200451 Yˆ  18.26391  0.87715 X SSReg   ( yˆ  y )2 = 204.2 ( xi , yî ) Residuals ei  yi  yˆ i ( xi , yi ) SSErr   ( y  yˆ )2 = 189.656 SSTot   ( y  y )2  (n  1) s y2 = 9 (43.76178) = 393.856 SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 observed response Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 fitted response Yˆ ~ E X E R C I S E ~ residuals Y  Yˆ ~ E X E R C I S E ~ > cor(x, y) -0.7200451 Yˆ  18.26391  0.87715 X SSReg   ( yˆ  y )2 = 204.2 ( xi , yî ) Residuals ei  yi  yˆ i SSErr   ( y  yˆ )2 = 189.656 SSTot   ( y  y )2 = 393.856 SSTot = SSReg + SSErr ( xi , yi ) Tot Err Reg minimum ANOVA Table H 0 : 1  0 Y  0  1 X   H A : 1  0 Source df SS MS 1 204.200 MSReg Regression Error 8 189.656 MSErr Total 9 393.856 – F-ratio p-value Fk – 1, n – k 0<p<1 ANOVA Table H 0 : 1  0 Y  0  1 X   H A : 1  0 Source df SS MS 1 204.200 204.200 Regression Error 8 189.656 23.707 Total 9 393.856 – F-ratio p-value 8.61349 0.018857 Same as before! Source Regression df SS MS 1 204.200 204.200 Error 8 189.656 23.707 Total 9 393.856 – F-ratio p-value 8.61349 0.018857 > summary(aov(lsreg)) x Residuals Df Sum Sq Mean Sq F value Pr(>F) 1 204.20 204.201 8.6135 0.01886 * 8 189.66 23.707 Source Regression df SS MS 1 204.200 204.200 Error 8 189.656 23.707 Total 9 393.856 – F-ratio p-value 8.61349 0.018857 Coefficient of Determination Moreover, SSReg SSTot 204.2   0.5185 . 393.856 The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining. > cor(x, y) -0.7200451 r 2  (0.72)2  0.5185 Coefficient of Determination Moreover, SSReg SSTot 204.2   0.5185 . 393.856 The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining. > > > > plot(x, y, pch = 19) lsreg = lm(y ~ x) abline(lsreg) summary(lsreg) SSReg SSTot r 2  (0.72)2  0.5185 Call: lm(formula = y ~ x) Residuals: Min 1Q -8.6607 -3.2154 Median 0.8954  0.5185 Coefficient of Determination 3Q 3.4649 The least squares regression line Max for 51.85% of the total accounts 5.7742 variability in the observed response, with 48.15% remaining. Coefficients: Estimate Std. Error t value (Intercept) 18.2639 2.6097 6.999 x -0.8772 0.2989 -2.935 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 Pr(>|t|) 0.000113 *** 0.018857 * ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.869 on 8 degrees of freedom Multiple R-squared: 0.5185, Adjusted R-squared: 0.4583 F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886 Summary of Linear Correlation and Simple Linear Regression Given: X x1 x2 x3 x4 … x n Y y1 y2 y3 y4 … y n Means Variances x y sx2  Linear Correlation Coefficient r Y s xy JAMA. 2003;290:1486-1493 –1  r  +1 sxy sx2 s 2 y Covariance s y2 measures the strength of linear association  Least Squares Regression Line Yˆ  ˆ0  ˆ1 X minimizes SSErr = ˆ1  sxy sx2 , ˆ0  y  ˆ1 x 2 ˆ ( y  y )  = SSTot – SSReg All point estimates can be upgraded to CIs for hypothesis testing, etc. (ANOVA) X Summary of Linear Correlation and Simple Linear Regression 95% Confidence Means Intervals Variances Given: Covariance 2 intervals”) (see notes for “95% prediction x X x1 x2 x3 x4 … x n upper 95% y band yn Y y1 y2 y3 y4 … confidence  Linear Correlation Coefficient r –1  r  +1 sxy sx2 Y s y2 s 2 y s xy JAMA. 2003;290:1486-1493 measures the strength of linear association  Least Squares Regression Line Yˆ  ˆ0  ˆ1 X ŷ sx minimizes SSErr = ˆ1  sxy sx2 , ˆ0  y  ˆ1 x 2 ˆ ( y  y )  = SSTot – SSReg All point estimates can be upgraded to CIs for hypothesis testing, etc. (ANOVA) lower 95% confidence band X Summary of Linear Correlation and Simple Linear Regression Given: X x1 x2 x3 x4 … x n Y y1 y2 y3 y4 … y n Means Variances x y sx2  Linear Correlation Coefficient r Y s xy JAMA. 2003;290:1486-1493 –1  r  +1 sxy sx2 s 2 y Covariance s y2 measures the strength of linear association  Least Squares Regression Line Yˆ  ˆ0  ˆ1 X minimizes SSErr = ˆ1  sxy sx2 , ˆ0  y  ˆ1 x = SSTot – SSReg All point estimates can be upgraded to CIs for hypothesis testing, etc.  Coefficient of Determination 2 ˆ ( y  y )  X (ANOVA) r2  SSReg proportion of total variability modeled SSTot by the regression line’s variability. Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” Y   0  1 X 1   2 X 2  3 X 3  H 0 : 1  2  3    k 1 X k 1   “main effects”  k 1  0 "No linear association between Y and any of its predictors X 1 , X 2 , X 3 , H A : i  0 for some i  1, 2,..., k  1 , X k 1." "Linear association between Y and at least one of its predictors." Yˆ  ˆ0  ˆ1 X1  ˆ2 X 2   ˆk 1 X k 1 For now, assume the “additive model,” i.e., main effects only. Y Multilinear Regression Yˆ  ˆ0  ˆ1 X1  ˆ2 X 2 True response yi Least Squares calculation of regression coefficients is computer-intensive. Formulas require Linear Algebra (matrices)! Residual ei  yi  yî  ( x1 , x2 , y ) Fitted response yˆ i Once calculated, how do we then test the null hypothesis? X2 0 ANOVA (x1i , x2i) Predictors X1 Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” Y   0  1 X 1   2 X 2  3 X 3    k 1 X k 1   R code example: lsreg = lm(y ~ x1+x2+x3) “main effects” Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” Y   0  1 X 1   2 X 2  3 X 3   1,1 X 12   2,2 X 22    k 1 X k 1   k 1,k 1 X k21  cubes + R code example: lsreg = lm(y ~ x+x^2+x^3) x1+x2+x3) “main effects” quadratic terms, etc.  (“polynomial regression”) Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” Y   0  1 X 1   2 X 2  3 X 3   1,1 X 12   2,2 X 22    k 1 X k 1   k 1, k 1 X k21  cubes + + 1,2 X 1 X 2  1,3 X 1 X 3  +  2,3 X 2 X 3   2,4 X 2 X 4  “main effects” quadratic terms, etc. (“polynomial regression”)  1, k 1 X 1 X k 1 “interactions”   2,k 1 X 2 X k 1 + R code example: lsreg = lm(y ~ x1*x2) x+x^2+x^3) x1+x2+x1:x2)  Recall… Example in R (reformatted for brevity): Multiple Linear Reg with interaction with an indicator (“dummy”) variable: I = 1 > I = c(1,1,1,1,1,0,0,0,0,0) Yˆ  13.36  1.62 X Yˆ  ˆ0  ˆ1 X  ˆ2 I  ˆ3 X I > lsreg = lm(y ~ x*I) > summary(lsreg) Coefficients: I = 0 Yˆ  6.56  0.01X Suppose these are actually two subgroups, requiring two distinct linear regressions! Estimate (Intercept) x I x:I 6.56463 0.00998 6.80422 1.60858 Yˆ  6.56  0.01X  6.80 I  1.61X I ANOVA Table (revisited) Y  0  1 X1  2 X 2  H 0 : 1  2  3    k 1 X k 1   "No linear association between Y and any of its predictors X 1 , X 2 , X 3 ,…, X k -1 ." k 1  0 Note that if true, then it would follow that H A : i  0 Y  0     0  Y . "Linear association between Y and at least one of its predictors." for some i  1, 2,..., k  1 ˆ  ˆ  ˆ X  ˆ X  0 1 1 2 2 From sample of n data points…. Y Note that if true, then it would follow that  ˆk 1 X k 1 ˆ0  y . But how are these regression coefficients calculated in general? “Normal equations” solved via computer (intensive). ANOVA Table (revisited) H 0 : 1  2  3   k 1  0 Yˆ  ˆ0  ˆ1 X1  ˆ2 X 2  Source df "No linear association between Y and any of its predictors X 1 , X 2 , X 3 , , X k 1."  ˆk 1 X k 1 SS MS  (based on n data points). SS df F  MSReg MSErr p-value n Regression k 1 2 ˆ ( y  y )  i i 1 MSReg Fk 1, n  k n Error nk 2 ˆ ( y  y )  i i i 1 0  p 1 MSErr n Total n 1 2 ( y  y )  i i 1 *** How are only the statistically significant variables determined? *** “MODEL SELECTION”(BE) X1 1 Ŷ  Step 1. t-tests: p-values: X2 ˆ ˆ2 Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model. If significant, then… X3 X4 ˆ 3 + + H 0 : 1  0 H 0 : 2  0 p1 < .05 p2 < .05 Reject H0 Reject H0 ˆ4 + H 0 : 3  0 p3  .05 Accept H0 + …… H 0 : 4  0 p4 < .05 Reject H0 Step 2. Are all coefficients significant at level  ? If not…. …… …… …… “MODEL SELECTION”(BE) X1 1 Ŷ  Step 1. t-tests: p-values: X2 ˆ ˆ2 Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model. If significant, then… X3 X4 ˆ 3 + + H 0 : 1  0 H 0 : 2  0 + H 0 : 3  0 p2 < .05 Reject H0 Reject H0 + …… H 0 : 4  0 p3  .05 p1 < .05 ˆ4 p4 < .05 Accept H0 Reject H0 …… …… …… Step 2. Are all coefficients significant at level  ? If not…. delete that term, X1 Ŷ  X2 ˆ 1 + ˆ2 + X3 ˆ 3 X4 + ˆ4 + …… Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model. If significant, then… “MODEL SELECTION”(BE) X1 1 Ŷ  Step 1. t-tests: p-values: X2 ˆ ˆ2 X3 X4 ˆ 3 + + H 0 : 1  0 H 0 : 2  0 + H 0 : 3  0 p2 < .05 Reject H0 Reject H0 + …… H 0 : 4  0 p3  .05 p1 < .05 ˆ4 p4 < .05 Accept H0 Reject H0 …… …… …… Step 2. Are all coefficients significant at level  ? If not…. delete that term, and recompute new coefficients! Ŷ  X1 X1ˆ ˆ1 X2 1 + + Xˆ2 ˆ  2 2 + X4 Xˆ4  4 + + ˆ4 + …… + …… Step 3. Repeat 1-2 as necessary until all coefficients are significant → reduced model Recall ~ k  2 independent, equivariant, normally-distributed “treatment groups” Y1 Y2 Yk 1 H0 : 1  k k 1 2 2 = 2 = = k Re-plot data on a “log-log” scale. Re-plot data on a “log” scale (of Y only).. Binary outcome, e.g., “Have you ever had surgery?” (Yes / No) Binary outcome, e.g., “Have you ever had surgery?” (Yes / No) Binary outcome, e.g., “Have you ever had surgery?” (Yes / No) “log-odds” (“logit”) = example of a general “link function” g ( )  ˆ  ˆ ˆX ln       0 1  1  ˆ   ˆ  1 1 e  ( ˆ0  ˆ1 X ) “MAXIMUM LIKELIHOOD ESTIMATION” (Note: Not based on LS implies “pseudo-R2,” etc.) Binary outcome, e.g., “Have you ever had surgery?” (Yes / No) “log-odds” (“logit”)  ˆ  ˆ ˆ ˆ ln    0  1 X1  2 X 2   1  ˆ   ˆk X k  ˆ  Suppose one of the predictor variables is binary…  ˆ  X1  1: ln  1   ˆ0  ˆ1  ˆ2 X 2   1  ˆ  1   1 1 e  ( ˆ0  ˆ1 X1  ˆ2 X 2   ˆk X k )  1, Age  50 X1   0, Age  50  ˆk X k SUBTRACT!  ˆ  X 1  0 : ln  0   ˆ0   1  ˆ  0   ˆ2 X 2   ˆk X k Binary outcome, e.g., “Have you ever had surgery?” (Yes / No) “log-odds” (“logit”)  ˆ  ˆ ˆ ˆ ln    0  1 X1  2 X 2   1  ˆ   ˆk X k  ˆ  Suppose one of the predictor variables is binary…  ˆ  X1  1: ln  1   ˆ0  ˆ1  ˆ2 X 2   1  ˆ  1   1 1 e  ( ˆ0  ˆ1 X1  ˆ2 X 2   ˆk X k )  1, Age  50 X1   0, Age  50  ˆk X k SUBTRACT!  ˆ  X 1  0 : ln  0   ˆ0   1  ˆ  0   ˆ2 X 2   ˆk X k Binary outcome, e.g., “Have you ever had surgery?” (Yes / No) “log-odds” (“logit”)  ˆ  ˆ ˆ ˆ ln    0  1 X1  2 X 2   1  ˆ   ˆk X k  ˆ  Suppose one of the predictor variables is binary…  ˆ   ˆ  1   ln  0   ˆ1 ln   1  ˆ   1  ˆ  1  0    1 1 e  ( ˆ0  ˆ1 X1  ˆ2 X 2   ˆk X k )  1, Age  50 X1   0, Age  50 Binary outcome, e.g., “Have you ever had surgery?” (Yes / No) “log-odds” (“logit”)  ˆ  ˆ ˆ ˆ ln    0  1 X1  2 X 2   1  ˆ   ˆk X k  ˆ  Suppose one of the predictor variables is binary…   ˆ    1    1  ˆ1   ˆ ln    1   ˆ 0         1  ˆ 0   1 1 e  ( ˆ0  ˆ1 X1  ˆ2 X 2   ˆk X k )  1, Age  50 X1   0, Age  50 Binary outcome, e.g., “Have you ever had surgery?” (Yes / No) “log-odds” (“logit”)  ˆ  ˆ ˆ ˆ ln    0  1 X1  2 X 2   1  ˆ   ˆk X k  ˆ  Suppose one of the predictor variables is binary…  odds of surgery given Age  50  ln    ˆ1  odds of surgery given Age  50    ln  OR   ˆ1 ………….. implies ………….. 1 1 e  ( ˆ0  ˆ1 X1  ˆ2 X 2   ˆk X k )  1, Age  50 X1   0, Age  50 OR  e ˆ1 in population dynamics Unrestricted population growth (e.g., bacteria) Population size y obeys the following law with constant a > 0. dy  ay Population size y obeys the following law, constant a > 0, and “carrying capacity” M. Let survival probability  = y M . dy dt 1 Restricted population growth (disease, predation, starvation, etc.)  a y ( M  y)  dt 1 ln | y |  at  b y  e a t b  e a t eb  C e a t With initial condition y (0)  y0 Exponential growth d   a dt  (1   ) 1 1     d   a dt   1   ln |  |  ln |1   |  at  b y y  y0 e  a (1   ) dt d y  a dt at d Logistic growth  0  0  (1   0 ) e  at   ln   1     at  b 

7 - Linear Correlati..

Related documents

Products

Support

7 - Linear Correlati..

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib