CHAPTER 7 Linear Correlation & Regression Methods • 7.1 - Motivation • 7.2 - Correlation / Simple Linear Regression • 7.3 - Extensions of Simple Linear Regression Parameter Estimation via SAMPLE … Testing for association between two DATA POPULATION variables X and Y… • Categorical variables • Numerical variables Chi-squared Test Categories of X ??????? PARAMETERS Categories of Y Means: X E[ X ] Variances: Y E[Y ] X2 E ( X X ) 2 Y2 E (Y Y ) 2 Covariance: Examples: X = Disease status (D+, D–) Y = Exposure status (E+, E–) X = # children in household (0, 1-2, 3-4, 5+) Y = Income level (Low, Middle, High) XY E ( X X )(Y Y ) Parameter Estimation via SAMPLE DATA … x1, x2 , x3 , x4 , y1, y2 , y3 , y4 , , xn , yn • Numerical variables ??????? PARAMETERS STATISTICS x y Means: [ X ] y Y n E[Y ] xX E n ( x x )2 2 22 Variances: s E ( X ) X 2 ( y y )2 2 s Y E (Y Y ) 2 xX n 1 y n 1 Covariance: y) s XYE( x(xX)( y X(can )(Y be +,Y )–, or 0) xy n 1 Parameter Estimation via SAMPLE DATA … x1, x2 , x3 , x4 , , xn • Numerical variables x 1 x2 x3 x4 … xn ??????? y 1 y2 y3 y4 … yn y1, y2 , y3 , y4 , , yn PARAMETERS STATISTICS x y Means: [ X ] y Y n E[Y ] xX E n ( x x )2 2 Variances: s Y JAMA. 2003;290:1486-1493 x n 1 ( y y )2 s n 1 Scatterplot (n data points) 2 y Covariance: y) s XYE( x(xX)( y X(can )(Y be +,Y )–, or 0) xy X n 1 Parameter Estimation via SAMPLE DATA … • Numerical variables x 1 x2 x3 x4 … xn ??????? y 1 y2 y3 y4 … yn PARAMETERS STATISTICS x y Means: [ X ] y Y n E[Y ] xX E n ( x x )2 2 Variances: s Y JAMA. 2003;290:1486-1493 x n 1 ( y y )2 s n 1 Scatterplot (n data points) 2 y Covariance: y) s XYE( x(xX)( y X(can )(Y be +,Y )–, or 0) xy n 1 Does this suggest a linear trend between X and Y? X If so, how do we measure it? Testing for association between two population variables X and Y… ^ • Numerical variables ??????? PARAMETERS Means: X E[ X ] Variances: Y E[Y ] X2 E ( X X ) 2 Y2 E (Y Y ) 2 Covariance: XY E ( X X )(Y Y ) Linear Correlation Coefficient: XY X2 Y2 Always between –1 and +1 Parameter Estimation via SAMPLE DATA … • Numerical variables x 1 x2 x3 x4 … xn ??????? y 1 y2 y3 y4 … yn PARAMETERS STATISTICS x y Means: [ X ] y Y n E[Y ] xX E n ( x x )2 2 Variances: s Y JAMA. 2003;290:1486-1493 x n 1 ( y y )2 s n 1 Scatterplot (n data points) 2 y Covariance: y) s XYE( x(xX)( y X(can )(Y be +,Y )–, or 0) xy n 1 Linear Correlation Coefficient: X r sxy XY sXx2 s yY22 Always between –1 and +1 Parameter Estimation via SAMPLE DATA … Example in R (reformatted for brevity): x x x x … x y y y y … y > pop 1 =2seq(0, 3 20, 4 0.1) n > x = sort(sample(pop, 10)) 1.11 1.8 4.0 n7.3 2 2.1 3 3.7 4 11.9 12.4 17.1 > yY = sample(pop, 10) 13.1 18.3 17.6 19.1 19.3 13.6 8.0 3.0 3.2 • Numerical variables ??????? 9.1 PARAMETERS STATISTICS x y x y > c(mean(x), mean(y)) Means: [X ] X E Y n E[Y ] n 7.05 12.08 2 2 22var(x)( x x ) > Variances: X E ( X X ) x 29.48944 n1 5.6 JAMA. 2003;290:1486-1493 plot(x, y, pch = 19) s s Y ) 2 ( y y )2 > 22var(y) E ( X Y43.76178 y n 1 Scatterplot n = 10 (n data points) Covariance: y) > Ycov(x, y)0) s XYE( x(xX)( y X(can )( +,Y )–, or be xy n 1 -25.86667 Linear Correlation Coefficient: X r sxy sx2 s y2 >Always cor(x, y) between -0.7200451 –1 and +1 Parameter Estimation via SAMPLE DATA … • Numerical variables x 1 x2 x3 x4 … xn Linear Correlation Coefficient: y 1 y2 y3 y4 … yn r Y JAMA. 2003;290:1486-1493 sxy sx2 s y2 Always between –1 and +1 r measures the strength of linear association Scatterplot (n data points) X Parameter Estimation via SAMPLE DATA … • Numerical variables x 1 x2 x3 x4 … xn Linear Correlation Coefficient: y 1 y2 y3 y4 … yn r Y JAMA. 2003;290:1486-1493 sxy sx2 s y2 Always between –1 and +1 r measures the strength of linear association Scatterplot (n data points) r –1 0 negative linear correlation X +1 positive linear correlation Parameter Estimation via SAMPLE DATA … • Numerical variables x 1 x2 x3 x4 … xn Linear Correlation Coefficient: y 1 y2 y3 y4 … yn r Y JAMA. 2003;290:1486-1493 sxy sx2 s y2 Always between –1 and +1 r measures the strength of linear association Scatterplot (n data points) r –1 0 negative linear correlation X +1 positive linear correlation Parameter Estimation via SAMPLE DATA … • Numerical variables x 1 x2 x3 x4 … xn Linear Correlation Coefficient: y 1 y2 y3 y4 … yn r Y JAMA. 2003;290:1486-1493 sxy sx2 s y2 Always between –1 and +1 r measures the strength of linear association Scatterplot (n data points) r –1 0 negative linear correlation X +1 positive linear correlation Parameter Estimation via SAMPLE DATA … • Numerical variables x 1 x2 x3 x4 … xn Linear Correlation Coefficient: y 1 y2 y3 y4 … yn r Y JAMA. 2003;290:1486-1493 > cor(x, y) -0.7200451 Scatterplot (n data points) sxy sx2 s y2 Always between –1 and +1 r measures the strength of linear association r –1 0 negative linear correlation X +1 positive linear correlation Testing for linear association between two numerical population variables X and Y… Linear Correlation Coefficient XY Now that we have r, we can conduct HYPOTHESIS TESTING on H 0 : 0 "No linear association X2 Y2 between X and Y ." H A : 0 "Linear association between X and Y ." Linear Correlation Coefficient ̂ r T sxy sx2 Test Statistic for p-value s y2 r 1 r n 2 ~ tn 2 2 0.72 1 (.72) 2 * pt(-2.935, 8) 2 10 2 2.935 on t8 p-value = .0189 < .05 Parameter Estimation via SAMPLE DATA … Linear Correlation Coefficient: r r measures the sxy sx2 s y2 strength of linear association If such an association between X and Y exists, then it follows that for any intercept 0 and slope 1, we have… Y 0 1 X “Response = Model + Error” > cor(x, y) -0.7200451 ( xi , yˆi ) Residuals ei yi yˆ i ( xi , yi ) Find estimates ̂ 0 and ̂1 for the “best” line Ŷ ˆ0 ˆ1 X in what sense??? SSErr ei 2 Parameter Estimation via SAMPLEvia DATA … SIMPLE LINEAR REGRESSION the METHOD OF LEAST SQUARES If such an association between X and Y Linear Correlation Coefficient: exists, then it follows that for any intercept 0 and slope 1, we have… r measures the sxy r strength of linear sx2 s y2 association Y 0 1 X “Response = Model + Error” > cor(x, y) -0.7200451 ( xi , yˆi ) Residuals ei yi yˆ i ( xi , yi ) Find estimates ̂ 0 and ̂1 for the “best” line Ŷ ˆ0 ˆ1 X in what i.e., thatsense??? minimizes “Least Squares Regression Line” SSErr ei 2 sxy 25.86667 ˆ 0.87715 1 2 29.48944 sx ˆ0 y ˆ1 x 12.08 (0.87715)(7.05) ( x , y ) is on line 18.26391 SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES If such an association between X and Y Linear Correlation Coefficient: exists, then it follows that for any intercept 0 and slope 1, we have… r measures the sxy r strength of linear sx2 s y2 association Y 0 1 X “Response = Model + Error” > cor(x, y) -0.7200451 ( xi , yˆi ) Find estimates ̂ 0 and ̂1 for the “best” line Ŷ ˆ0 ˆ0.87715 Yˆ 18.26391 X 1X i.e., that minimizes SSErr ei 2 sxy 25.86667 ˆ 0.87715 1 2 29.48944 sx Residuals ei yi yˆ i ( xi , yi ) ˆ0 y ˆ1 x 12.08 (0.87715)(7.05) Check ( x , y ) is on line 18.26391 SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 observed response Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 > cor(x, y) -0.7200451 ( xi , yˆi ) Residuals ei yi yˆ i ( xi , yi ) Find estimates ̂ 0 and ̂1 for the “best” line Ŷ ˆ0 ˆ0.87715 Yˆ 18.26391 X 1X i.e., that minimizes SSErr ei 2 sxy 25.86667 ˆ 0.87715 1 2 29.48944 sx ˆ0 y ˆ1 x 12.08 (0.87715)(7.05) 18.26391 SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 observed response Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 fitted response Yˆ > cor(x, y) -0.7200451 ( xi , yˆi ) Residuals ei yi yˆ i ( xi , yi ) Find estimates ̂ 0 and ̂1 for the “best” line Yˆ 18.26391 0.87715 X i.e., that minimizes SSErr ei 2 sxy 25.86667 ˆ 0.87715 1 2 29.48944 sx ˆ0 y ˆ1 x 12.08 (0.87715)(7.05) 18.26391 SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 observed response Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 fitted response Yˆ ~ E X E R C I S E ~ > cor(x, y) -0.7200451 ( xi , yˆi ) Residuals ei yi yˆ i ( xi , yi ) Find estimates ̂ 0 and ̂1 for the “best” line Yˆ 18.26391 0.87715 X i.e., that minimizes SSErr ei 2 sxy 25.86667 ˆ 0.87715 1 2 29.48944 sx ˆ0 y ˆ1 x 12.08 (0.87715)(7.05) 18.26391 SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 observed response Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 fitted response Yˆ ~ E X E R C I S E ~ residuals Y Yˆ > cor(x, y) -0.7200451 ( xi , yˆi ) Residuals ei yi yˆ i ( xi , yi ) Find estimates ̂ 0 and ̂1 for the “best” line Yˆ 18.26391 0.87715 X i.e., that minimizes SSErr ei 2 sxy 25.86667 ˆ 0.87715 1 2 29.48944 sx ˆ0 y ˆ1 x 12.08 (0.87715)(7.05) 18.26391 SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 observed response Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 fitted response Yˆ ~ E X E R C I S E ~ residuals Y Yˆ ~ E X E R C I S E ~ > cor(x, y) -0.7200451 ( xi , yˆi ) Residuals ei yi yˆ i ( xi , yi ) Find estimates ̂ 0 and ̂1 for the “best” line Yˆ 18.26391 0.87715 X i.e., that minimizes SSErr ei 2 189.6555 sxy 25.86667 ˆ 0.87715 1 2 29.48944 sx ˆ0 y ˆ1 x 12.08 (0.87715)(7.05) 18.26391 Testing for linear association between two numerical population variables X and Y… Linear Regression Coefficients Y 0 1 X “Response = Model + Error” Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 H 0 : 1 0 "No linear association between X and Y ." H A : 1 0 "Linear association between X and Y ." Linear Regression Coefficients Ŷ ˆ0 ˆ1 X sxy ˆ 1 2 sx ˆ0 y ˆ1 x Test Statistic for p-value? SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 observed response Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 fitted response Yˆ ~ E X E R C I S E ~ residuals Y Yˆ ~ E X E R C I S E ~ > cor(x, y) -0.7200451 ( xi , yˆi ) Residuals ei yi yˆ i ( xi , yi ) Find estimates ̂ 0 and ̂1 for the “best” line Yˆ 18.26391 0.87715 X i.e., that minimizes SSErr ei 2 189.6555 sxy 25.86667 ˆ 0.87715 1 2 29.48944 sx ˆ0 y ˆ1 x 12.08 (0.87715)(7.05) 18.26391 Testing for linear association between two numerical population variables X and Y… Linear Regression Coefficients Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 Y 0 1 X H 0 : 1 0 "No linear association between X and Y ." “Response = Model + Error” H A : 1 0 "Linear association between X and Y ." SSErr ( y yˆ )2 Test Statistic for p-value Linear Regression Coefficients Ŷ ˆ0 ˆ1 X ˆ1 sxy s 2 x ˆ0 y ˆ1 x T ˆ1 1 MSErr (n 1) sx2 MSErr SSErr n2 tn 2 0.87715 0 (9)(29.48944) 2.935 on t8 189.6555 / 8 Same t-score as H0: = 0! p-value = .0189 > > > > plot(x, y, pch = 19) lsreg = lm(y ~ x) # or lsfit(x,y) abline(lsreg) summary(lsreg) BUT WHY HAVE TWO METHODS FOR THE Call: SAME PROBLEM??? lm(formula = y ~ x) Residuals: Min 1Q -8.6607 -3.2154 Median 0.8954 3Q 3.4649 Max 5.7742 Coefficients: Estimate Std. Error t value (Intercept) 18.2639 2.6097 6.999 x -0.8772 0.2989 -2.935 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 Because this second method generalizes… Pr(>|t|) 0.000113 *** 0.018857 * ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.869 on 8 degrees of freedom Multiple R-squared: 0.5185, Adjusted R-squared: 0.4583 F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886 ANOVA Table H 0 : 1 0 Y 0 1 X H A : 1 0 Source SS df MS Treatment Error Total – F-ratio p-value ANOVA Table H 0 : 1 0 Y 0 1 X H A : 1 0 Source SS df MS Regression Error Total – F-ratio p-value ANOVA Table H 0 : 1 0 Y 0 1 X H A : 1 0 Source df SS Regression 1 MS Error Total – F-ratio p-value Testing for linear association between two numerical population variables X and Y… Linear Regression Coefficients Now that we have these, we can conduct HYPOTHESIS TESTING on 0 and 1 Y 0 1 X H 0 : 1 0 "No linear association between X and Y ." “Response = Model + Error” H A : 1 0 "Linear association between X and Y ." SSErr ( y yˆ )2 Test Statistic for p-value Linear Regression Coefficients Ŷ ˆ0 ˆ1 X ˆ1 sxy s 2 x ˆ0 y ˆ1 x T ˆ1 1 MSErr (n 1) sx2 tn 2 MSErr SSErr n2 df Err 8 0.87715 0 (9)(29.48944) 2.935 on t8 189.6555 / 8 Same t-score as H0: = 0! p-value = .0189 ANOVA Table H 0 : 1 0 Y 0 1 X H A : 1 0 Source df SS Regression 1 Error 8 Total MS – F-ratio p-value Parameter Estimation via SAMPLE DATA … x 1 x2 x3 x4 … xn y 1 y2 y3 y4 … yn STATISTICS Means: x x Variances: JAMA. 2003;290:1486-1493 Scatterplot (n data points) n y y ( x x ) s n 1 2 x n 2 SSTotal ( y y )2 s n 1 2 y df Total Parameter Estimation via SAMPLE DATA … x 1 x2 x3 x4 … xn y 1 y2 y3 y4 … yn STATISTICS Means: x x Variances: JAMA. 2003;290:1486-1493 n y y ( x x ) s n 1 2 x n 2 SSTotal ( y y )2 s n 1 2 y df Total Scatterplot (n data points) SSTot ( y y )2 (n 1) s y2 SSTot is a measure of the total amount of variability in the observed responses (i.e., before any model-fitting). Parameter Estimation via SAMPLE DATA … x 1 x2 x3 x4 … xn y 1 y2 y3 y4 … yn STATISTICS Means: x x Variances: ( x x )2 s n 1 n 2 x SSTotal ( y y )2 s n 1 2 y JAMA. 2003;290:1486-1493 Scatterplot (n data points) n y y df Total SSReg ( yˆ y )2 SSTot ( y y )2 (n 1) s y2 SSReg is a measure of the total amount of variability in the fitted responses (i.e., after model-fitting.) Parameter Estimation via SAMPLE DATA … x 1 x2 x3 x4 … xn y 1 y2 y3 y4 … yn STATISTICS Means: x x Variances: ( x x ) s n 1 2 x n 2 SSTotal ( y y )2 s n 1 2 y JAMA. 2003;290:1486-1493 Scatterplot (n data points) n y y df Total SSReg ( yˆ y )2 SSErr ( y yˆ )2 SSTot ( y y )2 (n 1) s y2 SSErr is a measure of the total amount of variability in the resulting residuals (i.e., after model-fitting). SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 observed response Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 fitted response Yˆ ~ E X E R C I S E ~ residuals Y Yˆ ~ E X E R C I S E ~ > cor(x, y) -0.7200451 Yˆ 18.26391 0.87715 X SSReg ( yˆ y )2 = 204.2 ( xi , yˆi ) Residuals ei yi yˆ i ( xi , yi ) SSErr ( y yˆ )2 = 189.656 SSTot ( y y )2 (n 1) s y2 = 9 (43.76178) = 393.856 SIMPLE LINEAR REGRESSION via the METHOD OF LEAST SQUARES predictor X 1.1 1.8 2.1 3.7 4.0 7.3 9.1 11.9 12.4 17.1 observed response Y 13.1 18.3 17.6 19.1 19.3 3.2 5.6 13.6 8.0 3.0 fitted response Yˆ ~ E X E R C I S E ~ residuals Y Yˆ ~ E X E R C I S E ~ > cor(x, y) -0.7200451 Yˆ 18.26391 0.87715 X SSReg ( yˆ y )2 = 204.2 ( xi , yˆi ) Residuals ei yi yˆ i SSErr ( y yˆ )2 = 189.656 SSTot ( y y )2 = 393.856 SSTot = SSReg + SSErr ( xi , yi ) Tot Err Reg minimum ANOVA Table H 0 : 1 0 Y 0 1 X H A : 1 0 Source df SS MS 1 204.200 MSReg Regression Error 8 189.656 MSErr Total 9 393.856 – F-ratio p-value Fk – 1, n – k 0<p<1 ANOVA Table H 0 : 1 0 Y 0 1 X H A : 1 0 Source df SS MS 1 204.200 204.200 Regression Error 8 189.656 23.707 Total 9 393.856 – F-ratio p-value 8.61349 0.018857 Same as before! Source Regression df SS MS 1 204.200 204.200 Error 8 189.656 23.707 Total 9 393.856 – F-ratio p-value 8.61349 0.018857 > summary(aov(lsreg)) x Residuals Df Sum Sq Mean Sq F value Pr(>F) 1 204.20 204.201 8.6135 0.01886 * 8 189.66 23.707 Source Regression df SS MS 1 204.200 204.200 Error 8 189.656 23.707 Total 9 393.856 – F-ratio p-value 8.61349 0.018857 Coefficient of Determination Moreover, SSReg SSTot 204.2 0.5185 . 393.856 The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining. > cor(x, y) -0.7200451 r 2 (0.72)2 0.5185 Coefficient of Determination Moreover, SSReg SSTot 204.2 0.5185 . 393.856 The least squares regression line accounts for 51.85% of the total variability in the observed response, with 48.15% remaining. > > > > plot(x, y, pch = 19) lsreg = lm(y ~ x) abline(lsreg) summary(lsreg) SSReg SSTot r 2 (0.72)2 0.5185 Call: lm(formula = y ~ x) Residuals: Min 1Q -8.6607 -3.2154 Median 0.8954 0.5185 Coefficient of Determination 3Q 3.4649 The least squares regression line Max for 51.85% of the total accounts 5.7742 variability in the observed response, with 48.15% remaining. Coefficients: Estimate Std. Error t value (Intercept) 18.2639 2.6097 6.999 x -0.8772 0.2989 -2.935 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 Pr(>|t|) 0.000113 *** 0.018857 * ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.869 on 8 degrees of freedom Multiple R-squared: 0.5185, Adjusted R-squared: 0.4583 F-statistic: 8.614 on 1 and 8 DF, p-value: 0.01886 Summary of Linear Correlation and Simple Linear Regression Given: X x1 x2 x3 x4 … x n Y y1 y2 y3 y4 … y n Means Variances x y sx2 Linear Correlation Coefficient r Y s xy JAMA. 2003;290:1486-1493 –1 r +1 sxy sx2 s 2 y Covariance s y2 measures the strength of linear association Least Squares Regression Line Yˆ ˆ0 ˆ1 X minimizes SSErr = ˆ1 sxy sx2 , ˆ0 y ˆ1 x 2 ˆ ( y y ) = SSTot – SSReg All point estimates can be upgraded to CIs for hypothesis testing, etc. (ANOVA) X Summary of Linear Correlation and Simple Linear Regression 95% Confidence Means Intervals Variances Given: Covariance 2 intervals”) (see notes for “95% prediction x X x1 x2 x3 x4 … x n upper 95% y band yn Y y1 y2 y3 y4 … confidence Linear Correlation Coefficient r –1 r +1 sxy sx2 Y s y2 s 2 y s xy JAMA. 2003;290:1486-1493 measures the strength of linear association Least Squares Regression Line Yˆ ˆ0 ˆ1 X ŷ sx minimizes SSErr = ˆ1 sxy sx2 , ˆ0 y ˆ1 x 2 ˆ ( y y ) = SSTot – SSReg All point estimates can be upgraded to CIs for hypothesis testing, etc. (ANOVA) lower 95% confidence band X Summary of Linear Correlation and Simple Linear Regression Given: X x1 x2 x3 x4 … x n Y y1 y2 y3 y4 … y n Means Variances x y sx2 Linear Correlation Coefficient r Y s xy JAMA. 2003;290:1486-1493 –1 r +1 sxy sx2 s 2 y Covariance s y2 measures the strength of linear association Least Squares Regression Line Yˆ ˆ0 ˆ1 X minimizes SSErr = ˆ1 sxy sx2 , ˆ0 y ˆ1 x = SSTot – SSReg All point estimates can be upgraded to CIs for hypothesis testing, etc. Coefficient of Determination 2 ˆ ( y y ) X (ANOVA) r2 SSReg proportion of total variability modeled SSTot by the regression line’s variability. Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” Y 0 1 X 1 2 X 2 3 X 3 H 0 : 1 2 3 k 1 X k 1 “main effects” k 1 0 "No linear association between Y and any of its predictors X 1 , X 2 , X 3 , H A : i 0 for some i 1, 2,..., k 1 , X k 1." "Linear association between Y and at least one of its predictors." Yˆ ˆ0 ˆ1 X1 ˆ2 X 2 ˆk 1 X k 1 For now, assume the “additive model,” i.e., main effects only. Y Multilinear Regression Yˆ ˆ0 ˆ1 X1 ˆ2 X 2 True response yi Least Squares calculation of regression coefficients is computer-intensive. Formulas require Linear Algebra (matrices)! Residual ei yi yˆi ( x1 , x2 , y ) Fitted response yˆ i Once calculated, how do we then test the null hypothesis? X2 0 ANOVA (x1i , x2i) Predictors X1 Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” Y 0 1 X 1 2 X 2 3 X 3 k 1 X k 1 R code example: lsreg = lm(y ~ x1+x2+x3) “main effects” Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” Y 0 1 X 1 2 X 2 3 X 3 1,1 X 12 2,2 X 22 k 1 X k 1 k 1,k 1 X k21 cubes + R code example: lsreg = lm(y ~ x+x^2+x^3) x1+x2+x3) “main effects” quadratic terms, etc. (“polynomial regression”) Testing for linear association between a population response variable Y and multiple predictor variables X1, X2, X3, … etc. Multilinear Regression “Response = Model + Error” Y 0 1 X 1 2 X 2 3 X 3 1,1 X 12 2,2 X 22 k 1 X k 1 k 1, k 1 X k21 cubes + + 1,2 X 1 X 2 1,3 X 1 X 3 + 2,3 X 2 X 3 2,4 X 2 X 4 “main effects” quadratic terms, etc. (“polynomial regression”) 1, k 1 X 1 X k 1 “interactions” 2,k 1 X 2 X k 1 + R code example: lsreg = lm(y ~ x1*x2) x+x^2+x^3) x1+x2+x1:x2) Recall… Example in R (reformatted for brevity): Multiple Linear Reg with interaction with an indicator (“dummy”) variable: I = 1 > I = c(1,1,1,1,1,0,0,0,0,0) Yˆ 13.36 1.62 X Yˆ ˆ0 ˆ1 X ˆ2 I ˆ3 X I > lsreg = lm(y ~ x*I) > summary(lsreg) Coefficients: I = 0 Yˆ 6.56 0.01X Suppose these are actually two subgroups, requiring two distinct linear regressions! Estimate (Intercept) x I x:I 6.56463 0.00998 6.80422 1.60858 Yˆ 6.56 0.01X 6.80 I 1.61X I ANOVA Table (revisited) Y 0 1 X1 2 X 2 H 0 : 1 2 3 k 1 X k 1 "No linear association between Y and any of its predictors X 1 , X 2 , X 3 ,…, X k -1 ." k 1 0 Note that if true, then it would follow that H A : i 0 Y 0 0 Y . "Linear association between Y and at least one of its predictors." for some i 1, 2,..., k 1 ˆ ˆ ˆ X ˆ X 0 1 1 2 2 From sample of n data points…. Y Note that if true, then it would follow that ˆk 1 X k 1 ˆ0 y . But how are these regression coefficients calculated in general? “Normal equations” solved via computer (intensive). ANOVA Table (revisited) H 0 : 1 2 3 k 1 0 Yˆ ˆ0 ˆ1 X1 ˆ2 X 2 Source df "No linear association between Y and any of its predictors X 1 , X 2 , X 3 , , X k 1." ˆk 1 X k 1 SS MS (based on n data points). SS df F MSReg MSErr p-value n Regression k 1 2 ˆ ( y y ) i i 1 MSReg Fk 1, n k n Error nk 2 ˆ ( y y ) i i i 1 0 p 1 MSErr n Total n 1 2 ( y y ) i i 1 *** How are only the statistically significant variables determined? *** “MODEL SELECTION”(BE) X1 1 Ŷ Step 1. t-tests: p-values: X2 ˆ ˆ2 Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model. If significant, then… X3 X4 ˆ 3 + + H 0 : 1 0 H 0 : 2 0 p1 < .05 p2 < .05 Reject H0 Reject H0 ˆ4 + H 0 : 3 0 p3 .05 Accept H0 + …… H 0 : 4 0 p4 < .05 Reject H0 Step 2. Are all coefficients significant at level ? If not…. …… …… …… “MODEL SELECTION”(BE) X1 1 Ŷ Step 1. t-tests: p-values: X2 ˆ ˆ2 Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model. If significant, then… X3 X4 ˆ 3 + + H 0 : 1 0 H 0 : 2 0 + H 0 : 3 0 p2 < .05 Reject H0 Reject H0 + …… H 0 : 4 0 p3 .05 p1 < .05 ˆ4 p4 < .05 Accept H0 Reject H0 …… …… …… Step 2. Are all coefficients significant at level ? If not…. delete that term, X1 Ŷ X2 ˆ 1 + ˆ2 + X3 ˆ 3 X4 + ˆ4 + …… Step 0. Conduct an overall F-test of significance (via ANOVA) of the full model. If significant, then… “MODEL SELECTION”(BE) X1 1 Ŷ Step 1. t-tests: p-values: X2 ˆ ˆ2 X3 X4 ˆ 3 + + H 0 : 1 0 H 0 : 2 0 + H 0 : 3 0 p2 < .05 Reject H0 Reject H0 + …… H 0 : 4 0 p3 .05 p1 < .05 ˆ4 p4 < .05 Accept H0 Reject H0 …… …… …… Step 2. Are all coefficients significant at level ? If not…. delete that term, and recompute new coefficients! Ŷ X1 X1ˆ ˆ1 X2 1 + + Xˆ2 ˆ 2 2 + X4 Xˆ4 4 + + ˆ4 + …… + …… Step 3. Repeat 1-2 as necessary until all coefficients are significant → reduced model Recall ~ k 2 independent, equivariant, normally-distributed “treatment groups” Y1 Y2 Yk 1 H0 : 1 k k 1 2 2 = 2 = = k Re-plot data on a “log-log” scale. Re-plot data on a “log” scale (of Y only).. Binary outcome, e.g., “Have you ever had surgery?” (Yes / No) Binary outcome, e.g., “Have you ever had surgery?” (Yes / No) Binary outcome, e.g., “Have you ever had surgery?” (Yes / No) “log-odds” (“logit”) = example of a general “link function” g ( ) ˆ ˆ ˆX ln 0 1 1 ˆ ˆ 1 1 e ( ˆ0 ˆ1 X ) “MAXIMUM LIKELIHOOD ESTIMATION” (Note: Not based on LS implies “pseudo-R2,” etc.) Binary outcome, e.g., “Have you ever had surgery?” (Yes / No) “log-odds” (“logit”) ˆ ˆ ˆ ˆ ln 0 1 X1 2 X 2 1 ˆ ˆk X k ˆ Suppose one of the predictor variables is binary… ˆ X1 1: ln 1 ˆ0 ˆ1 ˆ2 X 2 1 ˆ 1 1 1 e ( ˆ0 ˆ1 X1 ˆ2 X 2 ˆk X k ) 1, Age 50 X1 0, Age 50 ˆk X k SUBTRACT! ˆ X 1 0 : ln 0 ˆ0 1 ˆ 0 ˆ2 X 2 ˆk X k Binary outcome, e.g., “Have you ever had surgery?” (Yes / No) “log-odds” (“logit”) ˆ ˆ ˆ ˆ ln 0 1 X1 2 X 2 1 ˆ ˆk X k ˆ Suppose one of the predictor variables is binary… ˆ X1 1: ln 1 ˆ0 ˆ1 ˆ2 X 2 1 ˆ 1 1 1 e ( ˆ0 ˆ1 X1 ˆ2 X 2 ˆk X k ) 1, Age 50 X1 0, Age 50 ˆk X k SUBTRACT! ˆ X 1 0 : ln 0 ˆ0 1 ˆ 0 ˆ2 X 2 ˆk X k Binary outcome, e.g., “Have you ever had surgery?” (Yes / No) “log-odds” (“logit”) ˆ ˆ ˆ ˆ ln 0 1 X1 2 X 2 1 ˆ ˆk X k ˆ Suppose one of the predictor variables is binary… ˆ ˆ 1 ln 0 ˆ1 ln 1 ˆ 1 ˆ 1 0 1 1 e ( ˆ0 ˆ1 X1 ˆ2 X 2 ˆk X k ) 1, Age 50 X1 0, Age 50 Binary outcome, e.g., “Have you ever had surgery?” (Yes / No) “log-odds” (“logit”) ˆ ˆ ˆ ˆ ln 0 1 X1 2 X 2 1 ˆ ˆk X k ˆ Suppose one of the predictor variables is binary… ˆ 1 1 ˆ1 ˆ ln 1 ˆ 0 1 ˆ 0 1 1 e ( ˆ0 ˆ1 X1 ˆ2 X 2 ˆk X k ) 1, Age 50 X1 0, Age 50 Binary outcome, e.g., “Have you ever had surgery?” (Yes / No) “log-odds” (“logit”) ˆ ˆ ˆ ˆ ln 0 1 X1 2 X 2 1 ˆ ˆk X k ˆ Suppose one of the predictor variables is binary… odds of surgery given Age 50 ln ˆ1 odds of surgery given Age 50 ln OR ˆ1 ………….. implies ………….. 1 1 e ( ˆ0 ˆ1 X1 ˆ2 X 2 ˆk X k ) 1, Age 50 X1 0, Age 50 OR e ˆ1 in population dynamics Unrestricted population growth (e.g., bacteria) Population size y obeys the following law with constant a > 0. dy ay Population size y obeys the following law, constant a > 0, and “carrying capacity” M. Let survival probability = y M . dy dt 1 Restricted population growth (disease, predation, starvation, etc.) a y ( M y) dt 1 ln | y | at b y e a t b e a t eb C e a t With initial condition y (0) y0 Exponential growth d a dt (1 ) 1 1 d a dt 1 ln | | ln |1 | at b y y y0 e a (1 ) dt d y a dt at d Logistic growth 0 0 (1 0 ) e at ln 1 at b