PROBABILITY AND STATISTICS FOR SCIENTISTS AND ENGINEERS Correlation and Regression Analysis – An Application Jerrell T.Stracener - Ph.D. 1 Montgomery, Peck, and Vining (2001) present data concerning the performance of the 28 National Football league teams in 1976. It is suspected that the number of games won(y) is related to the number of yards gained rushing by an opponent(x). The data are shown in the following table: Jerrell T.Stracener - Ph.D. 2 Games Won (y) Yards Rushing by Opponent (x) Games Won (y) Yards Rushing by Opponent (x) Washington 10 2205 Detroit 6 1901 Minnesota 11 2096 Green Bay 5 2288 New England 11 1847 Houston 5 2072 Oakland 13 1903 Kansas City 5 2861 Pittsburgh 10 1457 Miami 6 2411 Baltimore 11 1848 New Orleans 4 2289 Los Angeles 10 1564 New york Giants 3 2203 Dallas 11 1821 New York Jets 3 2592 Atlanta 4 2577 Philadelphia 4 2053 Buffalo 2 2476 St. Louis 10 1979 Chicago 7 1984 San Diego 6 2048 Cincinnati 10 1917 San Francisco 8 1786 Cleveland 9 1761 Seattle 2 2876 Denver 9 1709 Tampa Bay 0 2560 Team Team 3 Jerrell T.Stracener - Ph.D. 3 Correlation Analysis Statistical analysis used to obtain a quantitative measure of the strength of the relationship between a dependent variable and one or more independent variables. Jerrell T.Stracener - Ph.D. 4 Scatter Plot 5 Jerrell T.Stracener - Ph.D. 5 Sample correlation coefficient n n n n x i y i x i y i i 1 i 1 i 1 ρˆ r 2 2 n n n n n x 2 x n y 2 y i i i i i 1 i 1 i 1 i 1 Notes: -1 r 1 Jerrell T.Stracener - Ph.D. 6 1 2 r 28 .386,127 59,084 195 28 128,284,292 59,084 28 1,685 195 2 r 0.738 Jerrell T.Stracener - Ph.D. 7 2 1 2 Correlation To test for no linear association between x & y, calculate t r n2 1 r2 where r is the sample correlation coefficient and n is the sample size. t r n2 1 r2 0.738 28 2 1 (0.738) 2 Jerrell T.Stracener - Ph.D. 8 5.5766 Correlation Conclude no linear association if - tα 2 ,n 2 t tα 2 ,n 2 then treat y1, y2, …, yn as a random sample Jerrell T.Stracener - Ph.D. 9 Correlation Take α=0.05 from the T-table, we get - tα 2 ,n 2 t0.025, 26 2.0555 Since t=-5.5766 < -2.0555, we conclude that there is a linear association between x and y. therefore, proceed with regression analysis Jerrell T.Stracener - Ph.D. 10 Linear Regression Model Simple linear regression model Y 0 1X where Y is the response (or dependent) variable 0 and 1 are the unknown parameters ~ N(0,) and data: (x1, y1), (x2, y2), ..., (xn, yn) Jerrell T.Stracener - Ph.D. 11 Least squares estimates of 0 and 1 ^ b1 1 n n n i 1 i 1 i 1 n xi yi xi yi n x xi i 1 i 1 n n 2 2 i n 1 n b 0 β 0 y i b1 x i n i 1 i 1 ^ Jerrell T.Stracener - Ph.D. 12 estimate of 1 ^ b1 β1 n n n i 1 i 1 i 1 n x i yi x i yi n x xi i 1 i 1 n n 2 2 i 28 386,127 59,084 195 b1 2 28 128,284,292 59,084 b1 0.00703 Jerrell T.Stracener - Ph.D. 13 estimate of 0 n 1 n b 0 y i b1 x i n i 1 i 1 1 b0 195 (0.00703) 59,084 28 b0 21.7883 Jerrell T.Stracener - Ph.D. 14 Least squares regression equation Point estimate of the linear model Y β 0 β1x ε is ˆ 21.78825 0.00703x Y Jerrell T.Stracener - Ph.D. 15 Regression Fitted Line Plot Jerrell T.Stracener - Ph.D. 16 Point estimate of 2 1 ˆ S σ y Y i i n 2 i 1 n 2 ^ 2 2 2 1 n b1 n n n y i y n X i y i X i y i n 2 i 1 n i 1 i 1 i 1 2 n y i 1 n 2 i 1 b1 n n n n X i y i X i y i yi n 2 i 1 n n i 1 i 1 i 1 5.726 Jerrell T.Stracener - Ph.D. 17 Interval Estimates for y intercept (0) (1 - ).100% confidence interval for 0 is β 0L , β 0 U where and β 0L b 0 t α 2 β 0U b 0 t α 2 ,n 2 Sb 0 ,n 2 Sb 0 2 Xi i 0 S b 0 S 2 n n n X i2 X i i 0 i 0 n where Jerrell T.Stracener - Ph.D. 18 1/ 2 Interval Estimates for y intercept (0) Take =0.05, then 95% confidence interval for 0 is n 2 Xi i 0 S b 0 S 2 n n n X i2 X i i 0 i 0 1/ 2 128,284,292 2.3929 2 28 128 , 284 , 292 59 , 084 2.696 Jerrell T.Stracener - Ph.D. 19 1/ 2 Interval Estimates for y intercept (0) Apply Sb0 to the equation and we get the lower and upper bound for β0 : β 0L b 0 t α 2 ,n 2 β 0U b 0 t α 2 Sb 0 21.7883 2.056 2.696 16.246 ,n 2 Sb 0 21.7883 2.056 2.696 27.33 Jerrell T.Stracener - Ph.D. 20 Interval Estimates for slope (1) (1 - ).100% confidence interval for 1 is β1L , β1U where β1L b1 t α and 2 β1U b1 t α 2 where Sb1 ,n 2 ,n 2 Sb1 Sb1 S n Xi n X 2 i 0 i n i 0 Jerrell T.Stracener - Ph.D. 21 2 1 2 Interval Estimates for slope (1) S Sb1 n Xi n X 2 i 0 i n i 0 0.00126 β1L b1 t α 2 ,n 2 β1U b1 t α 2 2 1 2 2.3929 59,084 2 128,284,292 28 1/ 2 Sb1 0.00703 2.056 0.00126 0.00961 ,n 2 Sb1 0.00703 2.056 0.00126 0.00444 Jerrell T.Stracener - Ph.D. 22 Confidence interval for conditional mean of Y, given x=2205 Given x equal to 2205, we can calculate the confidence interval of conditional mean of Y 2 ^ 1 ^ n xx L ( x) Y ( x) t 2 n n n ,n 2 2 n xi2 xi i 1 i 1 1 2 28 8997.878 1 L ( x) 6.298 2.056 2.3929 2 59084 292 , 284 , 128 28 28 L ( x) 5.3254 Jerrell T.Stracener - Ph.D. 23 1 2 Confidence interval for conditional mean of Y, given x=2205 and 2 ^ ^ 1 n xx U ( x) Y ( x) t 2 n n n ,n 2 2 2 n xi xi i 1 i 1 1 2 28 8997.878 1 U ( x) 6.298 2.056 2.3929 2 28 28 128284292 59084 U ( x) 7.248 Jerrell T.Stracener - Ph.D. 24 1 2 Jerrell T.Stracener - Ph.D. 25 Prediction interval for a single future value of Y, given x 2 ^ ^ 1 n xx YL ( x ) Y( x ) t 1 2 n n n ,n 2 2 2 n x i x i i 1 i 1 and 1 2 2 ^ ^ 1 n xx YU ( x) Y ( x) t 1 2 n n n ,n 2 2 2 n xi xi i 1 i 1 Jerrell T.Stracener - Ph.D. 26 1 2 Prediction interval for a single future value of Y, given x=2000 Given x= 2000, ^ Y (2000) 21.7883 0.00703* 2000 7.738 2 ^ ^ n xx 1 1 YL ( x) Y ( x) t 2 n n n ,n 2 2 n xi2 xi i 1 i 1 1 2 28 12130.82 1 YL ( x) 7.738 2.056 2.3929 1 2 28 28 128,284,292 59,084 YL ( x) 2.723 Jerrell T.Stracener - Ph.D. 27 1 2 Prediction interval for a single future value of Y, given x=2000 and 2 ^ ^ 1 n xx YU ( x) Y ( x) t 1 2 n ,n 2 n 2 n 2 n xi xi i 1 i 1 1 2 1 28 12130.82 YU ( x) 7.738 2.056 2.3929 1 2 28 28 128,284,292 59084 YU ( x) 12.75 Jerrell T.Stracener - Ph.D. 28 1 2 Prediction Interval Jerrell T.Stracener - Ph.D. 29