Stat 2560 Simple Linear Regression 9 March, 2010 Estimating σ 2 I Fitted (predicted) values ŷ1 , ŷ2 , . . . , ŷn can be obtained by successively substituting x1 , x2 , . . . , xn into the estimated regression line. I ie. ŷi = β̂0 + β̂1 xi , i = 1, 2, . . . , n I Residuals are the vertifcal deviations, yi − ŷi , i = 1, 2, . . . , n. I Using these residuals, we can estimae the σ 2 as Error Sum Pnof Squares,2 Pn SSE = i=1 (yi − ŷi ) = i=1 [yi − (β̂0 + β̂1 xi )]2 Pn (yi − ŷi )2 SSE = i=1 σ̂ 2 = s 2 = n−2 n−2 Estimation of σ 2 I Computation of SSE as per the formula given involves much tedious calculation I Instead, we can substitute the β values and simplify the formula as follows: SSE = n X i=1 I yi2 − β̂0 n X i=1 yi − β̂1 n X xi yi i=1 Computational formula is sensitive to effects of rounding in β̂0 and β̂1 , so try to get as many digits so that these estimate will protect against the rounding error. Coefficient of Determination I By fitting a regression model, we would like to know how good it is. I If the variation is Y can be explained by the regression model, then the regression model is good. I Coefficient of determination is the proportion of the variation in Y is explained by the regression model I Higher the coefficient of variation, the model is better. ie The regression model fitted is better. Coefficient of Determination I The coefficient of determination, denoted as r 2 , is given by r2 = 1 − I I I SSE SST where, SSE - sum of squares for error and SST - sum of squares for Total P P P SSE = ni=1 yi2 − β̂0 ni=1 yi − β̂1 ni=1 xi yi P P P SST = ni=1 (yi − ȳ )2 = ni=1 yi2 − ( ni=1 yi )2 /n Higher the value of r 2 , the more successful is the simple regression model in explaining the Y variation. Coefficient of Determination I Some software will present either r 2 or 100r 2 (percentage of variation explained) I r 2 > 0.8 is considered as satisfactory I If r 2 is small, an analyst will usually want to search for an alternative model (non-linear model or multiple regression model) I Note that r 2 will lie between 0 and 1, i.e 0 ≤ r 2 ≤ 1. Revisiting Example Consider the data actual (work-study) time (X) & booked time (Y) Actual time (mins):X= (112,145,208,192,184,223,108,89,47,160) Booked time (mins):Y=(150,180,210,150,180,180,120,60,60,150) Calculate P P various P sums P 2 P xi yi xi2 yi xi yi 1468 1440 244676 230400 233760 By substituting, we get β̂0 = 31.45 0.7667 Pn and2 β̂1 = P P SSE = i=1 yi − β̂0 ni=1 yi − β̂1 ni=1 xi yi = 230400- 31.45 x 1440 - 0.7667 x 233760 = 5888.200 2 σ̂ = SSE P/(n − 2) =P5888.200/8 = 736.025 SST = ni=1 yi2 − ( ni=1 yi )2 /n =230400 − 14402 /10 = 23040 SSE 2 = 1 − 5888.200/23040 = 0.744783 r = 1 − SST R -output Residuals: Min 1Q -39.684 -18.688 Median 0.814 3Q 16.177 Max 37.380 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 31.4454 24.8492 1.265 0.24132 x 0.7667 0.1589 4.826 0.00131 ** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 27.13 on 8 degrees of freedom Multiple R-squared: 0.7444, Adjusted R-squared: 0.7124 F-statistic: 23.29 on 1 and 8 DF, p-value: 0.001311 UFFI Example : Minitab Output Welcome to Minitab, press F1 for help. Retrieving worksheet from file: ’H:\My Documents\Desktop\uf Worksheet was saved on Wed Mar 03 2010 Regression Analysis: CH2O versus UFFI The regression equation is CH2O = 46.1 + 9.07 UFFI Predictor Constant UFFI S = 9.86755 Coef 46.122 9.074 SE Coef 2.849 4.028 R-Sq = 18.7% T 16.19 2.25 P 0.000 0.035 R-Sq(adj) = 15.0% Inference about the slope parameter β1 I The mean of β̂1 is E (β̂1 ) = β1 i.e β̂1 is an unbiased estimator of β I In the regression model, we assume that X is fixed and only Y is random. So variance of β̂1 can be estimated as V (β̂1 ) = σ̂β̂2 = 1 where Sxx = I P2 i=1 (xi − x̄)2 = Pn s2 Sxx 2 i=1 xi P − ( ni=1 xi )2 /n Standard deviation is the square root of variance σ̂β̂1 = sβ̂1 = √Ss xx I The estimator β̂1 follow normal distribution Testing and Confidence Interval for slope β1 I The assumption of normality helps us to make inference on the β1 I The standardized variable T = with n − 2 degrees of freedom I 100(1 − α)% confidence interval for the slope β1 is β̂1√ −β1 §/ Sxx follow t distribution β̂1 +tα/2,n−2 sβ̂1 I For example P P10 2 2 Sxx = 10 i=1 xi − ( i=1 xi ) /10 = 244676 − 215502.4 = 29173.6 p p sβ̂1 = s 2 /sxx = 736.025/29173.6 = 0.1589 95% confidence interval for β1 is β̂1 +t0.05/2,8 sβ̂1 = 0.7667+2.306x0.1589 = (0.4003, 1.1331) Testing and Confidence Interval for slope β1 I To test H0 : β1 = β , we compute the test statistic t= I I I β̂1 − β sβ̂1 The rejection region is Ha : β1 > β0 t > tα,n−2 Ha : β1 < β0 t < −tα,n−2 Ha : β1 6= β0 |t| > tα/2,n−2 We can also compute p-value. Most of the software report p-value of testing H0 : β1 = 0 against Ha : β1 6= 0 Testing β1 = 0 for our example, t = sβ̂1 = 0.7667 0.1589 = 4.825 > t0.025,8 = 2.306 β̂1 i.e slope is significantly different from zero R -output lm(formula = y ~ x) Residuals: Min 1Q -39.684 -18.688 Median 0.814 3Q 16.177 Max 37.380 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 31.4454 24.8492 1.265 0.24132 x 0.7667 0.1589 4.826 0.00131 ** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 27.13 on 8 degrees of freedom Multiple R-squared: 0.7444, Adjusted R-squared: 0.7124 F-statistic: 23.29 on 1 and 8 DF, p-value: 0.001311