Chapter 6: Simple Linear Regression Regression Line The line that best fits a collection of X-Y data -- the line that minimizes the sum of squared vertical distances (errors or residuals) from the points to the line. The line is also known as “least squared line”. The fitted straight line is of the form Ŷ = bo + b1X. Least Squares Method Y = Observation, Ŷ = Fit Error (residual) = Y – Ŷ SSE = Sum of squared errors = (Y – Ŷ)2 = (Y – b0 – b1X)2 The least squares method chooses the values for b0 and b1 to minimize the sum of squared errors (SSE). Also, Y = Ŷ + (Y – Ŷ), i.e. Y = Fit + Residual Statistical Model for Straight-Line Regression Population model: Y = o + 1X + i.e. Y = Y/X + where, Y/X = o + 1X = Mean of Y for the given X value Assumption: The deviations are assumed to be independent and normally distributed with mean 0 and standard deviation . Estimation Unknowns to be estimated are o, 1and . Estimate for o = bo = INTERCEPT Estimate for 1 = b1 = SLOPE Estimate for = S = Sy.x = Standard error of the estimate = STEYX Decomposition of variance Y = Fit + Residual = Ŷ + (Y – Ŷ) Subtracting Y from both sides, (Y - Y ) = (Ŷ - Y ) + (Y – Ŷ) Given assumption #4 above, (Y - Y )2 = (Ŷ - Y )2 + (Y – Ŷ)2 i.e. SST = SSR + SSE SST = Sum of Squares Total = (Y - Y )2 SSR = Sum of Squares Regression = (Ŷ - Y )2 SSE = Sum of Squares due to Error (Y – Ŷ)2= SST - SSR ANOVA Table Source Sum of squares Regression SSR Error SSE Total SST (Note: MSE = S2y.x) and MSE Df Mean Square F-test 1 MSR = SSR/1 F = MSR/MSE n-2 MSE = SSE/(n-2) n-1 = Sy.x = Standard error of estimate Coefficient of Determination r 2 SSR SST Sample Correlation Coefficient rxy = (the sign of b1) = (the sign of b1) r2 where b1 = the slope of the regression equation Hypothesis Testing: Given the four assumptions stated earlier, the model for Y = o + 1X + . Then, this regression model is statistically significant only if 1 0. Therefore, the hypothesis for testing whether or not the regression is significant is as follows. Ho: 1 = 0; Ho: 1 0 To test the above hypotheses, either a t-test or an F-test may be used. b t Test: t 1 s b1 F Test: F MSR , also note, F = t2 MSE Forecasting Y Point prediction Ŷ = bo + b1X Interval prediction = Ŷ ± t Sf, where Sf = Standard error of the forecast with df = n-2 Estimated Simple Linear Regression Equation y b o b1 x e b0 = the y-intercept, b1 = the slope of the line Review of assumptions 1. The mean of Y (y) = o + 1X 2. For a given X, Y values follow normal distribution 3. The dispersion (variance) of Y values remains constant everywhere along the line. 4. The error terms (e) are independent Analysis of residuals Recall the assumptions made for statistical analysis of regression 1. The underlying relation is linear (y = o + 1X) 2. The errors are independent 3. The errors have constant variance 4. The error are normally distributed Residual plots used for verifying assumptions Histogram of residuals Checks for the normality assumption – moderate deviation from bell-shape is permitted Residuals (on y-axis) v. fitted Checks for the linear assumptions – if the plot values Ŷ (on x-axis) is not completely random, a transformation may be considered Residuals v. explanatory Also checks for the linear model and for variable (x) constant variance assumption Residuals over time (for time Used for time-series data - checks for all series) assumptions Autocorrelation of residuals Checks for independence of residuals Variable transformations 1. 1/x Y = 0 + 1 (1/X) + 2. log(x) Y = 0 + 1 log(X) + x 3. Y = 0 + 1 x + 2 4. x Y = 0 + 1 X2 + Growth Curves Growth curves are used for long term forecasts using annual data. Exponential growth Sales increase over time by the same percentage each time period Y = b0b1t Log(Y) = Log(b0) + Log(b1)t = bo’ + b1’t, where b0’ = Log(b0) and b1’ = Log(b1) Conversely, b0 = 10^b0’ and b1 = 10^b1’ Linear growth: Sales increases over time by the same amount. Transformation may be used to convert exponential growth into linear growth.