1 The simple linear regression model makes the following assumptions: i) The relationship between the predictor variable and the response variable is linear, apart from random error; ii) The random error terms in the model are independent, and identically distributed, having a distribution that is normal with mean 0 and variance . In any situation in which we want to use simple linear regression, these assumptions need to be checked so that we can be confident that the model works. 2 We check the first assumption using the scatterplot of the response variable against the predictor variable. We will check the second assumption using the residuals from the model: If the data are collected in a time sequence, we may check the assumption of independence using a time series plot of the residuals. If we see any pattern, then we will not accept the assumption of independence. We will do a normal q-q plot of the residuals to check the assumption of normality. We will plot the residuals against the predictor variable to check the assumption of constant variance. The values of the residuals should be randomly distributed about the 0 line for all x. If we see any pattern, then we will not accept the assumption of constant variance. Example: (stainless steel stress fracture example, continued) We have already done a scatterplot and seen that the relationship between applied tensile stress and time to fracture seems to be linear. The residuals for the model are given in the table below: 2 RESIDUAL OUTPUT Observation 1 2 3 4 5 6 7 8 9 10 Predicted Y 64.16548673 61.91327434 57.40884956 52.90442478 50.65221239 48.4 43.89557522 39.39115044 34.88672566 30.38230088 Residuals -1.165486726 -3.913274336 -2.408849558 8.095575221 11.34778761 -11.4 -5.895575221 5.608849558 11.11327434 -11.38230088 Excel gives residual plots and normal q-q plots as options for regression. The normal q-q plot from this option actually is not very informative. Hence we will do a normal q-q plot of the residuals using the handout. The plots are shown below: X Variable 1 Residual Plot 15 Residuals 10 5 0 0 10 20 30 40 50 -5 -10 -15 X Variable 1 The plot of the residuals v. tensile stress shows no obvious pattern. Hence we will accept the assumption of homoscedasticity. 3 From the normal q-q plot, we see no reason to doubt the assumption of normality. Normal Q-Q Plot of Residuals Standardized Order Statistics 2 1.5 1 0.5 0 -2 0 -1 1 2 -0.5 -1 -1.5 -2 Standardized Normal Scores Confidence Intervals in Simple Linear Regression If the error terms in the model satisfy the assumptions of being i.i.d. normal, then we have ˆ 1 ~ Normal 1 , SS xx , and 2 1 x ˆ 0 ~ Normal 0 , . n SS xx ˆ1 1 MSE ~ t n 2 se ̂ 1 Hence, se ˆ , where SS xx ; and 1 4 2 ˆ0 0 1 x ~ t n 2 se ̂ MSE 0 n SS . , where se ˆ0 xx Then a 100(1-)% confidence interval estimate for 1 is ˆ1 t 2 se ˆ1 , and a 100(1-)% confidence interval estimate for ;n 2 ˆ ˆ 0 is 0 t ;n2 se 0 . 2 Example: (stainless steel stress fracture example, continued) A 95% confidence interval estimate for the slope of the line of best ˆ ˆ fit is 1 t ; n 2 se 1 0.900884956 2.3060.242775962 2 hrs. hrs. 1.4607 , 0.3410 2 kg / mm kg / mm2 , and a 95% confidence interval estimate for the intercept of the line of best fit is ˆ0 t 2 ;n 2 se ˆ0 66.41769912 2.3065.648129399 53.3931 hrs., 79.4423 hrs.. Sometimes we want an estimate of the conditional mean of the response variable at a particular value of the predictor variable. An unbiased estimator of the conditional mean at x = x0 is ˆY |x 0 1 x x ˆ0 ˆ1 x0 , which has variance V ˆ Y | x 2 0 . n SS 0 2 xx Then a 100(1-)% confidence interval estimate of the conditional mean at x = x is ˆ Y | x0 t se ˆ Y | x0 , where 0 2 ;n 2 5 se ˆ Y | x0 1 x0 x 2 MSE SS xx . n Example: (stainless steel stress fracture example, continued) A point estimate of the mean time to fracture when the stress is at 45 kg/mm2 is ˆ Y |45 ˆ0 ˆ1 x0 66.41769912 0.90088495645 25.8779 hrs. The mean stress for the sample is 20 hrs, and SSxx = (n-1)S2 = 1412.5. Also MSE = 83.25298673. Then a 95% confidence interval estimate for the mean time to fracture when the stress is at 45 kg/mm2 is ˆY | x0 t se ˆY | x0 2 ;n 2 2 45 20 25.8779 hrs. 2.306 83.25298673 0.10 1412.5 10.3808 hrs., 41.3750 hrs. . Note: The standard error of ˆY |x0 ˆ0 ˆ1 x0 is an increasing function of the the squared difference between x0 and x . Hence the confidence interval will be narrowest at the mean of x, and will increase in width as the distance from the mean increases. (See p. 279). 6