Least Squares Regression Models © 2010 Pearson Prentice Hall. All rights reserved The least-squares regression model is given by y i 1 x i 0 i where • yi is the value of the response variable for the ith individual • 0 and 1 are the parameters to be estimated based on sample data • xi is the value of the explanatory variable for the ith individual • i is a random error term with mean 0 and variance 2i 2, the error terms are independent and normally distributed. • i=1,…,n, where n is the sample size (number of ordered pairs in the data set) 14-2 Formulas for the slope and intercept estimates. For the estimated regression equation given by the formula: yˆi b0 b1 xi The slope b1 is calculated by: And the intercept b0 can be found with : ( x ) ( y ) ( xy) n 2 ( x) 2 (x ) n b0 y b1 ( x ) b1 The standard error of the estimate, se, is found using the formula yi yˆ i 2 se 14-4 n 2 2 residuals n 2 Parallel Example 2: Compute the Standard Error Compute the standard error of the estimate for the drilling data which is presented on the next slide. 14-5 Depth at Which Drilling Begins, x (in feet) 35 50 75 95 120 130 145 155 160 175 185 190 14-6 Time to Drill 5 Feet, y (in minutes) 5.88 5.99 6.74 6.1 7.47 6.93 6.42 7.97 7.92 7.62 6.89 7.9 Solution Step 1: Using technology (i.e. Minitab), we find the least squares regression line to be yˆ 0.0116 x 5.5273 Step 2, 3: The predicted values as well as the residuals for the 12 observations are given in the table on the next slide 14-7 yˆ Depth, x Time, y 35 5.88 5.9333 50 5.99 6.1073 75 6.74 6.3973 95 6.1 6.6293 120 7.47 6.9193 130 6.93 7.0353 145 6.42 7.2093 155 7.97 7.3253 160 7.92 7.3833 175 7.62 7.5573 185 6.89 7.6733 190 7.9 7.7313 14-8 y yˆ -0.0533 -0.1173 0.3427 -0.5293 0.5507 -0.1053 -0.7893 0.6447 0.5367 0.0627 -0.7833 0.1687 2 y yˆ 0.0028 0.0138 0.1174 0.2802 0.3033 0.0111 0.6230 0.4156 0.2880 0.0039 0.6136 0.0285 resi d uals2 2.7 01 2 Solution Step 4: We find the sum of the squared residuals by summing the last column of the table: 2 residuals 2.7012 Step 5: The standard error of the estimate is then given by: residuals2 2.7012 se 0.5197 n 2 10 14-9 CAUTION! Be sure to divide by n-2 when computing the standard error of the estimate. 14-10 Parallel Example 4: Compute the Standard Error Verify that the residuals from the drilling example are normally distributed. 14-11 14-12 Conclusion: We have insufficient evidence at the 5% level of significance to support the claim that the residual errors from this model are not normally distributed. Hypothesis Test Regarding the Slope Coefficient, 1 To test whether two quantitative variables are linearly related, we use the following steps provided that 1. the sample is obtained using random sampling. 2. the residuals are normally distributed with constant error variance. 14-14 Step 1: Determine the null and alternative hypotheses. The hypotheses can be structured in one of three ways: Two-tailed Left-Tailed Right-Tailed H0: 1 = 0 H0: 1 = 0 H0: 1 = 0 H1: 1 0 H1: 1 < 0 H1: 1 > 0 Step 2: Select a level of significance, , depending on the seriousness of making a Type I error. 14-15 Step 3: Compute the test statistic b1 1 b1 t0 sb1 sb1 which follows Student’s t-distribution with n-2 degrees of freedom. Remember, when computing the test statistic, we assume the null hypothesis to be true. So, we assume that 1=0. 14-16 P-Value Approach Step 4: Use Table VI to estimate the P-value using n-2 degrees of freedom. 14-17 P-Value Approach Two-Tailed 14-18 P-Value Approach Left-Tailed 14-19 P-Value Approach Right-Tailed 14-20 P-Value Approach Step 5: If the P-value < , reject the null hypothesis. 14-21 Step 6: State the conclusion. 14-22 CAUTION! Before testing H0: 1 = 0, be sure to draw a residual plot to verify that a linear model is appropriate. 14-23 Parallel Example 5: Testing for a Linear Relation Test the claim that there is a linear relation between drill depth and drill time at the = 0.05 level of significance using the drilling data. 14-24 Solution Verify the requirements: • We assume that the experiment was randomized so that the data can be assumed to represent a random sample. • In Parallel Example 4 we confirmed that the residuals were normally distributed by constructing a normal probability plot. • To verify the requirement of constant error variance, we plot the residuals against the explanatory variable, drill depth. 14-25 There is no discernable pattern. 14-26 Solution Step 1: We want to determine whether a linear relation exists between drill depth and drill time without regard to the sign of the slope. This is a two-tailed test with H0: 1 = 0 versus H1: 1 0 Step 2: The level of significance is = 0.05. Step 3: Using technology, we obtained an estimate of 1 in Parallel Example 2, b1=0.0116. To determine the 2 standard deviation of b1, we compute x i x . The calculations are on the next slide. 14-27 Depth, x 35 50 75 95 120 130 145 155 160 175 185 190 14-28 2 x i x xi x -91.25 -76.25 -51.25 -31.25 -6.25 3.75 18. 75 28. 75 33. 75 48. 75 58. 75 63. 75 x 8326.5625 5814.0625 2626.5625 976.5625 39. 0625 14. 0625 351.5625 826.5625 1139.0625 2376.5625 3451.5625 4064.0625 x 30006 .25 2 i Solution Step 3, cont’d: We have sb1 se x x 2 i 0.5197 0.0030 30006 .25 The test statistic is b1 0.0116 t0 3.867 sb1 0.003 14-29 Solution: P-Value Approach Step 4: Since this is a two-tailed test, the P-value is the sum of the area under the t-distribution with 122=10 degrees of freedom to the left of -t0 = -3.867 and to the right of t0 = 3.867. Using Table VI we find that with 10 degrees of freedom, the value 3.867 is between 3.581 and 4.144 corresponding to right-tail areas of 0.0025 and 0.001, respectively. Thus, the P-value is between 0.002 and 0.005. Step 5: Since the P-value is less than the level of significance, 0.05, we reject the null hypothesis. 14-30 Solution Step 6: There is sufficient evidence at the = 0.05 level of significance to conclude that a linear relation exists between drill depth and drill time. 14-31 Confidence Intervals for the Slope of the Regression Line A (1- )100% confidence interval for the slope of the true regression line, 1, is given by the following formulas: se b1 t 2 b t s 1 2 b1 2 Lower bound: x i x Upper bound: b1 t 2 se x x 2 i b1 t 2 sb1 Here, t/2 is computed using n-2 degrees of freedom. 14-32 Note: The confidence interval formula for 1 can be computed only if the data are randomly obtained, the residuals are normally distributed, and there is constant error variance. 14-33 Parallel Example 7: Constructing a Confidence Interval for the Slope of the True Regression Line Construct a 95% confidence interval for the slope of the least-squares regression line for the drilling example. 14-34 Solution The requirements for the usage of the confidence interval formula were verified in previous examples. We also determined • b1 = 0.0116 • sb 0.0030 in previous examples. 1 14-35 Solution Since t0.025=2.228 for 10 degrees of freedom, we have Lower bound = 0.0116-2.2280.003=0.0049 Upper bound = 0.0116+2.2280.003=0.0183. We are 95% confident that the mean increase in the time it takes to drill 5 feet for each additional foot of depth at which the drilling begins is between 0.005 and 0.018 minutes. 14-36 The Coefficient of Determination The Coefficient of Determination is the proportion of the variability in the response variable that can be attributed to the least squares regression model. How to calculate R2 Using the sum of squares technique: R2 1 2 ( residuals ) 2 ( y y ) But for the SLR models we can simplify 2 2 R (r ) the calculation slightly and , where r is the correlation between the response and predictor variables. Parallel Example 8: Calculating the Coefficient of Determination Using technology for our drilling example we can calculate the correlation between the response and predictor to be 0.772822. Using the simplified calculation for the coefficient of determination that means: R (0.772822) 0.5973 2 14-39 2 Interpretation: Our model using the depth at which drilling begins as a predictor is able to explain 59.73% of the natural variability in the time it takes to drill 5 feet.