Chapter 13 Regression, Inference, and Model Building HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Building a Simple Linear Regression Model HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Simple Linear Regression Model Definition The simple linear regression model is given by the linear equation yi b0 b1 xi ei , where b0 is the y-intercept for the population data, b1 is the slope coefficient for the population data, HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Simple Linear Regression Model Definition (cont.) x i is the value of the independent (or predictor) variable for observation i, ei is the random error in y for observation i, and y i is the value of the dependent (or response) variable for observation i. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Estimated Simple Linear Regression Equation Definition The estimated simple linear regression equation is yˆi b0 b1 xi , where b0 and b1 are estimates of their population counterparts. Specifically, b0 is an estimate of b0 , and b1 is an estimate of b1 . yˆi is the predicted value of y for a given value of xi , and is pronounced y-hat. The symbol yi is reserved for the observed value of y. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Defining a Linear Relationship HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. How Do We Measure How Close a Line Is to the Data? Definition The difference between the observed value of y and the predicted value of y is called the error, estimated error, or residual (ei). The error for each observation is given by Error ei Observed y Predicted y yi yˆi . HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Sum of Squared Errors Formula Sum of Squared Errors (SSE) The sum of squared errors (SSE) is given by SSE yi yˆi yi b0 b1 xi . 2 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. 2 Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Least Squares Line Definition The least squares line is the line that has the smallest sum of squared errors. This is the line of best fit. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Finding the Least Squares Line HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Slope and y-Intercept of the Least Squares Line Formula Slope and y-Intercept of the Least Squares Line The equation for finding the slope is given by b1 SS xy SS xx where SS xy xi x yi y HAWKES LEARNING SYSTEMS Students Matter. Success Counts. x y xy i i i i n Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Slope and y-Intercept of the Least Squares Line Formula (cont.) and SS xx xi x xi 2 2 x i n 2 . The slope can also be calculated using b1 n x i y i x i y i HAWKES LEARNING SYSTEMS Students Matter. Success Counts. n x i 2 x 2 . i Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Slope and y-Intercept of the Least Squares Line Formula (cont.) The estimate of the intercept is given by 1 b0 y b1 x yi b1 xi . n The x i and y i referred to in the expressions are the observed data values of x and y, respectively. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Intercept and Slope Coefficients Definition The intercept coefficient, b0, is the average value of the dependent variable, y, when the independent variable, x, is equal to zero. The slope coefficient, b1, is the average change in the dependent variable, y, for a one unit change in the independent variable, x. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Mean Square Error Formula Mean Square Error The variance of the error terms is also known as the mean square error and is given by: 2 e s y i yi n2 2 SSE . n2 The square root of the mean square is the standard error, or the standard deviation of the error terms. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Evaluating the Fit of a Model HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Total Sum of Squares (TSS) Formula Total Sum of Squares (TSS) The total variation in y is given by the total sum of squares (TSS). TSS yi y HAWKES LEARNING SYSTEMS Students Matter. Success Counts. 2 Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Sum of Squares of Regression Definition The sum of squares of regression (SSR) denotes the explained variation in the model. SSR = TSS – SSE (the explained variation, SSR, is equal to the total variation minus the unexplained variation) HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Coefficient of Determination Formula Coefficient of Determination The coefficient of determination, R2, is given by SSR SSE R 1 . TSS TSS 2 The coefficient of determination is a value between 0 and 1, inclusive. That is, 0 R2 1. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.1 The SAT Reasoning Test has been used for years as a predictor of academic success. If SAT scores are predictors of academic success, they should be positively related to the grade point average upon graduation. 27 graduates of a state college were sampled and their grade point averages (GPA) upon graduation and SAT scores reported upon admission are recorded. The data are given in Table 13.5. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.1 (cont.) Table 13.5 – SAT Scores and Graduating GPA Student SAT Critical Reading SAT Math SAT Writing SAT Total Graduating GPA 1 2 3 4 5 6 7 8 9 10 11 12 440 390 410 390 490 400 450 420 370 460 370 410 550 480 360 350 590 550 430 350 390 600 400 530 495 435 385 370 540 475 440 385 380 530 385 470 1485 1305 1155 1110 1620 1425 1320 1155 1140 1590 1155 1410 2.105 2.484 2.537 2.969 3.619 2.303 2.602 2.195 2.112 3.482 2.367 2.082 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.1 (cont.) Table 13.5 – SAT Scores and Graduating GPA Student SAT Critical Reading SAT Math SAT Writing SAT Total Graduating GPA 13 14 15 16 17 18 19 20 21 22 23 24 470 490 540 560 440 440 580 360 440 290 440 510 570 610 630 620 470 530 670 420 460 410 500 570 520 550 585 590 455 485 625 390 450 350 470 540 1560 1650 1755 1770 1365 1455 1875 1170 1350 1050 1410 1620 2.346 3.484 2.446 2.820 2.556 3.357 3.269 2.964 2.642 2.297 2.388 2.850 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.1 (cont.) Table 13.5 – SAT Scores and Graduating GPA Student SAT Critical Reading SAT Math SAT Writing SAT Total Graduating GPA 25 26 27 320 470 550 540 580 550 430 525 550 1290 1575 1650 2.742 2.347 3.025 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.1 (cont.) Figure 13.14 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.1 (cont.) The scatterplot in Figure 13.14 suggests that as SAT scores increase the GPA tends to increase, although there is a substantial amount of variability in the relationship. The upward sloping pattern of the data suggests a linear model could be constructed. However, a great deal of variation in the model’s errors should be expected. What percent of the variation in final grade point average can be explained by the model relating total SAT score to graduating GPA? HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.1 (cont.) Solution Using the least squares method, the estimated model is given by Estimated Graduating GPA 1.355 0.00094 Total SAT Score . HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.1 (cont.) Figure 13.15 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.1 (cont.) Figure 13.16 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.1 (cont.) One of the differences in the production model and the SAT/GPA model is the manner in which the data seem to fit the model. In the production model, the data seemed to fit closely around the line, while in the SAT/GPA model the data are loosely clustered about the line. While tight and loose are interesting portrayals of the relative fit of the data, it would be desirable to have a numerical measure to describe fit. R2 is such a measure. SSR 1.1862 2 R 0.1944 TSS 6.1013 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.1 (cont.) Thus, approximately 19% of the variation in graduating GPA is explained by this model. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. What is a goo R2? HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Fitting a Linear Time Trend HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.2 Many analysts believe that college tuition prices may soon be in the same situation as housing prices were when the housing bubble burst (causing home prices to drop significantly). Table 13.6 contains data for the Tuition Consumer Price Index (TCPI) from 1978 to 2009. Use a linear time trend to model the data. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.2 (cont.) Table 13.6 – Tuition Consumer Price Index Year TCPI Year TCPI 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 60.89 65.66 71.80 80.58 91.33 100.73 110.94 112.61 130.63 140.41 125.98 137.86 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 175.93 193.73 181.14 234.48 250.80 249.17 264.15 278.42 307.51 319.63 307.80 324.73 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.2 (cont.) Table 13.6 – Tuition Consumer Price Index Year TCPI Year TCPI 2002 2003 2004 2005 348.54 371.42 343.05 476.08 2006 2007 2008 2009 507.90 456.30 531.55 607.60 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.2 (cont.) Solution Tuition Consumer Price Index Figure 13.17 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.2 (cont.) A graph of the data reveals an upward trend in the tuition consumer price index. The data appear to be a nonstationary time series with an upward trend. To describe the data, we will model the trend by fitting a line through the data with the notion of capturing how fast (on average) the series is changing over time. Estimating the slope of the line will provide the average rate of change per year in the TCPI. The line is fitted using least squares estimates in exactly the same way as other regression models have been constructed. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.2 (cont.) The independent variable in a linear trend model is always time. In this case, the dependent variable is TCPI. The estimated least squares equation is Estimated TCPI 30609.4994 15.4794 Year . HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.2 (cont.) Tuition Consumer Price Index Figure 13.18 The computer output for the problem is given in Figure 13.19. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.2 (cont.) Figure 13.19 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.2 (cont.) The estimate of the slope, 15.4794, tells us that on average the TCPI is increasing at a rate of 15.4794 per year. Given how well the line fits the data (R2 0.9309), the trend line is a good descriptor of the data. The trend line can also be used for short-term prediction. Suppose you wanted to estimate the TCPI for 2010. If the data are not available, the trend model can be used. Estimated TCPI 30609.4994 15.4794 2010 504.0946 Prediction of the TCPI for 2010 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Confidence Interval HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. The Confidence Interval for Formula 100(1 a)% Confidence Interval for B1 The 100(1 a)% confidence interval for B1 is given by b1 ta 2,df sb1 , HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.3 Table 13.7 – Weekly Production Week Items Produced Cost ($) 1 2 3 4 5 6 7 8 9 10 22 30 36 41 27 45 30 37 32 31 3500 3800 4500 4200 3700 4600 3600 4550 3990 3675 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.3 (cont.) In Section 13.3, a model relating the number of items produced to total cost was constructed. Cost b0 b1 Items Produced ei If the relationship is to be applicable for the entire production process, then a substantial amount of data will be required, more than we could hope to collect. If the data given in Table 13.7 are considered a random sample of weekly production, then a relationship can be constructed from the sample data. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.3 (cont.) Specifically, the estimated least squares regression line relating items produced to total cost is Estimated Cost $2227.96 $53.88 Items Produced , where b0 = $2227.96 (the sample estimate of b0, the yintercept), and b1 = $53.88 (the sample estimate of b1, the slope). Note: Both estimates were determined using Microsoft Excel and rounded to the nearest hundredth. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.3 (cont.) Figure 13.20 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.3 (cont.) The manual calculation of sb1 is tedious. Virtually every statistical analysis program that performs regression analysis calculates sb1 . The summary output from Microsoft Excel is given in Figure 13.20. Most software packages will automatically include a confidence interval for b1 or it will include the pieces required to compute a confidence interval. Microsoft Excel automatically displays the 95% confidence interval for b1, and is capable of displaying an interval for any level of confidence you choose. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.3 (cont.) 95% Confidence Interval for 1 b1 ta 2,df sb1 53.88 2.306 10.9778 53.88 25.3148 28.57 to 79.19 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.4 (cont.) 99% Confidence Interval for b1 b1 ta 2,df sb1 53.88 3.35510.9778 53.88 36.8305 17.05 to 90.71 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.4 (cont.) Note: The confidence interval in this example was calculated using rounded values from the summary output. Microsoft Excel calculates the confidence interval using unrounded values as 17.05 to 90.72. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Testing a Hypothesis Concerning b1 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.6 Using the data in Example 13.3, determine if there is overwhelming evidence at the a = 0.05 level of a relationship between the number of items produced and the total production cost. Solution Step 1: State the hypotheses in plain English. • Null Hypothesis: There is not a linear relationship between the number of items produced and total production cost. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.6 (cont.) • Alternative Hypothesis: There is a linear relationship between the number of items produced and total production cost. Step 2: Select the appropriate statistical measure. Since we are interested in determining if b1 = 0, the sample estimate of b1, namely b1, will be used to evaluate whether the hypothesis b1 = 0 is reasonable. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.6 (cont.) Step 3: Determine whether the hypothesis should be one-sided or two-sided. The alternative hypothesis should be two-sided since we are interested in discovering any relationship (positive or negative) between items produced and total cost. Step 4: Specify the hypotheses using the appropriate statistical measure. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.6 (cont.) H0 : b1 0 (No linear relationship exists between items produced and total cost.) Ha : b1 0 (A linear relationship exists between items produced and total cost.) Step 5: Specify the level of the test. The level of the test has been given in the problem statement as the 0.05 level. Step 6: Select the appropriate test statistic. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.6 (cont.) Formula Test Statistic for Testing the Hypothesis b1 ≠ 0 The test statistic for testing the hypothesis b1 ≠ 0 is given by b1 0 b1 t . sb1 sb1 The test statistic follows a t-distribution with n 2 degrees of freedom. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.6 (cont.) The test statistic is similar in nature to the other test statistics developed in Chapter 10. It measures how far b1 is from the hypothesized value of b1, which is 0. This distance is measured in standard deviation units. If t is close to 0, then b1 is close to 0 and H0: b1 = 0 is the more reasonable conclusion. However, if t is far from zero, then b1 is far from its hypothesized value and Ha: b1 ≠ 0 would seem more reasonable. This criterion is defined by the critical value of the test statistic. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.6 (cont.) Step 7: Determine the critical value. The test is two-tailed and the level of the test is specified to be 0.05, which implies a 0.05 and a 2 0.05 2 0.025. The test statistic has a t-distribution with df n 2 10 2 8. The critical value corresponds to t0.025,8 2.306. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.6 (cont.) t-Distribution, df = 8 Figure 13.22 HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.6 (cont.) Step 8: Compute the test statistic. Table 13.9 – Regression Results Predictor Coefficient Standard Deviation of Coefficient t-value Intercept 2227.96 370.1488 6.019 Items Produced 53.88 10.9778 4.908 53.88 b1 0 b1 4.908 t sb1 sb1 10.9778 The estimated value of b1 is almost five standard deviations above zero. This is very persuasive evidence that b1 ≠ 0. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.6 (cont.) Step 9: Make the decision. Since the value of the test statistic falls into the rejection region, reject the null hypothesis in favor of the alternative. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Example 13.6 (cont.) Step 10: State the conclusion in terms of the original question. There is overwhelming evidence at the 0.05 level that b1 ≠ 0 so we reject the null hypothesis in favor of the alternative. This implies that it is reasonable to believe (at the 0.05 level) that there is a linear relationship between the number of items produced and total cost. In fact, there appears to be a positive linear relationship between items produced and production cost. However, our hypothesis test did not address the issue of a positive relationship, so we cannot make this conclusion. HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved.