DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos Lecture 2: Review of Multiple Regression (Ch. 4-5) Material based on: Bowerman-O’Connell-Koehler, Brooks/Cole slide 1 DSCI 5340 FORECASTING Review of textbook HW Page 127-128 Ex 3.12 (Use Excel) Page 128 Ex 3.13, 3.17 Page 132 Ex 3.25 Page 134 Ex 3.35 slide 2 DSCI 5340 FORECASTING Excel Data Analysis Add-in In Excel, Make Sure Analysis ToolPak is an add-in. slide 3 DSCI 5340 FORECASTING Ex 3.12 Page 128 Scatter Plot An accountant wishes to predict direct labor cost (y) based on the batch size (x) of a product produced in a job shop. Data for 12 production runs are given. slide 4 DSCI 5340 FORECASTING Ex 3.13 Page 128 Interpretation of Mean of Y Given X a. m y|x=60 = b0 + b1(60) : The average value of y for repeated values of X=60. This is the Fitted model: point on the regression line predicted for Y Ŷ = 18.49 + 10.15X at X=60. b. m y|x=30 = b0 + b1(30) : The average value of y for repeated values of X=30. This is the point on the regression line predicted for Y SUMMARY OUTPUT at X=30. The distribution of values around Regression Statistics X=30 should be similar to that for X=60. Multiple R 0.99963578 R Square 0.999271693 c. Interpretation of slope: As the Batch Size Adjusted R Square 0.999198862 Standard Error 8.641541386 increases by one unit, the direct labor cost Observations 12 increases by b1= 10.1463. ANOVA df Regression Residual Total Intercept Batch_Size_X 1 10 11 SS 1024592.904 746.7623752 1025339.667 MS 1024593 74.67624 F 13720.47 Significance F 5.04436E-17 Coefficients 18.48750754 10.14625896 Standard Error 4.676579789 0.086620659 t Stat 3.953211 117.1344 P-value 0.002716 5.04E-17 Lower 95% 8.067438459 9.953256104 Upper 95% 28.90757661 10.33926181 slide 5 DSCI 5340 FORECASTING Ex 3.13 Page 128 Interpretation of Model Intercept b0: 18.49 is the Labor Cost if the batch size is 0. Theoretically, this costs would be 0, but it can be interpreted as fixed costs. Interpretation of Error Term: There may be other factors that determine direct labor costs, such as benefits to employees, type of product, number of employees, etc. Thus, the model may be more accurate with additional independent variables that are being compensated by having an error term in the model. slide 6 DSCI 5340 FORECASTING Ex 3.17 Page 128 Accu-Copiers, Inc., sells and services the Accu-500 copying machine. As part of its standard service contract, the company agrees to perform routine service on the copier. To obtain information about the time it takes to perform routine service, AccuCopiers has collected data for 11 service calls, shown in Table 3.7 (p. 126) slide 7 DSCI 5340 FORECASTING EX 3.17 Page 128 slide 8 DSCI 5340 FORECASTING EX 3-25 Page 132: Test for correlation The test for correlation between X and Y: H0: ρ = 0 vs. Ha: ρ ≠ 0 Has the same test statistic and p-value as the test for significance of the regression slope coefficient. However, the two tests use different assumptions. slide 9 DSCI 5340 FORECASTING EX 3-35 Page 134 A State Department of Taxation asked taxpayers to report the time y (in hours) required to complete a tax form and the number of times x (including this one) the taxpayer has filled out this form slide 10 DSCI 5340 FORECASTING EX 3-35 Page 134 To understand this model, not that as x increases, 1/x decreases and thus μy|x decreases. slide 11 Multiple Regression Graphically DSCI 5340 FORECASTING slide 12 DSCI 5340 FORECASTING Residuals The residuals will be denoted êi: êi = yi - i They represent the distance that each dependent variable value is from the estimated regression line or the portion of the variation in y that cannot be “explained” with the data available. What assumptions can we test using these residuals? slide 13 DSCI 5340 FORECASTING Regression model assumptions What are the Assumptions of Regression Analysis? How can these assumptions be checked? The relationship is linear. The disturbances ei have constant variance s2e . The disturbances are independent. The disturbances are normally distributed. slide 14 DSCI 5340 FORECASTING Graphical Techniques scatterplots residual plots histograms (not an exact science) slide 15 DSCI 5340 FORECASTING Properties of residual plots Property 1: The average of the residuals will be equal to zero. This property holds regardless of whether the assumptions are true or not and is a direct result of the way the least-squares method works. Property 2: There should be no systematic pattern in a residual plot. (What is a systematic pattern?) Property 3: Residuals should look like random numbers chosen from a normal distribution. (How close to normality should the chart look?) slide 16 DSCI 5340 FORECASTING Residual plots In a residual analysis it is suggested that the following plots be used: 1. Plot the residuals versus each explanatory variable. 2. Plot the residuals versus the predicted or fitted values. 3. If the data are measured over time, plot the residuals versus some variable representing the time sequence. What assumptions can each of these support or indicate a violation? slide 17 DSCI 5340 FORECASTING Residual plots Plots may be constructed using the actual residuals, êi, or the standardized residuals. The standardized residuals are simply the residuals divided by their standard deviation. Why do you think standardized residuals are sometimes used instead of regular residuals? slide 18 DSCI 5340 FORECASTING No Violations of the Assumptions of Regression Plot shows random residuals slide 19 DSCI 5340 FORECASTING Does this Plot Look Like One of the Assumptions of Regression Analysis is Violated? slide 20 DSCI 5340 FORECASTING PLOT OF RESIDUALS - Standardized values are small. slide 21 DSCI 5340 FORECASTING Outliers The method of least squares estimation chooses the regression coefficient estimates so the error sum of squares, SSE, is a minimum. In doing this, the distances from the true y values, yi, to the points on the regression line of or surface, i, are minimized. Least squares thus tries to avoid any large distances from yi to i. slide 22 DSCI 5340 FORECASTING Outliers OUTLIER: When a sample data point has a y value that is much different from the y values of the other points in the sample. An outlier is any value whose studentized residual is greater than 2. An outlier does not have to be influential. That is, removing the outlier may not change the regression coefficients very much. slide 23 DSCI 5340 FORECASTING No influential observations slide 24 DSCI 5340 FORECASTING A High Leverage Observation That is Not Influential slide 25 DSCI 5340 FORECASTING Leverages The slope of the line appears to be determined almost entirely by this one point. The sixth observation is said to have high leverage and is referred to as a leverage point. What do you think the term “leverage point” means? slide 26 DSCI 5340 FORECASTING Studentized residuals Another measure sometimes used in place of the standardized residual is the standardized residual computed after deleting the ith observation. This measure is called the studentized residual or studentized deleted residual. (Note that SAS refers to the standardized residual as the studentized residual.) slide 27 DSCI 5340 FORECASTING Checking Model Assumptions Checking Assumption 1 - Normal distribution Construct a histogram Checking Assumption 2 - Constant variance Plot residuals versus predicted Y values Checking Assumption 3 - Errors are independent Durbin-Watson statistic Plot of errors and time slide 28 DSCI 5340 FORECASTING Detecting Sample Outliers Sample leverages Standardized residuals Cook’s distance measure Cook’s distance measure 1 Di = k+1 hi 1 - hi (standardized residual)2 slide 29 DSCI 5340 FORECASTING Example of An Influential Observation slide 30 DSCI 5340 FORECASTING Should an unusual observation be deleted? If an observation is exerting undue influence on the fit of the model, then from an exploratory and data-mining standpoint, removing the observation may reveal a substantial changes in the model. An observation may be miscoded or not be appropriate for the collected data. No more than 10% of the data should be deleted to improve the model. slide 31 DSCI 5340 FORECASTING Dummy Variables slide 32 DSCI 5340 FORECASTING Test of Null Hypothesis (F-test) Tests the null hypothesis: H0: b2=b3bp = 0 Ha: at least one beta is not zero Null hypothesis is known as a joint or simultaneous hypothesis, because it compares the values of all bi simultaneously. This tests overall significance of regression model. There is an F test for the overall model. slide 33 DSCI 5340 FORECASTING Model building: Backward Selection A “deconstruction” approach Begin with the saturated (full) regression model Compute the drop in R2 as a consequence of eliminating each predictor variable, and the partial F-test value; treat as if the variable was the last to enter the regression equation Compare the lowest partial F-test value, (designated FL), to the critical value of F (designated FC) a. If FL < FC, remove the variable and recompute the regression equation using the remaining predictor variables and return to step 2. b. FL > FC, adopt the regression equation as calculated slide 34 DSCI 5340 FORECASTING Model building: Stepwise Selection Calculate correlations of all predictors with response variable Select the predictor variable with highest correlation. Regress Y on Xi. Retain the predictor if there is a significant F-test value. Calculate partial correlations of all variable not in equation with response variable. Select next predictor to enter that has the highest partial correlation. Call this predictor Xj. Compute the regression equation with both Xi and Xj entered. Retain Xj if its partial F-value exceeds the tabulated F (1, n-2-1) df. Now determine whether Xi warrants retention. Compare its partial F-value as if Xj was entered into the equation first. slide 35 DSCI 5340 FORECASTING Stepwise Continued Retain if its F-value exceeds the tabulated F value Enter a new Xk variable. Compute regression with three predictors. Compute partial F-values for Xi, Xj and Xk. Determine whether any should be retained by comparing observed partial F with the critical F. 6) Retain regression equation when no other predictor can be entered or removed from the model. slide 36