DS 533 Fall 2004 Final Exam Name: _______Key____________ Show All your Work 1. A realtor in a local area is interested in being able to predict the selling price for a newly listed home or for someone considering listing their home. This realtor would like to attempt to predict the selling price by using the size of the home ( X1 , in square feet), the number of rooms ( X 2 ), the age of the home ( X 3 , in years) and if the home has an attached garage ( X 4 ). Use the Excel output below to determine if this realtor will be able to use this information to predict the selling price (in $1000). Summary measures Multiple R R-Square Adj. R-Square StErr of Estimate 0.9439 0.8910 0.8474 22.241 Regression coefficients Constant Size Number of Rooms Age Attached Garage 85. Coefficient -19.026 7.494 7.153 -0.673 0.453 Std Err 54.769 1.529 9.211 0.992 20.192 t-value -0.3474 4.9010 0.7767 -0.6789 0.0224 p-value 0.7355 0.0006 0.4553 0.5126 0.9826 Use the information above to estimate the linear regression model. ANSWER: Yˆ 7.494 X1 7.153 X 2 0.673 X 3 0.453 X 4 19.026 86. Interpret each of the estimated regression coefficients of the regression model in Question 85. ANSWER: This model shows that the selling price (in $000) increases by 7.5 for each square foot increase in size, increase by 7.15 for each additional room, decreases by 0.67 with increase in age, and increases by 0.453 for an attached garage. 87. Do the variables presented above seem to be significant in predicting the selling price? Explain your answer. 1 ANSWER: No; the only variable that is significant in this model is the size of the home in square feet (p-value=0.0006). The other variables are not significant. 88. Would any of the variables in this model be considered a dummy variable? Explain your answer. ANSWER: Yes; the attached garage is a dummy (0, 1) variable. This is a yes or no response. 89. Identify and interpret the coefficient of determination ( R2 ) and the standard error of the estimate (se) for the model in Question 85. ANSWER: R2 = 0.8910; This represents 89.1% of the variation in the selling price can be explained by this regression equation. se = 22.241; This represents the standard deviation of the residuals. 90. Would you recommend that the realtor use this model to predict the selling price of a home? Would you want to make any changes to this model before using it to predict the selling price of a home? Explain. ANSWER: The size of the home has a fairly strong relationship with the selling price, but the other variables do not seem to be significant in predicting the selling price. If you want to consider another variable, the appraised value of the home may be useful. However, you may also want to consider if there is multicollinearity exists in this model. In the current model it would seem as though the size of the home and the number of rooms could be highly correlated with one another. This could cause some problems with predicting the selling price of the home. Give a 95% confidence interval for the average selling price for 2 2. Below you will find a regression model that compares the relationship between the average utility bill (Y, in $) for homes of a particular size and the average monthly temperature (X, in Fahrenheit). The data represents monthly values for the past year. Also, the value for the Durbin-Watson statistic = 1.244, and a residual plot is shown below. Summary measures Multiple R R-Square StErr of Estimate 0.0295 0.0009 24.8184 ANOVA table Source Explained Unexplained df 1 10 SS MS 5.3575 5.3575 6159.5125 615.9512 F 0.0087 p-value 0.9275 Regression coefficients Coefficient 112.547 0.0403 Constant Average Monthly Temp Std Err 28.815 0.4316 t-value 3.9059 0.0933 p-value 0.0029 0.9275 40 30 20 10 0 -10 1 2 3 4 5 6 7 8 9 10 11 12 -20 -30 -40 48. Estimate the regression model. How well does this model fit the given data? ANSWER: Yˆ = 0.0403 X1 + 112.547; this is not a very good fit. The R2 = 0.0009. 49. Is there a linear relationship between X and Y? Explain how you arrived at your answer. ANSWER: 3 No; The p-value = 0.9275 for the F-statistic. There is not a significant linear relationship between these two variables. 50. In looking at the graph of the residuals, do you see any evidence of any violations of the assumptions regarding the errors of the regression model? ANSWER: There seems to be a pattern to the residuals and this violates the assumption that the residuals are probabilistically independent. The data appears to be autocorrelated. 51. Giving the Durbin-Watson value presented above, what would you conclude about the data? ANSWER: The Durbin-Watson statistic = 1.244 seems to indicate that there is lag 1 autocorrelation present in this data. This value indicates positive autocorrelation in the data. 52. Given you answer in Question 51, would you recommend modifying the original regression model? If so, how would you modify it? ANSWER: There is not an easy fix to the autocorrelation problem. In this case, you could use the average temperature to predict the next month’s utility bill. Also, you could look for other variables that may affect the utility bill such as appliances in house, number of people living in house, whether house has central air/heat, etc. You may be able to identify another variable that has a linear relationship with the average utility bill. 4 3. TOD Chevy is using Holt’s Method to forecast weekly car sales. Currently, the level is estimated to be 50 cars per week, and the trend is estimated to be 6 cars per week. During the current week 30 cars are sold. Forecast the number of cars 3 weeks from now. = =0.3. 3. The following specific percentage seasonal Factors are given for the month of December: 75.4, 86.8, 96.9, 72.6, 80.0, 85.4 Assume multiplicative decomposition model. If the expected trend-cycle for December is $900, and the mean seasonal Factors is used, what is the forecast for December? 5 Multiple Choice Questions Select the best answer 1. 2. If you are going to use a regression equation for prediction, you hope to have a reasonably se . R2 and a reasonably a. small; large b. large; small c. small; small d. large; large e. none of the above ANSWER: b In choosing the “best-fitting” line through a set of points in linear regression, we choose the one with the: a. b. c. d. e. 3. In a multiple regression analysis, there are 20 data points and 3 independent variables, and the sum of the squared differences between observed and predicted values of y is 160. The multiple standard error of estimate will be: a. b. c. d. e. 4. smallest sum of squared residuals largest sum of squared residuals smallest number of outliers largest number of points on the line none of the above 3.162 10 9.41 8.42 none of the above The F-ratio from the ANOVA table is calculated by: a. MSR / MSE b. MSE / MSR c. SST / SSE d. SSR / SSE e. none of the above ANSWER: a 5. The a. b. c. d. can be used to test for autocorrelation. regression coefficient correlation coefficient Durbin-Watson statistic F-test 6 e. t-test ANSWER: 5. A multiple regression equation includes 6 independent variables, and the coefficient of multiple determination is 0.91. The percentage of the variation in y that is explained by the regression equation is: a. b. c. d. e. 6. c 91% 95% 83% about 15% none of the above In regression analysis, multicollinearity refers to: a. the response variables being highly correlated b. the explanatory variables being highly correlated c. the response variable(s) and the explanatory variable(s) are highly correlated with one another d. the response variables are highly correlated over time. e. none of the above ANSWER: b 7. When determining whether to include or exclude a variable in regression analysis, if the p-value associated with the variable’s t-value is above some accepted significance value, such as 0.05, then: a. the variable is a candidate for inclusion b. the variable is a candidate for exclusion c. the variable is redundant d. the variable does not fit the guidelines of parsimony e. none of the above ANSWER: b 8. The following are the values of a time series for the first four time periods: t yt 1 24 2 25 3 26 4 27 Using a three-period moving average, the forecasted value for time period 5 is: a. b. c. d. 20.4 25.5 26 none of the above 7 9. When using exponential smoothing, a smoothing constant must be used. The smoothing constant is a value that: a. b. c. d. ranges between 0 and 1 ranges between –1 and +1 is equal to the largest observed value in the series represents the strength of the association between the forecasted and observed values e. none of the above 10. Winter’s model differs from simple exponential smoothing in that it includes a term for: a. seasonality b. trend c. residuals d. cyclical fluctuations e. none of the above Questions 11, through 14 refer to the following table. Seasonal Indexes of sales revenue of People's Bank are: January February March April May June July August September October November December 11. Total revenue for People's Bank in 1999 is forecasted to be $60,000. Based on the seasonal indexes above, sales in the first three months of 1999 should be: a. b. c. d. e. 12. 1.20 .90 1.00 1.08 1.02 1.10 1.05 .90 .85 1.00 1.10 .80 $4,800 $15,500 $14,723 $13,500 None of the above. If December 1999 revenue for People's Bank amounted to $5,000, a reasonable estimate of revenue for January 2000, based on the seasonal indexes given above would be: 8 a. b. c. d. f. 13. 14. $3,000 $4,500 $4,800 $7,500 None of the above. If revenue of People's Bank amounted to $5,500 in November 1999; the November 1999 sales revenue, after adjustment for seasonal variation using the indexes given above, would be: a. $6,500 b. $6,050 c. $5,500 d. $4,500 e. None of the above. Suppose that a simple exponential smoothing model is used (with = 0.40) to forecast monthly sandwich sales at a local sandwich shop. The forecasted demand for September was 1560 and the actual demand was 1480 sandwiches. Given this information, what would be the forecast for October in number of sandwiches? a. b. c. d. e. 1480 1528 1560 1592 cannot be determined from the information given 15. Which of the following is not an attribute of a normal probability distribution? 16. a. It is symmetrical about the mean. b. Most observations cluster around the mean. c. Most observations cluster around zero. d. The distribution is completely determined by the mean and variance. e. All the above are correct. When a time series contains no trend, it is said to be a. b. c. d. e. 17. nonstationary. seasonal. nonseasonal. stationary. filtered. The difference between seasonal and cyclical components is: a. b. Duration. Source. 9 18. 19. 20. 21. 22. c. Predictability. d. Frequency. e. All the above. A linear trend means that the time series variable changes by: a. a constant amount each time period b. a constant percentage each time period c. a positive amount each time period d. a negative amount each time period e. none of the above ANSWER: a When using the moving average method, you must select represent(s) the number of terms in the moving average. which a. a smoothing constant b. the explanatory variables c. an alpha value d. a span e. none of the above ANSWER: d The forecast error is: a. the difference between this period’s value and the next period’s value b. the difference between the average value and the expected value of the response variable c. the difference between the explanatory variable value and the response variable value d. the difference between the actual value and the forecast e. none of the above ANSWER: d A regression approach can also be used to deal with seasonality by using variables for the seasons. a. smoothing b. response c. residual d. dummy e. none of the above ANSWER: d In a random series, successive observations are independent of one another. If this property is violated, the observations are said to be: a. b. c. d. autocorrelated intercorrelated causal seasonal 10 e. none of the above ANSWER: a 11