4/25/03 252x0342 ECO252 QBA2 FINAL EXAM May 7, 2003 Name Hour of Class Registered (Circle) I. (18 points) Do all the following. Note that answers without reasons receive no credit. A researcher wishes to explain the selling price of a house in thousands on the basis of its asseseesd valuation, whether it was new and the time period. New is 1 if the house is new construction, zero otherwise. The researcher assembles the following data for a random sample of 30 home sales. Use .10 in this problem. ————— 4/25/2003 9:58:00 PM ———————————————————— Welcome to Minitab, press F1 for help. MTB > Retrieve "C:\Documents and Settings\RBOVE\My Documents\Drive D\MINITAB\2x03421.MTW". Retrieving worksheet from file: C:\Documents and Settings\RBOVE\My Documents\Drive D\MINITAB\2x0342-1.MTW # Worksheet was saved on Fri Apr 25 2003 Results for: 2x0342-1.MTW MTB > print c1 - c4 Data Display Row Price Value New Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 69.00 115.50 100.80 96.90 72.00 61.90 97.00 87.50 96.90 81.50 69.34 97.90 96.00 92.00 94.10 101.90 109.50 88.65 93.00 83.00 106.70 97.90 97.30 90.50 95.90 113.90 94.50 86.50 91.50 93.75 66.28 86.31 84.78 79.74 65.54 59.93 79.98 75.22 81.88 72.94 60.80 81.61 79.11 77.96 78.17 80.24 85.88 74.03 75.27 74.31 84.36 77.90 79.85 74.92 79.07 85.61 76.50 72.78 72.43 76.64 0 0 1 1 0 0 1 0 1 0 0 1 0 0 1 1 1 0 0 0 0 1 1 0 1 0 1 0 0 0 1 2 2 3 4 4 4 5 5 5 6 6 7 9 10 10 10 11 11 11 12 12 12 12 12 13 14 14 17 17 1. Looking for a place to start, the researcher does individual regressions of price against the individual independent variables. a. Explain why the researcher concludes from the rgressions that valuation (‘value’) is the most important independent variable. Consider the values of R 2 and the significance tests on the slope of the equation (2) b. What kind of variable is ‘new.’ Explain why the regression of ‘price’ against ‘new’ is equivalent to a test of the equality of 2 sample means, and what the conclusion would be. (2) 4/25/03 252x0342 MTB > regress c1 1 c2 Regression Analysis: Price versus Value The regression equation is Price = - 44.2 + 1.78 Value Predictor Constant Value Coef -44.172 1.78171 S = 3.475 SE Coef 7.346 0.09546 R-Sq = 92.6% T -6.01 18.66 P 0.000 0.000 R-Sq(adj) = 92.3% Analysis of Variance Source Regression Residual Error Total DF 1 28 29 SS 4206.7 338.1 4544.8 Unusual Observations Obs Value Price 6 59.9 61.900 11 60.8 69.340 MS 4206.7 12.1 Fit 62.606 64.156 F 348.37 SE Fit 1.719 1.642 P 0.000 Residual -0.706 5.184 St Resid -0.23 X 1.69 X X denotes an observation whose X value gives it large influence. MTB > regress c1 1 c3 Regression Analysis: Price versus New The regression equation is Price = 88.5 + 9.93 New Predictor Constant New S = 11.70 Coef 88.458 9.926 SE Coef 2.759 4.362 R-Sq = 15.6% T 32.07 2.28 P 0.000 0.031 R-Sq(adj) = 12.6% Analysis of Variance Source Regression Residual Error Total DF 1 28 29 Unusual Observations Obs New Price 2 0.00 115.50 6 0.00 61.90 26 0.00 113.90 SS 709.3 3835.5 4544.8 Fit 88.46 88.46 88.46 MS 709.3 137.0 SE Fit 2.76 2.76 2.76 F 5.18 P 0.031 Residual 27.04 -26.56 25.44 St Resid 2.38R -2.33R 2.24R R denotes an observation with a large standardized residual MTB > regress c1 1 c4 2 4/25/03 252x0342 Regression Analysis: Price versus Time The regression equation is Price = 86.4 + 0.698 Time Predictor Constant Time Coef 86.355 0.6980 S = 12.33 SE Coef 4.942 0.5057 R-Sq = 6.4% T 17.47 1.38 P 0.000 0.178 R-Sq(adj) = 3.0% Analysis of Variance Source Regression Residual Error Total DF 1 28 29 SS 289.6 4255.2 4544.8 Unusual Observations Obs Time Price 2 2.0 115.50 6 4.0 61.90 MS 289.6 152.0 Fit 87.75 89.15 F 1.91 SE Fit 4.07 3.27 P 0.178 Residual 27.75 -27.25 St Resid 2.38R -2.29R R denotes an observation with a large standardized residual MTB > regress c1 2 c2 c4; SUBC> dw; SUBC> vif. 2. The researcher now adds time. Compare this regression with the regression with Value alone. Are the coefficients significant? Does this explain the variation in Y better than the regression with value alone? . What would the predicted selling price be for an old house with a valuation of 80 in time 17? (3) Regression Analysis: Price versus Value, Time The regression equation is Price = - 45.0 + 1.75 Value + 0.368 Time Predictor Constant Value Time Coef -44.988 1.75060 0.3680 S = 3.097 SE Coef 6.553 0.08576 0.1281 R-Sq = 94.3% T -6.87 20.41 2.87 P 0.000 0.000 0.008 VIF 1.0 1.0 R-Sq(adj) = 93.9% Analysis of Variance Source Regression Residual Error Total Source Value Time DF 1 1 DF 2 27 29 SS 4285.8 258.9 4544.8 MS 2142.9 9.6 F 223.46 P 0.000 Seq SS 4206.7 79.2 Unusual Observations Obs Value Price 2 86.3 115.500 11 60.8 69.340 20 74.3 83.000 Fit 106.842 63.656 89.146 SE Fit 1.385 1.474 0.680 Residual 8.658 5.684 -6.146 St Resid 3.13R 2.09R -2.03R R denotes an observation with a large standardized residual Durbin-Watson statistic = 2.73 3 4/25/03 252x0342 3. The researcher now adds the variable ‘new’ Remember that there is nothing wrong with a negative coefficient unless there is some reason why it should not be negative. a. What two reasons would I find to doubt that this regression is an improvement on the regression with just value and time by just looking at the t tests and the sign of the coefficients? What does the change in R 2 adjusted tell me about this regression? (3) b. We have done 5 ANOVA’s so far. What was the null hypothesis in these ANOVA’s and what does the one where the null hypothesis was accepted tell us? (2) c. What selling price does this eqution predict for an old home with a valuation of 80 in time 17? What percentage difference is this from the selling price predicted in the regression with just time and value? (2) d. The last two regressions have a Durbin-Watson statistic computed. What did this test for, what should our conclusion be, and why is it important? (3) e. The column marked VIF (variance inflation factor) is a test for (multi)collinearity. The rule of thumb is that if any of these exceeds 5, we have a multicollinearity problem. None does. What is multicollinearity and why am I worried about it? (2) f. Do an F test to show whether the regression with ‘value’, ‘time’ and ‘new’ is an improvement over the regression with ‘value’ alone. (3) MTB > regress c1 3 c2 c4 c3; SUBC> dw; SUBC> vif. Regression Analysis: Price versus Value, Time, New The regression equation is Price = - 47.7 + 1.79 Value + 0.351 Time - 1.22 New Predictor Constant Value Time New Coef -47.675 1.79394 0.3508 -1.218 S = 3.105 SE Coef 7.190 0.09804 0.1298 1.322 R-Sq = 94.5% T -6.63 18.30 2.70 -0.92 P 0.000 0.000 0.012 0.366 VIF 1.3 1.0 1.3 R-Sq(adj) = 93.8% Analysis of Variance Source Regression Residual Error Total DF 3 26 29 SS 4294.0 250.7 4544.8 MS 1431.3 9.6 F 148.42 P 0.000 4 4/25/03 252x0342 Source Value Time New DF 1 1 1 Seq SS 4206.7 79.2 8.2 Unusual Observations Obs Value Price 2 86.3 115.500 11 60.8 69.340 20 74.3 83.000 Fit 107.862 63.502 89.492 SE Fit 1.777 1.487 0.778 Residual 7.638 5.838 -6.492 St Resid 3.00R 2.14R -2.16R R denotes an observation with a large standardized residual Durbin-Watson statistic = 2.60 MTB > 5 4/25/03 252x0342 II. Do at least 4 of the following 7 Problems (at least 15 each) (or do sections adding to at least 60 points Anything extra you do helps, and grades wrap around) . Show your work! State H 0 and H1 where applicable. Use a significance level of 5% unless noted otherwise. Do not answer questions without citing appropriate statistical tests. 1. (Berenson et. al. 1220) A firm believes that less than 15% of people remember their ads. A survey is taken to see what recall occurs with the following results (In these problems calculating proportions won’t help you unless you do a statistical test): Medium Mag TV Radio Total Remembered 25 10 8 43 Forgot 73 93 107 273 Total 98 103 115 316 a. Test the hypothesis that the recall rate is less than 15% by using proportions calculated from the ‘Total’ column. Find a p-value for this result. (5) b. Test the hypothesis that the proportion recalling was lower for Radio than TV. (4) c. Test to see if there is a significant difference in the proportion that remembered according to the medium. (6) d. The Marascuilo procedure says that if (i) equality is rejected in c) and (ii) p 2 p3 2 s p , where the chi – squared is what you used in c) and the standard deviation is 2 what you would use in a confidence interval solution to b), you can say that you have a significant difference between TV and Radio. Try it! (5) 6 4/25/03 252x0342 2. (Berenson et. al. 1142) A manager is inspecting a new type of battery. These are subjected to 4 different pressure levels and their time to failure is recorded. The manager knows from experience that such data is not normally distributed. Ranks are provided. PRESSURE Use low 1 2 3 4 5 8.2 8.3 9.4 9.6 11.9 rank normal 11 12 15 16 19 7.9 8.4 10.0 11.1 12.5 rank high rank whee! rank 9 13 17 18 20 6.2 6.5 7.3 7.8 9.1 4 5 7 8 14 5.3 5.8 6.1 6.9 8.0 1 2 3 6 10 a. At the 5% level analyze the data on the assumption that each column represents a random sample. Do the column medians differ? (5) b. Rerank the data appropriately and repeat a) on the assumption that the data is non-normal but cross classified by use. (5) c. This time I want to compare high pressure (H) against low - moderate pressure (L). I will write out the numbers 1-20 and label them according to pressure. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 H H H H H H H H L H L L L H L L L L L L Do a runs test to see if the H’s and L’s appear randomly. This is called a Wald-Wolfowitz test for the equality of means in two nonnormal samples. Null hypothesis is that the sequence is random and the means are equal. What is your conclusion? (5) 7 4/25/03 252x0342 3. A researcher studies the relationship of numbers of subsidiaries and numbers of parent companies in 11 metropolitan areas and finds the following: Area parents x 1 2 3 4 5 6 7 8 9 10 11 658 396 357 266 231 223 207 156 146 143 139 2922 subsidiaries y 2602 1709 1852 1223 875 666 1519 884 477 564 657 13028 x2 432964 156816 127449 70756 53361 49729 42849 24336 21316 20449 19321 1019346 xy 1712116 676764 661164 325318 202125 148518 314433 137904 69642 80652 91323 4419959 y2 6770404 2920681 3429904 1495729 765625 443556 2307361 781456 227529 318096 431649 19891990 a. Do Spearman’s rank correlation between x and y and test it for significance (6) b. Compute the sample correlation between x and y and test it for significance (6) c. Compute the sample standard deviation of x and test to see if it equals 200 (4) 8 4/25/03 252x0342 4. Data from the previous page is repeated: Area parents 1 2 3 4 5 6 7 8 9 10 11 658 396 357 266 231 223 207 156 146 143 139 2922 x subsidiaries y 2602 1709 1852 1223 875 666 1519 884 477 564 657 13028 x2 432964 156816 127449 70756 53361 49729 42849 24336 21316 20449 19321 1019346 y2 xy 1712116 676764 661164 325318 202125 148518 314433 137904 69642 80652 91323 4419959 6770404 2920681 3429904 1495729 765625 443556 2307361 781456 227529 318096 431649 19891990 a. Test the hypothesis that the correlation between x and y is .7 (5) b. Test the hypothesis that x has the Normal distribution. (9) c. Test the hypothesis that x and y have equal variances. (4) 9 4/25/03 252x0342 5. Data from the previous page is repeated: Area parents x 1 2 3 4 5 6 7 8 9 10 11 658 396 357 266 231 223 207 156 146 143 139 2922 subsidiaries y 2602 1709 1852 1223 875 666 1519 884 477 564 657 13028 x2 432964 156816 127449 70756 53361 49729 42849 24336 21316 20449 19321 1019346 xy 1712116 676764 661164 325318 202125 148518 314433 137904 69642 80652 91323 4419959 y2 6770404 2920681 3429904 1495729 765625 443556 2307361 781456 227529 318096 431649 19891990 a. Compute a simple regression of subsidiaries against parents as the independent variable. (5) b. Compute s e . (3) c. Predict how many subsidiaries will appear in a city with 60 parent corporations. (1) d. Make your prediction in c) into a confidence interval. (3) e. Compute s b0 and make it into a confidence interval for 0 . (3) f. Do an ANOVA for this regression and explain what it says about 1 . (3) 10 4/25/03 252x0342 6. A chain has the following data on prices, promotion expenses and sales of one product. (You can do x1 x 2 ): Store 1 2 3 4 5 6 7 8 9 10 11 12 sales promotion x1 x2 x12 3842 3754 5000 1916 3224 2618 3746 3825 1096 1882 2159 2927 35989 59 59 59 79 79 79 79 79 99 99 99 99 968 200 400 600 200 200 400 600 600 200 400 400 600 4800 3481 3481 3481 6241 6241 6241 6241 6241 9801 9801 9801 9801 80852 y2 x 22 Store 1 2 3 4 5 6 7 8 9 10 11 12 price y 40000 160000 360000 40000 40000 160000 360000 360000 40000 160000 160000 360000 2240000 x1 y 14760964 14092516 25000000 3671056 10394176 6853924 14032516 14630625 1201216 3541924 4661281 8567329 121407527 y 2999.08, x1 80.6667 226678 221486 295000 151364 254696 206822 295934 302175 108504 186318 213741 289773 2752491 x2 y 768400 1501600 3000000 383200 644800 1047200 2247600 2295000 219200 752800 863600 1756200 15479600 x 2 400.000. and a. Do a multiple regression of sales against x1 and x 2 . (10) b. Compute R 2 and R 2 adjusted for degrees of freedom. Use a regression ANOVA to test the usefulness of this regression. (6) d. Use your regression to predict sales when price is 79 cents and promotion expenses are $200. (2) e. Use the directions in the outline to make this estimate into a confidence interval and a prediction interval. (4) f. If the regression of Price alone had the following output: The regression equation is sales = 7391 - 54.4 price Predictor Constant price S = 726.2 Coef 7391 -54.44 SE Coef 1133 13.81 R-Sq = 60.9% T 6.52 -3.94 P 0.000 0.003 R-Sq(adj) = 56.9% Analysis of Variance Source Regression Residual Error Total DF 1 10 11 SS 8200079 5273437 13473517 MS 8200079 527344 F 15.55 P 0.003 Do an F-test to see if adding x 2 helped. (4). The next page is blank – please show your work. 11 4/25/03 252x0342 (Blank) 12 4/25/03 252x0342 7. The Lees present the following data on college students summer wages vs. years of work experience blocked by location. Years of Work Experience Region 1 2 3 1 16 19 24 2 21 20 21 3 18 21 22 4 14 21 25 a. Do a 2-way ANOVA on these data and explain what hypotheses you test and what the conclusions are. (9) (Or do a 1-way ANOVA for 6 points.) The following column sums are done for you: x 1 69, x 2 81, n1 4, n 2 4, x 2 1 1217 and x 2 2 1643. So x1 17.25,and x 2 20.25. b. Do a test of the equality of the means in columns 2 and 3 assuming that the columns are random samples from Normal populations with equal variances (4). c. Assume that columns 2 and 3 do not come from a Normal distribution and are not paired data and do a test for equal medians. (4) d. Test the following data for uniformity. n 20. Category 1 2 3 4 5 Numbers 0 2 0 10 8 13