4/25/03 252x0341 ECO252 QBA2 FINAL EXAM May 6, 2003 Name Hour of Class Registered (Circle) I. (18 points) Do all the following. Note that answers without reasons receive no credit. A researcher wishes to use demographic information to predict sales of a large chain of nationwide sports stores. The researcher assembles the following data for a random sample of 38 stores. Use .10 in this problem. Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 Sales Age 1695713 3403862 2710353 529215 663687 2546324 2787046 612696 891822 1124968 909501 2631167 882973 1078573 844320 1849119 3860007 826574 604683 1903612 2356808 2788572 634878 2371627 2627838 1868116 2236797 1318876 1868098 1695219 2700194 1156050 643858 2188687 830352 1226906 566904 826518 33.1574 32.6667 35.6553 33.0728 35.7585 33.8132 30.9797 30.7843 32.3164 32.5312 31.4400 33.1613 31.8736 33.4072 34.0470 28.8879 36.1056 32.8083 33.0538 33.4996 32.6809 28.5166 32.8945 30.5024 30.2922 31.2911 33.0498 32.9348 31.8381 31.0794 32.1807 31.6944 34.0263 34.7315 30.5613 33.5183 32.3952 29.9108 Growth Income 0.8299 0.6619 0.9688 0.0821 0.4646 2.1796 1.8048 -0.0569 -0.1577 0.3664 2.2256 1.5158 0.1413 -1.0400 1.6836 2.3596 0.7840 0.1164 1.1498 0.0606 1.6338 1.1256 1.4884 4.7937 1.8922 1.8667 1.7896 0.2707 3.0129 23.4630 0.7041 -0.1569 0.7084 0.1353 0.3848 0.7417 0.6693 0.1111 26748.5 53063.8 36090.1 32058.1 47843.4 50181.0 30710.1 29141.7 25980.2 18730.9 31109.2 35614.1 23038.4 34531.7 30350.4 38964.9 49392.8 25595.7 29622.6 31586.1 39674.6 28879.0 24287.1 46711.2 33449.8 31694.5 25459.2 47047.3 26433.2 33396.7 26179.4 33454.6 42271.5 46514.8 27030.8 42910.1 40561.4 22326.0 HS 73.5949 88.4557 73.5362 79.1780 84.1838 93.4996 78.0234 70.2949 70.6674 63.7395 76.9059 82.9452 65.2127 73.4944 80.2201 87.5973 85.3041 65.5884 80.6176 80.3790 79.8526 81.2371 70.2244 87.1046 80.2057 75.2914 77.6162 85.1753 74.1792 81.6991 73.4140 73.7161 78.6493 80.9503 66.8057 77.8905 79.3622 58.3610 College 17.8350 31.9439 18.6198 20.6284 35.2032 41.7057 28.0250 15.0882 10.9829 13.2458 19.5500 20.8135 16.9796 32.9920 22.3185 24.5670 30.8790 17.4545 18.6356 38.3249 23.7780 16.9300 19.1429 30.8843 26.5570 28.3600 19.2490 35.4994 18.6375 41.1130 17.8566 26.5426 29.8734 24.5374 14.1390 20.8340 19.0309 10.6729 In the data above ‘Sales’ is the total sales in the last month, ‘Age’ is the median customer age, ‘Growth’ is the population growth rate in the last ten years, ‘Income’ is median family income, ‘HS’ is percent of potential customers with a high school diploma, ‘College’ is percent of potential customers with a college degree. To start with, the researcher runs ‘sales’ against each independent variable individually with the following results. MTB > regress c1 1 c2 Regression Analysis: Sales versus Age The regression equation is Sales = 931626 + 21783 Age Predictor Constant Age Coef 931626 21783 4/25/03 252x0341 SE Coef 2851421 87750 T 0.33 0.25 P 0.746 0.805 S = 919493 R-Sq = 0.2% R-Sq(adj) = 0.0% Analysis of Variance Source DF SS MS Regression 1 52099324721 52099324721 Residual Error 36 3.04368E+13 8.45467E+11 Total 37 3.04889E+13 Unusual Observations Obs Age Sales 17 36.1 3860007 22 28.5 2788572 Fit 1718106 1552797 F 0.06 SE Fit 353724 376045 P 0.805 Residual 2141901 1235775 St Resid 2.52R 1.47 X R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence. MTB > regress c1 1 c3 Regression Analysis: Sales versus Growth The regression equation is Sales = 1595571 + 26834 Growth Predictor Constant Growth Coef 1595571 26834 S = 914467 SE Coef 161301 39601 R-Sq = 1.3% T 9.89 0.68 P 0.000 0.502 R-Sq(adj) = 0.0% Analysis of Variance Source Regression Residual Error Total DF SS MS 1 3.83946E+11 3.83946E+11 36 3.01050E+13 8.36249E+11 37 3.04889E+13 Unusual Observations Obs Growth Sales 17 0.8 3860007 30 23.5 1695219 Fit 1616609 2225167 F 0.46 SE Fit 151819 878449 P 0.502 Residual 2243398 -529948 St Resid 2.49R -2.09RX R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence. MTB > regress c1 1 c4 Regression Analysis: Sales versus Income The regression equation is Sales = 299877 + 39.2 Income Predictor Constant Income Coef 299877 39.17 S = 849860 SE Coef 554447 15.71 R-Sq = 14.7% T 0.54 2.49 P 0.592 0.017 R-Sq(adj) = 12.3% Analysis of Variance Source Regression Residual Error Total DF SS MS 1 4.48747E+12 4.48747E+12 36 2.60014E+13 7.22262E+11 37 3.04889E+13 Unusual Observations Obs Income Sales 17 49393 3860007 Fit 2234579 SE Fit 276038 F 6.21 P 0.017 Residual 1625428 St Resid 2.02R R denotes an observation with a large standardized residual 4/25/03 252x0341 2 MTB > regress c1 1 c5 Regression Analysis: Sales versus HS The regression equation is Sales = - 2969741 + 59660 HS Predictor Constant HS Coef -2969741 59660 S = 802004 SE Coef 1370956 17669 R-Sq = 24.1% T -2.17 3.38 P 0.037 0.002 R-Sq(adj) = 21.9% Analysis of Variance Source Regression Residual Error Total DF SS MS 1 7.33335E+12 7.33335E+12 36 2.31556E+13 6.43210E+11 37 3.04889E+13 Unusual Observations Obs HS Sales 17 85.3 3860007 38 58.4 826518 Fit 2119509 512081 F 11.40 SE Fit 192928 358068 P 0.002 Residual 1740498 314437 St Resid 2.24R 0.44 X R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence. MTB > regress c1 1 c6 Regression Analysis: Sales versus College The regression equation is Sales = 789847 + 35854 College Predictor Constant College Coef 789847 35854 S = 871330 SE Coef 439508 17582 R-Sq = 10.4% T 1.80 2.04 P 0.081 0.049 R-Sq(adj) = 7.9% Analysis of Variance Source Regression Residual Error Total DF SS MS 1 3.15714E+12 3.15714E+12 36 2.73318E+13 7.59216E+11 37 3.04889E+13 Unusual Observations Obs College Sales 6 41.7 2546324 17 30.9 3860007 Fit 2285170 1896988 SE Fit 347197 189865 F 4.16 P 0.049 Residual 261154 1963020 St Resid 0.33 X 2.31R R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence. 1. On the basis of the material above the researcher decides that ‘HS’ is the best single predictor of sales. Please explain why. Consider the values of R 2 and the significance tests on the slope of the equation. According to the equation showing the response of sales to HS, how much will sales rise if there is a 1% increase in people with a high school diploma. The average store has sales of $1638487. Relative to this, what percent increase in sales would be caused by a 1 (per cent) increase in ‘HS’. (4) 4/25/03 252x0341 3 2. The researcher tries to improve the prediction by adding another variable. Since there were 4 other variables than ‘HS,’ there are four regressions below. Do any of them represent an improvement on ‘HS’ alone? Why? Look at the significance tests on the coefficients of the new variables and the adjusted R 2 . In order to put this in perspective, the average values of the independent variables are shown below. Age Growth Income College HS 32.450 1.599 34175 23.67 77.24 Take the best of the four regressions below and give the value of sales that would be predicted for a store with average value of the independent variables and explain by what percent sales would rise if ‘HS’ went up by 1. How much does this differ from the prediction using ‘HS’ alone. (4) MTB > regress c1 2 c5 c2 Regression Analysis: Sales versus HS, Age The regression equation is Sales = - 2126081 + 60953 HS - 29076 Age Predictor Constant HS Age Coef -2126081 60953 -29076 S = 811809 SE Coef 2678378 18226 78952 R-Sq = 24.3% T -0.79 3.34 -0.37 P 0.433 0.002 0.715 R-Sq(adj) = 20.0% Analysis of Variance Source Regression Residual Error Total Source HS Age DF SS MS 2 7.42273E+12 3.71137E+12 35 2.30662E+13 6.59034E+11 37 3.04889E+13 F 5.63 P 0.008 DF Seq SS 1 7.33335E+12 1 89382656452 Unusual Observations Obs HS Sales 17 85.3 3860007 Fit 2023658 SE Fit 325389 Residual 1836350 St Resid 2.47R R denotes an observation with a large standardized residual MTB > regress c1 2 c5 c3 Regression Analysis: Sales versus HS, Growth The regression equation is Sales = - 2959336 + 59494 HS + 1506 Growth Predictor Constant HS Growth Coef -2959336 59494 1506 S = 813360 SE Coef 1412551 18355 36079 R-Sq = 24.1% T -2.10 3.24 0.04 P 0.043 0.003 0.967 R-Sq(adj) = 19.7% Analysis of Variance Source Regression Residual Error Total DF SS MS 2 7.33450E+12 3.66725E+12 35 2.31544E+13 6.61555E+11 37 3.04889E+13 F 5.54 P 0.008 4 4/25/03 252x0341 Source HS Growth DF Seq SS 1 7.33335E+12 1 1152089260 Unusual Observations Obs HS Sales 17 85.3 3860007 30 81.7 1695219 Fit 2116944 1936614 SE Fit 205088 786380 Residual 1743063 -241395 St Resid 2.21R -1.16 X R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence. MTB > regress c1 2 c5 c4 Regression Analysis: Sales versus HS, Income The regression equation is Sales = - 3089379 + 62540 HS - 3.0 Income Predictor Constant HS Income Coef -3089379 62540 -3.01 S = 813216 SE Coef 1715239 30098 25.26 R-Sq = 24.1% T -1.80 2.08 -0.12 P 0.080 0.045 0.906 R-Sq(adj) = 19.7% Analysis of Variance Source Regression Residual Error Total Source HS Income DF SS MS 2 7.34273E+12 3.67136E+12 35 2.31462E+13 6.61320E+11 37 3.04889E+13 F 5.55 P 0.008 DF Seq SS 1 7.33335E+12 1 9375516635 Unusual Observations Obs HS Sales 17 85.3 3860007 Fit 2096954 SE Fit 272313 Residual 1763053 St Resid 2.30R R denotes an observation with a large standardized residual MTB > regress c1 2 c5 c6 Regression Analysis: Sales versus HS, College The regression equation is Sales = - 3193739 + 64448 HS - 6161 College Predictor Constant HS College Coef -3193739 64448 -6161 S = 812572 SE Coef 1627759 25486 23343 R-Sq = 24.2% T -1.96 2.53 -0.26 P 0.058 0.016 0.793 R-Sq(adj) = 19.9% Analysis of Variance Source Regression Residual Error Total Source HS College DF SS MS 2 7.37935E+12 3.68967E+12 35 2.31096E+13 6.60273E+11 37 3.04889E+13 F 5.59 P 0.008 DF Seq SS 1 7.33335E+12 1 45998746314 Unusual Observations Obs HS Sales 17 85.3 3860007 Fit 2113692 SE Fit 196709 Residual 1746316 St Resid 2.22R R denotes an observation with a large standardized residual 5 4/25/03 252x0341 3. In desperation the researcher tries to add all the variables at once. a. What does the ANOVA show? (2) b. Do any of the coefficients have a wrong sign? (Remember there is nothing wrong with a negative coefficient unless you can give a reason why it shouldn’t be negative) (1) c. Which of the coefficients are significant? (2) d. Do an F test to show if addition of all the variables improved the regression. To do this drop a few zeros. Take the Regression Sum of squares in the regression with ‘HS’ alone as 7.333, the regression sum of squares after adding all the new variables as 7.454 and the error sum of squares as 23.303. I’m getting this from the ANOVA table below and the sequential SS table below it by 12 dividing all the SS’s by 10 since only their relative size matters. (3) e. To put the results in perspective try again to predict the sales that a store with the mean values of the independent variables would have and what percent increase in sales would come from an increase of 1 in ‘HS.’ How does this compare with our prediction when we used ‘HS’ alone? f. The column marked VIF (variance inflation factor) is a test for (multi)collinearity. The rule of thumb is that if any of these exceeds 5, we have a multicollinearity problem. None does. What is multicollinearity and why am I worried about it? (2) MTB > regress c1 5 c5 c2 c6 c4 c3; SUBC> vif. Regression Analysis: Sales versus HS, Age, College, Income, Growth The regression equation is Sales = - 2270706 + 62735 HS - 27384 Age - 5702 College + 2.4 Income + 2084 Growth Predictor Constant HS Age College Income Growth Coef -2270706 62735 -27384 -5702 2.45 2084 S = 848433 SE Coef 3696533 35090 93046 28359 30.53 44098 R-Sq = 24.4% T -0.61 1.79 -0.29 -0.20 0.08 0.05 P 0.543 0.083 0.770 0.842 0.937 0.963 VIF 3.5 1.3 2.7 3.8 1.4 R-Sq(adj) = 12.6% Analysis of Variance Source Regression Residual Error Total Source HS Age College Income Growth DF SS MS 5 7.45407E+12 1.49081E+12 32 2.30348E+13 7.19839E+11 37 3.04889E+13 F 2.07 P 0.095 DF Seq SS 1 7.33335E+12 1 89382656452 1 26200610077 1 3524785887 1 1608397623 6 4/25/03 252x0341 Unusual Observations Obs HS Sales 17 85.3 3860007 30 81.7 1695219 Fit 2038662 1899886 SE Fit 360453 826437 Residual 1821346 -204667 St Resid 2.37R -1.07 X R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence. 7 4/25/03 252x0341 II. Do at least 4 of the following 7 Problems (at least 15 each) (or do sections adding to at least 60 points Anything extra you do helps, and grades wrap around) . Show your work! State H 0 and H1 where applicable. Use a significance level of 5% unless noted otherwise. Do not answer questions without citing appropriate statistical tests. 1. (Berenson et. al. 1220) A firm believes that less than 15% of people remember their ads. A survey is taken to see what recall occurs with the following results (In these problems calculating proportions won’t help you unless you do a statistical test): Medium Mag TV Radio Total Remembered 25 10 7 42 Forgot 73 93 108 274 Total 98 103 115 316 a. Test the hypothesis that the recall rate is less than 15% by using proportions calculated from the ‘Total’ column. Find a p-value for this result. (5) b. Test the hypothesis that the proportion recalling was lower for Radio than TV. (4) c. Test to see if there is a significant difference in the proportion that remembered according to the medium. (6) d. The Marascuilo procedure says that if (i) equality is rejected in c) and (ii) p 2 p3 2 s p , where the chi – squared is what you used in c) and the standard deviation is 2 what you would use in a confidence interval solution to b), you can say that you have a significant difference between TV and Radio. Try it! (5) 8 4/25/03 252x0341 2. (Berenson et. al. 1142) A manager is inspecting a new type of battery. These are subjected to 4 different pressure levels and their time to failure is recorded. The manager knows from experience that such data is not normally distributed. Ranks are provided. PRESSURE Use low 1 2 3 4 5 8.0 8.1 9.2 9.4 11.7 rank normal 11 12 15 16 19 7.6 8.2 9.8 10.9 12.3 rank high rank whee! rank 8 13 17 18 20 6.0 6.3 7.1 7.7 8.9 4 5 7 9 14 5.1 5.6 5.9 6.7 7.8 1 2 3 6 10 a. At the 5% level analyze the data on the assumption that each column represents a random sample. Do the column medians differ? (5) b. Rerank the data appropriately and repeat a) on the assumption that the data is non-normal but cross classified by use. (5) c. This time I want to compare high pressure (H) against low - moderate pressure (L). I will write out the numbers 1-20 and label them according to pressure. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 H H H H H H H L H H L L L H L L L L L L Do a runs test to see if the H’s and L’s appear randomly. This is called a Wald-Wolfowitz test for the equality of means in two nonnormal samples. Null hypothesis is that the sequence is random and the means are equal. What is your conclusion? (5) 9 4/25/03 252x0341 3. A researcher studies the relationship of numbers of subsidiaries and numbers of parent companies in 11 metropolitan areas and finds the following: Area parents 1 2 3 4 5 6 7 8 9 10 11 653 391 352 261 226 218 202 151 141 138 134 2867 x subsidiaries y 2607 1714 1857 1228 880 671 1524 889 482 569 662 13083 x2 xy 426409 152881 123904 68121 51076 47524 40804 22801 19881 19044 17956 990401 1702371 670174 653664 320508 198880 146278 307848 134239 67962 78522 88708 4369154 y2 6796449 2937796 3448449 1507984 774400 450241 2322576 790321 232324 323761 438244 20022545 a. Do Spearman’s rank correlation between x and y and test it for significance (6) b. Compute the sample correlation between x and y and test it for significance (6) c. Compute the sample standard deviation of x and test to see if it equals 200 (4) 10 4/25/03 252x0341 4. Data from the previous page is repeated: Area parents 1 2 3 4 5 6 7 8 9 10 11 653 391 352 261 226 218 202 151 141 138 134 2867 x subsidiaries y 2607 1714 1857 1228 880 671 1524 889 482 569 662 13083 x2 xy 426409 152881 123904 68121 51076 47524 40804 22801 19881 19044 17956 990401 1702371 670174 653664 320508 198880 146278 307848 134239 67962 78522 88708 4369154 y2 6796449 2937796 3448449 1507984 774400 450241 2322576 790321 232324 323761 438244 20022545 a. Test the hypothesis that the correlation between x and y is .8 (5) b. Test the hypothesis that x has the Normal distribution. (9) c. Test the hypothesis that x and y have equal variances. (4) 11 4/25/03 252x0341 5. Data from the previous page is repeated: Area parents 1 2 3 4 5 6 7 8 9 10 11 653 391 352 261 226 218 202 151 141 138 134 2867 x subsidiaries y 2607 1714 1857 1228 880 671 1524 889 482 569 662 13083 x2 xy 426409 152881 123904 68121 51076 47524 40804 22801 19881 19044 17956 990401 1702371 670174 653664 320508 198880 146278 307848 134239 67962 78522 88708 4369154 y2 6796449 2937796 3448449 1507984 774400 450241 2322576 790321 232324 323761 438244 20022545 a. Compute a simple regression of subsidiaries against parents as the independent variable. (5) b. Compute s e . (3) c. Predict how many subsidiaries will appear in a city with 50 parent corporations. (1) d. Make your prediction in c) into a confidence interval. (3) e. Compute s b0 and make it into a confidence interval for 0 . (3) f. Do an ANOVA for this regression and explain what it says about 1 . (3) 12 4/25/03 252x0341 6. A chain has the following data on prices, promotion expenses and sales of one product (You can do x1 x 2 ): Store 1 2 3 4 5 6 7 8 9 10 11 12 sales promotion x1 x2 x12 4141 3754 5000 4011 3224 2618 3746 3825 1096 1882 2159 2927 38383 59 59 59 59 79 79 79 79 99 99 99 99 948 200 400 600 600 200 400 600 600 200 400 400 600 5200 3481 3481 3481 3481 6241 6241 6241 6241 9801 9801 9801 9801 78092 y2 x 22 Store 1 2 3 4 5 6 7 8 9 10 11 12 price y 40000 160000 360000 360000 40000 160000 360000 360000 40000 160000 160000 360000 2560000 x1 y 17147881 14092516 25000000 16088121 10394176 6853924 14032516 14630625 1201216 3541924 4661281 8567329 136211509 244319 221486 295000 236649 254696 206822 295934 302175 108504 186318 213741 289773 2855417 x2 y 828200 1501600 3000000 2406600 644800 1047200 2247600 2295000 219200 752800 863600 756200 17562800 y 3198.58, x1 79.0000 and x 2 433.333. a. Do a multiple regression of sales against x1 and x 2 . (10) b. Compute R 2 and R 2 adjusted for degrees of freedom. Use a regression ANOVA to test the usefulness of this regression. (6) d. Use your regression to predict sales when price is 79 cents and promotion expenses are $200. (2) e. Use the directions in the outline to make this estimate into a confidence interval and a prediction interval. (4) f. If the regression of Price alone had the following output: The regression equation is sales = 7564 - 55.3 price Predictor Constant price S = 605.6 Coef 7564.3 -55.26 SE Coef 863.6 10.71 R-Sq = 72.7% T 8.76 -5.16 P 0.000 0.000 R-Sq(adj) = 70.0% Analysis of Variance Source Regression Residual Error Total DF 1 10 11 SS 9772621 3667664 13440285 MS 9772621 366766 F 26.65 P 0.000 Do an F-test to see if adding x 2 helped. (4). The next page is blank – please show your work. 4/25/03 252x0341(blank) 13 14 4/25/03 252x0341 7. The Lees present the following data on college students summer wages vs. years of work experience blocked by location. Years of Work Experience Region 1 2 3 1 16 19 24 2 21 20 21 3 18 21 22 4 14 21 26 a. Do a 2-way ANOVA on these data and explain what hypotheses you test and what the conclusions are. (9) (Or do a 1-way ANOVA for 6 points.) The following column sums are done for you: x 1 69, x 2 81, n1 4, n 2 4, x 2 1 1217 and x 2 2 1643. So x1 17.25,and x 2 20.25. b. Do a test of the equality of the means in columns 1 and 3 assuming that the columns are random samples from Normal populations with equal variances (4). c. Assume that columns 1 and 3 do not come from a Normal distribution and are not paired data and do a test for equal medians. (4) d. Test the following data for uniformity. n 20. (6) Category 1 2 3 4 5 Numbers 0 2 0 10 8 15