252x0641 12/15/05 ECO252 QBA2 Final EXAM December , 2006 Version 1 Name and Class hour:_________________________ I. (18+ points) Do all the following. Note that answers without reasons and citation of appropriate statistical tests receive no credit. Most answers require a statistical test, that is, stating or implying a hypothesis and showing why it is true or false by citing a table value or a p-value. If you haven’t done it lately, take a fast look at ECO 252 - Things That You Should Never Do on a Statistics Exam (or Anywhere Else) Regression A seeks to explain the selling price of a home in terms of a group of variables explained on the output sheet. Note that regressions 1 and 7 are identical. Look at the definitions of the variables carefully and, in particular, notice which are interaction variables. a) The homes in this regression are in three different areas. There are dummy variables to indicate that the homes are in Area 1 or Area 2. Why isn’t there a dummy variable for Area 3? (1) b) In Regression 1, what coefficients are significant at the 5% level? (2) c) What independent variables did I remove from the problem to get to Regression 2 from Regression 1? Why? (2) d) Following the same process, I went on to remove one or more variables each time until I got to Regression 5. When I got to Regression 5 I ran the ‘best subsets regression.’ 6. I concluded that it was time to quit removing variables. Between the best subsets regression and the characteristics of the coefficients of the results in Regression 5 I felt that I had gone as far as was reasonable in removing independent variables. What are the three things that led me to think that regression 5 was the best that I could do? (3) e) Using Regression 5 and assuming that all homes have two baths, Regression 5 effectively becomes 3 regressions relating price to living area. Take the coefficient of bath, multiply it by two and add it to the constant to get the effective intercept for homes with two baths. Using L or any other symbol that you find convenient for living area, what are the equations relating living area to price in (3 points) Area 1? Area 2? Area 3? [11] f) Continuing with Regression 5 and assuming that a home has 2(thousand) square feet of living area and 2 baths, what would it sell for in Area 1? Area 2? Area 3? What is the percent difference between the lowest and highest price? (2) 1 252x0641 12/15/05 g) We have not yet dealt with the question of whether the coefficients in Regression 5 are reasonable. In order to do this look at two homes in Area 1 that have two baths. If one has 2(thousand) square feet of living area and the other 3, how would there prices differ? Does that seem reasonable? Try the same for a home in area 3. (3) [16] h) As I warned you, I now repeated Regression 1 as Regression 7, without using the VIFs. Much to my surprise, I ended up dropping the same variables as I did after Regression 1. Why? (1) i) Continuing in the same way, I worked myself to Regression 9. Looking at the things I usually check, this looked pretty good. Then I tried to check the coefficients in the same way that I did in g). Why was I very unhappy? What is there in Regression 8 that could explain these results? (4) j) Regression 11 is a stepwise regression. The printout, which continues on page 7 presents four different possible regressions in column form. Look at in each case a coefficient has a t-value under it and a p-value for a significance test. After the fourth try, the computer refused to add any more independent variables. The only regression here that I thought was worth looking at was the one with four independent variables. What can you tell me about its acceptability? (3) [24] k) Do an F test to compare regressions 2 and 3 and to find out if lot 1 and lot 2 have any explanatory power. (3) II. Hand in your third computer problem. (2 to 7 points) 2 252x0641 12/15/05 III. Do at least 4 of the following 7 Problems (at least 12 each) (or do sections adding to at least 50 points – (Anything extra you do helps, and grades wrap around). You must do parts a) and b) of problem 1. Show your work! State H 0 and H1 where applicable. Use a significance level of 5% unless noted otherwise. Do not answer questions without citing appropriate statistical tests – That is, explain your hypotheses and what values from what table were used to test them. Clearly label what section of each problem you are doing! The entire test has about 151 points, but 70 is considered a perfect score. Don’t waste our time by telling me that two means, proportions, variances or medians don’t look the same to you. You need statistical tests! There are two blank pages below. 1. a) If I want to test to see if the mean of x 2 is larger than the mean of x1 my null hypotheses are: (Note: D 1 2 ) i) 1 2 and D 0 ii) 1 2 and D 0 v) 1 2 and D 0 vi) 1 2 and D 0 iii) 1 2 and D 0 vii) 1 2 and D 0 iv) 1 2 and D 0 viii) 1 2 and D 0 (2) The first two columns below represent times for 25 workers on an industrial task. The third column is the difference between them d Row x1 x2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 6.11 5.13 6.42 4.65 5.82 4.08 4.01 5.26 5.25 7.66 6.29 5.41 6.17 5.50 4.06 6.19 6.71 4.41 5.25 4.85 6.50 5.24 7.29 4.99 4.26 4.81 4.19 5.17 4.07 4.58 2.97 3.39 4.14 4.31 6.68 5.37 3.95 4.93 4.04 2.40 4.71 5.93 2.93 4.25 4.41 4.68 3.50 6.09 2.87 3.06 1.30 0.94 1.25 0.58 1.24 1.11 0.62 1.12 0.94 0.98 0.92 1.46 1.24 1.46 1.66 1.48 0.78 1.48 1.00 0.44 1.82 1.74 1.20 2.12 1.20 Assume that .05 . Minitab gives us the following summary (edited). Descriptive Statistics: x1, x2, d Variable x1 x2 d N 25 25 25 N* 0 0 0 Mean 5.50 4.30 1.20 SE Mean 0.200 0.212 ………… StDev 1.00 1.06 ……… Minimum 4.010 2.400 0.4400 Q1 4.750 3.445 0.9400 Median Q3 Maximum 5.260 6.240 7.660 4.250 4.870 6.680 1.200 1.4700 2.120 In the d column, the column sum is 30.08 and the sum of the first 24 numbers squared is 38.585. Do not recompute things that have been done for you if you want to ever get much done on this exam. Clearly label parts b, c, d etc. The null hypothesis is the same for parts c, d and e, so state it clearly. b). Find the sample variance for the d column. (2) c) On the assumption that the underlying distributions are Normal and that the first two columns represent independent samples from populations that represent plants 1 and 2 and come from populations with similar 3 252x0641 12/15/05 variances, can we conclude that average workers in plant 2 complete the task faster than those in plant 1? (4) d) (Extra credit) Repeat part c) after dropping the assumption that the variances are similar. (5) e) Actually, these data supposedly represent performance of a single sample of 25 workers on two administrations of a standard test of manual dexterity. The question was ‘Did the time for the test improve between the first and second administration?’ (3) [11] f) Assume that the means above come from independent samples, but that the data represent samples for populations with known population variances of 1.00 and 1.06. Test the null hypothesis that you used in part c) and find an exact p-value. (3) [14] g) Using the value of s d that you used in e), make a confidence interval with a confidence level of 92%. You must find the value of z needed to do this first. Of course, it is not on the t-table. (2) [16] 2 4 252x0641 12/15/05 2. Let us expand the problem of question 1 by adding another column. The full data set with lots done for you looks like this. The first three columns represent the given data. In the next three columns I have take the first three columns and squared them. I have added the first three rows to get the seventh column. I have computed row means in the 9th column. The tenth column is a row sum of squares. In the 11th to the 13th columns the numbers in the first three columns are ranked from 1 to 75. Sums are provided for all 13 columns. (1) (2) (3) (4) (5) (6) Row x 1 x 2 x 3 x21 x22 x 23 Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 x1 6.11 5.13 6.42 4.65 5.82 4.08 4.01 5.26 5.25 7.66 6.29 5.41 6.17 5.50 4.06 6.19 6.71 4.41 5.25 4.85 6.50 5.24 7.29 4.99 4.26 x2 4.81 4.19 5.17 4.07 4.58 2.97 3.39 4.14 4.31 6.68 5.37 3.95 4.93 4.04 2.40 4.71 5.93 2.93 4.25 4.41 4.68 3.50 6.09 2.87 3.06 x3 5.46 4.66 5.80 4.36 5.20 3.53 3.70 4.70 4.78 7.17 5.83 4.68 5.55 4.77 3.23 5.45 6.32 3.67 4.75 4.63 5.59 4.37 6.69 3.93 3.66 x1sq 37.3321 26.3169 41.2164 21.6225 33.8724 16.6464 16.0801 27.6676 27.5625 58.6756 39.5641 29.2681 38.0689 30.2500 16.4836 38.3161 45.0241 19.4481 27.5625 23.5225 42.2500 27.4576 53.1441 24.9001 18.1476 x2sq 23.1361 17.5561 26.7289 16.5649 20.9764 8.8209 11.4921 17.1396 18.5761 44.6224 28.8369 15.6025 24.3049 16.3216 5.7600 22.1841 35.1649 8.5849 18.0625 19.4481 21.9024 12.2500 37.0881 8.2369 9.3636 x3sq 29.8116 21.7156 33.6400 19.0096 27.0400 12.4609 13.6900 22.0900 22.8484 51.4089 33.9889 21.9024 30.8025 22.7529 10.4329 29.7025 39.9424 13.4689 22.5625 21.4369 31.2481 19.0969 44.7561 15.4449 13.3956 (2) (3) Sum (1) (4) (5) (6) (7) x i rsum 16.38 13.98 17.39 13.08 15.60 10.58 11.10 14.10 14.34 21.51 17.49 14.04 16.65 14.31 9.69 16.35 18.96 11.01 14.25 13.89 16.77 13.11 20.07 11.79 10.98 (7) (8) (9) (10) x 2 i x i x i2 rmean 5.46 4.66 5.80 4.36 5.20 3.53 3.70 4.70 4.78 7.17 5.83 4.68 5.55 4.77 3.23 5.45 6.32 3.67 4.75 4.63 5.59 4.37 6.69 3.93 3.66 rmsq 29.8116 21.7156 33.6013 19.0096 27.0400 12.4374 13.6900 22.0900 22.8484 51.4089 33.9889 21.9024 30.8025 22.7529 10.4329 29.7025 39.9424 13.4689 22.5625 21.4369 31.2481 19.0969 44.7561 15.4449 13.3956 rssq 90.280 65.589 101.585 57.197 81.889 37.928 41.262 66.897 68.987 154.707 102.390 66.773 93.176 69.325 32.676 90.203 120.131 41.502 68.188 64.408 95.401 58.805 134.988 48.582 40.907 (8) (9) (10) (11) r1 rank1 63.0 44.0 68.0 31.0 59.0 19.0 15.0 50.0 48.5 75.0 66.0 52.0 64.0 55.0 17.0 65.0 72.0 27.5 48.5 41.0 69.0 47.0 74.0 43.0 23.0 (11) (12) (13) r2 r3 rank2 40.0 21.0 45.0 18.0 29.0 4.0 7.0 20.0 24.0 70.0 51.0 14.0 42.0 16.0 1.0 36.0 61.0 3.0 22.0 27.5 33.5 8.0 62.0 2.0 5.0 rank3 54.0 32.0 58.0 25.0 46.0 9.0 12.0 35.0 39.0 73.0 60.0 33.5 56.0 38.0 6.0 53.0 67.0 11.0 37.0 30.0 57.0 26.0 71.0 13.0 10.0 (12) (13) The sums of the columns will not fit on the table so they are printed here. (1)Sum of x1 = 137.51; (2)Sum of x2 = 107.43; (3)Sum of x3 = 122.48; (4)Sum of x1sq = 780.400; (5)Sum of x2sq = 488.725; (6)Sum of x3sq = 624.649; (7)rsum = 367.42; (8)Sum of rmean = 122.473; (9)Sum of rmsq = 624.587; (10)Sum of rssq = 1893.77; (11)Sum of rank 1 = 1236.5; (12)Sum of rank 2 = 662; (13)Sum of rank 3 = 951.5. You are left to find column means and the grand mean. Please avoid Recomputing stuff that I have done for you. Life is not that long. You will need to get column and overall means. Almost everything else is done for you. a) Consider the first three columns to be three independent random samples from Normal distributions with similar variances. Compare the means using an appropriate statistical test or tests. (6) b) Actually as in 1e) these data represent three tests of a single random sample of 25 workers. Consider the data blocked by worker and compare means. (4) c) Consider the first three columns to be three independent random samples from a distribution that is not Normal. Compare the medians using an appropriate statistical test or tests. (5) [31] 5 252x0641 12/15/05 (Blank) 6 252x0641 12/15/05 3. A sales manager wishes to predict newspaper circulation on Sunday ( y ) on the basis of weekday morning circulation ( x1 ) weekday evening circulation ( x 2 ) and time ( x3 ). The data is below (Use .01) . All circulation data is in millions sold. Row 1 2 3 4 5 6 7 8 9 y x1 x2 x3 S 54 55 55 56 56 57 58 59 60 AM 27 29 30 33 34 35 36 37 39 PM 34 33 31 29 29 28 26 25 24 T 1 2 3 4 5 6 7 8 9 The quantities below are given: y 510, x 300, n 9, x 22 7549, compute 1 x y ?, x 1 x y as part of a). 2y x 2 y and x x 259, 14623 2 1 2 28932, x 2 1 10126, 8525. Yes, you will have to 1 You do not need all of these. a) Compute a simple regression of Sunday circulation against morning circulation.(8) b) Compute R 2 (4) c) Compute s e (3) d) Compute s b1 ( the std deviation of the slope) and do a confidence interval for 1 .(3) e) Do a prediction interval for units when morning circulation rises to 45 million. (3) Why is this interval likely to be larger than other prediction intervals we might compute for morning circulation we have actually observed? (1) [53] 7 252x0641 12/15/05 4. Data from problem 3 is repeated. (Use .01) . A sales manager wishes to predict newspaper circulation on Sunday ( y ) on the basis of weekday morning circulation ( x1 ) weekday evening circulation ( x 2 ) and time ( x3 ). The data is below (Use .01) . All circulation data is in millions sold. Row 1 2 3 4 5 6 7 8 9 y x1 x2 x3 S 54 55 55 56 56 57 58 59 60 AM 27 29 30 33 34 35 36 37 39 PM 34 33 31 29 29 28 26 25 24 T 1 2 3 4 5 6 7 8 9 The quantities below are given: y 510, x 300, n 9, x 22 7549, 1 x y ?, x 1 2y x 2 259, 14623 and y 28932, x x x 8525. 2 2 1 10126, 1 2 a) Do a multiple regression of Sunday circulation against morning and evening circulation. (12) b) Compute R 2 and R 2 adjusted for degrees of freedom for both this and the previous problem. Compare the values of R 2 adjusted between this and the previous problem. Use an F test to compare R 2 here with the R 2 from the previous problem.(6) c) Compute the regression sum of squares and use it in an F test to test the usefulness of this regression. (5) d) Use your regression to predict Sunday circulation when AM circulation is 40 and PM circulation is 23. (2) e) Use the directions in the outline to make this estimate into a confidence interval and a prediction interval. (4) [82] 8 252x0641 12/15/05 5. Data from problem 3 is repeated. (Use .01) . A sales manager wishes to predict newspaper circulation on Sunday ( y ) on the basis of weekday morning circulation ( x1 ) weekday evening circulation ( x 2 ) and time ( x3 ). The data is below (Use .01) . All circulation data is in millions sold. The time variable is now added with the following results. MTB > SUBC> SUBC> Regress c1 3 c2 c3 c10; VIF; DW. Regression Analysis: S versus AM, PM, T The regression equation is S = 62.2 - 0.253 AM - 0.071 PM + 0.991 T Predictor Constant AM PM T Coef 62.19 -0.2533 -0.0707 0.9913 S = 0.453612 SE Coef 17.23 0.2960 0.3642 0.4957 R-Sq = 96.8% Analysis of Variance Source DF SS Regression 3 30.971 Residual Error 5 1.029 Total 8 32.000 T 3.61 -0.86 -0.19 2.00 P 0.015 0.431 0.854 0.102 VIF 53.6 61.6 71.7 R-Sq(adj) = 94.9% MS 10.324 0.206 F 50.17 P 0.000 Durbin-Watson statistic = 2.13279 a) What do the significance tests on the coefficients reveal? Give reasons. (2) b) Can you explain why the coefficients of AM and PM seem unreasonable? What is the apparent reason for this? (2) c) Do a 2% two-sided Durbin-Watson test on the result as suggested in class. What is the hypothesis tested and what is the result? (3) d) Reuse your spare parts from the previous regression if possible to compute the correlation between AM and PM circulation and test it for significance. (4) e) Compute a rank correlation between AM and PM circulation and test it for significance. Can you explain why it is larger than the correlation in d)? (4) f) Test the hypothesis that the correlation that you computed in d) is -.99. (4) [101] g) (Extra credit) If AM, PM and T are x1 , x 2 and x3 , find the partial correlation coefficient (square root of the coefficient of partial determination) rY 3.12 . (2) 9 252x0641 12/15/05 6. The following times were recorded for 6 skiers on 3 slopes. In order to assess their difficulty we look at the median time for each slope. We do not assume a Normal distribution. Do not compute the median or mean time for any slope. Skier Slope 1 Slope 2 Slope 3 1 4.9 6.1 5.2 2 4.5 6.0 5.1 3 4.1 5.4 4.9 4 4.4 4.7 5.1 5 4.5 4.9 4.5 6 3.3 3.8 3.9 a) Test the hypothesis that the median time on slope 1 is 4 minutes (3 or 2 depending on method) (3) b) Test the hypothesis that slope 1 and slope 2 have the same median times. (4) c) Test the hypothesis that the slopes all have the same median time. (4) d) Explain what methods you would use in b) and c) if the columns were independent random samples. (1) e) Rank the skiers times on each slope from 1 (fastest) to 6. Use these as rankings of the skiers and test to see if the ranks agree between slopes. (4) [117] 10 252x0641 12/15/05 7. Clarence Sales is a marketing major and knows that national soft drink market shares are as below. Classic Coke 15.6% Pepsi 13.2% Diet Coke 5.1% Diet Pepsi 3.5% Other brands 62.6% He gets in a bit of trouble here and is sentenced to 20 hours of public service. After he finishes his public service he takes off for Maine, gets caught littering and is sentenced to another 20 hours of public service. During his public service, he picks up 100 cans in each state. The cans are as below. Brand PA ME Classic Coke 22 17 Pepsi 15 11 Diet Coke 13 10 Diet Pepsi 6 5 Other brands 44 57 Use a 1% significance level throughout this problem. Don’t waste our time by just computing percents and saying that they are different. Each problem requires a statistical test or the equivalent. State your null and alternative hypotheses in each problem. a) Regard the cans picked up as a random sample of sales in the two states. Can we say that the proportions of soft drink cans discarded in Pennsylvania are the same as the national market shares? (5) b) Clarence knows that that Maine is Moxie country, so he believes that the proportion of other brands sold is higher in Maine than in Pennsylvania. Is this true? (4) c) Create a 0.2% 2-sided confidence interval for the difference between the proportions of other brands sold in Maine. Using your Normal table, make this into a 0.1% 2-sided interval. (3) d) Actually Clarence’s mother owns the Coke franchise for Maine and last year between her sales of Classic Coke and Diet Coke accounted for 25% of the soft drink market in Maine. She tells Clarence that her sales are now above 25%. On the basis of Clarence’s Maine sample is that true? (2) [131] 11 252x0641 12/15/05 (Blank) 12