252x0644 12/14/05 ECO252 QBA2 Final EXAM December , 2006 Version 4 Name and Class hour:_________________________ I. (18+ points) Do all the following. Note that answers without reasons and citation of appropriate statistical tests receive no credit. Most answers require a statistical test, that is, stating or implying a hypothesis and showing why it is true or false by citing a table value or a p-value. If you haven’t done it lately, take a fast look at ECO 252 - Things That You Should Never Do on a Statistics Exam (or Anywhere Else) Regression B seeks to explain the selling price of a home in terms of a group of variables explained on the output sheet. Note that regressions 12 and 17 are identical. Look at the definitions of the variables carefully and, in particular, notice which are interaction variables. a) The homes in this regression have been rated High, Med or Low by realtors. There are dummy variables to indicate the ratings. Why didn’t I use High or AH in regression 12? (1) b) In Regression 12, what coefficients are significant at the 1% level? (2) c) What independent variables did I remove from the problem to get to Regression 13 from Regression 12? Why? (2) d) Following the same process, I went on to remove one or more variables to get to Regression 13. When I got to Regression 13 I ran the ‘best subsets regression.’ 14. I concluded that it was time to quit removing variables. Between the best subsets regression and the characteristics of the coefficients of the results in Regression 13 I felt that I had gone as far as was reasonable in removing independent variables. What are the three things that led me to think that regression 5 was almost the best that I could do? Remember that a close relationship between Sq.ft and Sqftsq is excusable. What in the printout might make you question my judgment? (3) e) Using Regression 13 and assuming that all homes have areas of 1000 sq ft., Regression 13 effectively becomes 3 regressions relating Market price to Assessment. Take the coefficient of Sq.ft, multiplied by 1000 and the coefficient of Sqftsqsq multiplied by 1000 2 . Add them to the constant to get the effective intercept for homes with areas of 1000 sq. ft.. Using A or any other symbol that you find convenient for living area, what are the equations relating assessment to Market price for (4 points) Low homes? Med homes? High homes ? Is the difference between the slopes of these three equations relative to market significant? Why? [12] 1 252x0644 12/14/05 f) Continuing with Regression 13 and assuming that a home has 1000 square feet of living area and an assessment of 24, what would it sell for if it were rated Low? Med? High? What is the percent difference between the lowest and highest price? (2) g) We have not yet dealt with the question of whether the coefficients in Regression 5 are reasonable. In order to do this look at two homes, one with an area of 1000 and the second with an area of 1001. By how much will their Market prices differ? Does that seem reasonable? (3) [17] h) As I warned you, I now repeated Regression 12 as Regression 15, without using the VIFs. I decided to drop 1 variable. Why? (1) i) I could now add AH to the independent variables and did equation 16. I dropped it immediately. Why? (1) j) I now ran Regression 17 without one fewer independent variable than Regression 15 and did the same thing to get to Regression 18. How does Regression 18 compare with Regression 13? (2) j) Regression 17 is a stepwise regression. The printout presents four different possible regressions in column form. Look at in each case a coefficient has a t-value under it and a p-value for a significance test. After the fourth try, the computer refused to add any more independent variables. The only regression here that I thought was worth looking at was the one with four independent variables. What can you tell me about its acceptability? (3) [24] k) Do an F test to compare regressions 15 and 18 and to see if the two variables removed had any explanatory power. II. Hand in your third computer problem. (2 to 7 points) 2 252x0644 12/14/05 III. Do at least 4 of the following 7 Problems (at least 12 each) (or do sections adding to at least 50 points – (Anything extra you do helps, and grades wrap around). You must do parts a) and b) of problem 1. Show your work! State H 0 and H1 where applicable. Use a significance level of 5% unless noted otherwise. Do not answer questions without citing appropriate statistical tests – That is, explain your hypotheses and what values from what table were used to test them. Clearly label what section of each problem you are doing! The entire test has about 151 points, but 70 is considered a perfect score. Don’t waste our time by telling me that two means, proportions, variances or medians don’t look the same to you. You need statistical tests! There are two blank pages below. 1. a) If I want to test to see if the mean of x 2 is smaller than the mean of x1 my null hypotheses are: (Note: D 1 2 ) i) 1 2 and D 0 ii) 1 2 and D 0 v) 1 2 and D 0 vi) 1 2 and D 0 iii) 1 2 and D 0 vii) 1 2 and D 0 iv) 1 2 and D 0 viii) 1 2 and D 0 (2) The first two columns below represent times for 25 workers on an industrial task. The third column is the difference between them d Row x1 x2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 5.11 4.13 5.42 3.65 4.82 3.08 3.01 4.26 4.25 6.66 5.29 4.41 5.17 4.50 3.06 5.19 5.71 3.41 4.25 3.85 5.50 4.24 6.29 3.99 3.26 4.81 4.19 5.17 4.07 4.58 2.97 3.39 4.14 4.31 6.68 5.37 3.95 4.93 4.04 2.40 4.71 5.93 2.93 4.25 4.41 4.68 3.50 6.09 2.87 3.06 0.30 -0.06 0.25 -0.42 0.24 0.11 -0.38 0.12 -0.06 -0.02 -0.08 0.46 0.24 0.46 0.66 0.48 -0.22 0.48 0.00 -0.56 0.82 0.74 0.20 1.12 0.20 Assume that .05 . Minitab gives us the following summary (edited). Descriptive Statistics: x1, x2, d Variable N x1 25 x2 25 d 25 N* 0 0 0 Mean SE Mean 4.50 0.200 4.30 0.212 0.20 …………… StDev Minimum Q1 1.001 3.010 3.750 1.062 2.400 3.445 …………… -0.5600 -0.0600 Median Q3 Maximum 4.260 5.240 6.660 4.250 4.870 6.680 0.2000 0.4700 1.1200 In the d column, the column sum is 5.08 and the sum of the first 24 numbers squared is 4.825. Do not recompute things that have been done for you if you want to ever get much done on this exam. Clearly label parts b, c, d etc. The null hypothesis is the same for parts c, d and e, so state it clearly. b). Find the sample variance for the d column. (2) 3 252x0644 12/07/05 c) On the assumption that the underlying distributions are Normal and that the first two columns represent independent samples from populations that represent plants 1 and 2 and come from populations with similar variances, can we conclude that average workers in plant 2 complete the task faster than those in plant 1? (4) d) (Extra credit) Repeat part c) after dropping the assumption that the variances are similar. (5) e) Actually, these data supposedly represent performance of a single sample of 25 workers on two administrations of a standard test of manual dexterity. The question was ‘Did the time for the test improve between the first and second administration?’ (3) [11] f) Assume that the means above come from independent samples, but that the data represent samples for populations with known population variances of 1.00 and 1.06. Test the null hypothesis that you used in part c) and find an exact p-value. (3) [14] g) Using the value of s d that you used in e), make a confidence interval with a confidence level of 94%. You must find the value of z needed to do this first. Of course, it is not on the t-table. (2) [16] 2 4 252x0644 12/07/05 2. Let us expand the problem of question 1 by adding another column. The full data set with lots done for you looks like this. The first three columns represent the given data. In the next three columns I have take the first three columns and squared them. I have added the first three rows to get the seventh column. I have computed row means in the 9th column. The tenth column is a row sum of squares. In the 11 th to the 13th columns the numbers in the first three columns are ranked from 1 to 75. Sums are provided for all 13 columns. You Row x 1 x 2 x 3 x21 x22 x 23 Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 x1 5.11 4.13 5.42 3.65 4.82 3.08 3.01 4.26 4.25 6.66 5.29 4.41 5.17 4.50 3.06 5.19 5.71 3.41 4.25 3.85 5.50 4.24 6.29 3.99 3.26 x2 4.81 4.19 5.17 4.07 4.58 2.97 3.39 4.14 4.31 6.68 5.37 3.95 4.93 4.04 2.40 4.71 5.93 2.93 4.25 4.41 4.68 3.50 6.09 2.87 3.06 x3 4.96 4.16 5.29 3.86 4.70 3.02 3.20 4.20 4.28 6.67 5.33 4.18 5.05 4.27 2.73 4.95 5.82 3.17 4.25 4.13 5.09 3.87 6.19 3.43 3.16 x1sq 26.1121 17.0569 29.3764 13.3225 23.2324 9.4864 9.0601 18.1476 18.0625 44.3556 27.9841 19.4481 26.7289 20.2500 9.3636 26.9361 32.6041 11.6281 18.0625 14.8225 30.2500 17.9776 39.5641 15.9201 10.6276 x2sq 23.1361 17.5561 26.7289 16.5649 20.9764 8.8209 11.4921 17.1396 18.5761 44.6224 28.8369 15.6025 24.3049 16.3216 5.7600 22.1841 35.1649 8.5849 18.0625 19.4481 21.9024 12.2500 37.0881 8.2369 9.3636 x3sq 24.6016 17.3056 27.9841 14.8996 22.0900 9.1204 10.2400 17.6400 18.3184 44.4889 28.4089 17.4724 25.5025 18.2329 7.4529 24.5025 33.8724 10.0489 18.0625 17.0569 25.9081 14.9769 38.3161 11.7649 9.9856 (2) (3) (4) Sum (1) (5) (6) x x i x i2 rsum 14.88 12.48 15.88 11.58 14.10 9.07 9.60 12.60 12.84 20.01 15.99 12.54 15.15 12.81 8.19 14.85 17.46 9.51 12.75 12.39 15.27 11.61 18.57 10.29 9.48 rmean 4.96000 4.16000 5.29333 3.86000 4.70000 3.02333 3.20000 4.20000 4.28000 6.67000 5.33000 4.18000 5.05000 4.27000 2.73000 4.95000 5.82000 3.17000 4.25000 4.13000 5.09000 3.87000 6.19000 3.43000 3.16000 rmsq 24.6016 17.3056 28.0194 14.8996 22.0900 9.1405 10.2400 17.6400 18.3184 44.4889 28.4089 17.4724 25.5025 18.2329 7.4529 24.5025 33.8724 10.0489 18.0625 17.0569 25.9081 14.9769 38.3161 11.7649 9.9856 (7) (8) (9) i x 2 i r1 rssq rank1 73.850 57.0 51.919 27.5 84.089 65.0 44.787 19.0 66.299 51.0 27.428 10.0 30.792 6.0 52.927 39.0 54.957 36.5 133.467 73.0 85.230 61.5 52.523 43.5 76.536 58.5 54.805 45.0 22.577 8.5 73.623 60.0 101.641 67.0 30.262 16.0 54.188 36.5 51.327 20.0 78.061 66.0 45.205 34.0 114.968 72.0 35.922 24.0 29.977 14.0 (10) (11) r2 r3 rank2 50.0 32.0 58.5 26.0 46.0 5.0 15.0 29.0 42.0 75.0 64.0 23.0 52.0 25.0 1.0 49.0 69.0 4.0 36.5 43.5 47.0 18.0 70.0 3.0 8.5 rank3 54.0 30.0 61.5 21.0 48.0 7.0 13.0 33.0 41.0 74.0 63.0 31.0 55.0 40.0 2.0 53.0 68.0 12.0 36.5 27.5 56.0 22.0 71.0 17.0 11.0 (12) (13) The sums of the columns will not fit on the table so they are printed here. (1)Sum of x1 = 112.51; (2)Sum of x2 = 107.43; (3)Sum of x3 = 109.96; (4)Sum of x1sq = 530.380; (5)Sum of x2sq = 488.725; (6)Sum of x3sq = 508.253; (7)Sum of rsum = 329.9; (8)Sum of rmean = 109.967; (9)Sum of rmsq = 508.308; (10)Sum of rssq = 1527.36; (11)Sum of rank1 = 1010.5; (12)Sum of rank2 = 892; (13)Sum of rank3 = 947.5. You are left to find column means and the grand mean. Please avoid Recomputing stuff that I have done for you. Life is not that long. You will need to get column and overall means. Almost everything else is done for you. a) Consider the first three columns to be three independent random samples from Normal distributions with similar variances. Compare the means using an appropriate statistical test or tests. (6) b) Actually as in 1e) these data represent three tests of a single random sample of 25 workers. Consider the data blocked by worker and compare means. (4) c) Consider the first three columns to be three independent random samples from a distribution that is not Normal. Compare the medians using an appropriate statistical test or tests. (5) [31] 5 252x0644 12/07/05 (Blank) 6 252x0644 12/07/05 3. A sales manager wishes to predict newspaper circulation on Sunday ( y ) on the basis of weekday morning circulation ( x1 ) weekday evening circulation ( x 2 ) and time ( x3 ). The data is below (Use .01) . All circulation data is in millions sold. Row 1 2 3 4 5 6 7 8 9 y x1 S 54 55 55 56 56 58 58 60 61 AM 32 35 36 40 41 42 43 44 47 x2 x3 PM 34 33 31 29 29 28 26 25 24 T 0 1 2 3 4 5 6 7 8 The quantities below are given: y 513, x 360, n 9, x 2 2 7549, compute 1 x y ?, x 1 x y as part of a). 2y x 2 259, 14700, y x x 1 2 2 29287, x 2 1 14584, 10230. Yes, you will have to 1 You do not need all of these. a) Compute a simple regression of Sunday circulation against morning circulation.(8) b) Compute R 2 (4) c) Compute s e (3) d) Compute s b0 ( the std deviation of the intercept) and do a confidence interval for 0 .(3) e) Do a prediction interval for units when morning circulation rises to 50 million. (3) Why is this interval likely to be larger than other prediction intervals we might compute for morning circulation we have actually observed? (1) [53] 7 252x0644 12/07/05 4. Data from problem 2 is repeated. (Use .01) . A sales manager wishes to predict newspaper circulation on Sunday ( y ) on the basis of weekday morning circulation ( x1 ) weekday evening circulation ( x 2 ) and time ( x3 ). The data is below (Use .01) . All circulation data is in millions sold. Row 1 2 3 4 5 6 7 8 9 y x1 S 54 55 55 56 56 58 58 60 61 AM 32 35 36 40 41 42 43 44 47 x2 x3 PM 34 33 31 29 29 28 26 25 24 T 0 1 2 3 4 5 6 7 8 The quantities below are given: y 513, x 360, n 9, x 22 7549, compute 1 x y ?, x 1 x y as part of a). 2y x 2 259, 14700, y x x 1 2 2 29287, x 2 1 14584, 10230. Yes, you will have to 1 a) Do a multiple regression of Sunday circulation against morning and evening circulation. (12) b) Compute R 2 and R 2 adjusted for degrees of freedom for both this and the previous problem. Compare the values of R 2 adjusted between this and the previous problem. Use an F test to compare R 2 here with the R 2 from the previous problem.(6) c) Compute the regression sum of squares and use it in an F test to test the usefulness of this regression. (5) d) Use your regression to predict the number of units sold when AM circulation is 40 and PM circulation is 25.(2) e) Use the directions in the outline to make this estimate into a confidence interval and a prediction interval. (4) [82] 8 252x0644 12/07/05 5. Data from problem 2 is repeated. (Use .01) . A sales manager wishes to predict newspaper circulation on Sunday ( y ) on the basis of weekday morning circulation ( x1 ) weekday evening circulation ( x 2 ) and time ( x3 ). The data is below (Use .01) . All circulation data is in millions sold. The time variable is now added with the following results. MTB > SUBC> SUBC> Regress c1 3 c2 c3 c10; VIF; DW. Regression Analysis: S versus AM, PM, T The regression equation is S = 48.8 - 0.163 AM + 0.302 PM + 1.51 T Predictor Coef SE Coef Constant 48.79 22.22 AM -0.1631 0.2621 PM 0.3024 0.5228 T 1.5081 0.6578 S = 0.658831 R-Sq = 95.3% Analysis of Variance Source DF SS Regression 3 43.830 Residual Error 5 2.170 Total 8 46.000 T P VIF 2.20 0.080 -0.62 0.561 29.1 0.58 0.588 60.2 2.29 0.070 5 R-Sq(adj) = 92.5% MS 14.610 0.434 F 33.66 P 0.001 Durbin-Watson statistic = 2.51601 a) What do the significance tests on the coefficients reveal? Give reasons. (2) b) Can you explain why the coefficient of AM seems unreasonable? What is the apparent reason for this? (2) c) Do a 10% two-sided Durbin-Watson test on the result as suggested in class. What is the hypothesis tested and what is the result? (3) d) Reuse your spare parts from the previous regression if possible to compute the correlation between AM and PM circulation and test it for significance. (4) e) Compute a rank correlation between AM and PM circulation and test it for significance. Can you explain why it is larger than the correlation in d)? (4) f) Test the hypothesis that the correlation that you computed in d) is -.99. (4) [101] g) (Extra credit) If AM, PM and T are x1 , x 2 and x3 , find the partial correlation coefficient (square root of the coefficient of partial determination) rY 3.12 . (2) 9 252x0644 12/07/05 6. The following times were recorded for 6 skiers on 3 slopes. In order to assess their difficulty we look at the median time for each slope. We do not assume a Normal distribution. Do not compute the median or mean time for any slope. Skier Slope 1 Slope 2 Slope 3 1 4.7 5.6 4.9 2 4.4 5.6 4.9 3 4.0 5.0 4.7 4 4.3 4.3 4.9 5 4.4 4.5 4.3 6 3.2 3.4 3.7 a) Test the hypothesis that the median time on slope 2 is 5 minutes (3 or 2 depending on method) (3) b) Test the hypothesis that slope 2 and slope 3 have the same median times. (4) c) Test the hypothesis that the slopes all have the same median time. (4) d) Explain what methods you would use in b) and c) if the columns were independent random samples. (1) e) Rank the skiers times on each slope from 1 (fastest) to 6. Use these as rankings of the skiers and test to see if the ranks agree between slopes. (4) [117] 10 252x0644 12/07/05 7. Clarence Sales is a marketing major and knows that national soft drink market shares are as below. Classic Coke 15.6% Pepsi 13.2% Diet Coke 5.1% Diet Pepsi 3.5% Other brands 62.6% He gets in a bit of trouble here and is sentenced to 20 hours of public service. After he finishes his public service he takes off for Maine, gets caught littering and is sentenced to another 20 hours of public service. During his public service, he picks up 100 cans in each state. The cans are as below. Brand PA ME Classic Coke 21 16 Pepsi 15 11 Diet Coke 13 10 Diet Pepsi 6 5 Other brands 45 58 Use a 1% significance level throughout this problem. Don’t waste our time by just computing percents and saying that they are different. Each problem requires a statistical test or the equivalent. State your null and alternative hypotheses in each problem. a) Regard the cans picked up as a random sample of sales in the two states. Can we say that the proportions of soft drink cans discarded in Maine are the same as the national market shares? (5) b) Clarence knows that that Maine is Moxie country, so he believes that the proportion of other brands sold is higher in Maine than in Pennsylvania. Is this true? (4) c) Create a 2% 2-sided confidence interval for the difference between the proportions of other brands sold in Maine. Using your Normal table, make this into a 2.5% 2-sided interval. (3) d) Actually Clarence’s mother owns the Pepsi franchise for Maine and last year between her sales of Pepsi and Diet Pepsi accounted for 15% of the soft drink market in Maine. She tells Clarence that her sales are now above 15%. On the basis of Clarence’s Maine sample is that true? (2) [131] 11 252x0644 12/07/05 (Blank) 12 252x0644 12/07/05 13