252x0581 12/14/05 ECO252 QBA2 Final EXAM December 14, 2005 Name and Class hour:_________________________ I. (12+ points) Do all the following. Note that answers without reasons receive no credit. Most answers require a statistical test, that is, stating or implying a hypothesis and showing why it is true or false by citing a table value or a p-value. If you haven’t done it lately, take a fast look at ECO 252 - Things That You Should Never Do on a Statistics Exam (or Anywhere Else) Berenson et. al. present a file called RESTRATE. It contains Zagat ratings for 50 restaurants in New York City and 50 more restaurants on Long Island. The data columns are described below. ‘Location’ New York or Long Island ‘Food’ Quality of food on a 0-25 scale ‘Décor’ Quality of décor on a 0-25 scale ‘Service’ Quality of service on a 0-25 scale ‘Summated Rating’ The sum of the food, décor, and service variables. ‘Locate’ A dummy variable – 1 if the restaurant is on Long Island ‘Price’ Average price for a meal ($), the dependent variable. ‘Inter’ The product of ‘Location’ and ‘Summated Rating’ The values of these data and the correlation matrix between them appear at the end of this section. We are trying to explain price based on the Zagat rating and the dummy variable for location. The regression is run as follows. Assume a 5% significance level. Regression Analysis: Price versus Locate, Summated rating The regression equation is Price = - 13.7 - 7.54 Locate + 0.961 Summated rating Predictor Constant Locate Summated rating S = 5.94216 Coef -13.699 -7.537 0.96079 SE Coef 5.054 1.197 0.08960 R-Sq = 59.2% Analysis of Variance Source DF SS Regression 2 4960.2 Residual Error 97 3425.0 Total 99 8385.2 Source Locate Summated rating DF 1 1 T -2.71 -6.30 10.72 P 0.008 0.000 0.000 VIF 1.0 1.0 R-Sq(adj) = 58.3% MS 2480.1 35.3 F 70.24 P 0.000 Seq SS 900.0 4060.2 a) What is the difference (in dollars) in expected price of a meal at restaurants of similar quality in New York and Long Island? (1) b) How much would you expect to pay for a meal at a New York restaurant with a Zagat (summated) rating of 50? (1) c) Which of the coefficients are significant? Do not answer this without evidence from the printout. (2) d) Compute the coefficient of partial determination (partial correlation squared) for ‘locate’ and explain its meaning. (2) 1 252x0581 12/14/05 Now the authors suggest that we add the interaction term to the equation. The new result is. Regression Analysis: Price versus Locate, Summated rating, Inter The regression equation is Price = - 26.3 + 13.1 Locate + 1.19 Summated rating - 0.368 Inter Predictor Constant Locate Summated rating Inter S = 5.84913 Coef -26.291 13.13 1.1872 -0.3676 SE Coef 7.957 10.26 0.1423 0.1813 R-Sq = 60.8% DF 1 1 1 P 0.001 0.204 0.000 0.045 R-Sq(adj) = 59.6% Analysis of Variance Source DF SS Regression 3 5100.9 Residual Error 96 3284.4 Total 99 8385.2 Source Locate Summated rating Inter T -3.30 1.28 8.34 -2.03 MS 1700.3 34.2 F 49.70 P 0.000 Seq SS 900.0 4060.2 140.6 d) Is this a better regression than the previous one? In order to answer this comment on R-squared, Rsquared adjusted and the sign and significance of the coefficients. (3) [9] At this point I was on my own. First I ran a stepwise regression and got the following. Stepwise Regression: Price versus Food, Décor, ... Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is Price on 6 predictors, with N = 100 Step Constant Summated rating T-Value P-Value 1 -13.66 2 -18.40 3 -15.14 4 -15.82 0.893 8.50 0.000 1.047 11.51 0.000 1.323 9.14 0.000 1.953 4.53 0.000 -0.137 -6.57 0.000 -0.137 -6.74 0.000 -0.139 -6.85 0.000 -0.93 -2.42 0.018 -1.85 -2.62 0.010 Inter T-Value P-Value Food T-Value P-Value Décor T-Value P-Value S R-Sq R-Sq(adj) -0.93 -1.55 0.125 7.02 42.46 41.87 5.87 60.16 59.34 5.73 62.44 61.27 5.69 63.37 61.83 More? (Yes, No, Subcommand, or Help) SUBC> n The results here don’t seem very practical. For example the fourth version of the regression this gives me is Price = - 15.82 + 1.953 Summated rating – 0.139 Inter – 1.85 Food -0.93 Décor Note that the number under the coefficient is a t-ratio and the number under that is a p-value for the significance test. e) What are the two ‘best’ variables the stepwise regression picks? Why might I be reluctant to add ‘Food’ and ‘Décor’ in spite of evidence in the significance tests? (2) [11] 2 252x0581 12/14/05 So, the next version of the regression that I did was. MTB > regress c7 2 locate inter; SUBC> VIF. Regression Analysis: Price versus Locate, Inter The regression equation is Price = 39.7 - 52.9 Locate + 0.820 Inter Predictor Constant Locate Inter Coef 39.740 -52.896 0.8196 S = 7.64274 SE Coef 1.081 8.541 0.1469 R-Sq = 32.4% Analysis of Variance Source DF SS Regression 2 2719.3 Residual Error 97 5665.9 Total 99 8385.2 Source Locate Inter DF 1 1 T 36.77 -6.19 5.58 P 0.000 0.000 0.000 VIF 31.2 31.2 R-Sq(adj) = 31.0% MS 1359.7 58.4 F 23.28 P 0.000 Seq SS 900.0 1819.3 This was pretty much as I expected, so I added the components of Zagat’s rating to the equation. MTB > regress c7 5 c6 c8 c2 c3 c4; SUBC> VIF. Regression Analysis: Price versus Locate, Inter, Food, Décor, Service The regression equation is Price = - 21.1 + 8.8 Locate - 0.294 Inter + 0.251 Food + 1.15 Décor + 1.97 Service Predictor Constant Locate Inter Food Décor Service Coef -21.130 8.84 -0.2937 0.2505 1.1465 1.9678 S = 5.69379 SE Coef 7.979 10.22 0.1804 0.3766 0.2798 0.4322 R-Sq = 63.7% Analysis of Variance Source DF SS Regression 5 5337.8 Residual Error 94 3047.4 Total 99 8385.2 Source Locate Inter Food Décor Service DF 1 1 1 1 1 T -2.65 0.86 -1.63 0.67 4.10 4.55 P 0.009 0.390 0.107 0.508 0.000 0.000 VIF 80.6 84.9 2.7 2.3 3.2 R-Sq(adj) = 61.7% MS 1067.6 32.4 F 32.93 P 0.000 Seq SS 900.0 1819.3 371.6 1574.8 672.1 f) Use an F test to compare Price = 39.7 - 52.9 Locate + 0.820 Inter with Price = - 21.1 + 8.8 Locate - 0.294 Inter + 0.251 Food + 1.15 Décor + 1.97 Service What does the F test show about the 3 variables that we added? (3) [14] 3 252x0581 12/14/05 g) Now compare Price = - 21.1 + 8.8 Locate - 0.294 Inter + 0.251 Food + 1.15 Décor + 1.97 Service with Price = - 26.3 + 13.1 Locate + 1.19 Summated rating - 0.368 Inter You can’t use an F test here but you can look at R-squared and significance. Does the equation that I just fitted look like an improvement? As always, give reasons. (2). [16] h) I don’t think I am finished. Should I drop some variables from ? [18] Price = - 21.1 + 8.8 Locate - 0.294 Inter + 0.251 Food + 1.15 Décor + 1.97 Service Why? Which would you suggest that I drop first? Why? Check VIFs and significance. (2) 4 252x0581 12/14/05 ————— 12/6/2005 8:46:21 PM ———————————————————— Welcome to Minitab, press F1 for help. MTB > print c1-c8 Data Display Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 Location NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC NYC LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI Food 19 18 19 23 23 23 20 20 19 21 20 21 24 20 17 21 21 20 17 21 23 17 22 19 21 19 19 21 24 19 17 19 22 22 14 22 20 18 18 24 21 18 20 21 19 18 20 21 19 21 21 17 17 23 23 21 21 21 22 22 23 23 21 17 23 15 Décor 21 17 16 18 20 18 17 15 18 19 17 23 20 17 18 17 19 16 11 16 20 19 14 19 19 14 17 13 21 16 15 16 19 18 15 22 15 14 20 18 17 17 19 10 14 17 16 12 17 20 18 14 17 19 22 18 19 18 18 20 20 18 14 17 23 17 Service 18 17 19 21 21 20 16 17 18 19 16 21 22 20 14 20 21 19 13 20 23 16 15 18 20 16 19 21 21 19 15 19 21 20 15 21 18 17 16 21 18 17 19 17 19 17 17 14 19 20 21 17 18 18 21 19 23 18 20 20 22 20 19 17 22 15 Summated rating Locate 58 0 52 0 54 0 62 0 64 0 61 0 53 0 52 0 55 0 59 0 53 0 65 0 66 0 57 0 49 0 58 0 61 0 55 0 41 0 57 0 66 0 52 0 51 0 56 0 60 0 49 0 55 0 55 0 66 0 54 0 47 0 54 0 62 0 60 0 44 0 65 0 53 0 49 0 54 0 63 0 56 0 52 0 58 0 48 0 52 0 52 0 53 0 47 0 55 0 61 0 60 1 48 1 52 1 60 1 66 1 58 1 63 1 57 1 60 1 62 1 65 1 61 1 54 1 51 1 68 1 47 1 Price 50 38 43 56 51 36 25 33 41 44 34 39 49 37 40 50 50 35 22 45 44 38 14 44 51 27 44 39 50 35 31 34 48 48 30 42 26 35 32 63 36 38 53 23 39 45 37 31 39 53 37 37 29 38 37 38 39 29 36 38 44 27 24 34 44 23 Inter 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 60 48 52 60 66 58 63 57 60 62 65 61 54 51 68 47 5 252x0581 12/14/05 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI LI 19 20 20 20 23 19 15 20 21 23 27 17 22 20 20 25 17 25 19 27 21 19 20 23 24 18 15 16 18 20 21 21 23 19 14 19 15 12 19 21 13 17 17 20 16 17 11 16 12 25 17 22 18 20 11 18 21 19 27 18 16 20 16 12 24 18 15 14 17 18 17 18 20 19 15 22 18 21 19 16 17 19 16 24 18 23 19 24 17 19 20 21 23 20 14 17 17 18 21 19 20 16 50 57 52 50 62 59 43 59 56 64 62 50 50 55 48 74 52 70 56 71 49 56 61 63 74 56 45 53 51 50 66 58 58 49 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 30 32 25 29 43 31 26 34 23 41 32 30 28 33 26 51 26 48 39 55 24 38 31 30 51 30 27 38 26 28 33 38 32 25 50 57 52 50 62 59 43 59 56 64 62 50 50 55 48 74 52 70 56 71 49 56 61 63 74 56 45 53 51 50 66 58 58 49 MTB > Correlation c2 c3 c4 c5 c6 c8. Correlations: Food, Décor, Service, Summated rating, Locate, Inter Food 0.374 0.000 Décor Service 0.727 0.000 0.633 0.000 Summated rat 0.800 0.000 0.824 0.000 0.914 0.000 Locate 0.089 0.381 0.084 0.406 0.137 0.175 0.120 0.235 Inter 0.203 0.043 0.201 0.045 0.253 0.011 0.257 0.010 Décor Service Summated rat Locate 0.984 0.000 6 252x0581 12/14/05 II. Do at least 4 of the following 6 Problems (at least 10 each) (or do sections adding to at least 38 points – (Anything extra you do helps, and grades wrap around). You must do parts a) and b) of problem 1. Show your work! State H 0 and H1 where applicable. Use a significance level of 5% unless noted otherwise. Do not answer questions without citing appropriate statistical tests – That is, explain your hypotheses and what values from what table were used to test them. Clearly label what section of each problem you are doing! The entire test has 151 points, but 70 is considered a perfect score. 1. a) If I want to test to see if the mean of x 2 is larger than the mean of x1 my null hypotheses are: i) 1 2 and D 0 v) 1 2 and D 0 ii) 1 2 and D 0 iii) 1 2 and D 0 vi) 1 2 and D 0 vii) 1 2 and D 0 iv) 1 2 and D 0 viii) 1 2 and D 0 (2) Let us revisit Problem B in the take-home. We are going to evaluate the evaluators. Candidate 1 2 3 4 5 6 7 8 9 Moore 52 25 29 33 24 36 42 49 20 Gaston Difference 38 14 31 -6 24 5 29 4 27 -3 28 8 41 1 27 22 31 -11 Assume that we can use the Normal distribution for these data. There are 4 columns here: first, the number of the candidate; second Moore’s evaluation of the candidate; third Gaston’s evaluation of the same candidates. Finally, we have the difference between the ratings. Don’t forget that the data are cross-classified. Note that x1 310 the sums for Moore’s column are and they are x12 11696 . For the difference column, d 34 and d 2 952 b) Compute the mean and standard deviation for Gaston. (2) Show your work! c) The mean rating that has been given to hundreds of job candidates interviewed over the last year is 35. Regard Gaston’s ratings as a random sample from a Normal population and test that his mean rating is below 35. Use (i) Either a test ratio or a critical value for the mean (3) and (ii) an appropriate confidence interval. (2) d) Test the hypothesis that the population mean for Moore is higher than for Gaston. Don’t forget that the data is cross-classified. (3) e) To see how well they agree, compute a correlation between Moore’s and Gaston’s ratings and check it for significance. (5) [17] 7 252x0581 12/14/05 2. Of course it was nonsense to assume that the data was Normally distributed in Problem 1. Candidate 1 2 3 4 5 6 7 8 9 Moore 52 25 29 33 24 36 42 49 20 Gaston Difference 38 14 31 -6 24 5 29 4 27 -3 28 8 41 1 27 22 31 -11 a. But, just in case, test Moore’s column for Normality. Remember that the sample mean and variance were calculated from the data. (5) b. Now test that the median for Gaston is below 35. If you can do it, use the Wilcoxon signed rank test. (4) c. Now test to see if the median for Moore is higher than the median for Gaston. Don’t forget that the data is cross-classified. (4) d. Compute a rank correlation between Moore and Gaston’s ratings and test it for significance. (5) [13] 8 252x0581 12/14/05 3. (Lind et al -190) We are trying to locate a new day-care center. We want to know if the proportion of parents who are eligible to put children in day care is larger on the south side of town than on the east side of town. a) If the south side is Area 1 and the north side is area 2, state the null and alternate hypotheses and do the test. Answer the question “Should we put the center on the south side?” with a yes, no or maybe depending on your result. (4) b) Find a p-value for the null hypothesis in a) (2) c) Do a 2-sided 99% confidence interval for the South East proportion eligible on the South side. (2) Number eligible 88 57 Sample size 200 150 d) Do a 2-sided 89% confidence interval for the proportion eligible on the South side. Do not use a value of z from the t table! (1) e) We suddenly realize that out of 100 parents surveyed on the North side 51 are eligible. Do a test of the hypothesis that the proportion is the same in all 3 areas. (5) f) Use the Marascuilo procedure to determine whether there is a significant difference between the areas with the largest and second largest proportion of eligible parents. (3) [17] 9 252x0581 12/14/05 4. From the data on the right find the following. xy (1) a) b) R 2 (2) c) The sample correlation rxy (1) d) Test the hypothesis that the population correlation between x and y is 0.75 (5) e) Compute a simple regression of x against y . Remember that y is the dependent variable. (4) [13] Row 1 2 3 4 5 6 7 8 9 10 X 3 8 6 9 6 9 6 4 8 5 Y 22 68 68 96 46 80 52 38 78 48 X 64, X Y 596. Y 2 448 2 40000 10 252x0581 12/14/05 5. From the data in Problem 4 find the following a) s e (3) b) Do a significance test for the slope b1 (3) c) Find a confidence interval for the constant b0 (3) d) Do an ANOVA for this regression and explain its meaning. (3) e) Find the value of Y when X is 9 and build a prediction interval around it (3) [15] Row 1 2 3 4 5 6 7 8 9 10 X 3 8 6 9 6 9 6 4 8 5 Y 22 68 68 96 46 80 52 38 78 48 X 64, X Y 596. Y 2 448 2 40000 11 252x0581 12/14/05 6. Do the following a) Assume that n1 10, n 2 12, n3 15, s12 20, and s 22 15, s32 10, .10 and that all come from a Normal distribution. Test the following: (i) 1 15 (2) (ii) 1 2 (2) (iii) Explain what test would be appropriate to test 1 2 3 (1) b) Read both parts of this question before you start. To test the response of an 800 number, I make 20 attempts to reach the number, continuing to call until I get through. (i) I hypothesize that the results follow a Poisson distribution with an unknown mean. Test this – data are at right. (6) (ii) I hypothesize that the results follow a Poisson distribution with a mean of 2 (5) Do not do these two parts using the same method. [16] Number of Unsuccessful Tries before success. 0 1 2 3 4 5 6 7 Observed Frequency 4 2 8 12 10 0 2 2 40 12 252x0581 12/07/05 (Blank) 13 252x0581 12/07/05 ECO252 QBA2 Final EXAM December 14-16, 2005 TAKE HOME SECTION Name: _________________________ Student Number: _________________________ Class days and time: _________________________ III Take-home Exam (20+ points) A) 4th computer problem (5+) This is an internet project. You should do only one of the following 2 problems. Problem 1: In his book, Statistics for Economists: An Intuitive Approach (New York, HarperCollins, 1992), Alan S. Caniglia presents data for 50 states and the District of Columbia. These data are presented as an appendix at the end of this section. The Data consists of six variables. The dependent variable, MIM, the mean income of males (having income) who are 18 years of age or older. PMHS, the percent of males 18 and older who are high school graduates. PURBAN, the percent of total population living in an urban area. MAGE, the median age of males. Using his data, I got the results below. Regression Analysis: MIM versus PMHS The regression equation is MIM = 2736 + 180 PMHS Predictor Constant PMHS Coef 2736 180.08 S = 1430.91 SE Coef 2174 31.31 R-Sq = 40.3% T 1.26 5.75 P 0.214 0.000 R-Sq(adj) = 39.1% Analysis of Variance Source DF SS Regression 1 67720854 Residual Error 49 100328329 Total 50 168049183 MS 67720854 2047517 F 33.07 P 0.000 Unusual Observations Obs PMHS MIM Fit SE Fit Residual St Resid 1 69.1 12112 15180 200 -3068 -2.17R 3 71.6 12711 15630 215 -2919 -2.06R 50 81.9 21552 17485 447 4067 2.99R R denotes an observation with a large standardized residual. His only comment is that a 1% increase in the percent of males that are college graduates results is associated with about a $180 increase in male income and that there is evidence here that the relationship is significant. He then describes three dummy variables: NE = 1 if the state is in the Northeast (Maine through Pennsylvania in his listing); MW = 1 if the state is in the Midwest (Ohio through Kansas) and SO = 1 if the state is in the South (Delaware through Texas). If all of the dummy variables are zero, the state is in the West (Montana through Hawaii). I ran the regression with all six independent variables. To check these variables, look at his data. MTB > regress c2 6 c3-c8; SUBC> VIF; SUBC> brief 2. Regression Analysis: MIM versus PMHS, PURBAN, MAGE, NE, MW, SO The regression equation is MIM = - 1294 + 198 PMHS + 49.4 PURBAN - 42 MAGE + 247 NE + 757 MW + 1269 SO 14 252x0581 12/07/05 Predictor Constant PMHS PURBAN MAGE NE MW SO Coef -1294 198.13 49.36 -42.1 246.6 756.7 1268.9 S = 1271.71 SE Coef 5394 53.97 14.27 151.6 723.7 608.2 863.0 R-Sq = 57.7% T -0.24 3.67 3.46 -0.28 0.34 1.24 1.47 DF 1 1 1 1 1 1 VIF 3.8 1.4 1.5 2.4 2.1 5.2 R-Sq(adj) = 51.9% Analysis of Variance Source DF SS Regression 6 96890414 Residual Error 44 71158768 Total 50 168049183 Source PMHS PURBAN MAGE NE MW SO P 0.811 0.001 0.001 0.783 0.735 0.220 0.149 MS 16148402 1617245 F 9.99 P 0.000 Seq SS 67720854 23781889 281110 1416569 193443 3496549 Unusual Observations Obs PMHS MIM Fit SE Fit Residual St Resid 50 81.9 21552 16999 543 4553 3.96R R denotes an observation with a large standardized residual. He has asked whether region affects the independent variable, on the strength of the significance tests in the output above, he concludes that the regional variables do not have any affect on male income. (Median Age looks pretty bad too.) There are two ways to confirm these conclusions. Caniglia does one of these, an F test that shows whether the regional variables as a group have any effect. He says that they do not. Another way to test this is by using a stepwise regression. MTB > stepwise c2 c3-c8 Stepwise Regression: MIM versus PMHS, PURBAN, MAGE, NE, MW, SO Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is MIM on 6 predictors, with N = 51 Step Constant 1 2736 2 2528 PMHS T-Value P-Value 180 5.75 0.000 134 4.46 0.000 PURBAN T-Value P-Value S R-Sq R-Sq(adj) Mallows C-p 50 3.86 0.000 1431 40.30 39.08 15.0 1263 54.45 52.55 2.3 More? (Yes, No, Subcommand, or Help) SUBC> y No variables entered or removed More? (Yes, No, Subcommand, or Help) SUBC> n What happens is that the computer picks PMHS as the most valuable independent variable, and gets the same result that appeared in the simple regression above. It then adds PURBAN and gets MIM = 2528 + 134 PMHS + 50 PURBAN. The coefficients of the 2 independent variables are significant, the adjusted R-Sq is higher than the adjusted R-sq with all 6 predictors and the computer refuses to add any more independent variables. So it looks like we have found our ‘best’ regression. (See the text for interpretation VIFs and C-p’s.) 15 252x0581 12/07/05 So here is your job. Update this work. Use any income per person variable, a mean or a median for men, women or everybody. Find measures of urbanization or median age. Fix the categorization of states if you don’t like it. Regress state incomes against the revised data. Remove the variables with insignificant coefficients. If you can think of new variables add them. (Last year I suggested trying percent of output or labor force in manufacturing.) Make sure that you pick variables that can be compared state to state. Though you can legitimately ask whether size of a state affects per capita income, using total amount produced in manufacturing is poor because it’s just going to be big for big states. Similarly the fraction of the workforce with a certain education level is far better then the number. For instructions on how to do a regression, try the material in Doing a Regression. For data sources, try the sites mentioned in 252Datalinks. Use F tests for adding the regional variables and use stepwise regression. Don’t give me anything you don’t understand. Problem 2: Recently the Heritage Foundation produced the graph below. What I want to know is if you can develop an equation relating per capita income (the dependent variable) and Economic freedom x . Because it is pretty obvious that a straight line won’t work, you will probably need to create a x 2 variable too. But I would like to know what parts of ‘economic freedom’ affect per capita income. In addition to the Heritage Foundation Sources, the CIFP site mentioned in 252datalinks, and the CIA Factbook might provide some interesting independent variables. You should probably use a sample of no more than 50 countries and it’s up to you what variables to use. You are, of course, looking for significant coefficients and high R-squares. For instructions on how to do a regression, try the material in Doing a Regression. 16 252x0581 12/07/05 B. Do only Problem 1 or problem 2. (Problem Due to Donald R Byrkit). Four different job candidates are interviewed by seven executives. These are rated for 7 traits on a scale of 1-10 and the scores are added together to create a total score for each candidate-rater pair that is between 0 and 70. The results appear below. Row 1 2 3 4 5 6 7 Sum Sum Sum Sum Sum Sum Raters Moore Gaston Heinrich Seldon Greasy Waters Pierce of of of of of of Lee 52 38 54 43 58 36 52 Candidates Jacobs 25 31 38 30 44 28 41 Wilkes 29 24 40 31 46 22 37 Delap 33 29 39 28 47 25 45 Jacobs = 237 squares (uncorrected) of Jacobs = 8331 Wilkes = 229 squares (uncorrected) of Wilkes = 7947 Delap = 246 squares (uncorrected) of Delap = 9094 Personalize the data by adding the second to last digit of your student number to Lee’s column. For example Roland Dough’s student number is 123689, so he uses 52 + 8 = 60, 38 + 8 = 46, 62 etc. If the second to last digit of your student number is zero, add 10. Problem 1: a) Assume that a Normal distribution applies and use a statistical procedure to compare the column means, treating each column as an independent random sample. If you conclude that there is a difference between the column means, use an individual confidence interval to see if there is a significant difference between the best and second-best candidate. If you conclude that there is no difference between the means, use an individual confidence interval to see if there is a significant difference between the best and worst candidate. (6) b) Now assume that a Normal distribution does not apply but that the columns are still independent random samples and use an appropriate procedure to compare the column medians. (4) [16] Problem 2: a) Assume that a Normal distribution applies and use a statistical procedure to compare the column means, taking note of the fact that each row represents one executive. If you conclude that there is a difference between the column means, use an individual confidence interval to see if there is a significant difference between the best and second-best candidate. If you conclude that there is no difference between the column means, use an individual confidence interval to see if there is a significant difference between the kindest and least kind executive. (8) b) Now assume that a Normal distribution does not apply but that each row represents the opinion of one rater and use an appropriate procedure to compare the column medians. (4) c) Use Kendall’s coefficient of concordance to show how the raters differ and do a significance test. (3) Problem 3: (Extra Credit) Decide between the methods used in Problem 1 and Problem 2. To do this, test for equal variances and for Normality on the computer. What is your decision? Why? (4) You can do most of this with the following commands in Minitab if you put your data in 3 columns of Minitab with A, B, C and D above them. MTB > MTB > SUBC> SUBC> MTB > MTB > AOVOneway A B C D stack A B C D C11; subscripts C12; UseNames. rank C11 C13 vartest C11 C12 MTB > Unstack (c13); SUBC> Subscripts c12; SUBC> After; SUBC> VarNames. MTB > NormTest 'A'; SUBC> KSTest. #Does a 1-way ANOVA # Stacks the data in c12, col.no. in c12. #Puts the ranks of the stacked data in c13 #Does a bunch of tests, including Levene’s On stacked data in c11 with IDs in c12. #Unstacks the ranks in the next available # columns. Uses IDs in c12. #Does a test (apparently Lilliefors) for Normality # on column A. 17 252x0581 12/07/05 C. You may do both problems. These are intended to be done by hand. A table version of the data for problem 2 is provided in 2005data1 which can be downloaded to Minitab. I do not want Minitab results for these data except for Problem 2e. Problem 1: Using data from the 1970s and 1980s, Alan S. Caniglia calculated a regression of nonresidential investment on the change in level of final sales to verify the accelerator model of investment. This theory says that because capital stock must be approximately proportional to production, investment will be driven by changes in output. In order to check his work I put together a data set 2005series. The last two years of the series are in Exhibit C1 below. Exhibit C1 Row Date 73 1988 01 74 1988 02 75 1988 03 76 1988 04 77 1989 01 78 1989 02 79 1989 03 80 1989 04 RPFI 862.406 879.330 882.704 891.502 900.401 901.643 917.375 902.298 Sales 6637.22 6716.38 6749.47 6835.07 6873.33 6933.55 7015.34 7026.76 Sales-4Q 6344.41 6431.37 6510.82 6542.55 6637.22 6716.38 6749.47 6835.07 Change 292.815 285.006 238.644 292.522 236.106 217.171 265.876 191.695 DEFL %Y 2.897 3.318 3.699 3.724 4.013 4.016 3.596 3.537 MINT % 9.88 9.67 9.96 9.51 9.62 9.79 8.93 8.92 RINT 6.983 6.352 6.261 5.786 5.607 5.774 5.334 5.383 ‘Date’ consists of the year and the quarter. ‘RPFI’ consists of real fixed private investment from 2005InvestSeries1. ‘Sales’ consists of sales data (actually a version of gross domestic product) from 2005SalesSeries1. ‘Sales-4Q’ (Sales 4 Quarters earlier’ is also sales data from 2005SalesSeries1, but is the data of one year earlier. (Note that the 1989 numbers in ‘Sales-4Q’ are identical to the 1988 numbers in ‘Sales.’ ‘Change’ is ‘Sales’ – ‘Sales-4Q. ‘DEFL %Y’ is the percent change in the gross domestic deflator over the last year (a measure of inflation) taken from 2005deflSeries1. ‘MINT %’ is an estimate of the percent return on Aaa bonds taken from 2005intSeries1. Only the values for January, April, July and October are used since quarterly data was not available. ‘RINT’ (an estimate of the real interest rate) is ‘MINT %’ - ‘DEFL %Y’. These are manipulated in the input to the regression program as in Exhibit C2 below. Exhibit C2 Row Time 73 1988 01 74 1988 02 75 1988 03 76 1988 04 77 1989 01 78 1989 02 79 1989 03 80 1989 04 Y 86.2406 87.9330 88.2704 89.1502 90.0401 90.1643 91.7375 90.2298 X1 29.2815 28.5006 23.8644 29.2522 23.6106 21.7171 26.5876 19.1695 X2 6.98 6.35 6.26 5.79 5.61 5.77 5.33 5.38 Here Y is ‘RFPI’ divided by 10. X1 is ‘Change’ divided by 10. X2 is ‘RINT’ rounded to eliminate the last decimal place. If you don’t understand how I got Exhibit C2 from Exhibit C1 find out before you go any further, Personalize the data by adding one year (four values) to the data in 2005 series. Pick the year to be added by adding the last digit of your student number to 1990. Make sure that I know the year you are using. Then get, for your year, ‘RPFI’ from 2005InvestSeries1, ‘Sales’ from 2005SalesSeries1, ‘Sales-4Q’ from 2005SalesSeries1 (Make sure that you use the sales of one year earlier, not 1989 unless your year is 1990.), ‘DEFL %Y’ 2005deflSeries1 and ‘MINT %’ from 2005intSeries1. Calculate ‘Change’ by subtracting ‘Sales-4Q’ from ‘Sales.’ If you are going to do Problem 2, calculate ‘RINT’ by subtracting ‘DEFL %Y’ from ‘MINT %.’ Present your four rows of new values in the format of Exhibit C1. Now manipulate your numbers to the form in Exhibit C2 and again present your four rows of numbers. These are observations 81 through 84. Now it’s time to compute your spare parts. The following are computed for you from the data for 1970 through 1989. 18 252x0581 12/07/05 Sum of Y = 5323.20 Sum of X1 = 1283.42 Sum of X2 = 328.33 Sum of Ysq = 371032 Sum of X1sq = 30307.57 Sum of X2sq = 2080.65 Sum of X1Y = 92676.9 Sum of X2Y = 24188.2 Sum of X1X2 = 6324.09 Y 5323 .2 X 1 1283.42 X 2 328.33 Y 371032 X 30307 .6 X 2080 .65 X 1 Y 92676.9 X 2 Y 24188.2 X 1 X 2 6324.09 n 80 2 2 1 2 2 Add the results of your data to these sums (You only need the sums involving X1 and Y if you are not doing Problem 2.) (Show your work!) and do the following. a. Compute the regression equation Yˆ b0 b1 x1 to predict investment on the basis of change in sales only. (2) b. Compute R 2 . (2) c. Compute s e . (2) d. Compute s b0 and do a significance test on b0 (1.5) e. Compute s b1 and do a significance test on b1 (2) f. In the first quarter of 2001 sales were 9883.167, the interest rate was 7.15% and the gdp inflation rate was 2.176%. In the first quarter of 2000 sales were 9668.827. Get values of Y and X1 from this and predict the level of investment for 2001. Using this, create a confidence interval or a prediction interval for investment in 2001 as appropriate. (3) g. Do an ANOVA table for the regression. What conclusion can you draw from the hypothesis test in the ANOVA? (2) [30] Problem 2: Continue with the data in problem 1. a. Compute the regression equation Yˆ b0 b1 x1 b2 x 2 to predict investment on the basis of real interest rates and change in sales. Do not attempt to use the value of b1 you got in problem 1. Is the sign of the coefficient what you expected? Why? (5) b. Compute R-squared and R-squared adjusted for degrees of freedom for this regression and compare them with the values for the previous problem. (2) c. Using either R – squares or SST, SSR and SSE do an F tests (ANOVA). First check the usefulness of the multiple regression and show whether the use of real interest rates gives a significant improvement in explanatory power of the regression? (Don’t say a word without referring to a statistical test.) (3) d. Use the values in 1f to compute a prediction for 2001 investment. By what percent does the predicted investment change if you add real interest rates? (2) e. If you are prepared to explain the results of VIF and Durbin-Watson (Check the text!), run the regression of Y on X1 and X2 using MTB > Regress Y 2 X1 X2; SUBC> VIF; SUBC> DW; SUBC> Brief 2. Explain your results. (2) [44] 19 252x0581 12/07/05 Caniglia’s Original Data Data Display Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 STATE ME NH VT MA RI CT NY NJ PA OH IN IL MI WI MN IA MO ND SD NE KS DE MD DC VA WV NC SC GA FL KY TN AL MS AR LA OK TX MT ID WY CO NM AZ UT NV WA OR CA AK HI MIM 12112 14505 12711 15362 13911 17938 15879 17639 15225 16164 15793 17551 17137 15417 15878 15249 14743 13835 12406 14873 15504 16081 17321 15861 15506 13998 12529 12660 13966 14651 13328 13349 13301 11968 12274 15365 14818 16135 14256 14297 17615 16672 14057 15269 15788 16820 17042 15833 17128 21552 15268 PMHS 69.1 73.0 71.6 74.0 65.1 71.8 68.9 70.0 68.0 69.0 68.8 68.9 69.3 70.9 73.5 71.9 66.2 68.0 68.3 74.2 74.5 70.4 69.2 67.9 64.3 58.6 58.2 58.2 60.4 68.0 55.8 59.0 59.9 57.2 58.3 61.3 68.7 65.3 73.8 73.5 77.9 79.1 70.6 73.4 80.4 76.0 77.5 75.1 74.3 81.9 76.9 PURBAN 47.5 52.2 33.8 83.8 87.0 78.8 84.6 89.0 69.3 73.3 64.2 83.3 70.7 64.2 66.9 58.6 68.1 48.8 46.4 62.9 66.7 70.6 80.3 100.0 66.0 36.2 48.0 54.1 62.4 84.3 50.9 60.4 60.0 47.3 51.6 68.6 67.3 79.6 52.9 54.0 62.7 80.6 72.1 83.8 84.4 85.3 73.5 67.9 91.3 64.3 86.5 MAGE 29.2 29.2 28.4 29.6 30.1 30.6 30.3 30.7 30.4 28.6 28.0 28.6 27.8 28.3 28.3 28.7 29.3 27.5 27.9 28.6 28.7 28.7 29.2 29.9 28.6 29.1 28.1 26.7 27.3 32.9 27.8 28.7 27.8 26.1 29.2 26.2 28.6 27.1 28.4 27.0 26.7 27.9 26.6 28.2 23.8 30.0 29.0 29.5 28.9 26.3 27.6 NE 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 MW 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 SO 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 20