252y0242 5/07/02 ECO252 QBA2 FINAL EXAM May 6, 2002 Name KEY Hour of Class Registered (Circle) Note: If this is the only thing you look at before taking the final, you are badly cheating yourself. People who used last year's final and did not read the problems carefully got very wrong answers to them. If you can't be bothered to think, there is not much point to taking this course or this exam. Note: Have you reread “Things that You Should Never Do On a Statistics Exam …?” I think I could have graded this exam by just looking for violations of these rules. The following comments are cut and pasted from previous exams. People made the same mistakes on this exam. If you still think that a large p-value means that a coefficient is significant, you need a conference with an audiologist. Further note that a p-value is a probability and can only be compared with another probability (like the significance level). The rule on p-value: If the p-value is less than the significance level reject the null hypothesis; if the p-value is greater or equal than the significance level, do not reject the null hypothesis. Don’t tell me that a negative coefficient in a regression doesn't look right. There is nothing wrong with a negative regression coefficient, unless you have a good reason to believe that it shouldn't be negative. How many times do I have to tell you: 1) Null hypotheses almost always contain parameters , like , , p, , , p . They never contain sample statistics like x , s, p, d x , p 2) Null hypotheses almost always contain equalities ( This means that if you want to test 5 , it's an alternate hypothesis and the null hypothesis is not 5 ) When you find that you have a 1-sided alternate hypothesis, you do not use 2-sided confidence intervals or tests. Many people were too lazy or ignorant to compute x y from x 1 1 and y nor to get x x 1 2 x y . There is no way in the universe to get from x and x or x from x . 1 1 2 2 1 1 You will always be asked to compute a sum of this sort on an exam, so figure out how to do it in advance. A test of multiple proportions is a 2 test! Every year I see people trying to compare more than two proportions by a method appropriate for b) below. It doesn't work! p is defined as a difference between two proportions, when you have more than two that definition doesn't work. Also, simply computing the proportions and telling me that they are different is just a way of making me suspect that you don't know what a statistical test is. 252y0242 5/07/02 A test of multiple means is an Analysis of Variance! Every year I see people trying to compare more than two means by a method appropriate for comparing two means. It doesn't work! is defined as a difference between two means, when you have more than two that definition doesn't work. Also, simply computing the means and telling me that they are different is just a way of making me suspect that you don't know what a statistical test is. Most people groan when I say that the final exam is cumulative. On this exam someone claimed it wasn't cumulative enough - perhaps because the student had hardly been to class after the 3 rd exam. The questions and the sections they covered follow. I. 1. 2. 3. 4. Sections I2, J3 a. b. Section K2 c. J4 K2 II. 1. 2. 3. 4. 5. 6. 7. 8. a. K4 b. K3 c. K3 a. D7 b. D2 c. F1 d. E3 a. D5 b. E4 c. B6 or D5 a. E1 b. D6 c. B5 G, H and I J a. J b. J4 c. K2 d. J2 e. K2 f. F2 a. L2 b. L2 c. L5 d. L6 e. K5 2 252y0242 5/07/02 I. (16+ points) Do all the following. 1. Hand in your fourth regression problem (2 points) Remember: Y = 'Vol' = volatility (Std. Deviation of return) , X1 = 'CR' = Credit rating on a zero to 100 (per cent) scale, X2 = 'emd' = a dummy variable that is 1 if a country has an emerging market , 0 if the country has a developed market, X3 = 'ecr' = the product of 'CR' and 'emd', X4 = 'gdp' = per capita income in thousands of US dollars in the late '90's, X5 = 'gd-cr' = the product of 'CR' and 'gdp.' We would expect foreign exchange rates to become less volatile as i) credit rating improves, ii) markets become developed, and iii) per capita income rises. Remember saying 'yes' or 'no' to a question is useless unless you cite statistical tests. Use a significance level of 1% in this problem except when you are told otherwise. 2. Answer the following questions: a. For the regression of 'Vol' against 'CR', 'emd', 'ecr' 'gdp' and 'gd-cr' , what coefficients are significant at the 5% level? Why? What about the 1% level? (3) b. Given the comments at the beginning of this page, what signs would you expect the coefficients to have. Do they have the expected signs? (4) c. For the same regression, what does the ANOVA tell us? Why? (2) d. In view of the analysis above, is there a regression that seems to work better than the one mentioned in a) above? Why? (2) The problem in the text says "Write a model that describes the relationship between volatility (Y) and credit rating as two nonparallel lines, one for each type of market ……. Is there evidence to conclude that the slope of the linear relationship between volatility (Y) and credit rating (X1) depends on market type?" a. What equation did you fit that answers the questions in the text? Given the coefficients that you found, what are the two equations (and coefficients) that your equation implies for these two market types? (3) b. Using the 1% confidence level, what evidence can you present as to whether the slope depends on market type? (2) What equation was suggested by your stepwise regression. Does this seem to work as well as the one suggested by the textbook authors? Why? If you compare the slope of the regression line relating volatility to the credit rating for countries with gdps of 2(thousand) and 20(thousand), what seems to be happening to the slope as per capita gdp rises? (5) 3. 4. Solution: 2a. The printout from the computer for this equation says: Regression Analysis * NOTE * gd-cr is highly correlated with other predictor variables The regression equation is Vol = 40.1 - 0.227 CR + 16.9 emd - 0.332 ecr - 0.066 gdp + 0.00100 gd-cr Predictor Constant CR emd ecr gdp gd-cr s = 2.768 Coef 40.096 -0.2268 16.904 -0.3316 -0.0659 0.001000 Stdev 9.654 0.1494 9.442 0.1494 0.4245 0.005436 R-sq = 96.2% t-ratio 4.15 -1.52 1.79 -2.22 -0.16 0.18 p 0.000 0.142 0.086 0.036 0.878 0.856 R-sq(adj) = 95.4% The following coefficients have p-values below .05: The constant (.000), and ecr (.036). These are significant at the 5% level. Since the p-value for the constant is the only one below .01, only the constant is significant at the 1% level. 3 252y0242 5/07/02 2b. From the comments at the top of the page: CR We would expect foreign exchange rates to become less volatile as credit rating improves, so this would be negative. (It was!) emd This variable equals 1 if a country does not have a developed market . We would expect foreign exchange rates to become less volatile as markets become developed, so its sign should be positive. (It was.) ecr This is a product of the two variables above. It is zero for developed markets. In emerging markets, we might expect more sensitivity to the credit rating, since they are often thinner markets. So this is, most likely negative. (It was) gdp We would expect foreign exchange rates to became less volatile as per capita income rises. This would have a negative sign. (It was) gd-cr Since this variable will have its highest value for countries with high credit ratings and high incomes, this should have a negative sign too. (Note that it does not!) 2c. The ANOVA is below: Analysis of Variance SOURCE Regression Error Total DF 5 24 29 SS 4596.79 183.86 4780.65 SOURCE CR emd ecr gdp gd-cr DF 1 1 1 1 1 SEQ SS 4388.03 55.70 152.78 0.02 0.26 MS 919.36 7.66 F 120.01 p 0.000 The most important conclusion here is that since the p-value is tiny, we can reject the null hypothesis that the independent variables have no ability to explain variation of Y. 2d. We have a high R-squared here (96.2% - 95.4% adjusted), but the coefficients are not very significant and one seems to have the wrong sign. The regression recommended by the authors, 'Vol' against 'CR', 'emd' and 'ecr' seems to have as good an R-squared (96.1% - 95.7% adjusted) , significant coefficients at both the 1% and 5% level and the expected signs on the coefficients. 3a. The equation mentioned in 2d does the job. It is: Vol = 38.6 - 0.204 CR + 18.3 emd - 0.354 ecr For a developed market, 'emd' and 'ecr' are zero, so the equation becomes: Vol = 38.6 - 0.204 CR But for an emerging market 'emd' is 1 and 'ecr' is equal to 'CR', so the equation becomes: Vol = 38.6 - 0.204 CR + 18.3(1) - 0.354 CR Or: Vol = 56.9 - 0.558 CR. 3b. All the p-values are below 1%, especially that of the coefficient of 'ecr'. So the slope depends on market type. 4. The stepwise regression suggested the equation: Vol = 56.0 - 0.523 CR + 0.00442 gd-cr All the coefficients have a zero p-value and are thus highly significant, however 'gd-cr' has the wrong sign. Remember that 'gd-pr' is the product of 'gdp' and 'cr'. If a country has a 'gdp' of 2, the equation becomes: Vol = 56.0 - 0.523 CR + 0.00442(2)CR or: Vol = 56.0 - 0.514 CR But if the gdp is 20, the equation becomes: Vol = 56.0 - 0.523 CR + 0.00442(20)CR or: Vol = 56.0 - 0.435 CR This seems to indicate that the volatility is less responsive to credit rating in rich countries. This doesn't sound reasonable to me. 4 252y0242 5/07/02 II. Do at least 4 of the following 7 Problems (at least 15 each) (or do sections adding to at least 60 points Anything extra you do helps, and grades wrap around) . Show your work! State H 0 and H1 where applicable. Use a significance level of 5% unless noted otherwise. Do not answer questions without citing appropriate statistical tests. 1. A researcher is investigating the behavior of the Dow-Jones Transportation, Industrial and Utility averages. Data is presented below for closing numbers for 14 days in May 2001. Because the researcher believes that the underlying distributions are not Normal, she computes rank correlations instead of standard correlations. For your convenience, ranks have been computed for Transportation and Industry. a. Check the utilities for rises and falls in value, marking rises with + and falls with -. Using a statistical test, find out if the pattern of rises and falls is random. (5) b. Compute a rank correlation between industry and utility prices and test it for significance. (5) c. Compute a measurement of concordance between the three series and test it for significance. Express it on a zero to one scale. (6) Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Date 5 5 5 5 5 5 5 5 5 5 5 5 5 5 07 08 09 10 11 14 15 16 17 18 21 22 23 24 Trans Indust x1 r1 x2 2850.64 2865.54 2865.73 2899.76 2879.56 2874.59 2880.24 2925.50 2957.58 2978.95 3004.35 2990.97 2969.16 2951.01 1 2 3 7 5 4 6 8 10 12 14 13 11 9 10935.2 10888.5 10867.0 10910.4 10821.3 10877.3 10873.0 11215.9 11248.6 11301.7 11337.9 11257.2 11105.5 11122.4 Util r2 7 5 2 6 1 4 3 10 11 13 14 12 8 9 x3 383.93 378.74 383.74 383.52 386.64 391.04 385.70 387.52 387.84 391.54 394.43 394.67 398.31 397.68 Solution: Data is repeated with ranks and signs added for x3 . It's remarkable how many people thought that they could do this without ranking x3 . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 5 5 5 5 5 5 5 5 5 5 5 5 5 5 07 08 09 10 11 14 15 16 17 18 21 22 23 24 x1 r1 x2 2850.64 2865.54 2865.73 2899.76 2879.56 2874.59 2880.24 2925.50 2957.58 2978.95 3004.35 2990.97 2969.16 2951.01 1 2 3 7 5 4 6 8 10 12 14 13 11 9 10935.2 10888.5 10867.0 10910.4 10821.3 10877.3 10873.0 11215.9 11248.6 11301.7 11337.9 11257.2 11105.5 11122.4 r2 7 5 2 6 1 4 3 10 11 13 14 12 8 9 x3 383.93 378.74 383.74 383.52 386.64 391.04 385.70 387.52 387.84 391.54 394.43 394.67 398.31 397.68 r3 + + + + + + + + + - 4 1 3 2 6 9 5 7 8 10 11 12 14 13 a) This is a runs test. The null hypothesis is randomness. r 7, n1 4, n 2 9 and n n1 n 2 13 . Since the numbers given on the 5% runs test table are 3 and a blank, and r 6 is above 3, we cannot reject the null hypothesis. 5 252y0242 5/07/02 b) row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 sum r3 d r2 r3 r2 7 5 2 6 1 4 3 10 11 13 14 12 8 9 105 4 1 3 2 6 9 5 7 8 10 11 12 14 13 105 3 4 -1 4 -5 -5 -2 3 3 3 3 0 -6 -4 0 d2 9 16 1 16 25 25 4 9 9 9 9 0 36 16 184 From the outline: d 1 6184 1 nn 1 14 14 1 2 6 rs 2 1 2 1104 1 0.4044 0.596 14195 The 14 line from the rank correlation table has n .050 .025 .010 .005 14 .4593 .5341 .6220 .6978 H 0 : s 0 If you tested at the 5% level, reject the null hypothesis of no relationship if rs is above .4593 H 1 : s 0 H 0 : s 0 or, if you tested at the 5% level, reject the null hypothesis if rs is above .5341. So, in this H 1 : s 0 case we have a significant rank correlation. c) To compute S write row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 sum SR 2 SR r1 r2 r3 1 2 3 7 5 4 6 8 10 12 14 13 11 9 105 7 5 2 6 1 4 3 10 11 13 14 12 8 9 105 4 1 3 2 6 9 5 7 8 10 11 12 14 13 105 12 8 8 15 12 17 14 25 29 35 39 37 33 31 315 144 64 64 225 144 289 196 625 841 1225 1521 1369 1089 961 8757 From the outline: Take k columns with n items in each and rank each column from 1 to n . The null hypothesis is that the rankings disagree Compute a sum of ranks SRi for each row. Then S SR 2 n SR 2 8757 14 22 .52 1669 .5 , 315 n 1k 15 3 22 .5 is the mean of the SRi s. If H 0 is disagreement, S can be 14 2 2 checked against a table for this test. If S S reject H 0 . Since n is too large for the table use where SR 2n1 22 k n 1W 313 .8154 W S 1 k2 12 n 3 n 1669 .5 3 1 12 2 14 3 14 S n 1 1 kn 12 1669 .5 31 .8 , where 314 15 1 12 20034 .8154 is the Kendall Coefficient of Concordance and 24570 must be between 0 and 1. Since this is below .2052 5.9915 , it is significant and we reject H 0 . 6 252y0242 5/07/02 2. (Pelosi and Sandifer) A diaper company is testing three filler materials for diapers. Eight diapers were tested with each of the three filler materials making a total of 24 diapers put on 24 toddlers. Each column ( x1 , x 2 , and x3 ) can be considered a random sample of eight taken from a Normally distributed population. As each toddler played, fluid was injected into the diaper until the product leaked. Each number below in x1 , x 2 , and x3 represents the capacity of the diaper. The remaining columns ( r1 , r2 , and r3 ) are a ranking of the 24 numbers. In this entire problem we assume that the underlying distributions are Normal. Row 1 2 3 4 5 6 7 8 x1 r1 792 790 797 803 811 791 801 791 5.0 2.0 6.0 8.5 13.5 3.5 7.0 3.5 x2 r2 809 818 803 781 813 808 805 811 12.0 17.0 8.5 1.0 15.5 11.0 10.0 13.5 x3 r3 826 813 854 843 846 847 835 872 18.0 15.5 23.0 20.0 21.0 22.0 19.0 24.0 The following are computed for you: x 6376.00, x 6448.00, x x 1 3 2 2 6736.00, 5197954, n1 n2 n3 8 . x x 2 2 1 5082066, 2 3 5673944 and a. Compute the sample variances of x1 and x3 and test the hypothesis that the population variances for these two columns are equal. (4) b. Assume that the variances of the populations from which x 2 and x3 come are equal and test the hypothesis that 3 is greater than 2 i) First state your null and alternate hypotheses (2) and then test the hypotheses using a (ii) test ratio, (iii) a critical value and (iv) a confidence interval. (6) c. Test if the hypothesis that the means of all three populations are equal holds water (7) d. Use a test of goodness of fit to see if x 2 has the Normal distribution. (5) Solution: a) As I'm sure we all know: x1 6376 x12 n1x12 5082066 8797 2 x1 797 s12 s1 7.5024 56 .286 n1 8 n1 1 7 x2 x3 x 2 n2 x n3 3 6448 806 s22 8 6736 842 s32 8 x 2 2 n2 x2 2 n2 1 x 2 3 n3 x32 n3 1 5197954 8806 2 123 .714 7 s 2 11 .1227 5673944 8842 2 318 .857 7 s 3 17.8566 If we follow 252meanx4 in the outline: Our Hypotheses are H 0 : 12 32 and H1 : 12 32 . DF1 n1 1 7 and DF3 n3 1 7 , Since the table is set up for one sided tests, if we wish to test H 0 : 12 32 , we must do two separate one-sided tests. First test DF1, DF 3 F 7,7 4.99 F.025 and then test .025 s 22 s12 s12 s 32 56 .286 0.177 against 318 .857 318 .857 DF 3, DF1 F 7,7 4.99 . If 5.665 against F.025 .025 56 .286 either test is failed, we reject the null hypothesis. Since 5.665 is above the table F, we reject the null hypothesis. We really should not be using a method for comparing the means that requires equal variances. 7 252y0242 5/07/02 b) From Table 3 of the Syllabus Supplement: Interval for Confidence Hypotheses Interval Difference H0 : 0 d t 2 sd between Two H1 : 0 1 1 Means ( sd s p 1 2 n1 n2 unknown, variances DF n1 n2 2 assumed equal) Test Ratio t sˆ 2p Critical Value d cv 0 t 2 sd d 0 sd n1 1s12 n2 1s22 n1 n2 2 x 2 806 , s22 123.714 , s 2 11 .1227 , x3 842 s32 318.857, s 2 17 .8566 and n 2 n3 8 . d x 2 x3 806 842 36 sˆ 2p n1 1s 22 n2 1s32 n 2 n3 2 .05, 7123 .714 7318 .857 123 .714 318 .857 221 .2855 14 2 221 .2855 1 1 221 .2855 .25 1 1 n 2 n3 s d sˆ p = DF n 2 n3 2 8 8 2 14 8 8 14 t .05 1.761 55 .321375 7.4378 H 0 : 3 2 H 0 : 2 3 H 0 : 0 (i) Hypotheses: or or 2 3 H1 : 0 H 1 : 3 2 H 1 : 2 3 (ii) Test Ratio: t d 0 36 0 4.840 . Make a diagram of an almost-Normal curve with a mean at 7.4378 sd zero and a 'reject' region below -1.761. Since -4.840 is in the reject region, reject the null hypothesis. (iii) Critical Value: d cv 0 t 2 sd becomes d cv 0 t s d 0 1.7617.4378 13 .098 . Make a diagram of an almost-Normal curve with a mean at zero and a 'reject' region below -13.098. Since -36 is in the reject region, reject the null hypothesis. (iv) Confidence Interval: d t 2 sd becomes d t s d 36 1.7617.4378 22 .092 since 22 .092 contradicts H 0 : 0 , reject the null hypothesis. Solution: H 0 : 1 2 3 H 1 : not all means equal .05 Sum 1 792 790 797 803 811 791 801 791 2 809 818 803 781 813 808 805 811 3 826 813 854 843 846 847 835 872 Sum 6376 + 6448 + 6736 + nj 8+ 8+ 8+ 19560 x ij 24 n x j 797 806 842 SS 5082066 + 5197954 + 5673944 + 19560 815 x 24 15953964 x 2j 635209+ 649636+ 708964 = 1993809 x 2 ij 8 252y0242 5/07/02 x nx 15953964 24815 12564 SSB n x nx 8797 8806 8842 24815 SST 2 ij 2 j .j 2 2 2 2 2 2 2 81993809 15941400 9072 Note that none of the items in the SS column can be negative. Source SS Between Within Total DF MS 9072 2 4536 3492 12564 21 23 166 F F.05 27.28 F 2,21 3.47 s H0 Column means equal Since the value of F we calculated is more than the table value, we reject the null hypothesis and conclude that there is a significant difference between column means. d) H 0 : N ?, ? H 1 : Not Normal Because the mean and standard deviation are unknown, this is a Lilliefors problem. Note that data must be in order for the Lilliefors or K-S method to work. From the data we found that x 806 and xx . F t actually is computed from the Normal table. For example s 11 .1227 . t s Fe 781 Px 781 Pz 2.25 Pz 0 P2.25 z 0 .5 .4878 .0122 . D is the difference (absolute value) between the two cumulative distributions. Row 1 2 3 4 5 6 7 8 x 781 803 805 808 809 811 813 818 t -2.25 -0.27 -0.09 0.18 0.27 0.45 0.63 1.08 O 1 1 1 1 1 1 1 1 O n 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 FO 0.125 0.250 0.375 0.500 0.625 0.750 0.875 1.000 Fe 0.0122 0.3936 0.4641 0.5714 0.6064 0.6736 0.7357 0.8599 D 0.1128 0.1436 0.0891 0.0714 0.0186 0.0764 0.1393 0.1401 The maximum deviation is 0.1436. The Lilliefors table for .05 and n 8 gives a critical value of 0.285. Since our maximum deviation does not exceed the critical value, we do not reject H 0 . 9 252y0242 5/07/02 3. Data from the previous problem is repeated. In this problem assume that the underlying distributions are not Normal. Remember that each column is an independent sample. Row 1 2 3 4 5 6 7 8 x1 792 790 797 803 811 791 801 791 r1 5.0 2.0 6.0 8.5 13.5 3.5 7.0 3.5 x2 r2 x3 809 818 803 781 813 808 805 811 12.0 17.0 8.5 1.0 15.5 11.0 10.0 13.5 r3 826 813 854 843 846 847 835 872 18.0 15.5 23.0 20.0 21.0 22.0 19.0 24.0 The following are computed for you: x 6376.00, x 6448.00, x x 1 3 2 2 6736.00, 5197954, n1 n2 n3 8 . x x 2 2 1 5082066, 2 3 5673944 and a. Test the hypothesis that the median of the population underlying x3 is larger than the median of the population underlying x 2 . (6) b. Test the hypothesis that all three columns come from populations with equal medians. (7) c. Test the hypothesis that x 2 comes from a population with a median of 804 using either a sign test (4) or a Wilcoxon signed rank test (5). Solution: a) The null hypothesis is H 0 : The median of the population underlying x3 is larger than the median of the population underlying x 2 . The text below is largely repeated from 252review. The data are repeated in order. The ranks that appear above appear in parentheses, since they make it easier to find r2 and r3 , the ranks of the numbers among the 16 numbers in the two groups. Row 1 2 3 4 5 6 7 8 x2 809 818 803 781 813 808 805 811 ( r2 ) 12.0 17.0 8.5 1.0 15.5 11.0 10.0 13.5 r2 5 9 2 1 7.5 4 3 6 37.5 x3 826 813 854 843 846 847 835 872 ( r3 ) 18.0 15.5 23.0 20.0 21.0 22.0 19.0 24.0 r3 10 7.5 15 12 13 14 11 16__ 98.5 Since this refers to medians instead of means and if we assume that the underlying distribution is not Normal, we use the nonparametric (rank test) analogue to comparison of two sample means of independent samples, the Wilcoxon-Mann-Whitney Test. (Note that data is not cross-classified so that the Wilcoxon Signed Rank Test is not applicable. ) H 0 : 2 3 H1 : 2 3 . We get TL 37 .5 and TU 98.5 . Check: Since the total amount of data is 8 + 8 = 16 n , 37.5 + 98.5 nn 1 16 17 136 .They do. 2 2 For a 5% one-tailed test with n1 8 and n 2 8 , Table 6 says that the critical values are 52 and 84. We accept the null hypothesis in a 1-sided test if the smaller of thee two rank sums lies between the critical values. The lower of the two rank sums, W 37.5 is not between these values, so reject H 0 . b) Since this involves comparing three apparently random samples from a non-normal distribution, we use a Kruskal-Wallis test. The null hypothesis is H 0 : Columns come from same distribution or medians are equal. If we repeat the table once again and add the rank sums we get: must equal 10 252y0242 5/07/02 Row 1 2 3 4 5 6 7 8 x1 792 790 797 803 811 791 801 791 r1 5.0 2.0 6.0 8.5 13.5 3.5 7.0 3.5 49.0 x2 r2 809 818 803 781 813 808 805 811 12.0 17.0 8.5 1.0 15.5 11.0 10.0 13.5 88.5 x3 r3 826 813 854 843 846 847 835 872 18.0 15.5 23.0 20.0 21.0 22.0 19.0 24.0 162.5 Sums of ranks are given above. To check the ranking, note that the sum of the three rank sums is 49.5 + 88.5 + 162.5 = 300, that the total number of items is 24 and that the sum of the first n numbers is nn 1 24 25 300 . Now, compute the Kruskal-Wallis statistic 2 2 2 2 2 12 SRi 2 3n 1 12 49 .0 88 .5 162 .5 325 H 8 8 nn 1 i ni 24 25 8 1 36639 .5 75 16 .56875 . If we try to look up this result in the (8, 8, 8) section of the Kruskal-Wallis 50 8 table (Table 9) , we find that the problem is to large for the table. Thus we must use the chi-squared table with 2 degrees of freedom. Since .2052 5.9915 reject H 0 . c) We repeat the column with the alleged median next to it. H 0 : 2 804 and d x 2 . r is the rank of the absolute values of d and r * is the rank with signs and corrections for ties. d Row r r* x2 d 1 2 3 4 5 6 7 8 809 818 803 781 813 808 805 811 804 804 804 804 804 804 804 804 5 14 -1 -23 9 4 1 7 5 14 1 23 9 4 1 7 4 7 1 8 6 3 2 5 36 4+ 7+ 1.586+ 3+ 1.5+ 5+ 89 36 . 2 The sum of the + numbers T 26 .5 , while T 9.5 . If we check these against Table 7 for n 8 , we find that the smaller of the two numbers must be 4 or below for a rejection in a 2-sided 5% test. Since 9.5 is above 4, we do not reject the null hypothesis. If, instead, we do the simpler and less powerful sign test, we simply look at how many numbers (2) are above 804 or how many numbers (6) are above 204. Since this is a 2-sided test, the p-value is found by checking the binomial table with p .5 . 2Px 2 2.14453 ..28906 or 2Px 6 21 Px 5 21 .85547 .28906 . Because this p-value is above our confidence level, we do not reject the null hypothesis. Our check this time is that the sum of the ranks is the sum of the numbers 1 through 8, which is 11 252y0242 5/07/02 4. (Pelosi and Sandifer) A survey on student drinking revealed the following: Residence Nonbinge Infrequent Frequent Total Drinker Binge Drinker Binge Drinker On Campus 35 29 47 111 Off Campus 49 31 24 104 Total 84 60 71 215 a. Test the hypothesis that the proportion in each of the three drinking categories is the same regardless of where a student lives. (7) b. Test the hypothesis that the proportion of infrequent binge drinkers is higher off campus than on campus. (4) c. The researcher believes that, nationwide, the proportion of frequent binge drinkers is 30%. Test to see if the proportion on the campus profiled above is higher. (3) d. Find a p-value for the result in c (2) Solution: DF r 1c 1 12 2 H 0 : Homogeneousor p1 p 2 p 3 H 1 : Not homogeneousNot all ps are equal O On Off Total 1 35 49 84 2 3 29 47 31 24 60 71 Total pr 111 .516279 104 215 .483721 1.000000 .2052 5.9915 E 1 2 On 43 .3674 30 .9767 Oft 40 .6326 29 .0233 Total 84 .0000 60 .0000 3 Total pr 36 .6558 111 .000 .516279 34 .3442 104 .000 .483721 71 .0000 215 .000 1.000000 The proportions in rows, p r , are used with column totals to get the items in E . Note that row and column sums in E are the same as in O . (Note that 2 9.63298 224.6329 215 is computed two different ways here - only one way is needed.) O2 O E 2 E O2 E O Row O E E E 1 35 43.3674 8.3674 70.013 1.61442 28.2470 2 29 30.9767 1.9767 3.907 0.12614 27.1494 3 47 36.6558 -10.3442 107.002 2.91911 60.2633 4 49 40.6326 -8.3674 70.013 1.72308 59.0905 5 31 29.0233 -1.9767 3.907 0.13463 33.1113 6 24 34.3442 10.3442 107.002 3.11559 16.7714 215 215.000 0.0000 9.63298 224.6329 Since the 2 computed here is greater than the 2 from the table, we reject H 0 . 31 29 .29808 , n2 104 . .26162 , n1 111 and p2 104 111 Confidence Hypotheses Test Ratio Critical Value Interval pcv p0 z 2 p p p 0 p p z 2 sp H 0 : p p0 z If p0 0 p H 1 : p p0 p p1 p2 p0 q 0 1 n 1 n If p 0 p p 0 p 01 p 02 p1q1 p2 q 2 s p p01q 01 p02 q 02 n p n2 p2 p n1 n2 or p 0 0 p 1 1 n n b) We are comparing p1 Interval for Difference between proportions q 1 p 1 1 Or use s p 2 0 2 n1 n 2 12 252y0242 5/07/02 sp p1q1 p2q2 .26126 .73874 .29808 .70192 .0017388 .0020118 .00374704 .0612419 n1 n2 111 104 29 31 n p n2 p2 111.26162 104 .29808 1 1 .27907 , 111 104 n1 n2 111 104 1.645 . Note that q 1 p and that q and p are between 0 and 1. p p1 p2 .036816 , p0 .05, z z.05 p p0q0 1 n1 1 n3 .27907 .72093 1111 1104 .0037470 .061213 H0 : p 0 H 0 : p1 p2 H 0 : p1 p2 0 Our hypotheses are or or H1 : p 0 H1 : p1 p2 H1 : p1 p2 0 There are three ways to do this problem. Only one is needed p p0 .036816 0 0.6015 Make a Diagram showing a 'reject' (i) Test Ratio: z p .061231 region below -1.645. Since -0.6015 is above this value, do not reject H 0 . (ii) Critical Value: pcv p0 z p becomes pcv p0 z p 2 0 1.645 .061231 .10069 . Make a Diagram showing a 'reject' region below - 0.10069. Since p .036816is not below this value, do not reject H 0 . (iii) Confidence Interval:: p p z s p becomes p p z sp 2 .036816 1.645 .0612419 0.0639 . Since not reject H 0 . c) From the formula table we have: Interval for Confidence Hypotheses Interval Proportion p p z 2 s p H 0 : p p0 H1 : p p0 pq n q 1 p sp p .0639 does not contradict p 0 , do Test Ratio z p p0 p Critical Value pcv p0 z 2 p p0 q0 n q0 1 p0 p H1 : p .30 . It is an alternate hypothesis because it does not contain an equality. The null hypothesis is thus H 0 : p .30. The problem says that .05 , n 215 , x 71 . so that p0 q0 .33023 .66977 x 71 .03207 . This is a one-sided test and .33023 . p n 215 n 215 z z.01 2.327 . This problem can be done in one of three ways. p (i) The test ratio is z p p0 p .33023 .30 .03207 .9426 . Make a diagram of a normal curve with a mean at zero and a reject zone above z z.05 1.645 . Since z 0.9464 is not in the 'reject' zone, do not reject H 0 . We cannot say that the proportion of binge drinkers significantly above 30%. (ii) Since the alternative hypothesis says p .30, we need a critical value that is above .30. We use pcv p0 z p .30 1.645.03207 .3528. Make a diagram of a normal curve with a mean at .30 and a reject zone above .3528. Since p .33023 is not in the 'reject' zone, do not reject H 0 . We cannot say that the proportion is significantly above 30%. 13 252y0242 5/07/02 pq . To make the 2-sided confidence interval, n p p z 2 s p , into a 1-sided interval, go in the same direction as H1 : p .30 . We get (iii) To do a confidence interval we need s p p p z s p . This is not a good use of our time here., but should not contradict the null hypothesis. d) In this case the p-value would be (from (i) above) Pz .9426 Pz 0.94 .5 .3264 .1736 . Of course, you could answer c) by observing that this is above .05 . 14 252y0242 5/07/02 5. A fast food corporation wishes to predict its mean weekly sales as a function of weekly traffic flow on the street where the restaurant is and the city in which it is located. In the first version of the study, the data is as below. y is 'sales' in thousands, x1 is 'flow', traffic flow in thousands of cars per week, x 2 is 1 if the store is in city 2, zero otherwise. (Use .01) . y Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 6.4 6.7 7.7 2.9 9.5 6.0 6.2 5.0 3.5 8.4 5.2 3.9 5.5 4.1 3.2 5.4 x1 x2 59.3 60.3 82.1 32.3 98.0 54.1 54.4 51.4 36.7 75.9 48.4 41.5 52.6 41.1 29.6 49.5 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 The following data is computed for you y 89.6000, x 867.200, x 2 7.0, n 16, x 52023.3, x ?, y 554.760, x y 5358.62, x y ?, x x 338.60. 1 2 1 2 2 2 1 2 1 2 You do not need all of these on this page. a. Compute a simple regression of sales against flow. (7) b. Given your equation, what sales do you expect when the flow is 60.00? (1) c. Compute R 2 (4) d. Compute s e (3) e. Compute s b1 ( the std deviation of the slope) and do a significance test for 1 .(3) f. Do a prediction interval for sales when the flow is 60. (3) Solution: a) Spare Parts Computation: x1 867 .2 x1 54 .2 n 16 y SSx1 b1 Sx1 y Sx1 y SSx1 x nx12 53023 .3 16 54 .22 x y nx y 5358 .62 1654.25.6 1 1 502 .3 SSy y 2 ny 554 .76 16 5.62 2 53 .00 x1 y nx1 y 2 1 2 1 5021 .07 89 .6 5.6 n 16 Not that SSx1 and SSy cannot be negative! y x nx12 502 .3 0.1000 5021 .07 b0 y b1x1 5.6 0.1000 54.2 0.1800 Yˆ b0 b1 x1 becomes Yˆ 0.18 0.100 x1 . b) If x 60, Yˆ 0.18 0.10060 6.18 1 SSR 50 .23 x y nx y 0.100 502 .3 50.23 R SST 0.948 or 53 .00 x y nx y Sx y 502 .3 0.9481 SSx SSy x nx y ny 5021 .07 53 .00 c) SSR b1Sx1 y b1 2 1 2 2 R 2 1 1 1 1 2 1 2 1 2 1 2 2 ( 0 R 2 1 always!) 15 252y0242 5/07/02 se2 d) SSE SST SSR 53.00 50.23 2.77 SSE 2.77 0.1979 se 0.1979 0.4448 n2 14 ( s e2 is always positive!) e) sb21 se2 s2 0.1979 e 0.0003941 x12 nx12 SSx1 5021 .07 1 sb1 .00003941 .00628 H 0 : 1 0 b1 0.1000 Make a diagram. We accept the 15 .92 The usual significance test is sb1 .00628 H 1 : 1 0 null hypothesis if our t ratio is between t n 2 t 14 2.977 and t n 2 t 14 2.977 . Since 15.92 So tb1 2 .005 2 .005 is not between these numbers, reject the null hypothesis and say that the slope is significant. Note that the same people who found s b1 instead of s b0 on the last exam found s b0 instead of s b1 on this one, proving that it is easier to copy than to think. We have already found that if x10 60, Yˆ 0.18 0.10060 6.18 f) . From the regression formula outline the Prediction Interval is Y0 Yˆ0 t sY , where s 2y 1 60 54 .22 x1 x1 2 x1 x1 2 2 1 1 s 1 0 . 1979 1 e SSx1 5021 .07 n x12 nx12 n 16 1 se2 0 0 33 .64 1 0.1979 1.069 0.2116 So s 0.2116 0.4600 and 0.1979 0.0625 y 5021 .07 Y Yˆ t s 6.18 2.997 0.4600 6.18 1.38 . Note that the same people who found a prediction 0 0 Y interval instead of a confidence interval on the last exam found a confidence interval instead of a prediction interval on this one, proving that it is easier to copy than to think. 16 252y0242 5/07/02 6. Data from the previous problem is repeated. below . (Use .05) . y is 'sales' in thousands, x1 is 'flow', traffic flow in thousands of cars per week, x 2 is 1 if the store is in city 2, zero otherwise. (Use .01) . y Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 x1 6.4 6.7 7.7 2.9 9.5 6.0 6.2 5.0 3.5 8.4 5.2 3.9 5.5 4.1 3.2 5.4 x2 59.3 60.3 82.1 32.3 98.0 54.1 54.4 51.4 36.7 75.9 48.4 41.5 52.6 41.1 29.6 49.5 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 The following data is computed for you y 89.6000, x 867.200, x 2 7.0, n 16, x 52023.3, x ?, y 554.760, x y 5358.62, x y ?, x x 338.60. 1 2 1 2 2 2 1 2 1 2 a. Do a multiple regression of price against x1 and x 2 . (12) b. Compute R 2 and R 2 adjusted for degrees of freedom for both this and the previous problem. Compare the values of R 2 adjusted between this and the previous problem. Use an F test to compare R 2 here with the R 2 from the previous problem. What does your F-test suggest about the significance of the coefficient of x 2 ? (5) c. Compute the regression sum of squares and use it in an F test to test the usefulness of this regression. (5) d. Use your regression to predict sales in city 2 when flow is 60.00. (2) e. Use the directions in the outline to make this estimate into a confidence interval and a prediction interval. (4) Solution: You should be able to compute all the sums below. The only ones that you were asked to x 22 , which was identical to x 2 y , which was mostly zeroes. compute here are x , and x1 x2 x12 2 x 22 y2 x2 y x1 x 2 1 2 6.4 6.7 59.3 60.3 0 0 3516.49 3636.09 0 0 40.96 44.89 379.52 404.01 0.0 0.0 0.0 0.0 3 4 5 6 7 8 9 10 11 12 13 14 15 16 7.7 2.9 9.5 6.0 6.2 5.0 3.5 8.4 5.2 3.9 5.5 4.1 3.2 5.4 89.6 82.1 32.3 98.0 54.1 54.4 51.4 36.7 75.9 48.4 41.5 52.6 41.1 29.6 49.5 867.2 0 0 0 0 0 0 0 1 1 1 1 1 1 1 7 6740.41 1043.29 9604.00 2926.81 2959.36 2641.96 1346.89 5760.81 2342.56 1722.25 2766.76 1689.21 876.16 2450.25 52023.3 0 0 0 0 0 0 0 1 1 1 1 1 1 1 7 59.29 8.41 90.25 36.00 38.44 25.00 12.25 70.56 27.04 15.21 30.25 16.81 10.24 29.16 554.76 632.17 93.67 931.00 324.60 337.28 257.00 128.45 637.56 251.68 161.85 289.30 168.51 94.72 267.30 5358.62 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.4 5.2 3.9 5.5 4.1 3.2 5.4 35.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 75.9 48.4 41.5 52.6 41.1 29.6 49.5 338.6 row y x1 y y 89.6000, x 867.200, x 7.0, n 16, x 52023.3, x 7, y 554.760, x y 5358.62, x y 35.7, x x 338.60. Of course, many of you decided that, since x 7, x 49 - after a whole year of statistics, too. 1 1 2 2 2 2 1 2 2 2 1 2 2 2 17 252y0242 5/07/02 a) First, we compute or copy from the last problem y and x x2 2 n y 89.6 5.60 , x x n 1 16 n 1 867 .2 54 .2 , 16 7 0.4375 . Then, we compute or copy our spare parts: 16 y ny 554 .76 165.6 53.000 * Sx y x y nx y 5358 .62 16 54 .25.6 502 .30 Sx y X Y nX Y 35.7 160.4375 5.6 3.50 SSx1 x12 nx12 52023.3 1654.22 5021.07 * SSx2 X 22 nX 22 7 160.43752 3.9375 and Sx x X X nX X 338 .60 1654.20.4375 40 .8 . SSy 2 2 1 1 2 2 1 2 2 1 2 1 2 1 2 * indicates quantities that must be positive. (Note that some of these were computed for the last problem. Can you believe that some people copied the '*' from last year's exam?) Then we substitute these numbers into the Simplified Normal Equations: X 1Y nX 1Y b1 X 12 nX 12 b2 X 1 X 2 nX 1 X 2 X Y nX Y b X X 2 2 1 1 2 nX X b X 1 2 2 2 2 nX , 2 2 502 .3 5021 .07 b1 40 .8b2 3.50 40 .8b1 3.9375 b2 which are and solve them as two equations in two unknowns for b1 and b2 . These are a fairly tough pair of equations to solve. The choices are, essentially to multiply the second equation by 10.3619 or to multiply it by 123.06544 40 .8 to eliminate b2 , 3.9375 5021 .07 to eliminate b1 . Let's try the first. The equations become 40 .8 502 .3 5021 .07 b1 40 .8b2 If we add these together, we get 466 .033 4598 .304 b1 . This means that 36 .267 422 .766 40 .8b2 466 .033 0.10135 . The first of the two normal equations can now have our new value substituted 4598 .304 into it to get 502 .3 5021 .07 0.10135 40.8b2 or 6.5854 40.8b2 . If we solve this for b2 , we get b1 b2 0.1614 . Finally we get b0 by solving b0 Y b1 X 1 b2 X 2 5.6 0.10135 54.2 0.1614 0.4375 0.0362 . Thus our equation is Yˆ b b X b X 0.0362 0.10135X 0.1614X . 0 1 1 2 2 1 b) In the previous problem we had SSR b1Sx1 y b1 2 x y nx y 0.100 502 .3 50.23 and 1 1 SSR 50 .23 0.948 SST 53 .00 In this problem SSR b1 Sx1 y b2 Sx2 y 0.10135 502 .3 0.1614 3.5 50.34 and R2 R2 SSR 50 .34 0.950 . If we use R 2 , which is R 2 adjusted for degrees of freedom, we get, for the SST 53 .00 first regression R 2 n 1R 2 k 15 0.948 1 .944 , and for the second n k 1 14 This is evidence that the second independent variable didn’t help. R2 15 0.950 2 .942 . 13 18 252y0242 5/07/02 A better way of doing this is to look at (from the outline) 2 2 n k r 1 Rk r Rk F r ,n k r 1 , where k 1, r 1 and n is still 16. 2 r 1 Rk r F 1,13 13 .950 .948 1,13 9.07 . Our null hypothesis is essentially 0.52 . If we check the F table, F.01 1 1 .950 that x 2 doesn't help, and we cannot reject it. 1,13 c) The same thing can be done using ANOVA. The ANOVA table is , for the first regression Source SS* 50.23 X1 DF* 1 MS* 50.23 2.77 14 Error 53.00 15 Total . The ANOVA table is , for the second regression Source SS* DF* 50.34 2 X1 , X 2 F* 255s F.01 8.86 F* 123s F.01 6.70 0.197 MS* 25.16 2.66 13 0.205 Error 53.00 15 Total Both of these show that the Xs have a significant relationship to Y. However, when we combine them, the story is not as positive. Source SS* DF* MS* F* F.01 50.23 1 50.23 245s 9.07 X1 0.11 1 0.11 0.53ns 9.07 X2 Error Total 2.66 53.00 13 15 0.205 Since our computed F is smaller than the table F , we do not reject our null hypothesis that X 2 has no effect. d) Our regression is Yˆ b0 b1 X 1 b2 X 2 0.0362 0.10135X 1 0.1614X 2 . If X 1 60 and X 2 1 , we have Yˆ 0.0362 0.1013560 0.16141 6.2786 0 13 e) From the ANOVA table, s e 0.205 0.45282 . Since k 2, t nk 1 t .005 3.012 . The outline says 2 that an approximate confidence interval is Y0 Yˆ0 t se n 6.28 3.012 0.4522 6.28 0.35 and an 15 approximate prediction interval is Y0 Yˆ0 t s e 6.28 3.012 0.4522 6.28 1.36. . 19 252y0242 5/07/02 7. The regression in the previous problem was run again, using data from four cities. Remember, y is 'sales' in thousands, x1 is 'flow', traffic flow in thousands of cars per week. (Use .05) . First it was run in the form Y b0 b1 X 1 with the following results. The regression equation is sales = 0.010 + 0.109 flow Predictor Constant flow Coef 0.0104 0.108570 s = 0.5947 Stdev 0.3583 0.006077 R-sq = 93.6% t-ratio 0.03 17.87 p 0.977 0.000 R-sq(adj) = 93.3% Analysis of Variance SOURCE Regression Error Total DF 1 22 23 SS 112.87 7.78 120.65 MS 112.87 0.35 F 319.17 p 0.000 Then it was run again in the form Y b0 b1 X 1 b2 X 2 b3 X 3 b4 X 4 with the following results: The regression equation is sales = - 0.178 + 0.105 flow + 0.199 city2 + 0.675 city3 + 1.17 city4 Predictor Constant flow city2 city3 city4 Coef -0.1782 0.105002 0.1991 0.6751 1.1717 s = 0.3960 Stdev 0.2941 0.004475 0.2049 0.2745 0.2245 R-sq = 97.5% t-ratio -0.61 23.47 0.97 2.46 5.22 p 0.552 0.000 0.343 0.024 0.000 R-sq(adj) = 97.0% Analysis of Variance SOURCE Regression Error Total DF 4 19 23 SS 117.674 2.979 120.653 SOURCE flow city2 city3 city4 DF 1 1 1 1 SEQ SS 112.873 0.274 0.254 4.272 MS 29.418 0.157 F 187.61 p 0.000 a) What does the ANOVA show? (2) b) Do an F test to show if location (adding x 2 'city 2', x3 'city 3',and x 4 'city 4' all at once) improves our explanation of weekly sales. (4) c) We have added dummy variables for cities 2, 3 and 4. Why didn't we add one for city 1? (1) d) What is the sales predicted for a flow of 60 in city 3. What does it mean to say that the coefficient of 'city3' is .6751. (2) e) Explain how the model would be modified to show interaction between city and traffic flow. (2) f) An ANOVA was run to determine if management style affected the number of sick days taken by employees. The research was done using 3 different management styles in five separate departments. The dependent variable was the number of sick days taken by each employee. The Minitab output follows (Pelosi and Sandifer): Source Department Mgt. Style Interaction Error Total DF 4 2 8 60 74 SS 208.187 101.440 44.293 42.000 395.920 MS 52.047 50.720 5.537 0.700 20 252y0242 5/07/02 Solution: a) The ANOVAs both show the same thing. You can test these using values from the F table if you wish, but it is enough to say that the p-values of zero should lead us to reject the null hypothesis that there is no relationship between Y and the Xs. 2 2 n k r 1 Rk r Rk b) the easiest way is to use the F test. F r ,n k r 1 . Since the total degrees of 2 r 1 Rk r freedom are n 1, n 24. r 3 is the number of variables added. k 1 is the number of independent variables we started with. The original R 2 was .936 and it grew to .975. This time R-squared adjusted 19 .975 .936 3,19 3.13 , so. Taken 9.88 . The table says that F.05 grew, which is a good sign. F 3,19 3 1 .975 as a whole the dummy variables seem to be beneficial. c) You cannot add a variable that is a linear combination of others. If x5 were added to represent 'city 1', it would be equal to 1 x2 x3 x4 and would make computation impossible. d) In city 3, sales would be sales = - 0.178 + 0.105 flow + 0.199 city2 + 0.675 city3 + 1.17 city4 = - 0.178 + 0.105 (60) + 0.199 (0) + 0.675 (1) + 1.17 (0)=6.797. The .6751 coefficient tells us that if we compare locations with the same traffic in cities 1 and 3, the location in city 3 will have sales .6751 higher. e) We could modify the model for interaction by adding 3 new variables: X 5 X1 X 2 , X 6 X1 X 3 and X 7 X1 X 4 . f) Finish the Minitab table and explain what it shows. In particular, citing numbers in the table or from the F table, does management style make a difference in the number of sick days that employees take and does what department management style is changed in seem to have an effect? (5) Source DF SS MS F F.05 Department 4 208.187 52.047 Mgt. Style 2 101.440 50.720 74.353 F 4,60 2.53 72.531 F 2,60 3.15 7.910 F 8,60 2.10 Interaction 8 44.293 5.537 Error 60 42.000 0.700 Total 74 395.9 This is your basic Minitab printout for 2-way ANOVA. To finish it divide the MSs by the Error (Within) mean square (0.700) to get the values of F, look up the corresponding values of F on the table and declare those lines that have Fs that are larger than the table F to imply a rejection of a null hypothesis. In this case all the F's that we computed are larger than the table Fs, so we conclude (i) that department affects the number of sick days, (ii) that management style affects the number of sick days and (iii) that changes of management style have different effects in different departments. 21 252y0242 5/07/02 8. Extra Credit - Questions on correlation. Go back to problem 7. Use the R-sq in the first regression to find the correlation between sales and traffic flow (0). Use the same significance level that you used on that problem. a. Test the correlation between sales and flow for significance. (3) b. Test the hypothesis that the correlation between sales and traffic flow is .9. (4) c. Compute the partial correlation between sales and 'city4' , rY 4.123 . (2) d. It's no secret that not all the coefficients of the second regression in problem 7 were very significant. I checked for (multi)collinearity by doing the following Minitab command: MTB > corr c2 c3 c4 c5 Correlations (Pearson) city2 city3 city4 flow -0.228 -0.256 0.313 city2 city3 -0.243 -0.329 -0.194 These results were also printed out as: Matrix CORR1 flow flow 1.00000 city2 -0.22820 city3 -0.25624 city4 0.31345 city2 -0.22820 1.00000 -0.24254 -0.32918 city3 -0.25624 -0.24254 1.00000 -0.19389 city4 0.31345 -0.32918 -0.19389 1.00000 Explain what collinearity is and whether it is likely that collinearity influenced my results. (3) e. Aczel reports the following regression results: MTB > REGRESS 'export' on 4 'm1' 'lend' 'price' 'exch'; SUBC > DW. ………… (Most of output omitted) Durbin-Watson statistic = 2.58 If n 67 , explain, telling your significance level, what we ought to conclude from this printout. (3) Solution: a) Since R 2 .936 , r .936 .9675 . If we want to test H 0 : xy 0 against H1 : xy 0 and r x and y are normally distributed, we use t n 2 sr r 1 r n2 2 .9675 1 .936 24 2 17 .9458 . Compare this with t n2 2 t .22 025 2.074 . Since 17.9458 does not lie between these two values, reject the null hypothesis. b) H 0 : xy .9 . If we are testing H 0 : xy 0 against H 1 : xy 0 , and 0 0 , we use Fisher's z1 1 r 1 1 .9675 1 z ln transformation. Let ~ ln ln 60 .538 2.05164 . This has an approximate 2 1 r 2 1 .9675 2 1 1 0 mean of z ln 2 1 0 sz 1 n3 1 1 .9 1 ln 2 1 .9 2 ln 19 1.47222 and a standard deviation of ~ n 2 z z 2.05164 1.47222 1 0.218218 , so that t 2.443 . Compare this with 22 sz 0.218218 t n2 2 t .22 025 2.074 . Since 2.443 does not lie between these two values, reject the null hypothesis. c) The example given in the outline is from the computer printout, rY23.12 t 32 t 32 df df n k 1 and k is the number of independent variables. The printout says , where 22 252y0242 5/07/02 The regression equation is sales = - 0.178 + 0.105 flow + 0.199 city2 + 0.675 city3 + 1.17 city4 Predictor Constant flow city2 city3 city4 rY 4.123 Coef -0.1782 0.105002 0.1991 0.6751 1.1717 5.22 2 .589 5.22 2 19 Stdev 0.2941 0.004475 0.2049 0.2745 0.2245 t-ratio -0.61 23.47 0.97 2.46 5.22 p 0.552 0.000 0.343 0.024 0.000 d) Collinearity is a condition that occurs when highly correlated independent variables appear in a regression. The result is large standard deviations and low values of t. Though we do have some insignificant coefficients, the cause is unlikely to be collinearity, since the correlations are not high. . d) This is a Durbin-Watson test and we are given the Durbin-Watson statistic, DW 0.258 Use a DurbinWatson table with n 67 and k 4 to fill in the diagram below. 0 + 0 dL + ? dU + 0 2 0 + 4 dU + ? 4 dL + 0 4 + If you used the 5% table, you got d L 1.48 and dU 1.73. If you used the 1% table, you got d L 1.32 and dU 1.57. In either case the given value of DW 0.258 falls well below d L , indicating positive autocorrelation. 23