252y0141 5/4/01 (Open in 'Page Layout') ECO252 QBA2 Name KEY FINAL EXAM Hour of Class Registered (Circle) May 1, 2001 Read "Things that You Should Never Do on a Statistics Exam (or Anywhere Else)!" before taking exams! I. (16+ points) Do all the following. 1. Hand in your fourth regression problem (3 points) Remember: Y = Company profit in millions of dollars, X1 = CEO's yearly income in thousands of dollars (X1 = 1000 means a million dollar annual income) , X2 = Percentage of stock owed by CEO (X2 = 3 means the CEO owns 3.0% of the stock) Use a significance level of 10% in this problem. 2. Answer the following questions. a. For the regression of Y against X1 and X2 only, what does the ANOVA tell us? Which of the coefficients are significant? What tells you this? (3) b. Do an F test to show if the addition of X2 and X3 improves the regression over your results with X1 alone. (4) c. Based on your regression of Y against X1, X2, and X3, (i) What evidence is there that CEO income and stock percentage interact? (1) (ii) What change does this equation predict for every one thousand dollars of CEO income when the CEO owns 3% of the company's stock? (3) (iii) What profit does the equation predict for a firm where the CEO earns $1 million and owns 44% of the stock? What might this lead you to suspect about this equation? (2) (iv) Based only on the adjusted R-squared and the significance of the coefficients, is there an equation that seems to work better than the equation with three independent variables? Why? (3) Solution: a) The regression of Y against X1 and X2 is probably useless. The ANOVA gives a p-value of 0.685 to the null hypothesis that there is no relation between Y and the Xs. This leads us to expect insignificant coefficients and the high p-values, well above .10 , confirm that. I'm sure that if you 9 compute or copy the t-ratios and compare them against t .05 , you will find that they are in the 'accept' zone. b) From my printout, if we regress Y against X1, X2, and X3, we find. Analysis of Variance SOURCE Regression Error Total DF 3 8 11 SS 56926416 42689452 99615872 SOURCE x1 x2 x3 DF 1 1 1 SEQ SS 2063334 5962759 48900324 MS 18975472 5336182 F 3.56 p 0.067 F 0.21 p 0.655 Compare this to the original regression with X1 only Analysis of Variance SOURCE Regression Error Total DF 1 10 11 SS 2063334 97552536 99615872 MS 2063334 9755254 If we use either the sequential sum of squares in the top printout or the regression sum of squares in the bottom printout, we find that X1 explains 206334 leaving 56926416 - 206334 = 54863082 for X2 and X3. If we combine these two we get SOURCE DF SS MS F F.10 x1 x2 and x3 Error Total 1 2 8 11 2063334 54863082 42689452 99615872 2063334 27431541 5336182 . 5.14 3.11 1 252y0141 5/4/01 2,8 , we reject the null hypothesis of no relation between Y and X2 and Since our F of 5.14 is larger than F.10 X3. 1159 0.12198 * X 1 6.10 X 2 0.03534 * X 2 , where the numbers in 982 .7 0.04232 61.16 0.01167 parentheses are standard deviations, and asterisks indicate coefficients at the 10% level. (i) One of the coefficients that is significant is that of X2, the interaction term. This tells us that X1 and X2 interact. (ii) If we substitute X 2 3 and X 3 X 1 X 2 3 X 1, our equation becomes c) Our regression reads Y Y 1159 0.12198 * X 1 6.10 3 0.03534 * X 13 , so that X1 is multiplied by .12198 - 3(.03534) = .0161.. This is the amount that Y will rise every time X1 goes up by one. (iii) Substitute 1000 for X1 and 44 for X2., You should get a value for Y of about -5280. It is doubtful that either a million-dollar executive or high stock ownership by the CEO would produce a loss. We should realize that in our data million dollar salaries and really high stock ownership do nor appear together., so we might suspect that this equation has poor predictive powers. (iv) Given the fact that the coefficient of X2 is not significant, perhaps we ought to limit ourselves to the equation with X1 and X3 alone. It has a higher R-squared adjusted and the coefficients of both the remaining Xs are significant at the 10% level. 2 252y0141 5/4/01 II. Do at least 4 of the following 7 Problems (at least 15 each) (or do sections adding to at least 60 points Anything extra you do helps, and grades wrap around) . Show your work! State H 0 and H1 where applicable. Use a significance level of 5% unless noted otherwise. Do not answer questions without citing appropriate statistical tests. 1. (Black, p532) A researcher wishes to predict the price of a meal in New Orleans ( y ) on the basis of location ( x1 - a dummy variable, 1 if the restaurant is in the French Quarter, 0 otherwise) and the probability of being seated on arrival. ( x 2 ). The data is below (Use .10 ) . Row price 1 2 3 4 5 6 7 8 9 10 FQ prob y x1 x2 8.52 21.45 16.18 6.21 12.19 25.62 13.90 18.66 5.25 7.98 0 1 1 0 1 1 0 1 0 0 0.62 0.43 0.58 0.74 0.19 0.49 0.80 0.75 0.37 0.63 The following are given to help you. y 135 .96 , y 2 2270 .68, x 5, 1 x12 5, x 2 5.6, x22 3.4658, x y ?, x y 75.4405 , x x ? and n 10 . 1 2 1 2 You do not need all of these. a. Compute a simple regression of price against x1 .(7) b. On the basis of this regression, what price do you expect to pay for a meal in the French Quarter? Outside the French Quarter? (2) b. Compute R 2 (4) c. Compute s e (3) d. Compute s b0 ( the std deviation of the intercept) and do a confidence interval for 0 .(3) f. Do a confidence interval for the price of a meal in the French Quarter. (3) x y 94 .1 See computation at end of problem 2. Solution: a) 1 Spare Parts Computation: x1 x y 1 n SSx1 5 0.5 10 y 135 .96 13.596 n 10 2.50 Sx1 y Sx1 y SSx1 x y nx y 26.12 10.44 x nx 2.50 1 1 2 1 1 2 2 1 nx12 5 10 0.52 x y nx y 94.1 100.513.596 1 1 26 .12 SSy b1 x y 2 ny 2270 .68 10 13 .596 2 2 422 .166 b0 y b1 x1 13.596 10.44 0.5 8.376 Yˆ b0 b1 x1 becomes Yˆ 8.376 10.44 x1 . 3 252y0141 5/4/01 b) If x1 1, Yˆ 8.376 10.44 1 18.82 SSR 272 .69 x y nx y 10.4426.12 272 .69 R SST 0.646 422 .166 x y nx y Sx y 26.12 0.646 SSx SSy x nx y ny 2.50 422 .166 c) SSR b1 Sx1 y b1 2 1 1 2 2 R2 1 1 2 1 1 2 1 2 1 2 or 2 ( 0 R 2 1 always!) d) SSE SST SSR 422 .166 272 .69 149 .476 s e 18 .685 4.323 SSE 149 .476 18 .685 n2 8 ( s e2 is always positive!) 1 1 x 2 x2 e) s b20 s e2 s e2 n SSx n x12 nx12 1 8 sb1 3.737 1.933 tn2 t.05 1.860 2 18 .685 1 0.5 18 .685 0.2 3.737 10 2.50 2 f) . s e2 so 0 b0 tsb0 8.376 1.8601.933 8.38 3.60 . We have already found that If x1 1, Yˆ 8.376 10.44 1 18.82 From the regression formula outline the Prediction Interval is The Confidence Interval is Y0 Yˆ0 t sYˆ , where 1 sY2ˆ s e2 n X 0 X 2 X 2 nX 2 s 2 1 x10 x1 e n SSx1 2 1 1 0.52 18 .685 10 2.50 18 .685 0.2 3.737 s ˆ 3.737 1.933 . Y So Y0 Yˆ0 t sYˆ 18.82 1.860 1.933 18.82 3.60 . 12 252y0141 5/4/01 2. Data from the previous problem is repeated. below . (Use .10 ) . Row price 1 2 3 4 5 6 7 8 9 10 FQ prob y x1 x2 8.52 21.45 16.18 6.21 12.19 25.62 13.90 18.66 5.25 7.98 0 1 1 0 1 1 0 1 0 0 0.62 0.43 0.58 0.74 0.19 0.49 0.80 0.75 0.37 0.63 The following are given to help you. y 135 .96 , y 2 2270 .68, x 5, 1 x12 5, x 2 5.6, x22 3.4658, x y ?, x y 75.4405 , x x ? and n 10 . 1 2 1 2 a. Do a multiple regression of price against x1 and x 2 . (12) b. Compute R 2 and R 2 adjusted for degrees of freedom for both this and the previous problem. Compare the values of R 2 adjusted between this and the previous problem. Use an F test to compare R 2 here with the R 2 from the previous problem.(4) c. Compute the regression sum of squares and use it in an F test to test the usefulness of this regression. (5) d. Use your regression to predict the price of a meal in the French Quarter sold when the probability of being seated on arrival is 40%(2) e. Use the directions in the outline to make this estimate into a confidence interval and a prediction interval. (4) Sx y Sx y Solution: Note: Deciding that, since b1 1 in simple regression, it must be true that b2 2 in SSx1 SSx2 multiple regression won't get you an ounce of credit for this type of problem. 5.6 0.56 . Second, we compute or copy a) First, we compute Y 13 .596 , X 1 0.50 and X 2 10 X Y 94.1 , X Y 75.4405 , Y 2270 .68 , X 5 , X X X 2.44 . Third, we compute or copy our spare parts: SSy y ny 2270 .68 10 13 .596 422 .166 * Sx y x y nx y 94 .1 10 0.513 .596 26 .12 Sx y X Y nX Y 75.4405 100.56 13.596 0.6971 SSx1 x12 nx12 5 100.52 2.50 * SSx2 X 22 nX 22 3.4658 100.562 0.3298* and Sx x X X nX X 2.44 100.50 0.56 0.36 . 2 1 2 1 2 1 2 2 3.4658 and 2 2 2 1 1 2 2 1 2 1 2 2 1 2 1 2 * indicates quantities that must be positive. (Note that some of these were computed for the last problem.) 13 252y0141 5/4/01 Fourth, we substitute these numbers into the Simplified Normal Equations: X 1Y nX 1Y b1 X 12 nX 12 b2 X 1 X 2 nX 1 X 2 X Y nX Y b X X 2 2 1 1 nX X b X 2 1 2 2 2 2 nX , 2 2 26 .12 2.50b1 0.36 b2 0.6971 0.36 b1 0.3298 b2 which are and solve them as two equations in two unknowns for b1 and b2 . We do this by multiplying the first equation by 0.144, which is 0.36 divided by 2.50. The purpose of this is to make the coefficients of b1 equal in both equations. We could do just as well by multiplying the second equation by 0.36 divided by 0.3298 and making the coefficients b2 equal. 3.7613 0.36b1 0.0518 b2 So the two equations become . We then add the equations to get 0.6971 0.36 b1 0.3298 b2 0.278 11 .022 . The first of the two normal equations can now have our 3.0642 0.278 b2 , so that b 2 3.0642 new value substituted into it to get 26 .12 2.50b1 0.3611 .022 or 26.12 3.97 2.50b1 , which gives us b1 12.035 . Finally we get b0 by solving b0 Y b1 X 1 b2 X 2 13.598 12.035 0.50 11.022 0.56 1.4062 . Thus our equation is Yˆ b b X b X 1.4062 12.035X 11.022X . 0 1 1 2 2 1 2 b) The Regression sum of Squares is SSR b1 X 1Y nX 1Y b2 X 2 Y nX 2 Y b1 Sx1 y b2 Sx2 y 12.035 26.12 11.022 0.6971 306 .671 * and is used in the ANOVA below. The coefficient of determination is R 2 SSR b1 Sx1 y b2 Sx 2 y 306 .671 .726 * . Our results can SST 422 .166 SSy be summarized below as: R2 * .646 .726 n k 1 2 10 10 R 2 , which is R 2 adjusted for degrees of freedom, has the formula R 2 R2 .607 .647 n 1R 2 k , where n k 1 k is the number of independent variables. R 2 adjusted for degrees of freedom went up and seems to show that our second regression is better. Our previous regression had SSR at 272.69. If it rose to 306.671, the new variable must explain 306.671 272.69 = 33.981. SSE SST SSR 422 .166 306 .671 115 .495 The ANOVA table is Source X1 X2 SS* 272.69 33.981 DF* 1 MS* 272.69 F* F.10 1 33.981 2.0595 F71 3.59 115.495 7 16.4993 Error 422.166 9 Total Since our computed F is smaller than the table F , we do not reject our null hypothesis that X 2 has no effect. 14 252y0141 5/4/01 A faster way to do this is to use the R 2 s directly. The difference between R 2 = 72.6% and R 2 =64.6% is 8.0%. Source SS* DF* MS* F* F.10 72.6 1 72.6 X1 8.0 X1 1 8.0 2.0438 F71 3.59 27.4 7 3.91428 Error 100.0 9 Total The numbers are a bit different because of rounding, but the conclusion is the same. c) We computed the regression sum of squares in the previous section. Source SS DF MS 306.671 2 153.3355 X1 , X 2 F 9.293 F.10 2,7 3.26 F.10 115.495 7 16.49939 Error 422.116 9 Total Since our computed F is larger than the table F , we reject our null hypothesis that X 1 and X 2 do not explain Y . d) Yˆ b0 b1 X 1 b2 X 2 1.4062 12.035X 1 11.022X 2 1.4062 12.0351 11.022.40 17.85 . e) From the ANOVA table, s 16 .49939 4.062 . Since k 2, t nk 1 t 7 1.895 . The outline says e that an approximate confidence interval is Y0 Yˆ0 t se .05 2 17 .85 1.895 4.062 17 .95 2.43 and an 10 n approximate prediction interval is Y0 Yˆ0 t s e 17.85 1.895 4.062 17.95 7.70. . Computation of sums follows. Row 1 2 3 4 5 6 7 8 9 10 price 8.52 21.45 16.18 6.21 12.19 25.62 13.90 18.66 5.25 7.98 135.96 FQ 0 1 1 0 1 1 0 1 0 0 5 prob 0.62 0.43 0.58 0.74 0.19 0.49 0.80 0.75 0.37 0.63 5.60 x1sq 0 1 1 0 1 1 0 1 0 0 5 x2sq 0.3844 0.1849 0.3364 0.5476 0.0361 0.2401 0.6400 0.5625 0.1369 0.3969 3.4658 ysq 72.590 460.103 261.792 38.564 148.596 656.384 193.210 348.196 27.562 63.680 2270.68 x1y 0.00 21.45 16.18 0.00 12.19 25.62 0.00 18.66 0.00 0.00 94.10 x2y 5.2824 9.2235 9.3844 4.5954 2.3161 12.5538 11.1200 13.9950 1.9425 5.0274 75.4405 x1x2 0.00 0.43 0.58 0.00 0.19 0.49 0.00 0.75 0.00 0.00 2.44 15 252y0141 5/4/01 3. An airline wants to select a computer package for its reservation system. Over 20 weeks it tries the four commercially available reservation system packages and records as x1 , x 2 , x3 , and x 4 , the number of passengers bumped by each system. It will choose the package with the smallest average bumps, assuming that there is a significant difference between the median or average number of bumps. The data below are in the columns labeled x, the original numbers and, in the r columns, their ranks on a 1 to 20 scale. Below this I have given you the sums of the columns, the number of items in each column, the means for each columns and the sums of the squared numbers (ssq) in each column. The columns are independent samples. Use a 5% significance level. Row P1 r1 P2 x1 1 2 3 4 5 6 12 14 9 11 16 16.0 18.0 9.5 14.0 20.0 x1 x2 62.0 5.0 12.4 798.0 r2 P3 x2 17.0 5.0 3.4 79.0 2 4 7 3 1 2.0 4.0 7.5 3.0 1.0 57.0 6.0 9.5 561.0 40 4 10 454 x3 r3 P4 x3 r4 x4 10 9 6 10 12 10 12.0 9.5 5.5 12.0 16.0 12.0 7 6 15 12 7.5 5.5 19.0 16.0 x4 sum count mean ssq a. Assume that the underlying distribution is Normal and test for a significant difference between the means. (7) b. Assume that the underlying distribution is not normal and test for a significant difference between the medians. (5). c. Find the mean and standard deviation for column P3 and test column P3 for a Normal distribution. (5) Solution: I followed the posted solution to Exercise 14.24. .05 Sum 1 12 14 9 11 16 Sum nj 2 2 4 7 3 1 3 10 9 6 10 12 10 4 7 6 15 12 x 62 + 17 + 57 + 40 176 5+ 5+ 6+ 4 20 n ij x j 12.4 3.4 9.5 10 176 8.8 x 20 SS 798 + 79 + 561 + 454 1892 x 2j 153.76 11.56 90.25 100 Sum is not useful. x nx 176 208.8 343.20 SSB n x nx 512.4 53.4 69.5 SST 2 ij 2 j .j 2 ij 2 2 2 x 2 2 2 4102 208.82 768.8 57.8 541.5 400 1548.8 =219.3 16 252y0141 5/4/01 Source SS DF MS F F.05 9.440 F 3,16 3.24 s Between 219.3 3 73.10 Within Total 123.9 343.2 16 19 7.74375 H0 Column means equal Since the value of F we calculated is more than the table value, we reject the null hypothesis and conclude that there is a significant difference between column means. b) Since this involves comparing three apparently random samples from a non-normal distribution, we use a Kruskal-Wallis test. The null hypothesis is H 0 : Columns come from same distribution or medians are equal. r1 r2 r3 r4 1 2 3 4 5 6 16 18 9.5 14 20 2 4 7.5 3 1 77.5 17.5 12 9.5 5.5 12 16 12 67 7.5 5.5 19 16 . 48 Sums of ranks are given above. To check the ranking, note that the sum of the three rank sums is 77.5 + 17.5 + 67 + 48 = 210, that the total number of items is 5 + 5 + 6 + 4 = 20 and that the sum of the first n nn 1 20 21 210 . Now, compute the Kruskal-Wallis statistic numbers is 2 2 2 2 2 2 12 SRi 2 3n 1 12 77 .5 17 .5 67 48 321 H 5 6 4 nn 1 i ni 20 21 5 12 1201 .25 61 .25 748 .17 576 63 10 .905 . If we try to look up this result in the (5, 5, 6, 4) section 420 of the Kruskal-Wallis table (Table 9) , we find that the problem is to large for the table. Thus we must use the chi-squared table with 3 degrees of freedom. Since .2053 7.8147 reject H 0 . c) H 0 : N ?, ? H 1 : Not Normal Because the mean and standard deviation are unknown, this is a Lilliefors problem. xx From the data we can find that x 9.5 and s 1.9748 . t . F t actually is computed from the s Normal table. For example F 0.25 Pz 0.25 Pz 0 P0.25 z 0 .5 .0987 .4013 and F 0.25 Pz 0.25 Pz 0 P0 z 0.25 .5 .0987 .5987 . . x 6 t 1.77 F t .0384 O 1 O 0.1667 n Fo 0.1667 D .1283 9 10 10 10 12 0.25 .4013 1 0.25 .5987 1 0.25 .5987 1 0.25 .5987 1 1.26 .8962 1 0.1667 0.3333 .0680 0.1667 0.5000 .0987 0.1667 0.6667 .0680 0.1667 0.8333 .2346 0.1667 1.0000 .1038 MaxD .2346 Since the Critical O n 6 Value for .05 is .319 , do not reject H 0 . 17 252y0141 5/4/01 4. The data from the previous page is repeated. Use a 5% significance level. Row P1 r1 P2 x1 1 2 3 4 5 6 12 14 9 11 16 16.0 18.0 9.5 14.0 20.0 x1 P3 2.0 4.0 7.5 3.0 1.0 x3 17.0 5.0 3.4 79.0 r3 P4 x3 2 4 7 3 1 x2 62.0 5.0 12.4 798.0 r2 x2 r4 x4 10 9 6 10 12 10 12.0 9.5 5.5 12.0 16.0 12.0 7 6 15 12 7.5 5.5 19.0 16.0 x4 57.0 6.0 9.5 561.0 40 4 10 454 sum count mean ssq a. Assume that the underlying distribution is Normal and test columns 1 and 3 for differences in means. Assume identical variances. Use a (i) test ratio, (ii) a critical value and (iii) a confidence interval (6) b. Assume that the underlying distribution is not normal and test for a significant difference between the medians of columns 1 and 3(4) c. Assume again that the distributions are Normal and test that the variances are the same. (3) d. Test column P3 to see if its standard deviation is 5. (3). Solution: a) First, we need the variances of x1 and x3 . We know that x1 12 .4, n1 5 and x3 9.5, n3 6 . Recall that s12 s 32 x 2 3 nx32 n3 1 x 2 1 nx12 n1 1 798 512 .42 7.3 and 4 H 0 : 1 2 561 69.52 3.9 . We wish to test 5 H 1 : 1 2 From Table 3 in the Syllabus Supplement: Methods for comparing two sample means differ greatly from methods for comparing one sample mean with a population mean! Interval for Confidence Hypotheses Test Ratio Critical Value Interval Difference H0 : 0 d cv 0 t 2 sd d t 2 s d or d 0 t between Two H1 : 0 sd or x t 2 s x Means ( x cv x 0 1 2 unknown, or s 1 1 ˆp x s s s x H : 0 t 2 s x d variances 0 n1 n2 or 0 n 1s12 n2 1s22 sˆ 2p 1 assumed equal) H 1 : 0 n1 n2 2 DF n1 n2 2 d x x1 x3 12.4 9.5 2.9 DF n1 n3 2 5 6 2 9 .05, sˆ 2p n1 1s12 n2 1s32 n1 n3 2 s x s d sˆ p 1 1 n1 n3 = 47.3 53.9 5.4111 9 5.4111 1 1 5 6 9 t .025 2.262 1.984 1.4085 H 0 : 1 2 H 0 : 0 Our hypotheses are or H 1 : 1 2 H 1 : 0 252y0141 5/4/01 Use one of the following methods. 18 (i)Test Ratio: t d 0 x 0 2.9 0 2.059 . The 'do not reject' region is between sd s x 1.4085 9 9 t n1 n2 2 t .025 2.262 and t n1 n2 2 t .025 2.262 . Since our t is between these two numbers, do 2 2 not reject the null hypothesis. (ii) Critical Value: d cv 0 t 2 sd or xcv 0 t s x 0 2.262 1.4085 3.186 . The 'do not 2 reject' region is between -3.186 and 3.186. Since x d 2.9 is between these values, we do not reject H 0 . (iii)Confidence Interval: The 2-sided interval is d t sd or x t s x 2 2 2.9 2.262 1.4085 2.9 3.19. since this interval includes zero, we do not reject H 0 . b) Since we have two independent samples from non-Normal populations, we use the Wilcoxon-MannH 0 : 1 2 Whitney Test for Two Independent Samples. H 1 : 1 2 Sum x1 12 14 9 11 r1 3.5 2 9.5 5 r1 * 8.5 10 2.5 7 16 1 21 11 . 39 x3 10 9 6 10 12 10 r3 7 9.5 11 7 3.5 7 45 r3 * 5 2.5 1 5 8.5 5 . 27 For the purposes of this test, n 2 5 is the size of the larger sample (actually sample 3), n1 4 is the size of the smaller sample and we wish to compare their medians. Our first step is to rank the numbers from 1 to n n1 n 2 5 6 11 . Note that there are a number of ties that must receive an average rank. The numbers can be ordered from the largest to the smallest or from the smallest to the largest. To decide which to do, look at the smaller sample. If the smallest number is in the smaller sample, order from smallest to largest, if the largest number is in the smallest sample, order from the largest to the smallest. Since 16 is the largest number, let that be 1. Now compute the sums of the ranks. SR1 21, SR2 39 . As a check, note that these two rank sums must add to the sum of the first n numbers, and that this is nn 1 1112 66 , and that SR1 SR2 21 45 66 . 2 2 The smaller of SR1 and SR2 is called W and is compared with Table 5 or 6. To use Table 5, first find the part for n 2 6 , and then the column for n1 5 . Then try to locate W 21 in that column. In this case, the p-value is .0628, which should be doubled for a two-sided test. Since this is above the significance level, we cannot reject the null hypothesis. This can also be compared against the critical values for TL and TU ( TU is actually only needed for a 2-sided test) in table 14a. these are 41 and 19. Since W 21 is between these values, we cannot reject the null hypothesis. The starred rankings are what you would get if you ranked from the bottom to the top. They are incorrect, but would cause you to correctly not reject the null hypothesis. 19 252y0141 5/4/01 c). From Table 3 in the Syllabus Supplement: Interval for Confidence Hypotheses Interval Ratio of Variances 22 s22 DF , DF H 0 : 12 22 2 F.5 1.5 2 2 2 1 s1 H 1 : 12 22 1 F1DF 21 , DF2 DF1 , DF2 DF1 n1 1 F 2 DF2 n 2 1 2 .5 .5 2 or 1 2 Test Ratio F Critical Value s12 2 s2 DF1 , DF2 and F DF2 , DF1 Since this is a 2 sided test we use both F 4,5 s12 s 32 s22 s12 s 2 3.9 7.3 1.8718 and F 5, 4 32 0.534 . We 3.9 7.3 s1 4,5 7.39 and F 5,4 9.36 . Since our computed F's are less than the corresponding table Fs, find that F.025 .025 we cannot reject the null hypothesis. (Because the smaller F is below 1 and there are no values on the F 4,5 7.39 .) table below 1, we actually only look at F.025 d) e. From Table 3: Interval for Confidence Interval Variancen 1s 2 2 Small Sample .25 .5 2 Hypotheses Test Ratio H 0 : 2 02 2 H1: : 2 02 n 1s 2 02 Critical Value 2 s cv .25 .5 2 02 n 1 H 0 : 2 25 H 0 : 5 Our hypotheses are or . We know that We know that x3 9.5, n3 6 and H 1: : 2 25 H 1: : 5 s 32 x 2 3 nx32 n3 1 n 1s 2 53.9 0.78 and DF n 1 5. From 561 69.52 3.9 . Here 2 25 5 02 ( 5) ( 5) 12 .8325 . We would not reject the null hypothesis 0.8312 and .2025 the chi-squared table, we find .2975 if our 2 were between these values. It is not, so reject the null hypothesis. 20 252y0141 5/4/01 5. a. A machine fills a sample of 100 one-pound boxes of a product and they are later tested to see how many are over or under the desired one-pound size. The manufacturer wishes to test whether exactly half of the population of boxes is over the one-pound mark and that the occurrence of boxes that are 'over' and 'under' is random. In the sample there are 55 boxes that are 'over' and 45 that are under and there are 45 runs of 'overs' or 'unders'. (i) Test that the proportion of 'overs' is 50%. (2) (ii) Test that the sequence of 'overs' and 'unders' is random. (5) b. A series of 24 observations are used to calculate a simple regression with only one independent variable. We calculate a Durbin-Watson statistic of 0.471. Is Autocorrelation present? Is it positive or negative. (3) c. We are testing to see if the mean of a normally distributed population with a known variance of 20 is 5. We take a sample of 100 and find that the mean is 9. Given these results, what is the p-value of our result if (i) the Null hypothesis is H 0 : 5 , (ii) the Null hypothesis is H 0 : 5 , (iii) The Null hypothesis is H 0 : 5 (6) Solution: a) (i) From Table 3 Interval for Confidence Interval Proportion p p z 2 s p pq n q 1 p sp Our hypotheses are use a test ratio, z H0 : p 5 H 1 : p .5 p p0 Hypotheses Test Ratio H 0 : p p0 z H1 : p p0 . n 100 , p Critical Value p p0 pcv p0 z 2 p p 55 .55 and .05 . p 100 p p0 q0 n p0 q0 n .5.5 .05 . If we 100 .55 .5 1 . Since this is between -1.96 and 1.96, we do not reject the null .05 p hypothesis. If we use a critical value pcv p0 z p .5 1.960.05 or .402 to .598. Since .55 lies between 2 these numbe5rs, do not reject the null hypothesis. (ii) This is a runs test. The null hypothesis is randomness. r 45, n1 55, n 2 45 and n n1 n2 100 . Our values are too large for the runs test table, but we know that for a larger problem (if n1 and n 2 are too large for the table), r follows the normal distribution with 2n n 1 2 49 .548 .5 24 .25 . So 255 45 1 2 1 1 50 .5 and 2 100 n 1 99 45 50 .5 z 1.12 . Since this value of z is between z 1.960 , we do not reject 2 24 .25 H 0 : Randomness. b) This is a Durbin-Watson test and we are given the Durbin-Watson statistic, DW 0.471 . Use a DurbinWatson table with n 24 and k 1 to fill in the diagram below. n r 0 + 0 dL + ? dU + 0 2 0 + 4 dU + ? 4 dL + 0 4 + If you used the 5% table, you got d L 1.27 and dU 1.45. If you used the 1% table, you got d L 1.04 and dU 1.20. In either case the given value of DW 0.471 falls well below d L , indicating positive autocorrelation. 21 252y0141 5/4/01 c) From table 3 Interval for Confidence Interval x z 2 x Mean ( known) Hypotheses H0 : 0 x 0 x z H1 : 0 The problem states 2 20 , n 100 , and x 9. x test ratio, z Test Ratio n 20 x 0 x Critical Value xcv 0 z 2 x 2 . To get a p-value, we must use a 100 95 2.00 . 2 H : 5 (i) 0 H 1 : 5 pval Px 5 Pz 2 Pz 0 P0 z 2.00 .5 .4772 .9772 H : 5 (ii) 0 H 1 : 5 pval Px 5 Pz 2 Pz 0 P0 z 2.00 .5 .4772 .0228 H : 5 (iii) 0 In the case of a 2-sided hypothesis, find the probability to the nearest tail and double it. H 1 : 5 pval 2Px 5 2Pz 2 2.0228 .0456 . 22 252y0141 5/4/01 6. An electronics chain reports the following data on number of households, sales volume and number of customers for 10 stores. Row hshlds sales x1 1 2 3 4 5 6 7 8 9 10 161 99 135 120 164 221 179 204 214 101 x 1 cust x2 x3 157 93 136 123 153 241 201 207 230 135 1598.0, 305 55 205 105 255 505 355 455 405 155 x 2 1 273738, x 2 1676.0, x 2 2 302788, x x 1 2 287019 a) Compute the correlation between households and sales and test it for significance. (5) b) Test the same correlation to see if it is .9 (5) c) Compute the rank correlation between households and sales and test it for significance. (5) d) Compute Kendall's W for households, sales and customers and test it for significance (6) Solution: From the outline the simple sample correlation coefficient is S XY nXY . In this case we want SS SS X n X Y n Y 287019 10 159 .8167 .6 x x nx x x nx x nx 273738 10159 .8 302788 10167 .6 xy r 2 r 2 1 2 1 2 2 1 2 2 2 1 2 19194 .22 18377 .621890 .4 2 x y 2 2 2 2 19194 .2 18377 .6 21890 .4 .9158 .9570 If we want to test H 0 : xy 0 against H1 : xy 0 and x and y are normally distributed, we use t n 2 r r .9570 .9570 9.328 . Compare this with t n2 2 t .8025 2.306 . Since 2 sr . 10259 1 .9158 1 r 10 2 n2 9.328 does not lie between these two values, reject the null hypothesis. b) If we are testing H 0 : xy 0 against H 1 : xy 0 , and 0 0 , we use Fisher's z-transformation. 1 1 r 1 1 .9570 z ln Let ~ ln 2 1 r 2 1 .9570 1 1 0 z ln 2 1 0 sz 1 n3 1 ln 45 .5116 1.90898 . This has an approximate mean of 2 1 1 .9 1 ln 2 1 .9 2 ln 19 1.47222 and a standard deviation of ~ n 2 z z 1.90898 1.47222 1 0.37796 , so that t 1.156 . Compare this with 7 sz 0.27796 t n2 2 t .8025 2.306 . Since 1.156 lies between these two values, do not reject the null hypothesis. Note:: To do the above with logarithms to the base 10, try 1 1 r 1 1 .9570 1 ~ z10 log log log 45 .5116 0.82906 . This has an approximate mean of 2 1 r 2 1 .9570 2 23 252y0141 5/4/01 1 1 0 z 10 log 2 1 0 s z 10 1 1 .9 1 log log 19 0.63938 and a standard deviation of 2 1 .9 2 ~ n 2 z z 10 0.82906 0.63938 0.18861 .18861 0.16415 , so that t 10 1.156 . n3 10 3 s z 10 0.16415 c) The data given is repeated below with ranking necessary for the remainder of the problem. r1 , r2 and r3 are bottom-to-top rankings within each column, d r1 r2 and d 2 are needed for Spearman's rank correlation, SR isd the sum of rhe ranks in the given row. SR 2 is required for Kendall's W.. Row hshlds sales x1 1 2 3 4 5 6 7 8 9 10 cust x2 161 99 135 120 164 221 179 204 214 101 x3 157 93 136 123 153 241 201 207 230 135 305 55 205 105 255 505 355 455 405 155 r1 5 1 4 3 6 10 7 8 9 2 d r1 r2 d 2 SR r2 r3 6 1 4 2 5 10 7 8 9 3 6 1 4 2 5 10 7 9 8 3 -1 0 0 1 1 0 0 0 0 -1 0 1 0 0 1 1 0 0 0 0 1 4 SR 2 17 3 12 7 16 30 21 25 26 8 165 289 9 144 49 196 900 441 625 676 64 3393 To compute Spearman's Rank Correlation Coefficient, take a set of n points x, y and rank both x and y from 1 to n to get rx , ry . Do not attempt to compute a rank correlation without replacing the original numbers by ranks, then compute d rx ry ,and then rs 1 64 0.9758 . In this case, we have a 1-sided test 10 100 1 d 0 and d 2 d 1 nn 1 2 6 2 H 0 : s 0 . Note that H 1 : s 0 4 . If we check the table ‘Critical Values of rs , the Spearman Rank Correlation Coefficient,’ we find that the critical value for n 10 and .05 is .5515 or, for a two sided test, the 2.5% value is ,5515. So we must reject the null hypothesis and conclude that we can say that the rankings have significant agreement.. c) For Kendall's Coefficient of Concordance, take k columns with n items in each and rank each column from 1 to n . The null hypothesis is that the rankings disagree. Compute a sum of ranks SRi for each row. Then S where 16 .5 W n 2 n SR 2 3393 10 16 .52 670 .5 , SR 165 SR n 1k 113 is the mean of the SR s. n S 1 k2 12 SR 3 n 10 2 S 1 32 12 10 3 10 2 i S .90303 is the Kendall Coefficient of Concordance and must 742 .5 be between 0 and 1. H 0 is disagreement. Since n is too large for the table use 2n1 k n 1W 390.90303 24.38181 .Since .2059 21 .660 , we reject the null hypothesis and say that there is significant agreement. 24 252y0141 5/4/01 7. A producer of filters is getting complaints about the quality of the filters it is producing. It thus examines 1000 filters from each of its three shifts and discovers for shift 1 36 defects, for shift 2 40 defects and for shift 3 55 defects. a) Test the hypothesis that the proportion of defective filters is the same for all three shifts at the 95% level. (7) b) Test the hypothesis that the defect rate is higher for the third shift than the first. (3) c) Find a p-value for your result in b) (2) d) Do a confidence interval for the difference between the proportion defective for shifts 1 and shift 2. (4) Solution:: H 0 : Homogeneousor p1 p 2 p 3 H 1 : Not homogeneousNot all ps are equal DF r 1c 1 12 2 .2052 5.9915 O Shift1 Stift 2 Shift 3 Total pr Defective 36 40 55 131 .04367 Not 960 945 2869 .95633 964 Total 1000 1000 1000 3000 1.0000 E Shift1 Shift 2 Shift 3 Total pr Defective 43 .67 43 .67 43 .67 131 .04367 Not 956 .33 956 .33 956 .33 2869 .95633 Total 1000 .00 1000 .00 1000 .00 3000 1.0000 The proportions in rows, p r , are used with column totals to get the items in E . Note that row and column sums in E are the same as in O . (Note that 2 0.7434 is computed two different ways here - only one way is needed.) O2 O E 2 O E 2 Row OE O E E E 1 36 43.67 7.6700 58.829 1.34712 29.677 2 40 43.67 3.6700 13.469 0.30842 36.638 3 55 43.67 -11.3300 128.369 2.93952 69.270 4 964 956.33 -7.6700 58.829 0.06151 971.732 5 960 956.33 -3.6700 13.469 0.01408 963.684 6 945 956.33 11.3300 128.369 0.13423 933.804 3000 3000.00 0.0000 4.8049 3004.8049 O E 2 3004 .8049 3000 4.8049 O2 n E E Since this is less than 5.9915, do not reject H 0 . 25 252y0141 5/4/01 b) We are comparing p1 .036 , n1 1000 and p3 .055 , n 2 1000 . From Table 3 Interval for Confidence Hypotheses Test Ratio Interval Difference p p 0 p p z 2 sp H 0 : p p0 z between p H 1 : p p0 p p1 p2 proportions If p 0 p 0 p 01 p 02 p1q1 p2 q 2 q 1 p s p p01q 01 p02 q 02 p n1 n2 or p 0 0 n n 1 2 Or use s p Critical Value pcv p0 z 2 p If p0 0 p p0 q 0 1 n1 1 n2 n p n2 p2 p0 1 1 n1 n 2 H 0 : p1 p 3 H 0 : p 3 p1 0 H : p 0 b) 0 Same as or Note that p p3 p1 .019 , H 1 : p 0 H 1 : p1 p 3 H 1 : p 3 p1 0 n p n 2 p 3 1000 .036 1000 .055 36 55 p0 1 1 .0455 , 1000 1000 n1 n3 1000 1000 .05, z z.05 1.645, z 2 z 025 1.960. Note that q 1 p and that q and p are between 0 and 1. p p 0 q 0 1 n1 1 n3 .0455 .9545 11000 11000 .00008686 .0093198 (Only one of the following methods is needed!) Test Ratio: z p p 0 p .019 0 2.03 Make a Diagram showing a 'reject' region above .0093198 1.645. Since 2.03 is above this, reject H 0 . or Critical Value: pcv p0 z p 0 1.645 .0093198 .01533 . Make a Diagram showing a 'reject' region above .01533. Since 019 is above this, reject H 0 . or Confidence Interval: p p z s p (Probably not worth doing.). In all cases reject H 0 . c) z p p 0 p .019 0 2.03 pval Pp .019 Pz 2.03 .5 .4788 .0212 .0093198 d) Let p p1 p 2 .036 .040 .004 s p p1 q1 p 2 q 2 .036 .964 .040 .960 .000034704 .0000384 .0007310 .00855 n1 n2 1000 1000 Then p p z s p .004 1.960 .00855 ..004 .017 or -.013 to .021. 2 © 2001, R. E. Bove 26