252y0641 12/15/05 ECO252 QBA2 Final EXAM December , 2006 Version 1 Name and Class hour:____KEY________________ I. (18+ points) Do all the following. Note that answers without reasons and citation of appropriate statistical tests receive no credit. Most answers require a statistical test, that is, stating or implying a hypothesis and showing why it is true or false by citing a table value or a p-value. If you haven’t done it lately, take a fast look at ECO 252 - Things That You Should Never Do on a Statistics Exam (or Anywhere Else) Since I will use this as an example of approaches to model building I will include general rules in my solutions. Regression A seeks to explain the selling price of a home in terms of a group of variables explained on the output sheet. Note that regressions 1 and 7 are identical. Look at the definitions of the variables carefully and, in particular, notice which are interaction variables. a) The homes in this regression are in three different areas. There are dummy variables to indicate that the homes are in Area 1 or Area 2. Why isn’t there a dummy variable for Area 3? (1) A model with a set of dummy variables X d .1 , X d .2 , X d .k 1 , X d .k is overdetermined if X d .k 1 X d .1 X d .3 X d .k 1 . In practice this means if we have say 5 categories we must let one be a ‘control group’ and not be identified by a dummy variable. b) In Regression 1, what coefficients are significant at the 5% level? (2) A coefficient, i is insignificant if the null hypothesis H 0 : i 0 cannot be shown to be false at a reasonable significance level (Usually .05 or .01 ). In practice this means that the t-ratio t bi is not between t or the 2 s bi pvalue 2P t t computed or, if the t-ratio is negative, pvalue 2P t t computed is below a reasonable significance level. c) What independent variables did I remove from the problem to get to Regression 2 from Regression 1? Why? (2) The best subsets approach, according to Behrenson et. al., involves: (i) Choosing a large set of candidate independent variables; (ii) running a regression with all the candidate variables and using the VIF option, which tests for collinearity; (iii) eliminating variables with a VIF over 5; (iv) continuing to run regressions and eliminate candidate variables until there are no variables with a VIF over 5; (v) performing a best-subsets regression on the model without high VIFs and computing C p ; (vi) shortlisting the models with a C p less than or close to k 1 , where k is the number of independent variables in that regression; (vii) choosing from the shortlist on the basis of things like significance of coefficients and R-squared; (viii) using residual analysis and influence analysis to further refine the model by adding nonlinear terms, transforming variables and eliminating suspicious observations. Note that terms like a squared term are largely exempt from the VIF rules, if they are correlated with the untransformed variable. d) Following the same process, I went on to remove one or more variables each time until I got to Regression 5. When I got to Regression 5 I ran the ‘best subsets regression.’ 6. I concluded that it was time to quit removing variables. Between the best subsets regression and the characteristics of the coefficients of the results in Regression 5 I felt that I had gone as far as was reasonable in removing independent variables. What are the three things that led me to think that regression 5 was the best that I could do? (3) 1 252y0641 12/15/05 e) Using Regression 5 and assuming that all homes have two baths, Regression 5 effectively becomes 3 regressions relating price to living area. Take the coefficient of bath, multiply it by two and add it to the constant to get the effective intercept for homes with two baths. Using L or any other symbol that you find convenient for living area, what are the equations relating living area to price in (3 points) Area 1? Area 2? Area 3? [11] f) Continuing with Regression 5 and assuming that a home has 2(thousand) square feet of living area and 2 baths, what would it sell for in Area 1? Area 2? Area 3? What is the percent difference between the lowest and highest price? (2) g) We have not yet dealt with the question of whether the coefficients in Regression 5 are reasonable. In order to do this look at two homes in Area 1 that have two baths. If one has 2(thousand) square feet of living area and the other 3, how would there prices differ? Does that seem reasonable? Try the same for a home in area 3. (3) [16] h) As I warned you, I now repeated Regression 1 as Regression 7, without using the VIFs. Much to my surprise, I ended up dropping the same variables as I did after Regression 1. Why? (1) This is a rather outdated technique of eliminating insignificant coefficients and worrying about R-squared and adjusted R-squared. i) Continuing in the same way, I worked myself to Regression 9. Looking at the things I usually check, this looked pretty good. Then I tried to check the coefficients in the same way that I did in g). Why was I very unhappy? What is there in Regression 8 that could explain these results? (4) j) Regression 11 is a stepwise regression. The printout, which continues on page 7, presents four different possible regressions in column form. In each case a coefficient has a t-value under it and a p-value for a significance test. After the fourth try, the computer refused to add any more independent variables. The only regression here that I thought was worth looking at was the one with four independent variables. What can you tell me about its acceptability? (3) Stepwise regression adds independent variables in sequence, starting with the one with the highest correlation with the dependent variable. It then adds or removes variables using an F test for significant improvement and picking, at first, independent variables that provide the lowest p-values for the test. When it can no longer find new variables that provide an F-ratio with a p-value below a pre-specified significance level, it quits. k) Do an F test to compare regressions 2 and 3 and to find out if lot 1 and lot 2 have any explanatory power. (3) 2 252y0641 12/15/05 II. Hand in your fourth computer problem. (2 to 7 points) 3 252y0641 12/15/05 III. Do at least 4 of the following 7 Problems (at least 12 each) (or do sections adding to at least 50 points – (Anything extra you do helps, and grades wrap around). You must do parts a) and b) of problem 1. Show your work! State H 0 and H1 where applicable. Use a significance level of 5% unless noted otherwise. Do not answer questions without citing appropriate statistical tests – That is, explain your hypotheses and what values from what table were used to test them. Clearly label what section of each problem you are doing! The entire test has about 151 points, but 70 is considered a perfect score. Don’t waste our time by telling me that two means, proportions, variances or medians don’t look the same to you. You need statistical tests! There are two blank pages below. 1. a) If I want to test to see if the mean of x 2 is larger than the mean of x1 , my null hypotheses are: (Note: D 1 2 ) i) 1 2 and D 0 ii) 1 2 and D 0 v) 1 2 and D 0 vi) 1 2 and D 0 iii) 1 2 and D 0 vii) 1 2 and D 0 iv) 1 2 and D 0 viii) 1 2 and D 0 (2) Solution: Since this question will appear on future exams, it is left to the student. The first two columns below represent times for 25 workers on an industrial task. The third column is the difference between them d Row x1 x2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 6.11 5.13 6.42 4.65 5.82 4.08 4.01 5.26 5.25 7.66 6.29 5.41 6.17 5.50 4.06 6.19 6.71 4.41 5.25 4.85 6.50 5.24 7.29 4.99 4.26 4.81 4.19 5.17 4.07 4.58 2.97 3.39 4.14 4.31 6.68 5.37 3.95 4.93 4.04 2.40 4.71 5.93 2.93 4.25 4.41 4.68 3.50 6.09 2.87 3.06 1.30 0.94 1.25 0.58 1.24 1.11 0.62 1.12 0.94 0.98 0.92 1.46 1.24 1.46 1.66 1.48 0.78 1.48 1.00 0.44 1.82 1.74 1.20 2.12 1.20 Assume that .05 . Minitab gives us the following summary (edited). Descriptive Statistics: x1, x2, d Variable x1 x2 d N 25 25 25 N* 0 0 0 Mean 5.50 4.30 1.20 SE Mean 0.200 0.212 ………… StDev 1.00 1.06 ……… Minimum 4.010 2.400 0.4400 Q1 4.750 3.445 0.9400 Median Q3 Maximum 5.260 6.240 7.660 4.250 4.870 6.680 1.200 1.4700 2.120 In the d column, the column sum is 30.08 and the sum of the first 24 numbers squared is 38.585. Do not recompute things that have been done for you if you want to ever get much done on this exam. Clearly label parts b, c, d etc. The null hypothesis is the same for parts c, d and e, so state it clearly. 4 252y0641 2/23/07 b). Find the sample variance for the d column. (2) Solution: 38 .585 1.20 2 40 .025 . The last line of the output above should read d 25 0 1.20 0.080 0.400 0.4400 etc. d 40 .025 251.20 2 0.16771 , s 0.409 . n 1 24 The formula table has the following for the difference between two means. Interval for Confidence Hypotheses Test Ratio Interval s d2 2 D1: Difference between Two Means ( known) d 2 D d z 2 d 12 d n1 22 n2 H 0 : D D0 * z H 1 : D D0 , D 1 2 d D0 d Critical Value d cv D0 z d d x1 x 2 D2: Difference between Two Means ( unknown, variances assumed equal) D d t 2 s d D3: Difference between Two Means( unknown, variances assumed unequal) D d t 2 s d sd s p 1 1 n1 n2 t H 1 : D D0 , D 1 2 sˆ 2p d D0 sd d cv D0 t 2 s d n1 1s12 n2 1s22 n1 n2 2 DF n1 n2 2 DF H 0 : D D0 * s12 s22 n1 n2 sd s12 s22 n 1 n2 H 1 : D D0 , t d D0 sd d cv D0 t 2 s d t d D0 sd d cv D0 t 2 s d D 1 2 2 s12 2 n1 n1 1 D4: Difference between Two Means (paired data.) H 0 : D D0 * s 22 2 n2 n2 1 D d t 2 s d H 0 : D D0 * H 1 : D D0 , d x1 x 2 D 1 2 s df n 1 where sd d n n1 n 2 n We have the following facts - n n1 n2 25 , x1 5.50, s1 1.00 , s x1 0.200 , x 2 4.30, s 2 1.06 , s d2 0.16771 0.08190 n 25 n c) On the assumption that the underlying distributions are Normal and that the first two columns represent independent samples from populations that represent plants 1 and 2 and come from populations with similar variances, can we conclude that average workers in plant 2 complete the task faster than those in plant 1? (4) H 0 : 1 2 H 0 : D 0 Solution: Our hypotheses are or if D 1 2 . This is a right side test. If we H : 2 1 1 H 1 : D 0 s x2 0.212 , d 5.50 4.30 1.20, s d 0.409 , s d sd do a critical value for d , we need a d cv above zero. If we use a test ratio it will be on the right side of the t distribution, above zero. df n1 n 2 2 25 25 2 48 5 252y0641 2/23/07 Because we have assumed equal variances, we use method D2. df n1 n 2 2 25 25 2 48 sˆ 2p n1 1s12 n2 1s 22 s d sˆ p t n1 n 2 2 24 1.00 2 24 1.06 2 24 24 2 1.0000 1.1236 1.0618 2 1 1 2 1.0618 0.084944 0.2915 . If we use a test ratio, we compute n1 n 2 25 d D0 1.20 0 48 48 1.677 . and t .025 2.011. Because this is a 1-sided test, we reject 4.1166 . t .05 sd 0.2915 the null hypothesis if the t ratio is above 1.677. It is, so we reject H 0 and conclude that workers in plant 2 are faster. If we use a critical value for d , we use d cv D0 t s d 0 1.677 0.2915 0.489 . Since d 1.20 is above 0.489, we reject the null hypothesis. d) (Extra credit) Repeat part c) after dropping the assumption that the variances are similar. (5) s2 s2 2 1 2 n1 n 2 Solution: If we drop the equal variance assumption, we use Method D3. Here df 2 2 s2 s 22 1 n2 n1 n2 1 n1 1 We have s x1 0.200 and s x2 0.212 , 20. So . s12 s2 0.200 2 0.04000 , 1 0.212 2 0.04494 and n1 n1 s12 s 22 0.08494 2 0.200 2 0.212 2 0.04000 0.04494 0.08494 . So df n1 n 2 0.04000 2 0.04494 2 24 24 s2 s2 0.0072148 0.1731552 47 .838 s d 1 2 0.08494 0.2915 . If we use a test 0.0016000 0.0020196 n1 n 2 0.0036196 24 d D0 1.20 0 47 47 1.678 . and t .025 2.012. Because this is a 1ratio, we compute t 4.1166 . t .05 sd 0.2915 sided test, we reject the null hypothesis if the t ratio is above 1.678. It is, so we reject H 0 and conclude that workers in plant 2 are faster. If we use a critical value for d , we want d cv D0 t s d 0 1.678 0.2915 0.489 . Since d 1.20 is above 0.489, we reject the null hypothesis. e) Actually, these data supposedly represent performance of a single sample of 25 workers on two administrations of a standard test of manual dexterity. The question was ‘Did the time for the test improve between the first and second administration?’ (3) These are now paired data and we use method D4. d 1.20, s d 0.409 , s d sd n use a test ratio, we compute t s d2 0.16771 24 24 2.064 and t .05 1.711. .08190 t .025 n 25 If we d D0 1.20 0 14 .6520 . Because this is a 1-sided test, we reject the sd 0.08190 null hypothesis if the t ratio is above 1.711. It is, so we reject H 0 and conclude that workers in plant 2 are 6 252y0641 2/23/07 faster. If we use a critical value for d , we want d cv D0 t s d 0 1.7110.0819 0.140 . Since [11] d 1.20 is above 0.140, we reject the null hypothesis. f) Assume that the means above come from independent samples, but that the data represent samples for populations with known population variances of 1.00 and 1.06. Test the null hypothesis that you used in part c) and find an exact p-value. (3) We can now say that s d z d D0 d 12 n1 22 n2 0.08494 0.2915 and that 1.20 0 4.1166 . The p-value to four places is Pz 4.11 .5 .5 0 . 0.2915 [14] g) Using the value of s d that you used in e), make a confidence interval with a confidence level of 92%. You must find the value of z needed to do this first. Of course, it is not on the t-table. (2) [16] 2 Solution: The significance level is 1 .92 .08 , z z.04 so for z make a diagram. Draw a Normal 2 curve with a mean at 0. z .04 is the value of z with 4% of the distribution above it. Since 100 – 4 = 96, it is also the 96th percentile. Since 50% of the standardized Normal distribution is below zero, your diagram should show that the probability between z .04 and zero is 96% - 50% = 46% or P0 z z.04 .4600 . The relevant part of the Normal table appears below. 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 z 1.5 1.6 1.7 1.8 1.9 0.4332 0.4452 0.4554 0.4641 0.4713 0.4345 0.4463 0.4564 0.4649 0.4719 0.4357 0.4474 0.4573 0.4656 0.4726 0.4370 0.4484 0.4582 0.4664 0.4732 0.4382 0.4495 0.4591 0.4671 0.4738 0.4394 0.4505 0.4599 0.4678 0.4744 0.4406 0.4515 0.4608 0.4686 0.4750 0.4418 0.4525 0.4616 0.4693 0.4756 0.4429 0.4535 0.4625 0.4699 0.4761 0.4441 0.4545 0.4633 0.4706 0.4767 The closest we can come to this is P0 z 1.75 .4599 . (1.76 is also acceptable here.) So z .04 1.75 . A 2-sided confidence interval for the difference between the means is D d z 2 d 1.20 1.750.2915 1.20 0.51 or 0.69 to 1.71. 7 252y0641 2/23/07 2. Let us expand the problem of question 1 by adding another column. The full data set with lots done for you looks like this. The first three columns represent the given data. In the next three columns I have take the first three columns and squared them. I have added the first three rows to get the seventh column. I have computed row means in the 9th column. The tenth column is a row sum of squares. In the 11 th to the 13th columns the numbers in the first three columns are ranked from 1 to 75. Sums are provided for all 13 columns. (1) (2) (3) (4) (5) (6) Row x 1 x 2 x 3 x21 x22 x 23 Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 x1 6.11 5.13 6.42 4.65 5.82 4.08 4.01 5.26 5.25 7.66 6.29 5.41 6.17 5.50 4.06 6.19 6.71 4.41 5.25 4.85 6.50 5.24 7.29 4.99 4.26 x2 4.81 4.19 5.17 4.07 4.58 2.97 3.39 4.14 4.31 6.68 5.37 3.95 4.93 4.04 2.40 4.71 5.93 2.93 4.25 4.41 4.68 3.50 6.09 2.87 3.06 x3 5.46 4.66 5.80 4.36 5.20 3.53 3.70 4.70 4.78 7.17 5.83 4.68 5.55 4.77 3.23 5.45 6.32 3.67 4.75 4.63 5.59 4.37 6.69 3.93 3.66 x1sq 37.3321 26.3169 41.2164 21.6225 33.8724 16.6464 16.0801 27.6676 27.5625 58.6756 39.5641 29.2681 38.0689 30.2500 16.4836 38.3161 45.0241 19.4481 27.5625 23.5225 42.2500 27.4576 53.1441 24.9001 18.1476 x2sq 23.1361 17.5561 26.7289 16.5649 20.9764 8.8209 11.4921 17.1396 18.5761 44.6224 28.8369 15.6025 24.3049 16.3216 5.7600 22.1841 35.1649 8.5849 18.0625 19.4481 21.9024 12.2500 37.0881 8.2369 9.3636 x3sq 29.8116 21.7156 33.6400 19.0096 27.0400 12.4609 13.6900 22.0900 22.8484 51.4089 33.9889 21.9024 30.8025 22.7529 10.4329 29.7025 39.9424 13.4689 22.5625 21.4369 31.2481 19.0969 44.7561 15.4449 13.3956 (2) (3) Sum (1) (4) (5) (6) (7) x i rsum 16.38 13.98 17.39 13.08 15.60 10.58 11.10 14.10 14.34 21.51 17.49 14.04 16.65 14.31 9.69 16.35 18.96 11.01 14.25 13.89 16.77 13.11 20.07 11.79 10.98 (7) (8) (9) (10) x 2 i x i x i2 rmean 5.46 4.66 5.80 4.36 5.20 3.53 3.70 4.70 4.78 7.17 5.83 4.68 5.55 4.77 3.23 5.45 6.32 3.67 4.75 4.63 5.59 4.37 6.69 3.93 3.66 rmsq 29.8116 21.7156 33.6013 19.0096 27.0400 12.4374 13.6900 22.0900 22.8484 51.4089 33.9889 21.9024 30.8025 22.7529 10.4329 29.7025 39.9424 13.4689 22.5625 21.4369 31.2481 19.0969 44.7561 15.4449 13.3956 rssq 90.280 65.589 101.585 57.197 81.889 37.928 41.262 66.897 68.987 154.707 102.390 66.773 93.176 69.325 32.676 90.203 120.131 41.502 68.188 64.408 95.401 58.805 134.988 48.582 40.907 (8) (9) (10) (11) r1 rank1 63.0 44.0 68.0 31.0 59.0 19.0 15.0 50.0 48.5 75.0 66.0 52.0 64.0 55.0 17.0 65.0 72.0 27.5 48.5 41.0 69.0 47.0 74.0 43.0 23.0 (11) (12) (13) r2 r3 rank2 40.0 21.0 45.0 18.0 29.0 4.0 7.0 20.0 24.0 70.0 51.0 14.0 42.0 16.0 1.0 36.0 61.0 3.0 22.0 27.5 33.5 8.0 62.0 2.0 5.0 rank3 54.0 32.0 58.0 25.0 46.0 9.0 12.0 35.0 39.0 73.0 60.0 33.5 56.0 38.0 6.0 53.0 67.0 11.0 37.0 30.0 57.0 26.0 71.0 13.0 10.0 (12) (13) The sums of the columns will not fit on the table so they are printed here. (1)Sum of x1 = 137.51; (2)Sum of x2 = 107.43; (3)Sum of x3 = 122.48; (4)Sum of x1sq = 780.400; (5)Sum of x2sq = 488.725; (6)Sum of x3sq = 624.649; (7)rsum = 367.42; (8)Sum of rmean = 122.473; (9)Sum of rmsq = 624.587; (10)Sum of rssq = 1893.77; (11)Sum of rank 1 = 1236.5; (12)Sum of rank 2 = 662; (13)Sum of rank 3 = 951.5. You are left to find column means and the grand mean. Please avoid Recomputing stuff that I have done for you. Life is not that long. You will need to get column and overall means. Almost everything else is done for you. a) Consider the first three columns to be three independent random samples from Normal distributions with similar variances. Compare the means using an appropriate statistical test or tests. (6) Solution: Scary looking isn’t it. But almost all the real work has been done for you. In 252ANOVAex3 we have the display on the next page that we can bend to our will. 8 252y0641 2/23/07 Tableau presented in 252ANOVAex3. Individual Display 1 Sum ni x i 45 35 20 40 40 180 5 140 110 75 120 135 580 15 3 3 3 3 3 15 46.6667 36.6667 25.0000 40.0000 45.0000 (38.6667) 35 36 (38.6667) x 10475 6375 6850 23700 2 xijk 2025 1225 1296 4546 1 2 3 4 5 Sum nj Display 3 50 45 30 45 55 225 5 Display 2 45 30 25 35 40 175 5 45 x j SS x j 2 n x SS xi 2 6550 4150 1925 4850 6225 23700 2177.7778 1344.4444 625.0000 1600.0000 2025.0000 7772.2222 x ij2 x i 2 x .2j . Version for this problem. Items from the last page have been added. Row Column 1 Column 2 Column 3 Sum 1 2 ……. 24 25 Sum 6.11 5.13 ………… 4.99 4.26 137.51 25 4.81 4.19 ………… 2.87 3.06 107.43 25 5.46 4.66 …………. 3.93 3.66 122.48 25 16.38 13.98 ............ 11.79 10.98 367.42 75 3 3 ……….. 3 3 75 nj ni n x j 5.5004 4.2972 4.8992 (4.8989) x SS 780.400 488.725 624.649 1893.774 2 xijk 18.4659 24.0022 72.7225 30.2544 x j 2 x i 5.46 4.66 ………….. 3.93 3.66 (4.8989) x SS xi 2 90.280 65.598 …………… 48.582 40.907 1893.77 29.8116 21.7156 …………….. 15.4449 13.3956 624.587 x ij2 x i 2 x .2j . Almost everything was done for you on the last page. The new calculations are in boldface. Note that 5.5004 4.2972 4.8992 367 .42 x 4.8989 and 754.8989 2 1799 .941591 . 3 75 a) This is a one-way ANOVA. SST SSB n x 2 j .j x 2 nx 2 1893.774 754.89892 93.8324 nx 2 2572.7225 754.89892 18.1209 . SSW SST SSB = 93.8324 – 18.1209 = 75.7115. Source SS Between 18.1208 DF 2 MS 9.0604 F F.05 8.86s F 2,72 3.13 H0 Column means equal Within 75.7115 72 1.0231 Total 93.8324 74 The difference between the column means is significant because the computed F exceeds the column F. It looks like there is plentiful rounding error, since the computer gets the following. One-way ANOVA: x1, x2, x3 Source DF Factor 2 Error 72 Total 74 S = 1.025 SS MS F P 18.10 9.05 8.60 0.000 75.72 1.05 93.81 R-Sq = 19.29% R-Sq(adj) = 17.05% b) Actually as in 1e) these data represent three tests of a single random sample of 25 workers. Consider the data blocked by worker and compare means. (4) Solution: If the data is cross classified, we need a 2-way ANOVA. SST ANOVA. x SSC R 2 .j x 2 ij nx 2 1893.774 754.89892 93.8324 . This is the same as 1-way nx 2 2572.7225 754.89892 18.1209. This is the same as SSB in one-way ANOVA. 9 252y0641 2/23/07 SSR C x 2 i. nx 2 3624.587 1538.66672 73.8194 SSW SST SSR SSC = 93.8324 – 73.8194 – 18.1209 = 1.8921 Source SS DF MS F Rows 73.8194 24 3.0755 120.28s Columns 18.1208 2 9.0604 8.86s F.05 F 24,48 1.75 F 2, 48 3.19 H0 Row means equal Column means equal Within 1.8921 48 0.02557 Total 93.8324 74 This time, as indicated by the ‘s’s in the F column, both thee difference between the column means and the difference between the row means are significant. It looks like there is plentiful rounding error, since the computer gets the following. Two-way ANOVA: C11 versus C13, C12 Source DF C13 24 C12 2 Error 48 Total 74 S = 0.1998 SS MS F P 73.7956 3.07482 77.01 0.000 18.0961 9.04807 226.63 0.000 1.9164 0.03993 93.8081 R-Sq = 97.96% R-Sq(adj) = 96.85 c) Consider the first three columns to be three independent random samples from a distribution that is not Normal. Compare the medians using an appropriate statistical test or tests. (5) [31] Solution: If the columns are independent random samples and the distribution is not Normal, we have a Kruskal-Wallis test. The null hypothesis is that we have equal medians. The table on the previous page has (11)Sum of rank 1 = 1236.5, (12)Sum of rank 2 = 662 and (13)Sum of rank 3 = 951.5. As in the previous sections, the number of items in all the columns is n 75 , the number of items in each column is ni 25 and we have 3 columns, but too many items in a column to use the K-W table, so we say that we will consider the K-W statistic to have a Chi-squared distribution with 3 – 1 = 2 degrees of freedom. We must compute the Kruskal-Wallis 12 SRi 2 3n 1 statistic H nn 1 i ni 12 1236 .52 662 2 951 .52 376 12 2872528 .5 228 241 .8971 228 13 .8971 25 25 25 5700 25 7576 Since this is larger than .205 5.9915, we reject our null hypothesis. 10 252y0641 2/23/07 3. A sales manager wishes to predict newspaper circulation on Sunday ( y ) on the basis of weekday morning circulation ( x1 ) weekday evening circulation ( x 2 ) and time ( x3 ). The data is below (Use .01) . All circulation data is in millions sold. Note: The product computed in bold face. y x1 Row 1 2 3 4 5 6 7 8 9 S 54 55 55 56 56 57 58 59 60 AM 27 29 30 33 34 35 36 37 39 x2 x3 PM 34 33 31 29 29 28 26 25 24 T 1 2 3 4 5 6 7 8 9 The quantities below are given: y 510, x 300, n 9, x 2 2 7549, compute 1 x y ?, x 1 2y x 2 259, 14623 and x y was not given and is 1 x1 y 1458 1595 1650 1848 1904 1995 2088 2183 2340 17061 y 28932, x 10126, x x 8525. Yes, you will have to 2 2 1 1 2 x y as part of a). You do not need all of these. 1 Spare Parts Computation: 510 300 259 x1 y 17061 , Y 56 .6667 , X 1 33 .3333 , X 2 28 .7778 9 9 9 X 12 nX 12 SSX1 10126 933.33332 126.0000* †Needed only in next problem X nX SSX 7549 928.7778 = 95.5556*† Y nY SST SSY 28932 956.6667 2 = 32.0000* X Y nX Y SX Y 17061 933.3333 56.6667 = 61.0000 X Y nX Y SX Y 14623 928.7778 56.6667 = -53.6667† X X nX X SX X 8525 933.3333 28.7778 = -108.333† 2 2 2 2 2 2 2 2 1 1 2 2 1 2 1 2 1 2 1 2 *Must be positive. The rest may well be negative. a) Compute a simple regression of Sunday circulation against morning circulation.(8) Solution: The coefficients are b1 S xy SS x XY nXY X nX 2 2 61 .000 0.4841 and 126 .000 b0 Y b1 X 56 .6667 0.4841 33 .3333 40 .530 . So Yˆ 40 .530 0.4841 X . b) Compute R 2 (4) SSR b1 S xy 0.4841 (61) 29.5301 Solution: R 2 S xy 2 612 .9229 SSR b1 S xy 0.4841 61 .9228 or SST SSy 32 .0000 SS x SS y 126 32 c) Compute s e (3) Solution: s e2 SSE SST SSR SS y b1 S xy 32 29 .5301 0.35284 n2 n2 n2 7 s e 0.35284 0.5940 11 252y0641 2/23/07 d) Compute s b1 ( the standard deviation of the slope) and do a confidence interval for 1 .(3) 1 X 2 2 2 1 Solution: s b20 s e2 and s b1 s e . We, of course, want the second formula. n SS x SS x 0.35284 s b21 0.0028003 s b 0.0529 To test for significance, H 0 : 1 0 , we compute 1 126 b 0 0.4841 t 1 9.1512 . Since .01 , t.7005 = 3.499. The ‘do not reject’ zone is between ±2.499. sb1 0.0529 Since our computed t does not lie between these values, reject the null hypothesis and declare b1 significant. e) Do a prediction interval for units when morning circulation rises to 45 million. (3) Why is this interval likely to be larger than other prediction intervals we might compute for morning circulation we have actually observed? (1) [53] 1 X X 2 Solution: The Prediction Interval is Y0 Yˆ0 t sY , where sY2 s e2 0 1 and n SS x ˆ ˆ X 33 . 3333 . . = 45. We have already found , s 2 .35284 , Y 40 . 530 0 . 4841 X X Y b b X 0 0 1 0 0 1 SS x 126 . So Yˆ0 40.530 0.484145 = 62.3145. e 1 45 33 .3333 2 10 sY2 0.35284 1 0.35284 1.08025 0.7732 and sY 0.8793 . t.7005 = 3.499, 9 126 9 so Y0 Yˆ0 t sY 62.3145 3.449 0.8793 62.3 3.0 . Because 45 is relatively high compared to the mean and the variance of X, it yields a relatively large prediction interval. 12 252y0641 2/23/07 4. Data from problem 3 is repeated. (Use .01) . A sales manager wishes to predict newspaper circulation on Sunday ( y ) on the basis of weekday morning circulation ( x1 ) weekday evening circulation ( x 2 ) and time ( x3 ). The data is below (Use .01) . All circulation data is in millions sold. Row 1 2 3 4 5 6 7 8 9 y x1 x2 x3 S 54 55 55 56 56 57 58 59 60 AM 27 29 30 33 34 35 36 37 39 PM 34 33 31 29 29 28 26 25 24 T 1 2 3 4 5 6 7 8 9 The quantities below are given: y 510, x 300, n 9, x 22 x y ?, x 7549, x 1 1 2y 2 259, 14623 and y 28932, x x x 8525. 2 10126, 2 1 1 2 a) Do a multiple regression of Sunday circulation against morning and evening circulation. (12) Solution: We have already computed the spare parts. Y 56.6667 , X 1 33 .3333 , X 2 28 .7778 . X 2 1 nX 12 SSX 1 126.0000*, 32.0000*, X 1X 2 X Y nX Y SX Y 1 1 1 X 2 2 nX 22 SSX 2 = 95.5556* , = 61.0000, X 2Y Y 2 nY 2 SST SSY = nX 2 Y SX 2 Y = -53.6667 nX 1 X 2 SX 1 X 2 = -108.333 Note: * These spare parts must be positive. The rest may well be negative. We substitute these numbers into the Simplified Normal Equations: X 1Y nX 1Y b1 X 12 nX 12 b2 X 1 X 2 nX 1 X 2 X Y nX Y b X X 2 which are 2 1 1 2 nX X b X 1 2 2 2 2 nX , 2 2 61 .0000 126 .000 b1 108 .333 b2 53 .6667 108 .333 b1 95 .5554 b2 and solve them as two equations in two unknowns for b1 and b2 . These are a fairly tough pair of equations to solve until we notice that, if we multiply 95.5554 by 108 .333 95 .5554 1.133719 we get 108.333. The 61 .0000 126 .000 b1 108 .333 b2 equations become . If we add these together, we get 60 .8430 122 .819 b1 108 .333 b2 0.1570 0.04936 . Now remember that 61 126 b1 108 .333 b2 . 0.1570 3.181b1 . This means that b1 3.181 This can be rewritten as 108 .333 b2 61 126 b1 . Let’s substitute b1 0.04936 . 55 .5064 0.5124 . (It’s worth checking your 108 .333 work by substituting your values of b1 and b2 back into the normal equations.) Finally we get b0 by using 108 .333 b2 61 126 0.0436 55 .5064 . So b2 Y 56.6667 , X 1 33 .3333 , X 2 28 .7778 in b0 Y b1 X 1 b2 X 2 56 .6667 0.04936 33 .3333 0.5124 28 .7778 56 .6667 1.6453 14 .7457 69 .7671 . Thus our equation is Yˆ b0 b1 X 1 b2 X 2 69.7671 0.0494X 1 0.5124X 2 . 13 252y0641 2/23/07 Note: My flabber is ghasted. Minitab said The regression equation is S = 69.6 + 0.049 AM 0.506 PM. This is very close for a first try. b) Compute R 2 and R 2 adjusted for degrees of freedom for both this and the previous problem. Compare the values of R 2 adjusted between this and the previous problem. Use an F test to compare R 2 here with the R 2 from the previous problem.(6) Solution: From the first regression we have SST SSy 32.000 , R 2 RY2.1 0.9228 * and we can compute SSR b1 S x1 y 0.484161 29.5301 so that From the second regression Yˆ b0 b1 X 1 b2 X 2 69.7671 0.0494X 1 0.5124X 2 Y 56.6667 , X 1 33 .3333 , X 2 28 .7778 , X 2Y X Y nX Y SX Y 1 1 1 = 61.0000, nX 2 Y SX 2 Y = -53.6667. This means that SSR b1 Sx1 y b2 Sx2 y 0.0494 61 0.5124 53.6667 3.0134 27.4988 30.5122 . SSR 30 .5122 0.9535 *. If we use R 2 , which is R 2 adjusted for degrees of freedom, for SST 32 .0000 the first regression, the number of independent variables was k 1 and R 2 RY2.12 R2 n 1R 2 k 80.9228 1 .9117 R2 n 1R 2 k 80.9535 2 .9380 . n k 1 7 and for the second regression k 2 and R-squared adjusted is supposed to rise if our new variable has n k 1 6 any explanatory power. Note: * These numbers must be positive. The rest may well be negative. There are two ways to do the F test. We can use the second regression to give us SSE SST SSR2 32.000 30.5122 1.4878 . In the second regression, the explained sum of squares rises by 30.5122 – 29.5301 = 0.9821. We can make an ANOVA table for looking at a new variable as follows. Assume that we have SSR1 for the first regression on k independent variables and add r new independent variables and get a new SSR2 Source SS First Regression SSR1 2nd Regression SSR 2 SSR1 Error Total SSE SST Source 2 Regression MS Fcalc k MSR1 MSR1 MSE r MSR2 MSR2 MSE F F k , nk r 1 F r , nk r 1 n k r 1 MSE n 1 SS First Regression 29.5301 nd DF 0.9821 DF Fcalc MS 1 29.5301 1 0.9821 F 1,6 13 .75 119s F.01 3.96ns F 1,6 13.75 .01 Error 1.4878 6 0.24797 Total 32 8 We can get the same results using R 2 . Remember RY2.12 0.9535 and RY2.1 0.9228 . Fcalc Source SS DF MS First Regression RY2.1 0.9228 F 1,6 13 .75 F.01 1 .9228 119s 1,6 13 .75 3.96ns F.01 2nd Regression RY2.12 RY2.1 0.9535 .9228 .0307 1 .0307 Error Total 1 RY2.12 1 0.9535 .0465 1 8 6 .00775 14 252y0641 2/23/07 c) Compute the regression sum of squares and use it in an F test to test the usefulness of this regression. (5) Solution: Remember SSR 30.5122 and SSE SST SSR2 32.000 30.5122 1.4878 . Source SS nd First and 2 Regression DF 30.5122 2 MS Fcalc 15.25361 61.524 F 2,6 10 .92 F.01 Error 1.4878 6 0.24797 Total 32 8 The null hypothesis is no connection between Y and the X’s. It is rejected. d) Use your regression to predict the Sunday circulation when AM circulation is 40 and PM circulation is 23.(2) Solution: From the second regression Yˆ b0 b1 X 1 b2 X 2 69.7671 0.0494X 1 0.5124X 2 69.7671 0.0494 40 0.5124 23 69.7671 1.9760 11.7852 83.5283 e) Use the directions in the outline to make this estimate into a confidence interval and a prediction interval. 6 (4) Solution: The error mean square is 0.24797 and has 6 degrees of freedom. So use t .005 3.707 and s e 0.24797 0.4979 . The outline says that an approximate confidence interval is Y0 Yˆ0 t se 3.707 n 0.4979 83 .53 0.61 and an approximate prediction interval is 83 .5283 3.707 9 [82] Y Yˆ t s 83.5283 3.707 0.4979 83.53 1.85 . 0 0 e For the record, here is what the computer got. It looks like I had a fair amount of error in much of this question. Regression Analysis: S versus AM The regression equation is S = 40.5 + 0.484 AM Predictor Coef SE Coef Constant 40.529 1.774 AM 0.48413 0.05290 S = 0.593808 R-Sq = 92.3% Analysis of Variance Source DF SS Regression 1 29.532 Residual Error 7 2.468 Total 8 32.000 T P 22.84 0.000 9.15 0.000 R-Sq(adj) = 91.2% MS 29.532 0.353 F 83.75 P 0.000 Regression Analysis: S versus AM, PM The regression equation is S = 69.6 + 0.049 AM - 0.506 PM Predictor Coef SE Coef Constant 69.57 20.61 AM 0.0494 0.3115 PM -0.5057 0.3577 S = 0.555511 R-Sq = 94.2% Analysis of Variance Source DF SS Regression 2 30.148 Residual Error 6 1.852 Total 8 32.000 Source AM PM DF 1 1 T P 3.38 0.015 0.16 0.879 -1.41 0.207 R-Sq(adj) = 92.3% MS 15.074 0.309 F 48.85 P 0.000 Seq SS 29.532 0.617 15 252y0641 2/23/07 5. Data from problem 3 is repeated. (Use .01) . A sales manager wishes to predict newspaper circulation on Sunday ( y ) on the basis of weekday morning circulation ( x1 ) weekday evening circulation ( x 2 ) and time ( x3 ). The data is below (Use .01) . All circulation data is in millions sold. The time variable is now added with the following results. MTB > SUBC> SUBC> Regress c1 3 c2 c3 c10; VIF; DW. Regression Analysis: S versus AM, PM, T The regression equation is S = 62.2 - 0.253 AM - 0.071 PM + 0.991 T Predictor Constant AM PM T Coef 62.19 -0.2533 -0.0707 0.9913 S = 0.453612 SE Coef 17.23 0.2960 0.3642 0.4957 R-Sq = 96.8% Analysis of Variance Source DF SS Regression 3 30.971 Residual Error 5 1.029 Total 8 32.000 T 3.61 -0.86 -0.19 2.00 P 0.015 0.431 0.854 0.102 VIF 53.6 61.6 71.7 R-Sq(adj) = 94.9% MS 10.324 0.206 F 50.17 P 0.000 Durbin-Watson statistic = 2.13279 a) What do the significance tests on the coefficients reveal? Give reasons. (2) Solution: All of the p-values are above .01 , indicating that the null hypotheses that these coefficients are actually zero cannot be rejected. None of the coefficients are significant in spite of a highly significant ANOVA result. b) Can you explain why the coefficients of AM and PM seem unreasonable? What is the apparent reason for this? (2) Solution: Though it may make sense to have a negative coefficient for evening readership, which is obviously declining and being replaced by morning readership, one would expect that, if Sunday readership reflects interest in the news there would be a positive coefficient relating AM readership to Sunday readership. The problem seems to be the high VIF indicating a substantial degree of collinearity. In the presence of collinearity, coefficients are not reliable. c) Do a 2% two-sided Durbin-Watson test on the result as suggested in class. What is the hypothesis tested and what is the result? (3) Solution: The line presented in the notes is below. 0 0 dL ? dU 0 2 0 4 dU ? 4 d L 0 4 + + + + + + + Unfortunately the 1% values given in the text start at n 15 , but we can guess that the 1% values for k 3 would be somewhere near d L = 0.5 and d U = 1.5. The null hypothesis is that there is no autocorrelation, and we cannot reject the null hypothesis if the Durbin-Watson statistic is between d U and 4 d U . The statistic cited above is certainly close to 2 and apparently between d U and 4 d U , so it is unlikely that we can reject the null hypothesis. d) Reuse your spare parts from the previous regression if possible to compute the correlation between AM and PM circulation and test it for significance. (4) Solution: The formula for correlation is r XY nXY X nX Y 2 2 2 nY . We have 2 X 2 1 nX 12 SSX 1 126.0000* , 16 252y0641 2/23/07 X 2 2 nX 22 SSX 2 = 95.5556* and X 1X 2 nX 1 X 2 SX 1 X 2 = -108.333. Because of the sign of X 1 X 2 nX 1 X 2 SX 1 X 2 , the correlation will be negative. r 108 .332 126 95.5556 t n 2 r sr SX 1 X 2 2 SSX1 SSX 2 0.9747 = -0.9872. The test of the null hypothesis 0 is r .9872 .9872 16 .421 . We would use t.7005 = 3.499. The ‘do not 0.0601189 1 0.9747 1 r 7 n2 reject’ zone is between ±2.499. Since 16.421 is not in this interval, we cannot reject the null hypothesis. 2 e) Compute a rank correlation between AM and PM circulation and test it for significance. Can you explain why it is larger than the correlation in d)? (4) d Solution: y d2 x1 r1 x 2 r2 x3 Row 1 2 3 4 5 6 7 8 9 Sum S 54 55 55 56 56 57 58 59 60 AM 27 29 30 33 34 35 36 37 39 PM 34 33 31 29 29 28 26 25 24 9 8 7 5.5 5.5 4 3 2 1 1 2 3 4 5 6 7 8 9 T 1 2 3 4 5 6 7 8 9 8 6 4 1.5 0.5 -2 -4 -6 -8 0 64 36 16 2.25 0.25 4 16 36 64 238.5 A correlation coefficient between rx and ry can be computed as in d) above, but it is easier to compute d rx ry , and then rs 1 d 1 6238 .5 .9875 . This can be given a t test for H 981 1 nn 1 2 6 2 0 : 0 as in b) above, but for n between 4 and 30, a special table should be used. Part of the table from the Supplement is repeated below. We do not reject the null hypothesis of no rank correlation if rs is between ±.8167. Since our value of the rank correlation is below -.8167, reject the null hypothesis. n .100 .050 .025 .010 .005 .001 4 5 6 7 8 9 .8000 .7000 .6000 .5357 .5000 .4667 .8000 .8000 .7714 .6786 .6190 .5833 .9000 .8286 .7450 .7143 .6833 .9000 .8857 .8571 .8095 .7667 .9429 .8929 .8571 .8167 .9643 .9286 .9000 f) Test the hypothesis that the correlation that you computed in d) is -.99. (4) [101] Note: since almost no one did this problem, the correct values of n, r and were never substituted in these equations. H 0 : xy 0.8 Solution: Test when n 10, r .704 and r 2 .496 .05 . H : 0 . 8 1 xy This time compute Fisher's z-transformation (because 0 is not zero) 1 1 r 1 1 .704 1 1.704 1 1 ~ z ln ln ln ln 5.75676 1.75037 0.87519 2 1 r 2 1 .704 2 0.296 2 2 1 1 0 z ln 2 1 0 1 1 .8 1 1.8 1 1 ln 2 1 .8 2 ln 0.2 2 ln 9.0000 2 2.19722 1.09861 17 252y0641 2/23/07 sz 1 1 n3 10 3 1 0.37796 . 7 Finally t ~ z z 0.87519 1.09861 0.591 . Compare sz 0.37796 this with t n2 2 t .8025 2.306 . Since –0.591 lies between these two values, do not reject the null hypothesis. g) (Extra credit) If AM, PM and T are x1 , x 2 and x3 , find the partial correlation coefficient (square root of the coefficient of partial determination) rY 3.12 . (2) Solution: We had the following results. MTB > SUBC> SUBC> Regress c1 3 c2 c3 c10; VIF; DW. Regression Analysis: S versus AM, PM, T The regression equation is S = 62.2 - 0.253 AM - 0.071 PM + 0.991 T Predictor Constant AM PM T Coef 62.19 -0.2533 -0.0707 0.9913 S = 0.453612 SE Coef 17.23 0.2960 0.3642 0.4957 R-Sq = 96.8% Analysis of Variance Source DF SS Regression 3 30.971 Residual Error 5 1.029 Total 8 32.000 T 3.61 -0.86 -0.19 2.00 P 0.015 0.431 0.854 0.102 VIF 53.6 61.6 71.7 R-Sq(adj) = 94.9% MS 10.324 0.206 F 50.17 P 0.000 Durbin-Watson statistic = 2.13279 The outline says rY23.12 t 32 t 32 df 2.00 2 0.4444 2.00 2 5 18 252y0641 2/23/07 6. The following times were recorded for 6 skiers on 3 slopes. In order to assess their difficulty we look at the median time for each slope. We do not assume a Normal distribution. Do not compute the median or mean time for any slope. Skier Slope 1 Slope 2 Slope 3 1 4.9 6.1 5.2 2 4.5 6.0 5.1 3 4.1 5.4 4.9 4 4.4 4.7 5.1 5 4.5 4.9 4.5 6 3.3 3.8 3.9 a) Test the hypothesis that the median time on slope 1 is 4 minutes (3 or 2 depending on method) (3) Solution: row 1 2 3 4 5 6 d x1 4 x1 4.9 4.5 4.1 4.4 4.5 3.3 d 0.9 0.5 0.1 0.4 0.5 -0.7 0.9 0.5 0.1 0.4 0.5 0.7 r r* 6 3.5 1 2 3.5 5 6+ 3.5+ 1+ 2+ 3.5+ 5- We use a Wilcoxon signed rank test with the null hypothesis that the median of x1 is 4 (or the median of d x1 4 is zero. If we add together the numbers in r * with a + sign we get T 16 . If we do the same for numbers with a – sign, we get T 5. To check this, note that these two numbers must sum to the sum nn 1 67 21 , and that T T 16 5 21 . We check of the first n numbers, and that this is 2 2 5, the smaller of the two rank sums against the numbers in table 7. For a two-sided 5% test, we use the .025 column. For n 6 , the critical value is 1, and we reject the null hypothesis only if our test statistic is below this critical value. Since our test statistic is 5, we do not reject the null hypothesis. We could also use a sign test on this. We have 5 positive outcomes and 1 negative outcome. This is a 2sided test so pvalue 2P x 1 p .5, n 6 2.10938 .2198 . Since this is above .05 , we cannot reject the null hypothesis. b) Test the hypothesis that slope 1 and slope 2 have the same median times. (4) Solution: row 1 2 3 4 5 6 x1 x2 4.9 4.5 4.1 4.4 4.5 3.3 6.1 6.0 5.4 4.7 4.9 3.8 d x 2 x1 -1.2 -1.5 -1.3 -0.3 -0.4 -0.5 d r r* 1.2 1.5 1.3 0.3 0.4 0.5 4 5 6 1 2 3 456123- We use a Wilcoxon signed rank test with the null hypothesis that the medians of x1 and x 2 are the same (or the median of d x 2 x1 is zero. If we add together the numbers in r * with a + sign we get T 0 . If we do the same for numbers with a – sign, we get T 21 . To check this, note that these two numbers nn 1 67 21 , and that must sum to the sum of the first n numbers, and that this is 2 2 T T 0 21 21 . We check 0, the smaller of the two rank sums against the numbers in table 7. For a two-sided 5% test, we use the .025 column. For n 6 , the critical value is 1, and we reject the null hypothesis only if our test statistic is below this critical value. Since our test statistic is 0, we reject the null hypothesis. We could also use a sign test on this. We have 0 positive outcomes and 6 negative outcomes. This is a 2sided test so pvalue 2P x 6 p .5, n 6 2P x 0 p .5, n 6 2.01563 .03126 . Since this is below .05 , we reject the null hypothesis. 0.01563 19 252y0641 2/23/07 c) Test the hypothesis that the slopes all have the same median time. (4) Solution: This is a Friedman test. H0: 1 2 3 4 Where 1 is A, 2 is B, 3 is C and 4 is D. H1: At least one of the medians differs. First we rank the data within rows. The data appears below in columns marked x1 to x 4 and the ranks are in columns marked r1 to r4 . row x1 Slope 1 r1 x 2 Slope 2 r2 x3 Slope 3 r3 1 2 3 4 5 6 Sum SRi 4.9 4.5 4.1 4.4 4.5 3.3 1 1 1 1 1.5 1 6.5 6.1 6.0 5.4 4.7 4.9 3.8 3 3 3 2 3 2 16 5.2 5.1 4.9 5.1 4.5 3.9 2 2 2 3 1.5 3 13.5 To check the ranking, note that the sum of the three rank sums is 6.5 + 16 + 13.5 = 36, and that rcc 1 634 SRi 36 . Now compute the Friedman statistic the sum of the rank sums should be 2 2 12 12 6.52 16 2 13 .52 364 F2 SRi2 3r c 1 634 rc c 1 i 12 480 .5 72 8.0833 . The relevant part of the Friedman table appears below and since our value is 72 between 8.333 and 9.000, the p-value is between .008 and .012. Our null hypothesis is equal column medians and since the p-value is below .05, we reject it. c 3, r 6 F2 0.000 0.333 1.000 1.333 2.333 3.000 4.000 4.333 5.333 6.333 7.000 8.333 9.000 9.333 10.333 12.000 p value 1.000 .956 .740 .570 .430 .252 .184 .142 .072 .052 .029 .012 .008 .006 .002 .000 d) Explain what methods you would use in b) and c) if the columns were independent random samples. (1) Solution: The Wilcoxon signed rank test for paired data corresponds to the Mann-Whitney-Wilcoxon test for two independent samples. The Friedman test for cross-classified data corresponds to the Kruskal-Wallis test for independent samples. 20 252y0641 2/23/07 e) Rank the skiers times on each slope from 1 (fastest) to 6. Use these as rankings of the skiers and test to see if the ranks agree between slopes. (4) Solution: This calls for Kendall's Coefficient of Concordance. Take k columns with n items in each and rank each column from 1 to n . The null hypothesis is H : Disagreement that the rankings disagree. 0 H 1 : Agreement Compute a sum of ranks SRi for each row. Then S SR 2 n SR 2 , where SR n 1k 2 the mean of the SRi s. If H 0 is disagreement, S can be checked against a table for this test. If S , where S S reject H 0 . For n too large for the table use 2n1 k n 1W 1 knn 1 12 W S 1 k2 12 n row n SR k 3, n 6 S is the Kendall Coefficient of Concordance and must be between 0 and 1. x1 Slope 1 4.9 4.5 4.1 4.4 4.5 3.3 1 2 3 4 5 6 SR 3 is 2 r1 6 4.5 2 3 4.5 1 x 2 Slope 2 6.1 6.0 5.4 4.7 4.9 3.8 r2 6 5 4 2 3 1 x3 Slope 3 5.2 5.1 4.9 5.1 4.5 3.9 r3 6 4.5 3 4.5 2 1 SR 18 14 9 9.5 9.5 3 63 SR 2 324 196 81 90.25 90.25 9 790.50 n SR = 790 .5 610 .52 129 . To check this note that 2 n 1k 73 10 .5 2 2 Note that if we had complete disagreement, every applicant would have a rank sum of 10.5. The Kendall Coefficient of Concordance says that the degree of agreement on a zero to one scale is S 12 129 0.819048 . To do a test of the null hypothesis of disagreement W = 2 3 1 k n n 9 63 6 12 .05 , look up S in the table giving ‘Critical values of Kendall’s s ,’ part of which is reproduced below. Our computed value of S is larger than S .05 103 .9, so that we reject the null hypothesis of disagreement. .05 m 3 4 5 6 8 10 15 20 TABLE 12: Critical values of Kendall's s n3 48.1 60.0 89.8 119.7 n4 n5 n6 n7 49.5 62.6 75.7 101.7 127.8 192.9 258.0 64.4 88.4 112.3 136.1 183.7 231.2 349.8 468.5 103.9 143.3 182.4 221.4 299.0 376.7 570.5 764.4 157.3 217.0 276.2 335.2 453.1 571.0 864.9 1158.7 21 252y0641 2/23/07 7. Clarence Sales is a marketing major and knows that national soft drink market shares are as below. Classic Coke 15.6% Pepsi 13.2% Diet Coke 5.1% Diet Pepsi 3.5% Other brands 62.6% He gets in a bit of trouble here and is sentenced to 20 hours of public service. After he finishes his public service he takes off for Maine, gets caught littering and is sentenced to another 20 hours of public service. During his public service, he picks up 100 cans in each state. The cans are as below. Brand PA ME Classic Coke 22 17 Pepsi 15 11 Diet Coke 13 10 Diet Pepsi 6 5 Other brands 44 57 Use a 1% significance level throughout this problem. Don’t waste our time by just computing percents and saying that they are different. Each problem requires a statistical test or the equivalent. State your null and alternative hypotheses in each problem. a) Regard the cans picked up as a random sample of sales in the two states. Can we say that the proportions of soft drink cans discarded in Pennsylvania are the same as the national market shares? (5) Solution: A Kolmogorov-Smirnov test might work here, but I think that chi-squared is the more likely choice. The PA data is our O . I got E by multiplying the market share proportions by 100. H 0 :Proportions Apply Row 1 2 3 4 5 Sum p E .156 .132 .051 .035 .626 1.000 15.6 13.2 5.1 3.5 62.6 100.0 O 22 15 13 6 44 100 O E 2 OE 6.4 1.8 7.9 2.5 -18.6 0.0 O E 2 E 2.6256 0.2455 12.2373 1.7857 5.5265 22.4206 O2 E 31.0256 17.0455 33.1373 10.2857 30.9265 122.4206 O2 n = 122.4206 – 100 = 22.4206, but don’t E E use both. Both of these two formulas are shown above. Actually the longer calculation is better in this case because we have a value of E that is below 5, and its contribution of 1.7857 is not what pushes us into O E 2 16.1029 . The relevant part of the chi-squared rejection. DF r 1 4 . So we have 2 E table is shown below. The 5% value for 4 degrees of freedom is 9.4877. Since the computed chi-square exceeds the table chi-squared, we reject the null hypothesis. You can use 2 = 22.4206 or 2 0.005 0.010 0.025 0.050 0.100 0.900 0.950 0.975 0.990 0.995 Degrees of Freedom 1 2 3 4 5 6 7.87946 10.5966 12.8382 14.8603 16.7496 18.5476 6.63491 9.2103 11.3449 13.2767 15.0863 16.8119 5.02389 7.3778 9.3484 11.1433 12.8325 14.4494 3.84146 5.9915 7.8147 9.4877 11.0705 12.5916 2.70554 4.6052 6.2514 7.7794 9.2364 10.6446 0.01579 0.2107 0.5844 1.0636 1.6103 2.2041 0.00393 0.1026 0.3518 0.7107 1.1455 1.6354 0.00098 0.0506 0.2158 0.4844 0.8312 1.2373 0.00016 0.0201 0.1148 0.2971 0.5543 0.8721 0.00004 0.0100 0.0717 0.2070 0.4117 0.6757 7 20.2778 18.4753 16.0128 14.0671 12.0170 2.8331 2.1674 1.6899 1.2390 0.9893 8 21.9 550 20.0902 17.5346 15.5073 13.3616 3.4895 2.7326 2.1797 1.6465 1.344 9 23.5893 21.6660 19.0228 16.9190 14.6837 4.1682 3.3251 2.7004 2.0879 1.7349 22 252y0641 2/23/07 Since the rest of this question involves tests and confidence intervals for tests of one or two proportions, here is the relevant section of the formula table. Interval for Confidence Hypotheses Test Ratio Critical Value Interval Proportion p p z 2 s p sp pq n H 0 : p p0 H1 : p p0 p p0 z p q 1 p Difference between proportions q 1 p p p z 2 s p p p1 p 2 s p p1 q1 p 2 q 2 n1 n2 H 0 : p p 0 H 1 : p p 0 z p p 0 p p 0 p 01 p 02 If p 0 or p 0 0 p p 01q 01 p 02 q 02 n1 n2 Or use pcv p0 z 2 p p0 q0 n q0 1 p0 p pcv p0 z 2 p If p 0 1 1 n1 n 2 p p 0 q 0 p0 s p n1 p1 n 2 p 2 n1 n 2 b) Clarence knows that that Maine is Moxie country, so he believes that the proportion of other brands sold is higher in Maine than in Pennsylvania. Is this true? (4) Solution: Clarence’s assertion that p1 p 2 is an alternative hypothesis, so we have a left-sided test. H 0 : p1 p 2 H : p p 2 0 or 0 1 or if p p1 p 2 , H 1 : p1 p 2 H 1 : p1 p 2 0 44 57 44 57 p1 .44 p 2 .57 p 0.13 p 0 100 100 100 You should use p H 0 : p 0 are implied. H 1 : p 0 44 57 101 .505 100 100 200 1 1 2 .505 .495 p 0 q 0 .005 .0707 for hypothesis tests and 100 n1 n 2 pq p q .44 .56 .57 .43 s p 1 1 2 2 .002464 .002451 .004915 .0701 for 100 n2 100 n1 confidence intervals, but at this point I won’t be fussy. p p 0 0.13 1.838 . Since this is a left-sided test, we reject the null For a test ratio z .0707 p hypothesis if our computed z is below z z .01 2.327 . Though we would reject the null hypothesis at the 5% significance level, we would not reject it at the 1% level. Alternately pvalue Pz 1.84 .5 P1.84 z 0 .5 .4671 .0329 . Since this is not below 5%, we cannot reject the null hypothesis. If we use a critical value we want p cv 0 2.327 .070 .1645 . Since p 0.13 is not below this value, we do not reject the null hypothesis. c) Create a 0.2% 2-sided confidence interval for the difference between the proportions of other brands sold in Maine. Using your Normal table, make this into a 0.1% 2-sided interval. (3) Solution: Recall that s p .0701 and that a two-sided confidence interval has the form p p z 2 s p , where p 0.13 . For a 2% confidence interval we already know that z z.01 2.327. For the 0.1% 2 interval we need 2 .0012 .0005 . Make a diagram. Draw a Normal curve with a mean at 0. z .0005 is the value of z with 0.05% of the distribution above it. Since 100 – 0.05 = 99.95, it is also the 99.95th percentile. Since 50% of the standardized Normal distribution is below zero, your diagram should show that 23 252y0641 2/23/07 the probability between z .0005 and zero is 99.95% - 50% = 49.95% or P0 z z.0005 .4995 . The relevant part of the Normal table appears below. To get .4995 we need values of z between 3.27 and 3.32. Any of these would be acceptable, but a good guess would be z.0005 3.295 . If z 0 is between P0 z z 0 is 3.08 3.11 3.14 3.18 3.22 3.27 3.33 3.39 3.49 3.62 3.90 and and and and and and and and and and and 3.10 3.13 3.17 3.21 3.26 3.32 3.38 3.48 3.61 3.89 up .4990 .4991 .4992 .4993 .4994 .4995 .4996 .4997 .4998 .4999 .5000 Thus a 2% confidence interval would be p 0.13 2.327 .0701 .13 0.16 or -0.03 to 0.29 and a 0.1% confidence interval would be p 0.13 3.295 .0701 .13 0.23 or -0.10 to 0.36 d) Actually Clarence’s mother owns the Coke franchise for Maine and last year between her sales of Classic Coke and Diet Coke accounted for 25% of the soft drink market in Maine. She tells Clarence that her sales are now above 25%. On the basis of Clarence’s Maine sample is that true? (2) [131] Recall that his main sample was as below. Brand PA ME Classic Coke 22 17 Pepsi 15 11 Diet Coke 13 10 Diet Pepsi 6 5 Other brands 44 57 Classic coke and diet coke together were 17% + 10 % = 27% of the sample, so p .27 . His mother’s H 0 : p1 .25 assertion that p .25 is an alternate hypothesis and we have a right-sided test. . H 1 : p1 .25 p p0 q0 .25 .75 .001875 .04330 would be used for hypothesis tests, while s p 100 n pq n .27 .73 .001971 .04440 would be used for confidence intervals. 100 p p 0 .27 .25 0.4619 . We already know that z z.01 2.327. We If we use the test ratio, z 2 0.04330 p reject the null hypothesis if the computed z-ratio exceeds 2.327. It does not, and we do not reject the null hypothesis. If we want to use a critical value for p we want p cv .25 2.327 .04330 .3507 . I’m surprised that it is so large, but since the observed proportion does not exceed 35.07%, we cannot reject the null hypothesis. 24