4/26/02 252y0232 (Page layout view!) ECO252 QBA2 THIRD HOUR EXAM April 18, 2002 I. (10+ points) Do all the following; Name KEY Hour of Class Registered (Circle) MWF TR 10 12 12:30 2:00 1. Hand in your computer printouts for problems 2 and 3.(5 points – 3 point penalty for not handing in). remember that the ANOVA printout must be completed, using a 5% significance level, for full credit. I should be able to tell what is tested and what are the conclusions. 2. a. In particular, is the interaction between car and driver significant? Which numbers made you think that? (2) b. Create two confidence intervals for the difference between the means for drivers 2 and 3, one that is valid alone, and one that is valid simultaneously with other similar intervals. Do these intervals show a significant difference between these two means? Why? (4) Solution: The only parts of the solution to computer problem 2 that you need are: Tabulated Statistics ROWS: car 1 2 3 4 ALL COLUMNS: driver 1 2 3 ALL 42.000 32.000 30.667 31.333 34.000 25.000 28.000 45.000 24.667 30.667 12.667 29.333 28.333 54.667 31.250 26.556 29.778 34.667 36.889 31.972 CELL CONTENTS -- mpg:MEAN MTB > twoway 'mpg''car''driver' Two-way Analysis of Variance Analysis of Variance for mpg Source DF SS car 3 590.3 driver 2 76.1 Interaction 6 3227.9 Error 24 336.7 Total 35 4231.0 MS 196.8 38.0 538.0 14.0 To complete the printout, divide through the MS column by MSW 14 and place the results in the in the F column. Then look up the corresponding values of F in 5% lines on the F table. Source DF SS MS F.05 H0 F car 3 590.3 196.8 14.057s F 3,24 3.01 Car means identical driver 2 76.1 38.0 2.714ns .05 2,24 F.05 6,24 F.05 3.40 Driver means identical Interaction 6 3227.9 538.0 38.428s 2.51 No interaction Error 24 336.7 14.0 Total 35 4231.0 The first and the third null hypotheses are rejected. a) Since 38.428 is larger than 2.51, we reject the hypothesis that there is no interaction and say that there is significant interaction. b) Drivers 2 and 4 are in the columns. There are R 4 rows, C 3 columns and P 3 measurements per cell. Of course RC ( P 1) 432 24, the number of degrees of freedom for 'within' or 'error.' From the outline, we have for Bonferroni confidence intervals for column means 2MSW 1 2 x1 x2 t RC P 1 . This becomes, for m 1, 2m PR 2MSW 214 .0 30 .667 31 .250 2.064 0.583 2.064 2.333 PR 12 0.58 3.15 This indicates no significant difference. 2 3 x2 x3 t 24 2 4/18/02 252y0232 For Scheffe intervals for column means use 1 2 x1 x2 C 1FC 1, RCP 1 2MSW PR . So 2F.052,24 214 .583 23.40 2.333 .583 3.98 . This 2 3 (30 .667 31 .250 ) 12 indicates no significant difference. c. In your income and education regression, (i) Explain what coefficients are significant and why? (2) (ii) What income would you predict for someone with 3 years of education? (1) (iii) Make a confidence interval for the income of someone with 3 years of education using some of the information generated by Minitab below. (2) Descriptive Statistics Variable Educ N 32 Mean 12.000 Median 12.000 TrMean 12.071 Variable Educ Min 4.000 Max 20.000 Q1 8.000 Q3 16.000 StDev 4.363 SEMean 0.771 Column Sum of Squares Sum of squares (uncorrected) of Educ = 5198.0 Solution: The relevant output is: Regression Analysis The regression equation is Income = 5078 + 732 Educ Predictor Constant Educ Coef 5078 732.4 s = 2855 Stdev 1498 117.5 R-sq = 56.4% t-ratio 3.39 6.23 p 0.002 0.000 R-sq(adj) = 55.0% i) So we can state that, since the p-values are both below .05, that both coefficients are significant at the 5% level. ii) The regression can be written as Income 5078 732 Educ or Income 5078 732 .4 Educ . So Income 5078 732 (3) 7274 or Income 5078 732 .4(3) 7275 .2 . 1 iii) From the outline The Confidence Interval is Y0 Yˆ0 t sYˆ , where sY2ˆ s e2 n X 0 X 2 X 2 nX 2 1 3 12 2 8151025 1 81 1373758 .6 and s 1373758 .6 1172 .07 . 2855 2 Yˆ 32 5198 3212 2 32 590 30 If we use t n2 t .025 2.042 , we get Y0 7274 2.04211172.07 7274 2393 . 2 Please note the following from the 252 home page: The rule on p-value: If the p-value is less than the significance level (alpha) reject the null hypothesis; if the pvalue is greater than or equal to the significance level, do not reject the null hypothesis. Significance This is a topic that was covered under hypothesis tests. Probably the first reference I made to this was even earlier when I said that a parameter is significant if it is not zero. I later said that a null hypothesis often says that a parameter or a difference between parameters is insignificant. If a result is significant we reject the null hypothesis. 2 To put this more generally, a result is (statistically) significant if it is larger or smaller than would be expected by chance alone. Thus in the case of a regression coefficient the measure of significance could be the p-value, which tells us the probability of getting our actual result or something more extreme if we assume that the population value of the coefficient is zero. If the pvalue is small (below our significance level), then it is unlikely that our assumption about the coefficient is correct and we say that the coefficient is significant (or significantly different from zero). Of course, the various hypothesis tests that we have discussed here are also often ways of proving significance. 3 4/18/02 252y0232 II. Do at least 4 of the following 5 Problems (at least 10 each) (or do sections adding to at least 40 points Anything extra you do helps, and grades wrap around) . Show your work! State H 0 and H1 where applicable. Never say 'yes' or 'no' without a statistical test. 1. On the following pages there are printouts from two computer problems. a. The One-way ANOVA Problem ( Albright, Winston, Zappe - abbreviated): An automobile parts producer has instituted an employee empowerment program in five plants. Random samples of employees in each plant are asked to rate the success of the program on a 1 to 10 scale. 10 being the highest rating. They want to know if the program is being implemented with equal success at each plant and are thus looking to see if there is a significant difference between mean ratings at each plant. They are assuming that the results are distributed according to Normal distributions with similar variances. (i) Indicate what hypothesis was tested, what the p-value was and whether, using the p-value, you would reject the null if () the significance level was 5% and () the significance level was 1%. Explain why. Does this mean that the success was equal in all plants? (3) (ii) Do a 'normal' and a Scheffe confidence interval .05 for the difference between the means in the two plants that were the least successful. Do these intervals indicate a difference in the success of the program between these two plants? Why? (4.5). (iii) The printout gives 95% confidence intervals for the means for each plant. Find the numbers for the confidence interval for 'Midwest.' Why is this interval smaller than the others? (2.5) (iv) I would question whether ANOVA was appropriate for this problem because there is no evidence that the underlying populations are Normally distributed. What method would I prefer for this problem? (1) One-way ANOVA problem Worksheet size: 100000 cells MTB > RETR 'C:\MINITAB\2X0232-1.MTW'. Retrieving worksheet from file: C:\MINITAB\2X0232-1.MTW Worksheet was saved on 4/ 9/2002 MTB > print c1-c5 Data Display Row south midwest n-east s-west west 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 7 1 8 7 2 9 3 8 5 7 4 7 6 10 3 9 10 8 4 3 2 7 7 5 10 10 6 3 5 2 6 4 5 2 7 8 7 7 5 5 5 4 3 4 5 5 3 3 3 5 5 6 4 7 10 7 6 6 7 4 3 7 8 9 10 4 10 4 6 6 6 6 6 3 4 8 6 2 4 5 6 4 7 4 3 5 4 7 6 4 4 4/18/02 252y0232 MTB > AOVOneway c1 c2 c3 c4 c5. One-Way Analysis of Variance Analysis of Variance Source DF SS Factor 4 46.24 Error 85 393.55 Total 89 439.79 Level south midwest n-east s-west west N 11 26 14 18 21 Pooled StDev = Mean 5.545 6.000 4.429 6.556 5.048 2.152 MS 11.56 4.63 StDev 2.697 2.623 1.158 2.229 1.532 F 2.50 p 0.049 Individual 95% CIs For Mean Based on Pooled StDev ---+---------+---------+---------+--(----------*----------) (------*------) (---------*--------) (--------*-------) (-------*-------) ---+---------+---------+---------+--3.6 4.8 6.0 7.2 Solution: a) (i) All one-way ANOVAs test for equality of the means of the populations represented by the columns, so H 0 is 1 2 3 4 5 . The p-value is 4.9%, so we reject the null hypothesis at the 5% significance level, but not the 1% level. If we reject the null hypothesis we say that the success level was not the same at all the plants. (ii) The Northeast and the West plants were the least successful. From the outline if we desire a single interval and we want the difference between means of column 1 and column 2. 1 2 x1 x2 t n m s 2 85 3 5 x.3 x5 t .025 s 1 1 , where s MSW 4.63 2.152 . This becomes n1 n 2 1 1 4.429 5.048 1.988 4.63 0.11905 0.619 1.988 0.742 0.699 1.475 14 21 If we desire intervals that will simultaneously be valid for a given confidence level for all possible intervals 1 1 between column means, use 1 2 x1 x2 m 1Fm1,n m s , which becomes n n 2 1 1 1 3 5 x3 x5 5 1F4,85 4.63 4.429 5.048 5 12.48 4.63 0.11905 0.619 2.338 14 21 since both these intervals include zero, there is no significant difference. (iii) If we use the 'normal' formula for the difference between two means, we get 1 x1 tn m s 2 1 n1 1 6.000 0.839 . It is the smallest interval because we divide the pooled 26 standard deviation by the square root of n2 , which is the largest of all the sample sizes. 2 6.000 1.988 4.63 b. The Regression Problem: This relates the number of shares in thousands to the age of board members of a corporation. (i) Looking at significance tests and the value of R-squared, how successful is this regression? Why? Why shouldn't this surprise you? (3) (ii) Note that c1 contains 'shares' and that c4 contains predicted values of 'shares.' Add a regression line to the graph. (1) (ii) What equation relates the number of shares owned to the age of the board member? How many shares does it say that we should expect a 83-year old board member to own? Would you take this seriously? Why? (2) 5 4/18/02 252y0232 Regression Problem Worksheet size: 100000 cells MTB > RETR 'C:\MINITAB\2X0232-5.MTW'. Retrieving worksheet from file: C:\MINITAB\2X0232-5.MTW Worksheet was saved on 4/12/2002 MTB > echo MTB > Execute 'C:\MINITAB\252SOLS3.MTB' 1. Executing from file: C:\MINITAB\252SOLS3.MTB MTB > #252sols3 MTB > print c1 c2 Data Display Row shares age 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 7.9 66.4 29.7 60.5 10.4 28.7 86.9 121.1 35.3 2.8 74.4 13.1 9.1 19.1 18.8 3.1 96.5 47.0 31.1 53 60 69 49 67 68 46 62 63 55 57 71 66 70 66 57 54 64 56 MTB > plot c1*c2 (plot omitted) MTB > regress c1 on 1 c2 c3 c4 Regression Analysis The regression equation is shares = 153 - 1.86 age Predictor Constant age s = 33.01 Coef 152.95 -1.860 Stdev 64.82 1.061 R-sq = 15.3% t-ratio 2.36 -1.75 p 0.031 0.098 R-sq(adj) = 10.3% Analysis of Variance SOURCE DF SS MS F p Regression 1 3348 3348 3.07 0.098 Error 17 18522 1090 Total 18 21870 Unusual Observations Obs. age shares Fit Stdev.Fit Residual 8 62.0 121.10 37.65 7.70 83.45 R denotes an obs. with a large st. resid. MTB > MTB > SUBC> SUBC> SUBC> SUBC> MTB > St.Resid 2.60R plot c4*c2 (plot omitted) plot c4*c2 c1*c2; symbol; type 3 1; color 8 9; overlay. end 6 4/18/02 252y0232 C4 100 50 0 50 60 70 age Solution: b ) (i) This is a very unsuccessful regression - surely the author could have found a better predictor of the number of shares owned than age! R 2 is very small on a zero to one scale and the p-value for the slope is above 5%. The regression seems to say that the number of shares owned declines as the board member gets older. I see no reason why this should be true. (ii) To add a regression line, just connect the x's. (iii) The regression equation says shares = 153 - 1.86 age. If a board member is 83 shares 153 1.8683 1.38. Of course, you can't own negative shares, and the fact that the oldest board member is 71 might lead us to feel that we have exceeded our competence. Basically the low R 2 leaves us unsure whether we should take any of its results seriously. 7 4/18/02 252y0232 2. A researcher believes that the data below has a Normal distribution with a mean of 80 and a standard x x 80 deviation of 5. For your convenience the values of z are computed for you. 5 a. Use a chi-squared test to find out if the distribution is correct. (9) b. Is there a better way to do this problem than chi-squared? Why? Do it. (5) c. Assume that, instead of using population means given above, we actually checked the data and found that x 80 and s 5. How would this change what we did in a)? (1) d. Assume that, instead of using population means given above, we actually checked the data and found that x 80 and s 5. How would this change what we did in b)? (1) Observed x interval z interval Frequency below 74 below -1.2 23 74-78 -1.2 to -0.4 53 78-82 -0.4 to 0.4 52 82-86 0.4 to 1.2 46 86-90 1.2 to 2.0 24 above 90 above 2.0 2 200 Solution: H 0 : N 80,5 a) We find the cumulative distribution of z , Fe , and use it to find the frequency f e .We then find E f e n , where n 200 . Fe is the cumulative probability. In the first column Fe 0.4 Pz 0.4 .5 P0.4 z 0 .5 .1554 .3446 and Fe 1.2 Pz 1.20 .5 P0 z 1.2 .5 .3849 .8849 . f e is the difference between successive values of Fe . For example, P1.2 z 0.4 .Pz 0.4 Pz 1.2 .3446 .1151 .2295 x interval Fe fe z interval O E below 74 below -1.2 23 .1151 .1151 23.02 74-78 -1.2 to -0.4 53 .3446 .2295 45.90 78-82 -0.4 to 0.4 52 .6554 .3108 62.16 82-86 0.4 to 1.2 46 .8849 .2295 45.90 86-90 1.2 to 2.0 24 .9772 .0923 18.46 above 90 above 2.0 2 1.0000 .0228 4.56 200 1.0000 200.00 We use the O and E to do a conventional chi-squared analysis. In the right column is the short-cut method. Row O E 1 2 3 4 5 6 23 53 52 46 24 2 200 23.02 45.90 62.16 45.90 18.46 4.56 200.00 E O 0.0200 -7.1000 10.1600 -0.1000 -5.5400 2.5600 0.0000 E O2 0.000 50.410 103.226 0.010 30.692 6.554 E O 2 E 0.00002 1.09826 1.66064 0.00022 1.66260 1.43719 5.85893 O2 E 22.9800 61.1983 43.5006 46.1002 31.2026 0.8772 205.8589 For the Chi-Squared Method, we could have had to merge two cells, because the first E was below 2 5. However, the small value of the first row term in the E O E column indicates that there was no need to do this. We thus have 6 - 1 = 5 degrees of freedom. The value of Chi-squared that we computed is 3.80704 or 203.8071-200 = 3.8071. From the Chi-squared table .2055 11 .0705 . This is more than our computed 2 , so do not reject H 0 . 8 4/18/02 252y0232 b) For most problems where the population mean and standard deviation are given the best method is Kolmogorov-Smirnov Fe is copied from part a) and O is made into a Cumulative distribution Fo by dividing through by n 200 and adding down the column. D is the difference between the two cumulative distributions. Row 1 2 3 4 5 6 O O 23 53 52 46 24 2 0.115 0.265 0.260 0.230 0.120 0.010 n Fo Fe 0.115 0.380 0.640 0.870 0.990 1.000 0.1151 0.3446 0.6554 0.8849 0.9772 1.0000 D 0.0001000 0.0354000 0.0154000 0.0149000 0.0128000 0.0000000 1.36 0.096 . This is less than the For the Kolmogorov-Smirnov Method the 5% critical value is 200 maximum value of D , which is .0354, so reject H 0 . c) If the sample mean and standard deviation have been computed from the data, we would lose 2 degrees of freedom. We would go ahead exactly as before until the time came to look up chi-squared which would now have 3 degrees of freedom. d) We would go ahead exactly as in b) until the time came to use a table. We would find our critical value on the Lilliefors table. 9 4/18/02 252y0232 3. (Weirs) A maker of stain removers is testing the effectiveness of four different formulations of a new product. Columns represent formulations 1-4 of the product and the 6 rows represent different stains (Creosote, crayon, motor oil, grape juice, ink, coffee). Each formulation is rated on a 1-10 scale for its effectiveness. Stain 1 2 3 4 5 6 Sum Count Form 1 Form 2 Form 3 Form 4 1 7 2 5 9 10 7 5 4 6 1 4 9 7 4 5 6 8 4 4 9 4 2 6 38 42 20 29 6 6 6 6 Sum of Squares 296 314 sum count 15 4 31 4 15 4 25 4 22 4 21 4 129 24 24 Sum of squares 79 255 69 171 132 137 843 90 a. Assume that the parent distribution is Normal and compare the mean ratings for the four formulations, noting the fact that it is cross-classified. Use .10 . (14) Note: If you wish to ignore that the fact that the data is classified by stain type, indicate this now and compare the column means assuming that the data is four independent random samples from a Normal distribution.(10). ( .10 ) b. Using the same significance level, assume that Formulation 1 is the current formula and use Scheffe intervals to see which formulations have mean ratings that differ significantly from the current formulation. (4) c. Using a significance level of 15%, repeat the analysis in b) using Bonferroni intervals. (4) Solution: If the parent distribution is Normal use ANOVA, if it's not Normal, use Friedman or Kruskal-Wallis. If the samples are independent random samples use 1-way ANOVA or Kruskal Wallis. If they are cross-classified, use Friedman or 2-way ANOVA. a) 2-way ANOVA (Blocked by stain) ‘s’ indicates that the null hypothesis is rejected. Stain Form 1 Form 2 Form 3 Form 4 sum count mean Sum of squares x i.. n i SS x1 x2 x3 x4 x i. x i2. 1 1.0000 7.0 2.0000 5.000 15.000 4 3.750 79 14.063 2 9.0000 10.0 7.0000 5.000 31.000 4 7.750 255 60.062 3 4.0000 6.0 1.0000 4.000 15.000 4 3.750 69 14.063 4 9.0000 7.0 4.0000 5.000 25.000 4 6.250 171 39.063 5 6.0000 8.0 4.0000 4.000 22.000 4 5.500 132 30.250 6 9.0000 4.0 2.0000 6.000 21.000 4 5.250 137 27.562 Sum 38.0000 +42.0 +20.0000 +29.000 =129.000 24 5.375 843 185.062 +6 +6 +6 =24 nj 6 7.0 3.3333 4.8333 5.375 x x. j 6.3333 SS 296.000 +314.0 +90.0000 +143.000 =843.000 x 2j 40.1111 +49.0 +11.1111 +23.3611=123.5833 x 129 , From the above x x 129 5.375 . n 24 n 24 , SST x x 2 ij 2 ij 843 .0 , x 2 i. 185 .062 x 2 .j 123.5833 and n x 843 .0 24 5.375 2 843 .0 693 .375 149 .625 . 2 n x n x 6123 .5833 245.375 48.125 . This is SSB in a one way ANOVA. SSR n x n x 4185 .062 24 5.375 46 .875 ( SSW SST SSC SSR 54.625 ) SSC 2 2 j j 2 i i. 2 2 2 10 4/18/02 252y0232 Source SS DF MS F F.10 F 5,15 2.27 s F 3,15 2.49 s Rows (Stains) 46.875 5 9.375 2.574 Columns(Formulas) 48.125 3 16.041 4.405 H0 Row means equal Column means equal Within (Error) 54.625 15 3.642 Total 149.625 23 So the formulations (column means) are significantly different. One way ANOVA (Not blocked by stain) Source SS Columns(Formulas) ( SSW SST SSB .91.0 ) DF MS F 48.125 3 16.042 3.161 F.10 H0 F 3,20 2.38 s Column means equal Within (Error) 101.500 20 5.075 Total 149.625 23 Once again, the formulations (column means) are significantly different. b) This resembles problem F2. The formulas are given in the outline. R 6 is the number of rows, C 4 is the number of columns and P 1 is the number of observations per cell. Note that if P 1 , replace RC P 1 with R 1C 1 6 14 1 15 , which is the error degrees of freedom above. The Scheffe’ formula for column means is 1 2 x1 x2 becomes 1 2 x1 x2 C 1FC 1,R 1C 1 2MSW R C 1FC 1, RCP 1 2MSW PR x1 x2 , which 3 1F3,15 2MSW 6 23.642 x1 x2 6.046 x1 x2 2.46 . Since the formula works 6 regardless of the column number, we get the following 3 contrasts. 1 2 6.383 7.000 2.40 0.67 2.46 x1 x2 22.49 1 3 6.383 3.333 2.40 3.00 2.46 1 4 6.383 4.833 2.40 1.50 2.46 Since the error part of the formula (2.46) is larger than the difference between sample means in two cases, there is no significant difference there. However, Formulation 3 is significantly worse than Formulation 1. 2MSW c) The Bonferroni formula for column means is 1 2 x1 x2 t RC P 1 . Note that if 2m PR P 1 , replace RC P 1 with R 1C 1 6 14 1 15 . If .15 and we are doing m 3 intervals, 2m .15 23 .025. The Bonferroni formula becomes 23.642 2MSW 15 2 MSW x1 x2 2.131 x1 x2 t.025 6 R R x1 x2 1.10 . Once again, substitute the sample means. 1 2 6.383 7.000 1.10 0.67 1.10 1 3 6.383 3.333 1.10 3.00 1.10 1 4 6.383 4.833 1.10 1.50 1.10 According to these smaller, and probably more appropriate, intervals, both Formulations 3 and 4 are significantly worse than Formula 1. We will stick with the old formula. 1 2 x1 x2 tR 1C 1 2m 11 4/18/02 252y0232 3(ctd.). d. Actually, when Weirs presented the data in the previous problem, repeated below, he assumed that the underlying distribution was not Normal. So compare the median ratings using a 10% significance level. (6) Stain 1 2 3 4 5 6 Sum Count Sum of Squares Form 1 Form 2 Form 3 Form 4 1 7 2 5 9 10 7 5 4 6 1 4 9 7 4 5 6 8 4 4 9 4 2 6 38 42 20 29 6 6 6 6 296 314 sum count 15 4 31 4 15 4 25 4 22 4 21 4 129 24 24 Sum of squares 79 255 69 171 132 137 843 90 Solution: d) This becomes a Friedman test. We rank the data within rows. H 0 : Columns from same distribution Row 1 2 3 4 Row 1 2 1 2 3 4 5 6 1 9 4 9 6 9 7 10 6 7 8 4 2 7 1 4 4 2 5 5 4 5 4 6 1 2 3 4 5 6 Sum 1 4 3 4 2.5 4 4 3 3 4 4.0 2 17.5 21 3 4 2 3 2 1 1 2.5 1 2 1.5 1.5 1 3.0 8.5 13.0 There are r 6 rows and c 4 columns. Check: The rank sums must add to r cc 1 45 6 60 . 2 2 Since 17.5 + 21 + 8.5 + 13.0 = 60, we are all right. The Friedman Statistic is 12 12 1 F2 SR 2 3r c 1 17.5 2 21 2 8.5 2 13 2 365 988 .5 90 8.85 . r c c 1 645 10 The Friedman Table has no values for c 4 and r 6 , so we use a chi-squared table with c 1 3 degrees of freedom. Since .10 , the table gives us a critical value of 6.2514. . Since our computed chisquared is larger than the table value, we reject H 0 . Note - if you were told to use a significance level of .01, you would have gotten a critical value of 11.3449 and would not have rejected the null hypothesis. 12 4/18/02 252y0232 4. Use methods appropriate to testing goodness of fit. a. Test the hypothesis that the numbers below came from a Normal distribution. Use a 10% significance level. (6) note that Minitab says the following: mean 303.000 stdev 64.0878 n 9.00000 b. Test the hypothesis that the numbers below came from a Normal distribution with a mean of 240 and a standard deviation of 50 (6) 238 222 272 280 292 301 333 357 432 Solution: a) H 0 : N ?, ? H 1 : Not Normal Because the mean and standard deviation are unknown, this is a Lilliefors problem. Note that data must be in order for the Lilliefors or K-S method to work. From the data we found that x 303 .00 and xx . F t actually is computed from the Normal table. For example s 64.0878 . t s Fe 222 Px 222 Pz 1.26 Pz 0 P 1.26 z 0 .5 .3962 .1038 . D is the difference (absolute value) between the two cumulative distributions. O x O Row Fe FO t n 1 222 -1.26 0.1038 1 0.111111 0.11111 2 238 -1.01 0.1562 1 0.111111 0.22222 3 272 -0.48 0.3156 1 0.111111 0.33333 4 280 -0.36 0.3594 1 0.111111 0.44444 5 292 -0.17 0.4325 1 0.111111 0.55556 6 301 -0.03 0.4880 1 0.111111 0.66667 7 333 0.47 0.6808 1 0.111111 0.77778 8 357 0.84 0.7995 1 0.111111 0.88889 9 432 2.01 0.9778 1 0.111111 1.00000 D 0.007311 0.066022 0.017733 0.085044 0.123056 0.178667 0.096978 0.089389 0.022200 The maximum deviation is 0.17867. The Lilliefors table for .10 and n 9 gives a critical value of 0.249. Since our maximum deviation does not exceed the critical value, we do not reject H 0 . b) H0 :N 240 ,50 H 1 : Not N 240 ,50 Because the population mean and standard deviation are known, this is a Kolmogorov-Smirnov problem. x z . Row 1 2 3 4 5 6 7 8 9 x 222 238 272 280 292 301 333 357 432 z -0.36 -0.04 0.64 0.80 1.04 1.22 1.86 2.34 3.84 Fe 0.3594 0.4840 0.7389 0.7881 0.8508 0.8888 0.9686 0.9925 0.9999 O 1 1 1 1 1 1 1 1 1 O n 0.111111 0.111111 0.111111 0.111111 0.111111 0.111111 0.111111 0.111111 0.111111 FO 0.11111 0.22222 0.33333 0.44444 0.55556 0.66667 0.77778 0.88889 1.00000 D 0.248289 0.261778 0.405567 0.343656 0.295244 0.222133 0.190822 0.103611 0.000100 The maximum deviation is 0.405567. The Kolmogorov-Smirnov table for .10 and n 9 gives a critical value of 0.387. Since our maximum deviation exceeds the critical value, reject H 0 . 13 4/18/02 252y0232 5. (Weirs) The following data gives years of membership and numbers of shares (in thousands) owned for 8 board members of our corporation. Numbers are the dependent variable and years is the independent variable. Data Display Row share years 1 2 3 4 5 6 7 8 Total 300 408 560 252 288 650 600 522 3580 6 12 14 6 9 13 15 9 84 years shares squared squared 36 90000 144 166464 196 313600 36 63504 81 82944 169 422500 225 390000 81 272484 968 1771496 Note that n 8 and that you will have to compute xy . a. Compute the regression equation Y b0 b1 x to predict thousands of shares owned on the basis of age. (6) b. On the basis of your regression, how many thousands of shares do you expect to be owned by someone who has been on the board for 3 years ? (1) c. Compute R 2 . (4) d. Compute s e . (3) e. Compute s b0 and do a significance test on b0 .(4) f.. Do an interval that shows the average number of shares that would be owned by someone who has been on the board for 3 years. (3) g. Using your SST etc., put together the ANOVA table (6) x 84 , y 3580 , x 968 and y 1771496 . After all this time, trying to get x by squaring x or to get xy by multiplying x by y is inexcusable. We compute x y 40788 (See next page) 2 Solution: 2 2 Spare Parts Computation: x 84 x 10 .5 n 8 SSx y 3198 .0 y 3580 447 .25 n 8 x nx 968 810.5 86.0 Sxy xy nx y 40788 810 .5447 .5 SSy 2 y 2 2 2 ny 1771496 8447 .52 2 169446 .0 SST a) b1 Sxy SSx xy nxy 3198 .0 37.1860 x nx 86.0 2 2 b0 y b1 x 447 .5 37.1860 10.5 57.047 b) Y b0 b1 x becomes Yˆ 57.047 37.1860 x . So if x 0 3, then Yˆ0 57.047 37.1860x0 57.047 37.1860 3 168 .605 is the number of shares that we forecast for someone who has been on the board for 3 years. 14 4/18/02 252y0232 SSR 118920 .8 xy nxy 37.1860 3198 .0 118920 .8 R SST .7018 or 169446 .0 xy nxy Sxy 3198 .0 ( 0 R 1 always!) .7018 SSxSSy x nx y ny 86.0169446 .0 c) SSR b1 Sxy b1 2 2 2 R 2 2 2 2 2 2 2 s e2 d) SSE SST SSR 168446 .0 118920 .8 50525 .2 s e2 s e2 s e2 SSy b1 Sxy n2 y 2 xy nxy 169446 .0 37.1860 3198 .0 8420 .862 or ny 2 b1 1 R SST 1 R y 2 2 n2 y 2 n2 2 ny n2 ny 2 SSE 50525 .2 8420 .867 or n2 6 x b12 2 nx 2 2 6 1 .7018 169446 .0 8421 .466 or 6 So s e 8420 .867 91 .7653 n2 ( s e2 is always positive!) e) H 0: 0 0 H 1 : 0 0 2 1 x2 8420 .867 1 10 .5 11847 .9 s b20 s e2 sb0 11847.9 108.8 n SS 8 86 .0 x b 00 b0 0 57 .047 t 0 0.524 Assume that .05 and Make a diagram. Show an almost s b0 s b0 108 .8 normal curve and that the 'reject region is below t n2 t 6 2.447 or above t n2 t 6 2.447 . . 2 Since 0.524 is between these values, do not reject H 0 . Conclude that f) We found in b) that if x 3 , Yˆ 168.605 . 0 1 s 2yˆ s e2 0 n s y0 . 2 .025 0 .025 is insignificant. 0 2 2 s 2 1 x 0 x 8420 .867 1 3 10 .5 6560 .44 e 8 n 86.0 SS x x 2 nx 2 So Y0 Yˆ0 t 2 s y0 168 .605 2.447 80 .9966 168 198 . 6560.44 80.9966. x 0 x 2 g) From the previous page or above, SSR 118920 .8 , SST 169446 .0 and SSE 50525 .2 . H 0 is that there is no relation between Y and X . Source SS DF MS F F.05 Regression 118920.8 1 Error (Within) Total 50525.2 169446.0 6 7 118920.8 14.122 F 1,6 5.99 ns 8420.87 Since the table F is larger than the computed F, do not reject H 0 . Appendix: Computation of column sums. Row i 1 2 3 4 5 6 7 8 Sum share years C3 C4 y x x2 xy y2 300 408 560 252 288 650 600 522 3580 6 12 14 6 9 13 15 9 84 1800 4896 7840 1512 2592 8450 9000 4698 40788 90000 166464 313600 63504 82944 422500 360000 272484 1771496 36 144 196 36 81 169 225 81 968 C5 15 4/18/02 252y0232 It's worthwhile looking at the computer output for this exercise. MTB > RETR 'C:\MINITAB\2X0232-4.MTW'. (Retrieves previously stored data) Retrieving worksheet from file: C:\MINITAB\2X0232-4.MTW Worksheet was saved on 4/12/2002 MTB > Execute 'C:\MINITAB\252SOLS.MTB' 1. (Executes previously stored commands) Executing from file: C:\MINITAB\252SOLS.MTB Regression Analysis The regression equation is shares = 57 + 37.2 years Predictor Constant years Coef 57.0 37.186 s = 91.77 Stdev 108.8 9.895 R-sq = 70.2% t-ratio 0.52 3.76 p 0.619 0.009 R-sq(adj) = 65.2% Analysis of Variance SOURCE Regression Error Total DF 1 6 7 SS 118921 50525 169446 MS 118921 8421 F 14.12 p 0.009 16