252solnD4 10/19/06 (Open this document in 'Page Layout' view!) Re-edited to replace or with D . D. COMPARISON OF TWO SAMPLES 1. Two Means, Two Independent Samples, Large Samples. Text 10.1-10.3, 10.7 [10.1 – 10.3, 10.5] (10.1 – 10.3, 10.5) 2. Two Means, Two Independent Samples, Populations Normally Distributed, Population Variances Assumed Equal. Text 10.4, 10.13a, 10.20a, b, e [10.4, 10.15a, 10.13a,b,e]. For the last problem: x1 17 .5571 , s1 1.9333 , x 2 19.8905 , s 2 4.5767 (10.4, 10.14, 10.12a,b,e) 3. Two Means, Two independent Samples, Populations Normally Distributed, Population Variances not Assumed Equal. Optional Text 10.20[10.13c,d] (10.12c,d) See data above. D3, D4 4. Two Means, Paired Samples (If samples are small, populations should be normally distributed). Text 10.26, 10.29[10.36, 10.37], D1, D2 (10.32*(in 252hwkadd.), [10.34] (different numbers), 10.25[10.35], D1, D2) 5. Rank Tests. a. The Wilcoxon-Mann-Whitney Test for Two Independent Samples. Text 12.65[10.48] (10.46) b. Wilcoxon Signed Rank Test for Paired Samples. Text 12.74-12.76[10.57-59] (10.80-82 on CD), Downing & Clark 18-15, 18-9 (in chapter 17 in D&C 3rd edition), D5 6. Proportions. Text 10.32, 10.38, 10.39, 12.32** [12.2, 12.7*, 12.8*] (12.2) 7. Variances. Text 10.40, 10.43-10.48 [10.16, 10.19 - 10.24, 10.25] (10.15, 10.18 - 10.23, 10.24) D6a (below), D6, D7 (A summary problem), D8 (A summary problem) Graded assignment 3 will be posted. This document is a solution to summary problems. -------------------------------------------------------------------------------------------------------------------------Problem D7: This is a summary problem and could be an exam in itself. Almost everything you need to know about comparing two sample means or variances is in here. In a study of sleep gotten with a sleeping pill and with a placebo the results were as below (Keller, Warren, Bartel, 2nd ed. p. 354). Test for a difference in means or medians as appropriate. d x1 x2 Pill Placebo difference 7.3 8.5 6.4 9.0 6.9 x1 7.620 s12 12 1.197 6.8 7.9 6.0 8.4 6.5 .5 .6 .4 .6 .4 x 2 7.120 d 0.500 s 22 0.997 s d2 0.010 a. Assume that these are independent samples from population with a normal distribution and that 22 (Test if 12 22 ). b. Assume that these are independent samples and that 12 22 . c. Assume these are paired samples. In each case do (i) a 99% confidence interval for 1-2 , (ii) test if 1=2 . (iii) In case a test 22 . d. Redo part a(ii) assuming that the parent population is not normal. e. Redo part c(ii) assuming that the parent distribution is not normal. if 12 1 252solnD4 10/19/06 (Open this document in 'Page Layout' view!) Re-edited to replace or with D . Solution: Assume .01 . a) Assume that these are independent samples from a normal distribution and that 12 22 (Test if 12 22 ). From the Syllabus supplement: Interval for Confidence Interval Difference D d t 2 sd Between Two 1 1 Means ( sd s p n n2 Unknown, 1 Variances DF n1 n2 2 Assumed equal) (i) Hypotheses Test Ratio H 0 : D D0 t H1 : D D0 D 1 2 sˆ 2p Critical Value d D0 sd d cv D0 t 2 sd n1 1s12 n2 1s22 n1 n2 2 Confidence interval: In the case of equal variances we used a pooled variance, sˆ2p n1 1s12 n2 1s22 n1 n2 2 s d sˆ p 1 1 n1 n 2 d x1 x 2 and 41.197 40.947 1.097 . This is used to compute 8 1.097 1 1 5 D 1 2 8 t tn1 n2 1 t .005 3.355 , 0.439 0.662 . 5 , the becomes 2 Since we can D d t 2 sd equation say , 1 2 x1 x 2 t 2 s d that where or 1 2 0.500 3.355 0.662 0.500 2.221 (ii) H0 : D 0 H1 : D 0 or H 0 : 1 2 H1 : 1 2 . If we use a test ratio, d D0 x1 x2 1 2 0.500 0 0.755 . sd sd 0.662 8 t .005 3.355 , we accept H 0 . If we use t Since a this critical is between value instead, d cv 0 t 2 sd 0 3.355 0.062 2.221 . Since d 0.500 is between these critical values, we accept H 0 . (iii) We are testing test F DF1 , DF2 H 0 : 12 22 H1 : 12 22 . According to the syllabus supplement, s12 s22 DF2 , DF1 2 and F 2 , where DF1 n1 1 and DF2 n 2 1 . s2 s1 s 22 0.997 1.197 or 0.833 , so we 1 . 201 s12 1.197 s 22 0.997 4,4 , But it's not available accept H 0 . (Actually we should be checking against F4,4 F.005 4,4 9.60 and is larger than F.01 s12 2 4, 4 must be larger than F 4, 4 . So if on the table. A check of the table shows that F.005 .01 4, 4 4, 4 1.201 is less than F , it also must be less than F . .005 .01 2 252solnD4 10/19/06 (Open this document in 'Page Layout' view!) Re-edited to replace or with D . b) Assume that these are independent samples and that 12 22 . From the Syllabus supplement: Interval for Confidence Hypotheses Interval Difference H 0 : D D0 D d t 2 sd Between Two H1 : D D0 Means( s12 s22 D 1 2 sd Unknown, n1 n2 Variances 2 s12 s22 Assumed n n2 1 Unequal) DF 2 2 Test Ratio t Critical Value d D0 sd d cv D0 t 2 sd s12 s 22 n1 n1 1 (i) n2 n2 1 (Optional) Confidence interval: In the case of unequal variances we use the Satterthwaite method. s12 1.197 s 22 0.997 s2 s2 0.2394 , 0.1994 , so 1 2 0.2394 0.1994 0.4388 . n1 5 n2 5 n1 n 2 If we use this in the degrees of freedom formula, we find DF s12 s 22 n1 n 2 2 2 2 s12 s 22 n1 n2 n1 1 n2 1 0.4388 2 0.2394 2 0.1994 2 4 7.9341 . We round this down to get 4 7 degrees of freedom. This is used with s d s12 s 22 0.4388 0.6624 . Since we n1 n 2 can say that d x1 x 2 and D 1 2 , the equation D d t sd , where 2 7 t t .005 3.499 , becomes 1 2 x1 x 2 t s d or 2 1 2 0.500 3.499 0.662 0.500 2.318 (ii) H0 : D 0 H1 : D 0 or H 0 : 1 2 H1 : 1 2 . If we use a test ratio, d D0 x1 x2 1 2 0.500 0 0.755 . sd sd 0.6624 7 t .005 3.499 , we accept H 0 . If we use t Since a this critical is between value instead, d cv D0 t 2 sd 0 3.4990.0624 2.318 . Since d 0.500 is between these critical values, we accept H 0 . 3 252solnD4 10/19/06 (Open this document in 'Page Layout' view!) Re-edited to replace or with D . Assume these are paired samples. If the paired data problem were on the formula table, it would appear as below. Interval for Confidence Hypotheses Test Ratio Interval Difference H 0 : D D0 * D d t 2 s d d D0 t between Two H 1 : D D0 , sd d x1 x 2 Means (paired D 1 2 s data.) sd d n (i) Critical Value d cv D0 t 2 s d Confidence interval: In the case of paired data, we act as if we have only n n1 n 2 4 pairs. DF n 1 4 . t t .005 4.604 and s d sd n 0.010 .002 0.447 . 5 d x1 x 2 and D 1 2 , the equation D d t 2 sd , becomes 1 2 x1 x 2 t 2 s d or 1 2 0.500 4.604 0.0447 0.500 0.206 (ii) H0 : D 0 H1 : D 0 or H 0 : 1 2 H1 : 1 2 . If we use a test ratio, d D0 x1 x2 1 2 0.500 0 11 .18 . Since this is not between sd sd 0.0447 t 4 4.604 , we reject H . If we use a critical value instead, t .005 0 d cv D0 t 2 sd 0 4.6040.0447 0.206 . Since d 0.500 is not between these critical values, we reject H 0 . c) Redo part a(ii) assuming that the parent population is not normal. Since the parent population is not normal and the data represents two independent samples we do a Wilcoxon rank sum test. To do this we rank the ten numbers from 1 to ten starting at the extreme end of the smallest sample. Since the samples are of the same size we arbitrarily pick x1 as the smaller sample and note that 9 is the largest number in both samples so that is where we start our ranking. Since we are working with non normal items, our hypotheses are stated as H0 : 1 2 H1 : 1 2 x1 r1 x2 r1 7.3 8.5 6.4 9.0 6.9 5 2 9 1 6 23 6.8 7 7.9 4 6.0 10 8.4 3 6.5 8 32 d .5 .6 .4 .6 .4 From the above n1 n2 5 , and the sums of the ranks are SR1 23 and SR2 32 . W is the smaller of the two rank sums and is 23. To check our rank sums note that n1 n2 n 10 and that if the rank sums are nn 1 10 11 55 , so the ranking seems correct. If we go to . In this case 23 32 2 2 Table 5 in the syllabus supplement, we find that the p-value for W 23 is .210. Since this is a 2-sided test it should be doubled to .410. In any case, it is above .01 , so accept H 0 . For a 5% test Table 6 could be used. correct, SR1 SR2 4 252solnD4 10/19/06 (Open this document in 'Page Layout' view!) Re-edited to replace or with D . d) Redo part c(ii) assuming that the parent distribution is not normal. Since the parent population is not normal and the data represents paired samples we would prefer to do a Wilcoxon signed rank test of the hypotheses H0 : 1 2 H1 : 1 2 . To do this we take the values of d x1 x 2 and replace them with their absolute values d . We rank the n values from 1 to n . To compute corrected ranks we add + or - according to the sign in d and replace all ties with average ranks. x1 x2 d 7.3 8.5 6.4 9. 6.9 6.8 7.9 6.0 8.4 6.5 .5 .6 .4 .6 .4 rank corrected rank d .5 .6 .4 .6 .4 3 4 2 5 1 +3.0 +4.5 +1.5 +4.5 +1.5 For example, ranks 4 and 5 are both replaced with 4.5, their average, because they correspond to identical values (.6) of d . We next compute T and T , the sums of the positive and negative ranks. In this case T 3.0 4.5 1.5 4.5 1.5 15 , while T 0. Our check on the ranking is that the sum of the numbers nn 1 56 nn 1 15 , which, as it should be, is the sum from 1 to n is , In this case, since n 5 , 2 2 2 of T and T . We call the smaller of T and T , in this case 0, TL , and look it up on Table 7 in the syllabus supplement. Unfortunately for n 5 , there are no appropriate values, so we cannot reject H 0 . A second choice test here would be a sign test. We use a binomial table to find out the probability of getting 5 (or more) positive differences in 5 tries, assuming that the probability is .5. From the binomial table this probability is .0313, but to make this into a p-value for a 2-sided test, we must double it to .0626. Since .01 is less than the p-value, we must accept H 0 , though, if we were working with a higher significance level we could reject it. Problem D8: (2001 Graded Assignment 3) In your outline there are 6 methods to compare means or medians, methods D1, D2, D3, D4, D5a and D5b. Method D6 compares proportions and method D7 compares variances or standard deviations. In the following cases, identify H 0 and H 1 and identify which method to use. If the hypotheses involve a mean, state the hypotheses in terms of both and D 1 2 . If the hypotheses involve a proportion, state them in terms of both p and p p1 p 2 . If the hypotheses involve standard deviations or variances, state them in terms of both 2 and 12 22 or 22 12 . All the questions involve means, medians, proportions or variances. Note: Look at 252thngs ( 252thngs) on the syllabus supplement part of the website before you start (and before you take exams) a. You have data on income in two villages ( x1 in village 1, x 2 in village 2). You want to test the hypothesis that village 1 has higher earnings than village 2. You know that income has an extremely skewed distribution. and you have to decide whether to use the mean or the median income. b. You have a sample of earned incomes for 25 couples, both of whom are teachers. ( x1 is the women's incomes in a column, x 2 is the men's. Each line represents one couple. ) Test to see if the women make more than the men. c. You have interviewed a sample of 80 small businesses in the Northeast and 75 small businesses in the Southeast. Each business has indicated whether they sell in foreign markets. You want to show that businesses in the Northeast are more likely to export. ( x1 x 2 in the Southeast). x1 , for a sample of 20 pharmaceutical firms in Europe and profit rates, x 2 , for a sample of 17 is the total number of firms that export in the Northeast sample, d. You have profit rates, pharmaceutical firms in the US. You believe that they are normally distributed and you wish to see whether the European firms were more profitable than the American firms. 5 252solnD4 10/19/06 (Open this document in 'Page Layout' view!) Re-edited to replace or with D . e. In order to see which garage to use under contract for automobile repairs, 10 cars are towed first to garage 1 and than to garage 2. You end up with two data sets, the first data column, x1 , is estimates from the first garage and the second data column, x 2 , is estimates for the second garage. Each of the 10 lines of data refers to one car. You believe that the estimates are approximately normally distributed. Compare the estimates in garage 1 and 2. f. You are having a part produced in two different machines. x1 is 200 randomly selected data points that represent the length of parts from machine one, x 2 is 200 randomly selected data points that represent the length of parts from machine two. You want to test your suspicion that parts from machine 2 are longer than parts from machine 1. In a problem of this type you would assume that the lengths are normally distributed. g. You also suspect that parts from machine two are more variable in length than parts from machine one. Test this suspicion. Solution with problem statements inserted. It may help to use the following table taken from the outline. Paired Samples Location - Normal distribution. Method D4 Compare means. Independent Samples Methods D1- D3 Location - Distribution not Normal. Compare medians. Method D5b Method D5a Proportions Method D6b Method D6a Variability - Normal distribution. Compare variances. Method D7 a. You have data on income in two villages ( x1 in village 1, x 2 in village 2). You want to test the hypothesis that village 1 has higher earnings than village 2. You know that income has an extremely skewed distribution. and you have to decide whether to use the mean or the median income. Solution: Because of the skewed distribution, the median is the preferred statistic. If is the median. H 0 : 1 2 . Since we are comparing medians and the data are not paired, use Method D5a. H 1 : 1 2 b. You have a sample of earned incomes for 25 couples, both of whom are teachers. ( x1 is the women's incomes in a column, x 2 is the men's. Each line represents one couple. ) Test to see if the women make more than the men. H 0 : 1 2 Solution: If is the median. . Since we are comparing medians and the data are paired, use H 1 : 1 2 Method D5b. c. You have interviewed a sample of 80 small businesses in the Northeast and 75 small businesses in the Southeast. Each business has indicated whether they sell in foreign markets. You want to show that businesses in the Northeast are more likely to export. ( x1 is the total number of firms that export in the Northeast sample, x 2 in the Southeast). p x1 1 n1 H 0 : p1 p 2 H 0 : p1 p 2 0 Solution: If or . If p p1 p 2 , then H : p p x 2 1 1 H 1 : p1 p 2 0 p2 2 n2 Since we are comparing proportions, use Method D6. H 0 : p 0 . H 1 : p 0 6 252solnD4 10/19/06 (Open this document in 'Page Layout' view!) Re-edited to replace or with D . d. You have profit rates, x1 , for a sample of 20 pharmaceutical firms in Europe and profit rates, x 2 , for a sample of 17 pharmaceutical firms in the US. You believe that they are normally distributed and you wish to see whether the European firms were more profitable than the American firms. H 0 : 1 2 H 0 : 1 2 0 H : D 0 Solution: or . If D 1 2 , then 0 . Because you believe H 1 : 1 2 H 1 : 1 2 0 H1 : D 0 that the Normal distribution applies, you use a method that compares means. The total sample size is too small to use Method D1, which means that D2 or D3 should work. You could test the variances for equality and use D2, or not bother and use D3. e. In order to see which garage to use under contract for automobile repairs, 10 cars are towed first to garage 1 and than to garage 2. You end up with two data sets, the first data column, x1 , is estimates from the first garage and the second data column, x 2 , is estimates for the second garage. Each of the 10 lines of data refers to one car. You believe that the estimates are approximately normally distributed. Compare the estimates in garage 1 and 2. H 0 : 1 2 Solution: There is no reason to assume that one garage is cheaper than the other, so or H 1 : 1 2 H 0 : 1 2 0 H : D 0 . If D 1 2 , then 0 . Again, you compare means because you are, H : 0 2 1 1 H1 : D 0 presumably, interested in the total amount that you will pay for the repairs, which means that you want the lowest average cost. The important thing to notice here is that the data are in pairs, so you use Method D4. f. You are having a part produced in two different machines. x1 is 200 randomly selected data points that represent the length of parts from machine one, x 2 is 200 randomly selected data points that represent the length of parts from machine two. You want to test your suspicion that parts from machine 2 are longer than parts from machine 1. In a problem of this type you would assume that the lengths are normally distributed. You could use Method D2 (if you tested the variances for equality) or D3 here, but, since you have two large samples, it would be far easier to use Method D1. H 0 : 1 2 H 0 : 1 2 0 H 0 : D 0 Solution: or . If D 1 2 , then . H 1 : 1 2 H 1 : 1 2 0 H1 : D 0 g. You also suspect that parts from machine two are more variable in length than parts from machine one. Test this suspicion. H 0 : H 0 : 12 22 2 2 1 2 Solution: or . In terms of the variance ratio 12 or 22 , the alternate 2 1 H 1 : 12 22 H 1 : 1 2 hypothesis rules, so H 0 : 22 12 1 and H 1 : 22 12 1 . Since you are comparing variances, use Method D7. ----------------------------------------------------------------------------------------------------------------------------This is just an excerpt from an old solution to grass3, but it may make it easier to do grass3 and take the exams.. Remember the following: You have not done a hypothesis test unless you have stated your hypotheses, run the numbers and stated your conclusion. The rule on p-value says if the p-value is less than the significance level (alpha = ) reject the null hypothesis; if the p-value is greater than or equal to the significance level, do not reject the null hypothesis. A table follows. 7 252solnD4 10/19/06 (Open this document in 'Page Layout' view!) Re-edited to replace or with D . From the Formula Table (with Method D4 added): Interval for Confidence Hypotheses Interval Difference H 0 : D D0 * D d z 2 d between Two H 1 : D D0 , Means ( 12 22 D 1 2 d known) n1 n 2 (Method D1) d x1 x 2 Difference between Two Means ( unknown, variances assumed equal) (Method D2) D d t 2 s d Difference between Two Means( unknown, variances assumed unequal) (Method D3) D d t 2 s d Difference between Two Means (paired data.) (Method D4) Ratio of Variances 1 , DF2 F1DF 2 1 FDF1 , DF2 2 (Method D7) sd s p H 0 : D D0 * 1 1 n1 n2 H 1 : D D0 , D 1 2 Test Ratio z t sˆ 2p Critical Value d cv D0 z 2 d d D0 d d cv D0 t 2 sd d D0 sd n1 1s12 n2 1s22 n1 n2 2 DF n1 n2 2 s12 s22 n1 n2 sd DF H 0 : D D0 * s12 s22 n 1 n2 t d D0 sd d cv D0 t 2 sd t d D0 sd d cv D0 t 2 s d D 1 2 2 s12 2 s 22 n1 n1 1 2 n2 n2 1 H 0 : D D0 * D d t 2 s d H 1 : D D0 , d x1 x 2 22 12 H 1 : D D0 , s22 s12 D 1 2 1 , DF2 F.5DF .5 2 DF1 n1 1 DF2 n 2 1 2 .5 .5 2 or 1 2 H0 : 12 22 H1 : 12 22 sd sd n F DF1 , DF2 s12 s 22 and F DF2 , DF1 s 22 s12 * Same as H 0 : 1 2 , H1 : 1 2 if D0 0. Note that has been changed to D . © 2002 Roger Even Bove 8