252solngr3-072 10/23/07 (Open this document in 'Page Layout' view!) Name: Class days and time: Please include this on what you hand in! Graded Assignment 3 In your outline there are 6 methods to compare means or medians, methods D1, D2, D3, D4, D5a and D5b. Method D6 compares proportions and method D7 compares variances or standard deviations. In the following cases, identify H 0 and H 1 and identify and D 1 2 . If the p and p p1 p 2 . If the hypotheses involve standard deviations which method to use. If the hypotheses involve a mean, state the hypotheses in terms of both hypotheses involve a proportion, state them in terms of both or variances, state them in terms of both 2 and 12 22 or 22 12 . All the questions involve means, medians, proportions or variances. One of these problems is a chi-squared test. Note: Look at 252thngs ( 252thngs) on the syllabus supplement part of the website before you start (and before you take exams) especially the new rules. ----------------------------------------------------------------------------------------------------------------------------Example: This may seem long but it appears on last year’s graded assignment 3. A group of supervisors are given the exams on management skills before and after taking a course in management. Scores are as follows. Supervisor Before After 1 63 78 2 93 92 3 84 91 4 72 80 5 65 69 6 72 85 7 91 99 8 84 82 9 71 81 10 80 87 11 68 93 If we assume that the distribution of results is Normal, what method should we use to answer the question “Has the course improved the scores of the managers?” Solution: You are comparing means before and after the course. You can get away with using means because the parent distributions are Normal. If 2 is the mean of the second sample, you are hoping that 2 1 , which, because it contains no equality is an alternate hypothesis. So your hypotheses are H 0 : 1 2 H 0 : 1 2 0 H 0 : D 0 or . If D 1 2 , then . The important thing to notice H : H : 0 2 2 1 1 1 1 H 1 : D 0 here is that the data are in before and after pairs, so you use Method D4. -------------------------------------------------------------------------------------------------------------------General considerations. 1) All methods in section D are methods that can only be used for comparison of 2 samples. This is because, if (theta) is a parameter like or p, 1 2 is easy to define and will be zero if 1 and 1 are equal. If we go to more than two samples, say 3, we need something like 1 0 2 2 0 2 3 0 2 , where 0 is some sort of average of the parameters of the samples. This will equal zero if all the parameters are equal and will not allow positive discrepancies in one sample to cancel out negative discrepancies in another. This is what takes us to chi-squared and ANOVA methods. 252solngr3-072 10/23/07 Saying 1 0 2 2 0 2 3 0 2 0 is not the same as saying D 3 1 2 3 0 , because D 3 would be negative if , but saying 2 2 0 is the same as saying 1 2 3 1 0 2 0 D 2 1 2 0 . (Try proving this – it’s simple algebra.) 2) You can always substitute a method for the median for a method for the mean, but not vice versa. However, if a Normal distribution applies, a method involving means will be more efficient and powerful. 3) The computer will used Method D3 when it is not told what method to use. This is quite general because if the sample variances are similar, it gives results like D2 and if the sample sizes are large, it gives results like D1. However, if variances are equal D2 is easier to use and if the samples are large D1 is easier to use. 4) The K-S and Lilliefors methods only exist because chi-square performs so poorly for small samples. K-S needs , or other parameters. Lilliefors uses x or s and only works to test for a Normal distribution. 5) ‘Significant’ in statistics means that we have rejected a hypothesis like H 0 : 0 and ‘significantly different’ means that we have rejected a hypothesis like H 0 : 1 2 . Of course, if two parameters are significantly different, their difference is significant. 6) Be careful of inequalities. If 1 2 or 2 1 and D 1 2 , then D 0. Please remember A hypothesis containing , or is an alternative hypothesis. 7) In most problems you are better off trying to figure out what the alternative hypothesis is before you try to state the null hypothesis. 8) Do not lose sight of the fact that the purpose of samples is to compare populations. We may look at numbers in methods D6b and chi-squared tests, but our purpose is to deal with proportions of a population. Part 1. 1. You have data on income in two villages ( x1 in village 1, x 2 in village 2). You want to test the hypothesis that village 2 has higher earnings than village 1. You know that income has an extremely skewed distribution and you have to decide whether to use the mean or the median income. H 0 : 1 2 Solution: If is the median. . Since we are comparing medians and the data are not paired, H 1 : 1 2 use Method D5a. Question for a Later Exam: What if we want to compare three or more villages? Solution: KruskalWallis. 2. You have a sample of earned incomes for 25 couples, both of whom are teachers. ( x1 is the women's incomes in a column, x 2 is the men's. Each line represents one couple. ) Test to see if the men make more than the women. H 0 : 1 2 Solution: If is the median. . Since we are comparing medians and the data are paired, use H 1 : 1 2 Method D5b. Question for a Later Exam: What if we want to compare the incomes of 25 members each of three different ethnic groups? Each of the 25 lines of our table have three incomes, one for an individual of each group, but the individuals on each line have been matched for education, experience and personality. Solution: Friedman 2 252solngr3-072 10/23/07 3. You have interviewed a sample of 80 small businesses in the Northeast and 75 small businesses in the Southeast. Each business has indicated whether they sell in foreign markets. 60 firms in the Northeast and 50 in the Southeast export. You want to show that businesses in the Northeast are more likely to export. ( x1 is the total number of firms that export in the Northeast sample, x 2 in the Southeast). p x1 1 n1 H 0 : p1 p 2 H : p p 2 0 Solution: If or 0 1 . If p p1 p 2 , then H 1 : p1 p 2 H 1 : p1 p 2 0 p2 x2 n2 Since we are comparing proportions, use Method D6a. H 0 : p 0 . H 1 : p 0 4. You interview a sample of 57 Pennsylvania businesses in 2002 and reinterview the same sample in 2007. You ask them whether they export. Your data consists of two items for each firm: whether they exported in 2002 and whether they exported in 2007. You want to show that the proportion exporting has increased. Of the 57 firms 30 exported in both years and 10 did not export in the first year, but did so in the second. 4 firms discontinued exports after 2002. Solution: This can be called a paired comparison of proportions and the method is D6b. Let p1 represent proportion of the population that exported in 2002 p 2 represent the proportion of the population that exported in 2007. We want to test for p 2 p1 or p1 p 2 . Whichever way we write it, it’s an alternative hypothesis because it contains no equality. Let x11 be those who exported in 2002 and 2007, x12 those who exported in 2002 but not in 2007, x 21 be the number that exported in 2007 but not 2002e and x 22 be H : p p 2 those who never exported. Our hypotheses are given along with the table to be analyzed. 0 1 or H 1 : p1 p 2 question 2 question 1 yes no H : p 0 if p p1 p 2 , 0 yes x11 x12 H 1 : p 0 x no 21 x 22 5. You expand the sample in 3 by adding 60 small businesses in the Midwest, ( x3 is the number of these that export). You test the hypothesis that the same fraction of businesses export in each region. p x1 1 n1 H 0 : p1 p 2 p 3 x Solution: If p 2 2 n . This is a chi-squared test of homogeneity. Since we are 2 H 1 : not all ps equal. p x3 3 n3 comparing multiple proportions, use a chi-squared test. 6. You have profit rates, x1 , for a sample of 20 pharmaceutical firms in Europe and profit rates, x 2 , for a sample of 17 pharmaceutical firms in the US. You believe that they are normally distributed and you wish to see whether the European firms were more profitable than the American firms. H 0 : 1 2 H 0 : 1 2 0 H 0 : D 0 Solution: or . If D 1 2 , then . Because you believe H 1 : 1 2 H 1 : 1 2 0 H 1 : D 0 that the Normal distribution applies, you use a method that compares means. The total sample size is too small to use Method D1, which means that D2 or D3 should work. You could test the variances for equality and use D2, or not bother and use D3. Question for a Later Exam: What if we want to compare three or more groups of firms? Solution: Oneway ANOVA. 3 252solngr3-072 10/23/07 7. In order to see which garage to use under contract for automobile repairs, 35 cars are towed first to garage 1 and than to garage 2. You end up with two data sets, the first data column, x1 , is estimates from the first garage and the second data column, x 2 , is estimates for the second garage. Each of the 35 lines of data refers to one car. You believe that the estimates are approximately normally distributed. Compare the estimates in garage 1 and 2. H : 2 Solution: There is no reason to assume that one garage is cheaper than the other, so 0 1 or H 1 : 1 2 H 0 : 1 2 0 H : D 0 . If D 1 2 , then 0 . Again, you compare means because you are, H 1 : 1 2 0 H 1 : D 0 presumably, interested in the total amount that you will pay for the repairs, which means that you want the lowest average cost. The important thing to notice here is that the data are in pairs, so you use Method D4. Question for a Later Exam: What if we want to check 3 garages? Solution: 2-way ANOVA, with one measurement per cell. 8. You are having a part produced in two different machines. x1 is 200 randomly selected data points that represent the length of parts from machine one, x 2 is 200 randomly selected data points that represent the length of parts from machine two. You want to test your suspicion that parts from machine 2 are longer than parts from machine 1. In a problem of this type you would assume that the lengths are normally distributed. Solution: You could use Method D2 (if you tested the variances for equality) or D3 here, but, since you have two large samples, it would be far easier to use Method D1. H 0 : 1 2 H 0 : 1 2 0 H 0 : D 0 or . If D 1 2 , then . H 1 : 1 2 H 1 : 1 2 0 H 1 : D 0 9. You also suspect that parts from machine two are more variable in length than parts from machine one. (This is the same as saying that machine 2 is less reliable than machine 1). Test this suspicion. H 0 : H 0 : 12 22 2 2 1 2 Solution: or . In terms of the variance ratio 12 or 22 , the alternate 2 1 H 1 : 12 22 H 1 : 1 2 hypothesis rules, so H 0 : 22 12 1 and H 1 : 22 12 1 . Since you are comparing variances, use Method D7. Question for a Later Exam: What if you doubt that the Normal Distribution applies? Solution: Levene Test. Question for a Later Exam: What if there are 3 machines? Solution: Levene or Bartlett Test. 10. You are going to do the exercise in 8) again, but this time you have done a test like that in Exercise 9) and not rejected your null hypothesis. However, you have only 30 lengths from each machine. Solution: Hypotheses are the same as for 8. Because this is a small sample and we have found that the variances are equal we can use D2 Location - Normal distribution. Compare means. Location - Distribution not Normal. Compare medians. Paired Samples Method D4 Method D5b Independent Samples Methods D1- D3 D1 Large samples, D2 Equal variances, D3 More general Method D5a Proportions Method D6 Variability - Normal distribution. Compare variances. Method D7 4 252solngr3-072 10/23/07 Part 2: Do problems 3 and 4, using a 95% confidence level. Find p-values. 3. You have interviewed a sample of 80 small businesses in the Northeast and 75 small businesses in the Southeast. Each business has indicated whether they sell in foreign markets. 60 firms in the Northeast and 50 in the Southeast export. You want to show that businesses in the Northeast are more likely to export. ( x1 is the total number of firms that export in the Northeast sample, x 2 in the Southeast). p x1 1 n1 H 0 : p1 p 2 H : p p 2 0 H : p 0 Solution: If or 0 1 . If p p1 p 2 , then 0 . H 1 : p1 p 2 H 1 : p1 p 2 0 H 1 : p 0 p2 x2 n2 Since we are comparing proportions, use Method D6. Interval for Confidence Hypotheses Test Ratio Critical Value Interval pcv p0 z 2 p Difference p p 0 p p z 2 s p H 0 : p p 0 z between If p 0 p H 1 : p p 0 p p1 p 2 proportions 1 1 If p 0 p 0 p 01 p 02 p p 0 q 0 p1 q1 p 2 q 2 q 1 p n n s p 1 2 p 01q 01 p 02 q 02 p n1 n2 or p 0 0 n1 n2 n p n2 p 2 p0 1 1 Or use s p n1 n 2 So .05 , z z .05 1.645 , x1 60, n1 80, x 2 50 and n 2 75 . This means p1 q1 .2500 , 50 .6667 75 p .7500 .6667 .0833. p2 q 2 .3333 and p0 60 .7500 80 60 50 110 .7097 q 0 .2903 . 80 75 155 1 1 .7097 .2903 0.02583 0.005322 .072954 80 75 p .7097 .2903 .7500 .2500 .6667 .3333 s p .002344 .002964 .005307 .072856 80 75 .0833 0 1.142 . Make a diagram of a Normal curve with 0 in the middle and a .072954 ‘reject’ region above 1.645. Since 1.142 is not in the ‘reject region, do not reject the null hypothesis. Since the alternative hypothesis is H 1 : p 0 , the p-value is the probability that p Test ratio: z is greater than or equal to .0833, p value Pz 1.142 .5 .3729 .1271 . Because this is above .05, we cannot reject the null hypothesis. Critical value: Since the alternative hypothesis is H 1 : p 0 , the critical value must be above zero. p cv 0 1.645 .072954 .1200 . Make a diagram of a Normal curve with 0 in the middle and a ‘reject’ region above .1200. Since p .0833 is not in the ‘reject region, do not reject the null hypothesis. Confidence interval: p p z s p .0833 1.645.072856 .0365 . Make a diagram of a Normal curve with .0833 in the middle represent the confidence interval by shading the entire region above -.0365. Since p 0 0 is in the confidence interval, do not reject the null hypothesis. Even better, represent the null hypothesis H 0 : p 0 by shading the area below zero and note that this overlaps the confidence interval. 5 252solngr3-072 10/23/07 4. You interview a sample of 57 Pennsylvania businesses in 2002 and reinterview the same sample in 2007. You ask them whether they export. Your data consists of two items for each firm: whether they exported in 2002 and whether they exported in 2007. You want to show that the proportion exporting has increased. Of the 57 firms 30 exported in both years and 10 did not export in the first year, but did so in the second. 4 firms discontinued exports after 2002. Solution: This can be called a paired comparison of proportions and the method is D6b, the McNemar Test. Let p1 represent proportion of the population that exported in 2002 p 2 represent the proportion of the population that exported in 2007. We want to test for p 2 p1 or p1 p 2 . Whichever way we write it, it’s an alternative hypothesis because it contains no equality. Let x11 = 30 be those who exported in 2002 and 2007, x12 = 10 those who exported in 2002 but not in 2007, x 21 =4 be the number that exported in 2007 but not 2002 and x 22 = 57 – 30 – 10 – 4 = 13 be those who never exported. Our hypotheses are given along H : p 0 H : p p 2 with the table to be analyzed. 0 1 or if p p1 p 2 , 0 . The general design of the H 1 : p 0 H 1 : p1 p 2 question 2 question 1 yes no x x 21 30 4 table is , in this case . We will compare z 12 yes x11 x12 x12 x 21 10 13 x x no 22 21 4 10 62 2.5714 1.604 against z . Make a diagram of a Normal curve with 0 in the 14 10 4 middle and a ‘reject’ region below -1.645. Since -1.604 is not in the ‘reject region, do not reject the null hypothesis. Since the alternative hypothesis is H 1 : p 0 , the p-value is the probability that p is less than or equal to the value it actually takes. p value Pz 1.604 .5 .4452 .0548 . Because this is above .05, we cannot reject the null hypothesis. 6