Chapter 12 Analysis of Categorical Data LEARNING OBJECTIVES This chapter presents several nonparametric statistics that can be used to analyze data enabling you to: 1. 2. Understand the chi-square goodness-of-fit test and how to use it. Analyze data using the chi-square test of independence. CHAPTER OUTLINE 12.1 Chi-Square Goodness-of-Fit Test Testing a Population Proportion by Using the Chi-square Goodness-of-Fit Test as an Alternative Technique to the z Test 12.2 Contingency Analysis: Chi-Square Test of Independence KEY WORDS categorical data chi-square distribution chi-square goodness-of-fit test chi-square test of independence contingency analysis contingency table STUDY QUESTIONS 1. Statistical techniques based on assumptions about the population from which the sample data are selected are called _______________ statistics. 2. Statistical techniques based on fewer assumptions about the population and the parameters are called _______________ statistics. 3. A chi-square goodness-of-fit test is being used to determine if the observed frequencies from seven categories are significantly different from the expected frequencies from the seven categories. The degrees of freedom for this test are _______________. 4. A value of alpha = .05 is used to conduct the test described in question 3. The critical table chi-square value is _______________. 5. A variable contains five categories. It is expected that data are uniformly distributed across these five categories. To test this, a sample of observed data is gathered on this variable resulting in frequencies of 27, 30, 29, 21, 24. A value of .01 is specified for alpha. The degrees of freedom for this test are _______________. 6. The critical table chi-square value of the problem presented in question 5 is _______________. 221 222 Solutions Manual and Study Guide 7. The observed chi-square value for the problem presented in question five is _______________. Based on this value and the critical chi-square value, a researcher would decide to _______________ the null hypothesis. 8. A researcher believes that a variable is Poisson distributed across six categories. To test this, a random sample of observations is made for the variable resulting in the following data: Number of arrivals 0 1 2 3 4 5 Observed 47 56 38 23 15 12 Suppose alpha is .10, the critical table chi-square value used to conduct this chi-square goodness-offit test is _______________. 9. The value of the observed chi-square for the data presented in question 8 is _______________. Based on this value and the critical value determined in question 8, the decision of the researcher is to _______________ the null hypothesis. 10. The degrees of freedom used in conducting a chi-square goodness-of-fit test to determine if a distribution is normally distributed are _______________. 11. In using the chi-square goodness-of-fit test, a statistician needs to make certain that none of the expected values are less than _______________. 12. Suppose we want to test the following hypotheses using a chi-square goodness-of-fit test. H0: p = .20 and Ha: p .20 A sample of 150 data values is taken resulting in 37 items that possess the characteristic of interest. Let = .05. The degrees of freedom for this test are ____________. The critical chi-square value is ____________. 13. The calculated value of chi-square for question 12 is ________________. The decision is to _____________________________________. 14. The chi-square ____________________ is used to analyze frequencies of two variables with multiple categories. 15. A two-way frequency table is sometimes referred to as a _______________ table. Chapter 12: Analysis of Categorical Data 223 16. Suppose a researcher wants to use the data below and the chi-square test of independence to determine if variable one is independent of variable two. Variable One A B C Variable D 25 40 60 Two E 10 15 20 The expected value for the cell of D and B is _______________. 17. The degrees of freedom for the problem presented in question 16 are _______________. 18. If alpha is .05, the critical chi-square value for the problem presented in question 16 is _______________. 19. The observed value of chi-square for the problem presented in question 16 is _______________. Based on this observed value of chi-square and the critical chi-square value determined in question 18, the researcher should decide to _______________ the null hypothesis that the two variables are independent. 20. A researcher wants to statistically determine if variable three is independent of variable four using the observed data given below: Variable Four C D Variable Three A B 92 70 112 145 If alpha is .01, the critical chi-square table value for this problem is _______________. 21. The observed chi-square value for the problem presented in question 20 is _______________. Based on this value and the critical value determined in question 20, the researcher should decide to _______________ the null hypothesis. 224 Solutions Manual and Study Guide ANSWERS TO STUDY QUESTIONS 1. Parametric Statistics 2. Nonparametric Statistics 3. 6 4. 12.5916 5. 4 6. 13.2767 7. 2.091, Fail to Reject 8. 7.77944 9. 14.8, Reject 10. k – 3 11. 5 12. 1, 3.8416 13. 2.041, Fail to Reject 14. Test of Independence 15. Contingency 16. 40.44 17. 2 18. 5.99147 19. .19, Fail to Reject 20. 6.6349 21. 6.945, Reject Chapter 12: Analysis of Categorical Data SOLUTIONS TO ODD-NUMBERED PROBLEMS IN CHAPTER 12 12.1 f0 ( f0 fe )2 f0 fe 53 37 32 28 18 15 68 42 33 22 10 8 3.309 0.595 0.030 1.636 6.400 6.125 Ho: The observed distribution is the same as the expected distribution. Ha: The observed distribution is not the same as the expected distribution. ( f0 fe )2 Observed = 18.095 fe 2 df = k – 1 = 6 – 1 = 5, = .05 2.05,5 = 11.07 Since the observed 2 = 18.095 > 2.05,5 = 11.07, the decision is to reject the null hypothesis. The observed frequencies are not distributed the same as the expected frequencies. 12.3 Number f0 (Number)(f0) 0 1 2 3 28 17 11 5 0 17 22 15 54 Ho: The frequency distribution is Poisson. Ha: The frequency distribution is not Poisson. = 54 =0.9 61 Number 0 1 2 >3 Expected Probability .4066 .3659 .1647 .0628 Expected Frequency 24.803 22.312 10.047 3.831 225 226 Solutions Manual and Study Guide Since fe for > 3 is less than 5, collapse categories 2 and >3: Number fo fe ( f0 fe )2 f0 0 1 >2 28 17 16 61 24.803 22.312 13.878 60.993 0.412 1.265 0.324 2.001 df = k – 2 = 3 – 2 = 1, = .05 2.05,1 = 3.84146 ( f0 fe )2 Observed = 2.001 fe 2 Since the observed 2 = 2.001 < 2.05,1 = 3.84146, the decision is to fail to reject the null hypothesis. There is insufficient evidence to reject the distribution as Poisson distributed. The conclusion is that the distribution is Poisson distributed. 12.5 Definition fo Exp.Prop. fe Happiness Sales/Profit Helping Others Achievement/ Challenge 42 95 27 .39 .12 .18 227(.39)= 88.53 227(.12)= 27.24 40.86 63 227 .31 70.34 ( f0 fe )2 f0 24.46 168.55 4.70 0.77 198.48 Ho: The observed frequencies are distributed the same as the expected frequencies. Ha: The observed frequencies are not distributed the same as the expected frequencies. Observed 2 = 198.48 df = k – 1 = 4 – 1 = 3, = .05 2.05,3 = 7.81473 Since the observed 2 = 198.48 > 2.05,3 = 7.81473, the decision is to reject the null hypothesis. The observed frequencies for men are not distributed the same as the expected frequencies which are based on the responses of women. Chapter 12: Analysis of Categorical Data 12.7 Age 10-20 20-30 30-40 40-50 50-60 60-70 x fo 16 44 61 56 35 19 231 fM n fM s = m 15 25 35 45 55 65 fm 240 1,100 2,135 2,520 1,925 1,235 fm = 9,155 2 ( fM ) 2 n 1 n (9,155) 2 405,375 231 = 13.6 230 For Category 10-20 Prob 10 39.63 = –2.18 13.6 20 39.63 z = = –1.44 13.6 z = .4854 –.4251 Expected prob. .0603 For Category 20-30 Prob for x = 20, z = –1.44 .4251 30 39.63 = –0.71 13.6 Expected prob. For Category 30-40 for x = 30, z = –0.71 z = 3,600 27,500 74,725 113,400 105,875 80,275 fm2 = 405,375 9,155 = 39.63 231 Ho: The observed frequencies are normally distributed. Ha: The observed frequencies are not normally distributed. z = fm2 40 39.63 = 0.03 13.6 –.2611 .1640 Prob .2611 +.0120 Expected prob. .2731 227 228 Solutions Manual and Study Guide For Category 40-50 z = Prob 50 39.63 = 0.76 13.6 .2764 –.0120 for x = 40, z = 0.03 Expected prob. For Category 50-60 z = Prob 60 39.63 = 1.50 13.6 .4332 –.2764 for x = 50, z = 0.76 Expected prob. For Category 60-70 z = .2644 .1568 Prob 70 39.63 = 2.23 13.6 .4871 –.4332 for x = 60, z = 1.50 Expected prob. .0539 For < 10: Probability between 10 and the mean = .0603 + .1640 + .2611 = .4854 Probability < 10 = .5000 – .4854 = .0146 For > 70: Probability between 70 and the mean = .0120 + .2644 + .1568 + .0539 = .4871 Probability > 70 = .5000 – .4871 = .0129 Age < 10 10-20 20-30 30-40 40-50 50-60 60-70 > 70 Probability .0146 .0603 .1640 .2731 .2644 .1568 .0539 .0129 fe (.0146)(231) = 3.37 (.0603)(231) = 13.93 37.88 63.09 61.08 36.22 12.45 2.98 Categories < 10 and > 70 are less than 5. Collapse the < 10 into 10-20 and > 70 into 60-70. Chapter 12: Analysis of Categorical Data ( f0 fe )2 f0 Age fo fe 10-20 20-30 30-40 40-50 50-60 60-70 16 44 61 56 35 19 17.30 37.88 63.09 61.08 36.22 15.43 = .05 df = k – 3 = 6 – 3 = 3, 0.10 0.99 0.07 0.42 0.04 0.83 2.45 2.05,3 = 7.81473 Observed 2 = 2.45 Since the observed 2 < 2.05,3 = 7.81473, the decision is to fail to reject the null hypothesis. There is no reason to reject that the observed frequencies are normally distributed. 12.9 H0: p = .28 Ha: p .28 n = 270 x = 62 fo Spend More Don't Spend More Total fe ( f0 fe )2 f0 62 270(.28) = 75.6 2.44656 208 270(.72) = 194.4 0.95144 270 270.0 3.39800 The observed value of 2 is 3.398 = .05 and /2 = .025 df = k – 1 = 2 – 1 = 1 2.025,1 = 5.02389 Since the observed 2 = 3.398 < 2.025,1 = 5.02389, the decision is to fail to reject the null hypothesis. 229 230 Solutions Manual and Study Guide 12.11 Variable One Variable Two 24 59 13 43 20 35 57 137 83 56 55 194 Ho: Variable one is independent of Variable Two. Ha: Variable one is not independent of Variable Two. e11 = (83)(57) = 24.39 194 e21 = (56)(57) = 16.45 194 e31 = (55)(57) = 16.16 194 e12 = (83)(13) = 58.61 194 e22 = e32 = (55)(137) = 38.84 194 Variable Two (24.39) Variable 24 One (16.45) 13 (16.16) 20 57 2 = (56)(137) = 39.55 194 (58.61) 59 (39.55) 43 (38.84) 35 137 83 56 55 194 (24 24.39) 2 (59 58.61) 2 (13 16.45) 2 (43 39.55) 2 + + + + 16.45 24.39 58.61 39.55 (20 16.16) 2 (35 38.84) 2 + = .01 + .00 + .72 + .30 + .91 + .38 = 2.32 16.16 38.84 = .05, df = (c – 1)(r – 1) = (2 – 1)(3 – 1) = 2 2.05,2 = 5.99147 Since the observed 2 = 2.32 < 2.05,2 = 5.99147, the decision is to fail to reject the null hypothesis. Variable One is independent of Variable Two. Chapter 12: Analysis of Categorical Data 12.13 Number of Children 0 1 2 or 3 >3 Social Class Lower Middle Upper 7 18 6 9 38 23 34 97 58 47 31 30 97 184 117 31 70 189 108 398 Ho: Social Class is independent of Number of Children. Ha: Social Class is not independent of Number of Children. e11 = (31)(97) = 7.56 398 e12 = (31)(184) = 14.3 398 e32 = (189)(184) = 87.38 398 e13 = (31)(117) = 9.11 398 e33 = (189)(117) = 55.56 398 e21 = (70)(97) = 17.06 398 e41 = (108)(97) = 26.32 398 e22 = (70)(184) = 32.36 398 e42 = (108)(184) = 49.93 398 e23 = (70)(117) = 20.58 398 e43 = (108)(117) = 31.75 398 e31 = (189)(97) = 46.06 398 231 232 Solutions Manual and Study Guide 0 Number of Children 1 2 or 3 >3 2 = Social Class Lower Middle Upper (7.56) (14.33) (9.11) 7 18 6 (17.06) (32.36) (20.58) 9 38 23 (46.06) (87.38) (55.56) 34 97 58 (26.32) (49.93) (31.75) 47 31 30 97 184 117 31 70 189 108 398 (7 7.56) 2 (18 14.33) 2 (6 9.11) 2 (9 17.06) 2 + + + + 14.33 17.06 7.56 9.11 (38 32.36) 2 (23 20.58) 2 (34 46.06) 2 (97 87.38) 2 + + + + 46.06 32.36 20.58 87.38 (58 55.56) 2 (47 26.32) 2 (31 49.93) 2 (30 31.75) 2 + + + = 26.32 55.56 49.93 31.75 .04 + .94 + 1.06 + 3.81 + .98 + .28 + 3.16 + 1.06 + .11 + 16.25 + 7.18 + .10 = 34.97 = .05, df = (c – 1)(r – 1) = (3 – 1)(4 – 1) = 6 2.05,6 = 12.5916 Since the observed 2 = 34.97 > 2.05,6 = 12.5916, the decision is to reject the null hypothesis. Number of children is not independent of social class. Chapter 12: Analysis of Categorical Data 12.15 Industry Transportation Mode Air Train Publishing 32 12 Comp.Hard. 5 6 37 18 Truck 41 24 65 85 35 120 H0: Transportation Mode is independent of Industry. Ha: Transportation Mode is not independent of Industry. e11 = (85)(37) = 26.21 120 e21 = (35)(37) = 10.79 120 e12 = (85)(18) = 12.75 120 e22 = (35)(18) = 5.25 120 e13 = (85)(65) = 46.04 120 e23 = (35)(65) = 18.96 120 Industry 2 = Transportation Mode Air Train Publishing (26.21) (12.75) 32 12 Comp.Hard. (10.79) (5.25) 5 6 37 18 Truck (46.04) 41 (18.96) 24 65 85 35 120 (32 26.21) 2 (12 12.75) 2 (41 46.04) 2 + + + 12.75 26.21 46.04 (5 10.79) 2 (6 5.25) 2 (24 18.96) 2 + + = 10.79 5.25 18.96 1.28 + .04 + .55 + 3.11 + .11 + 1.34 = 6.43 = .05, df = (c – 1)(r – 1) = (3 – 1)(2 – 1) = 2 2.05,2 = 5.99147 Since the observed 2 = 6.431 > 2.05,2 = 5.99147, the decision is to reject the null hypothesis. Transportation mode is not independent of industry. 233 234 Solutions Manual and Study Guide 12.17 Type of Store Mexican Citizens Yes No Dept. 24 17 Disc. 20 15 Hard. 11 19 Shoe 32 28 87 79 41 35 30 60 166 Ho: Citizenship is independent of store type Ha: Citizenship is not independent of store type e11 = (41)(87) = 21.49 166 e31 = (30)(87) = 15.72 166 e12 = (41)(79) = 19.51 166 e32 = (30)(79) = 14.28 166 e21 = (35)(87) = 18.34 166 e41 = (60)(87) = 31.45 166 e22 = (35)(79) = 16.66 166 e42 = (60)(79) = 28.55 166 Chapter 12: Analysis of Categorical Data Type of Store Mexican Citizens Yes No Dept. (21.49) (19.51) 24 17 Disc. (18.34) (16.66) 20 15 Hard. (15.72) (14.28) 11 19 Shoe (31.45) (28.55) 32 28 87 79 41 35 30 60 166 (24 21.49) 2 (17 19.51) 2 (20 18.34) 2 (15 16.66) 2 = + + + 21.49 19.51 16.66 18.34 2 + (11 15.72) 2 (19 14.28) 2 (32 31.45) 2 (28 28.55) 2 + + + = 15.72 14.28 28.55 31.45 .29 + .32 + .15 + .17 + 1.42 + 1.56 + .01 + .01 = 3.93 = .05, df = (c – 1)(r – 1) = (2 – 1)(4 – 1) = 3 2.05,3 = 7.81473 Since the observed 2 = 3.93 < 2.05,3 = 7.81473, the decision is to fail to reject the null hypothesis. Citizenship is independent of type of store. 235 236 Solutions Manual and Study Guide 12.19 12 8 7 27 Variable 1 Variable 2 23 17 11 51 e11 = 11.00 e12 = 20.85 e13 = 24.12 e21 = 8.87 e22 = 16.75 e23 = 19.38 e31 = 7.09 e32 = 13.40 e33 = 15.50 21 20 18 59 56 45 36 137 (12 11.04)2 (23 20.85) 2 (21 24.12) 2 (8 8.87) 2 = + + + + 20.85 11.04 24.12 8.87 2 (17 16.75) 2 (20 19.38) 2 (7 7.09) 2 (11 13.40) 2 + + + + 13.40 16.75 19.38 7.09 (18 15.50) 2 = 15.50 .084 + .222 + .403 + .085 + .004 + .020 + .001 + .430 + .402 = 1.652 df = (c – 1)(r – 1) = (2)(2) = 4 = .05 2.05,4 = 9.48773 Since the observed value of 2 = 1.652 < 2.05,4 = 9.48773, the decision is to fail to reject the null hypothesis. 12.21 Cookie Type Chocolate Chip Peanut Butter Cheese Cracker Lemon Flavored Chocolate Mint Vanilla Filled Ho: Ha: fo 189 168 155 161 216 165 fo = 1,054 Cookie Sales is uniformly distributed across kind of cookie. Cookie Sales is not uniformly distributed across kind of cookie. If cookie sales are uniformly distributed, then fe = f 0 no.kinds 1,054 = 6 175.67 Chapter 12: Analysis of Categorical Data fo fe 189 168 155 161 216 165 237 ( f0 fe )2 f0 175.67 175.67 175.67 175.67 175.67 175.67 1.01 0.33 2.43 1.23 9.26 0.65 14.91 The observed 2 = 14.91 = .05 df = k – 1 = 6 – 1 = 5 2.05,5 = 11.0705 Since the observed 2 = 14.91 > 2.05,5 = 11.0705, the decision is to reject the null hypothesis. Cookie Sales is not uniformly distributed by kind of cookie. 12.23 Arrivals 0 1 2 3 4 5 6 = fo 26 40 57 32 17 12 8 fo = 192 ( f )(arrivals ) 426 192 f 0 (fo)(Arrivals) 0 40 114 96 68 60 48 (fo)(arrivals) = 426 = 2.2 0 Ho: The observed frequencies are Poisson distributed. Ha: The observed frequencies are not Poisson distributed. Arrivals 0 1 2 3 4 5 6 Probability .1108 .2438 .2681 .1966 .1082 .0476 .0249 fe (.1108)(192) = 21.27 (.2438)(192) = 46.81 51.48 37.75 20.77 9.14 4.78 238 Solutions Manual and Study Guide fo fe 26 40 57 32 17 12 8 ( f0 fe )2 f0 21.27 46.81 51.48 37.75 20.77 9.14 4.78 1.05 2.18 0.59 0.88 0.68 0.89 2.17 8.44 Observed 2 = 8.44 = .05 df = k – 2 = 7 – 2 = 5 2.05,5 = 11.0705 Since the observed 2 = 8.44 < 2.05,5 = 11.0705, the decision is to fail to reject the null hypothesis. There is not enough evidence to reject the claim that the observed frequency of arrivals is Poisson distributed. 12.25 Position Years 0-3 4-8 >8 Manager 6 28 47 81 Programmer 37 16 10 63 Operator 11 23 12 46 Systems Analyst 13 24 19 56 e11 = (67)(81) = 22.06 246 e23 = (91)( 46) = 17.02 246 e12 = (67)(63) = 17.16 246 e24 = (91)(56) = 20.72 246 e13 = (67)( 46) = 12.53 246 e31 = e14 = (67)(56) = 15.25 246 e32 = (88)(63) = 22.54 246 e21 = (91)(81) = 29.96 246 e33 = (88)( 46) = 16.46 246 e22 = (91)( 63) = 23.30 246 e34 = (88)(56) = 20.03 246 (88)(81) = 28.98 246 67 91 88 246 Chapter 12: Analysis of Categorical Data 239 Position 0-3 Years 4-8 >8 Manager (22.06) 6 (29.96) 28 (28.98) 47 81 Programmer (17.16) 37 (23.30) 16 (22.54) 10 63 Operator (12.53) 11 (17.02) 23 (16.46) 12 46 Systems Analyst (15.25) 13 (20.72) 24 (20.03) 19 56 67 91 88 246 (6 22.06) 2 (37 17.16) 2 (11 12.53) 2 (13 15.25) 2 = + + + + 22.06 12.53 15.25 17.16 2 (28 29.96) 2 (16 23.30) 2 (23 17.02) 2 (24 20.72) 2 + + + + 29.96 20.72 23.30 17.02 (47 28.98) 2 (10 22.54) 2 (12 16.46)2 (19 20.03)2 + + + = 28.98 22.54 16.46 20.03 11.69 + 22.94 + .19 + .33 + .13 + 2.29 + 2.1 + .52 + 11.2 + 6.98 + 1.21 + .05 = 59.63 = .01 df = (c – 1)(r – 1) = (4 – 1)(3 – 1) = 6 2.01,6 = 16.8119 Since the observed 2 = 59.63 > 2.01,6 = 16.8119, the decision is to reject the null hypothesis. Position is not independent of number of years of experience. 240 Solutions Manual and Study Guide 12.27 Number of Children 0 1 2 >3 Type of College or University Community Large Small College University College 25 178 31 49 141 12 31 54 8 22 14 6 127 387 57 234 202 93 42 571 Ho: Number of Children is independent of Type of College or University. Ha: Number of Children is not independent of Type of College or University. e11 = (234)(127) = 52.05 571 e31 = (93)(127) = 20.68 571 e12 = (234)(387) = 158.60 571 e32 = (193)(387 ) = 63.03 571 e13 = (234)(57) = 23.36 571 e33 = (93)(57) = 9.28 571 e21 = (202)(127) = 44.93 571 e41 = (42)(127) = 9.34 571 e22 = (202)(387) = 136.91 571 e42 = (42)(387) = 28.47 571 e23 = (202)(57) = 20.16 571 e43 = (42)(57) = 4.19 571 Chapter 12: Analysis of Categorical Data Number of Children 0 1 2 >3 Type of College or University Community Large Small College University College (52.05) (158.60) (23.36) 25 178 31 (44.93) (136.91) (20.16) 49 141 12 (20.68) (63.03) (9.28) 31 54 8 (9.34) (28.47) (4.19) 22 14 6 127 387 57 241 234 202 93 42 571 (25 52.05) 2 (178 158.6) 2 (31 23.36) 2 (49 44.93) 2 = + + + + 52.05 158.6 44.93 23.36 2 (141 136.91) 2 (12 20.16) 2 (31 20.68) 2 (54 63.03) 2 + + + + 136.91 20.16 20.68 63.03 (8 9.28) 2 (22 9.34) 2 (14 28.47) 2 (6 4.19) 2 + + + = 9.34 9.28 28.47 4.19 14.06 + 2.37 + 2.50 + 0.37 + 0.12 + 3.30 + 5.15 + 1.29 + 0.18 + 17.16 + 7.35 + 0.78 = 54.63 = .05, df= (c – 1)(r – 1) = (3 – 1)(4 – 1) = 6 2.05,6 = 12.5916 Since the observed 2 = 54.63 > 2.05,6 = 12.5916, the decision is to reject the null hypothesis. Number of children is not independent of type of College or University. 12.29 The observed chi-square value for this test of independence is 5.366. The associated pvalue of .252 indicates failure to reject the null hypothesis. There is not enough evidence here to say that color choice is dependent upon gender. Automobile marketing people do not have to worry about which colors especially appeal to men or to women because car color is independent of gender. In addition, design and production people can determine car color quotas based on other variables. 242 Solutions Manual and Study Guide