1 Analysis of Categorical Data The techniques presented in previous sections cover the analysis of numerical data and variables. Up to now it is left unanswered how to analyze categorical variables and data. 1.1 Two-way tables How to present data for categorical variables? Remember when doing a data description for a categorical variable we choose to do this with a relative frequency table. But when we have data on two categorical variables and we want to illustrate how the two variables depend on each other we use two-way tables. Two Way Tables: Data resulting from observations made on two categorical variables can be easily summarized in a two way table. Example: Suppose we are interested in the rate of sprouted seeds in two different kinds of water (rainwater, muddy water). The two categorical variables are Sprouted (yes or no) and water type (rainwater and muddy water). Suppose rainwater, muddy water, and tap water were used to water 100 seeds each. Then they were checked and noted how many of those seeds sprouted. The result can then be easily presented in the following table: rainwater muddy water tap water Total sprouted yes no 64 36 74 26 60 40 138 62 total 100 100 100 200 A natural question to ask at this point would be if the choice of water has an effect on the probability for a seed to sprout. Multiple Comparison The question in the example could be answered by comparing each type of water with each of the remaining, i.e. one would have to make the following comparisons muddy – rain muddy – tap rain – tap. But it would be better to first find if there is a difference at all between the probabilities. To find out we will introduce the χ2 -test for homogeneity. 1.2 χ2 Test of Homogeneity and Independence in a Two-Way-Table In this section two different situations and questions concerning two categorical variables will be covered. Both will lead to the same test. So, let us first introduce the two different types of questions and then introduce the test routine to answer these questions. 1 1. Comparing the distribution of a categorical variable in two or more populations. We will be looking at two or more samples from different populations and test if the distributions in all populations are the same. The null hypothesis to be tested in this kind of problem is that the distributions are homogeneous, they are equal for all populations. H0 : The probabilities for sprouting and nonsprouting is the same for both methods Ha : The probabilities for sprouting and nonsprouting depend on the treatment 2. Testing two categorical variables for independence When you study data that involves two variables, one important consideration is the relationship between the two variables. Does the proportion of measurements in the various categories for factor 1 depend on which category of factor 2 is being observed? Example: A survey was conducted to evaluate the effectiveness of a new flu vaccine that had been administered in a small community. It consists of a two–shot sequence in two weeks. A survey of 1000 residents the following spring provided the following information: Flu No Flu Total No vaccine One Shot Two Shots 24 9 13 289 100 565 313 109 578 Total 46 954 1000 The question to be answered is, if the flu shot had an impact on the incidence of the flu. Or we could ask if the incidence of the flu is independent from the vaccine. In order to answer this question, a χ2 test can be conducted, testing the following hypotheses: H0 : No relationship between treatment and incidence of flu Ha : Incidence of flu depends on amount of flu treatment The two questions lead to the exact same χ2 test. The χ2 distribution The χ2 – distribution is neither a normal nor a t-distribution. Table D gives upper tail areas for different degrees of freedom. 2 The χ2 Test for Homogeneity and Independence Given are a categorical variable with R categories and one categorical variable with C categories. Hypotheses: H0 : The two categorical variables are independent Ha : H0 is not true. Assumption: The sample size is large. The sample size is considered large enough as long as every count is at least 1, and not more than 20% of counts are less than 5. Test statistic: Compute for every cell of the two–way table the expected frequency E = expectedfrequency = and then χ20 = (row total)(column total) Sample size (O − E)2 E cells X all where O is the observed count. P-value: P (χ2 > χ20 ) the upper tail area for a χ2 distribution with (R − 1)(C − 1) df, found in Appendix Table D. Decision: As usual. Context: As usual. Continue Flu Example: 3 1. Hypotheses H0 : The flu is independent of the vaccine status versus Ha : H0 is not true. Let’s perform this test at a significance level of α = 0.05. 2. Assumptions are easily met the sample sizes are all greater than 5. 3. Test Statistic The two-way table gives the observed frequencies O. Next calculate the expected cell counts E for each cell: Flu/no vaccine: Flu/one shot: Flu/two shots: No Flu/no vaccine: No Flu/one shot: No Flu/two shots: (46·313)/1000=14.40 (46·109)/1000=5.01 (46·578)/1000=26.59 (954·313)/1000=298.60 (954·109)/1000=103.99 (954·578)/1000=551.41 Put the expected cell counts into the table: Flu No Flu Total No vaccine One Shot Two Shots 24 9 13 14.40 5.01 26.59 289 100 565 298.60 103.99 551.41 313 109 578 Total 46 954 1000 Now find for each cell (O − E)2 /E and add these fractions: χ2 = 6.404 + 3.169 + 6.944 + 0.309 + 0.153 + 0.335 = 17.313 and df=(3-1)(2-1)=2 4. P-value Table D provides us for df = 2 with P-value<0.0005. 5. Decision Since the P-value is less than α reject the null hypothesis. Context: At significance level of 5% the data do provide sufficient evidence that the probability of getting the flu is not the same for all three vaccination groups. The nature of the relationship has still to be explored: For example estimate the following probabilities P (flu=yes|vaccine=0) estimate 24/313=0.0767 P (flu=yes|vaccine=1) estimate 9/109=0.0826 P (flu=yes|vaccine=2) estimate 13/578=0.0225 From this we find that the estimated probability to get the flu given a person had 2 shots is much less than the estimated probabilities to get the flu given that a person had no or only one shot. 4 Example: Some time ago there was a report in the news that an AIDS vaccine tested in Thailand didn’t show any effect. The data quoted in the news is presented in the two–way table below (including the expected cell counts): HIV+ HIVTotal Placebo Vaccine 105 106 105.5 105.5 1168 1167 1167.5 1167.5 1273 1273 Total 211 2335 2546 χ2 = 0.00237 + 0.00237 + 0.000214 + 0.000214 = 0.005168 and df=(2-1)(2-1)=1 Conduct a test of homogeneity : 1. Hypotheses: H0 : The probability to get AIDS, is the same for the vaccinated and the placebo group versus Ha : H0 is not true. α = 0.05 2. The sample sizes are all large enough. 3. Test Statistic: See calculations above: χ2 = 0.0052, with 1 df, 4. P-value From Table D we find P-value>0.25 5. Decision: Do not reject H0 since the p-value is greater than α. 6. Context: At significance level α = 0.05 the data do not provide sufficient evidence that the HIV infection–rate was impacted by the vaccine. 1.3 χ2 Goodness of fit Test In this section the χ2 test for comparing the relative frequency distribution from a sample with a given probability distribution is introduced. Example: A company filling grass seed bags wants to evaluate their filling machine. The following distribution is advertised on their bags, where K1-K5 are different kinds of grass seeds: kind of seeds proportion K1 0.5 K2 0.25 K3 0.15 K4 0.05 K5 0.05 5 The company wants to check if the seed distribution in the bags fits the advertised distribution. They take a sample of size 1000 and find the following summarized data: kind of seeds K1 K2 K3 K4 K5 count 480 233 160 63 64 In order to check if the label is a truthful description of the contents of the seed bags, we want to compare the claimed distribution with the sample data. Notation: for a given categorical random variable k p1 p2 = number of categories = true probability to fall in category 1 = true probability to fall in category 2 .. . pk = true probability to fall in category k In order to compare the observed frequencies with the hypothesized distribution we study the χ2 goodness–of–fit statistic. O1 = observed cell count for category 1 O2 = observed cell count for category 2 .. . Ok = observed cell count for category k These observed counts shall be compared with the expected count, if the claim is true (the label is true). If we examine an event that occurs with probability p, then in a sample of size n we would expect to see this event about n · p times. The χ2 statistic is based on this fact. For a given hypothesized population distribution p01 , . . . , p0k (in the example, this is the label) define: E1 = expected cell count for category 1 = n · p01 E2 = expected cell count for category 2 = n · p02 .. . Ek = expected cell count for category k = n · p0k The χ2 statistic is then: χ2 = k X (Oi − Ei )2 Ei i=1 If p01 , . . . , p0k is the true distribution, then the χ2 statistic is χ2 – distributed with df=k − 1 degrees of freedom, 6 Continue Example: With p01 = 0.5, p02 = 0.25, p03 = 0.15, p04 = 0.05, p05 = 0.05 we get: kind of seeds K1 K2 K3 K4 K5 p0i 0.5 0.25 0.15 0.05 0.05 Oi 480 233 160 63 64 n = 1000 Ei = np0i 500 250 150 50 50 (0i − Ei )2 /Ei (480-500)2 /500=0.8 (233-250)2 /250=1.156 (160-150)2 /150=0.666 (63-50)2 /50=3.38 (64-50)2 /50=3.92 And χ20 = 0.8 + 1.156 + 0.666 + 3.38 + 3.92 = 9.9226. This number now has to be evaluated as coming from a χ2 distribution with 5-1=4 df. The χ2 Goodness–of–fit Test Given a distribution p01 , . . . p0k Hypotheses: H0 : p1 = p01 , . . . , pk = p0k versus Ha : H0 is not true. Assumption: The sample size is large. The sample size is large enough for the χ2 test to be appropriate as long as every observed count is at least 5. Test statistic: (Oi − Ei )2 Ei categories X χ20 = all with df = k − 1 P-value: P (χ2 > χ20 ) the uppertail area for a χ2 distribution with k − 1 df, found in Appendix Table D. Decision: Is the P-value ≤ α , then reject H0 Is the P-value > α , then do not reject H0 Continue Example: 1. Hypotheses: H0 : p1 = 0.5, p2 = 0.25, p3 = 0.15, p4 = 0.05, p5 = 0.05 versus Ha : H0 is not true. 2. Assumptions: The observed counts are all large enough. 3. Test Statistic: The test statistic is χ20 = 9.9226 with df=4 (see calculations above). 4. P-value: From Table D for df=4 find that 9.9922 falls between 9.49 and 11.14, with upper tail probabilities 0.050 and 0.025. So we conclude 0.025 <P-value< 0.05. 5. Decision: Since the p-value≤ 0.05 = α, we reject H0 at significance level of α = 0.05. 6. Interpretation: At significance level of 5% we find sufficient evidence in the sample that the label is not giving a true description of the contents of the seed bag. 7