STAT 305: Chapter 6 – Methods for Analyzing a Single Categorical Variable Spring 2014 INTRODUCTION TO CHI-SQUARE TESTS Up to this point, we have considered problems involving a single categorical variable with only two levels, and we have used the binomial distribution to find exact p-values. In this section, we will discuss another distribution, called the chi-square distribution, which can be used to approximate these p-values for two-tailed tests. Example 6.11, Revisited: Suppose the researcher was interested in determining whether the heart rate of rats is different when they are in a cage with other rats versus when they are in a cage by themselves. The following table shows the data collected from the study. Questions: 1. Set up the null and alternative hypotheses for investigating this research question. 2. If the null hypothesis is true, in how many pairs do we expect the heart rate to be higher when the rats are together? 3. In how many pairs did we observe that the heart rate was higher when the rats were together? 4. What if there had been only 2 rats with higher heart rates when the rats were together? Would we consider this value to be just as “extreme” as the actual observed? Explain. 103 STAT 305: Chapter 6 – Methods for Analyzing a Single Categorical Variable Spring 2014 We can record the expected and observed counts as follows: # of pairs where HR higher when rats together # of pairs where HR higher when rats alone Total # of pairs Observed Count Expected Count The chi-square test compares these expected and observed values using the following statistic: Chi-square Statistic = (Observed - Expected)2 Expected Questions: 1. What is the smallest value the chi-square statistic can assume, and when does this happen? 2. What does it mean when the chi-square statistic is very large? 104 STAT 305: Chapter 6 – Methods for Analyzing a Single Categorical Variable Spring 2014 The p-value is found using the chi-square distribution, which is indexed by its degrees of freedom. To find the degrees of freedom for problems involving a single categorical variable, count the number of columns (c) in a table of the observed values and then calculate df = (c-1). For this example, the table of the observed values is quite simple. # of pairs where HR higher when rats together # of pairs where HR higher when rats alone Observed Count c= df = The p-value is obtained by plotting the chi-square statistic on the distribution and then finding the area under the curve above the chi-square statistic. For example, the following graphic shows the chi-square distribution with df = 1. You can type the following command into an empty cell in Excel to find this area: “=CHIDIST(chi-square statistic, df) “ Questions: 1. Find the p-value for addressing the research question. 2. Do we have evidence that rats’ heart rates are different when they are in a cage with other rats versus when they are in a cage by themselves? Explain. 105 STAT 305: Chapter 6 – Methods for Analyzing a Single Categorical Variable Spring 2014 Carrying out the Chi-square test in JMP Enter the data as follows: Once again, be sure to right-click on the count column and select Preselect Role > Frequency. Then, select Analyze > Distribution. Place the variable of interest in the Y, Columns box: Click OK, and then choose Test Probabilities from the red drop-down arrow next to the variable name. Enter the expected proportions (instead of the expected counts): Click Done, and JMP returns the chi-square statistic and the p-value: 106 STAT 305: Chapter 6 – Methods for Analyzing a Single Categorical Variable Spring 2014 GOODNESS OF FIT TESTS In the previous example, we carried out what is known as a goodness of fit test. The basic idea behind this is that we want to see how well a statistical model fits a set of data. This goodness of fit test can be used for a single categorical variable with more than two levels, as well. Example 6.12: Mendelian theory states that the number of a certain type of peas falling into the classifications round and yellow, wrinkled and yellow, round and green, and wrinkled and green should be in the ratio 9:3:3:1. Suppose that the data obtained in the following table was obtained from 100 such peas. Are these data consistent with the model? Round Wrinkled Round Wrinkled Total Yellow Yellow Green Green 56 19 17 8 100 Questions: 1. Set up the null and alternative hypotheses for this research question. H0: The hypothesized model does fit the data. Ha: The hypothesized model does NOT fit the data. 2. Calculate the chi-square statistic from the data and find the p-value. Observed Counts Expected Counts Round Yellow Wrinkled Yellow Round Green Wrinkled Green Total 56 19 17 8 100 100 107 STAT 305: Chapter 6 – Methods for Analyzing a Single Categorical Variable Spring 2014 Test Statistic = (Observed - Expected)2 Expected Carrying out the test in JMP: Select Analyze > Distribution. 108 STAT 305: Chapter 6 – Methods for Analyzing a Single Categorical Variable Spring 2014 Select Test Probabilities from the red drop-down arrow next to the variable name and type in the hypothesized probabilities. Questions: 3. Write a conclusion to address the research question. Warning: Make sure the Chi-square test is valid! Recall that the chi-square distribution is used to approximate p-values. This approximation may not be very good with small sample sizes. One rule of thumb suggests that most of the expected cell frequencies in the table should be 5 or more; otherwise, the chi-square approximation may not be reliable. Also it should not be used in general when analyzing a single categorical (nominal or ordinal) that has only 2 levels, e.g. the rat heart rate study. 109