part 4 - Winona State University

advertisement
STAT 305: Chapter 6 – Methods for Analyzing a Single Categorical Variable
Spring 2014
INTRODUCTION TO CHI-SQUARE TESTS
Up to this point, we have considered problems involving a single categorical variable with only
two levels, and we have used the binomial distribution to find exact p-values. In this section,
we will discuss another distribution, called the chi-square distribution, which can be used to
approximate these p-values for two-tailed tests.
Example 6.11, Revisited: Suppose the researcher was interested in determining whether the
heart rate of rats is different when they are in a cage with other rats versus when they are in a
cage by themselves. The following table shows the data collected from the study.
Questions:
1. Set up the null and alternative hypotheses for investigating this research question.
2. If the null hypothesis is true, in how many pairs do we expect the heart rate to be higher
when the rats are together?
3. In how many pairs did we observe that the heart rate was higher when the rats were
together?
4. What if there had been only 2 rats with higher heart rates when the rats were together?
Would we consider this value to be just as “extreme” as the actual observed? Explain.
103
STAT 305: Chapter 6 – Methods for Analyzing a Single Categorical Variable
Spring 2014
We can record the expected and observed counts as follows:
# of pairs where HR
higher when rats
together
# of pairs where HR
higher when rats
alone
Total # of pairs
Observed Count
Expected Count
The chi-square test compares these expected and observed values using the following statistic:
Chi-square Statistic =

(Observed - Expected)2
Expected
Questions:
1. What is the smallest value the chi-square statistic can assume, and when does this
happen?
2. What does it mean when the chi-square statistic is very large?
104
STAT 305: Chapter 6 – Methods for Analyzing a Single Categorical Variable
Spring 2014
The p-value is found using the chi-square distribution, which is indexed by its degrees of freedom.
To find the degrees of freedom for problems involving a single categorical variable, count the
number of columns (c) in a table of the observed values and then calculate df = (c-1).
For this example, the table of the observed values is quite simple.
# of pairs where HR
higher when rats
together
# of pairs where HR
higher when rats
alone
Observed Count
c=
df =
The p-value is obtained by plotting the chi-square statistic on the distribution and then finding
the area under the curve above the chi-square statistic. For example, the following graphic
shows the chi-square distribution with df = 1. You can type the following command into an
empty cell in Excel to find this area: “=CHIDIST(chi-square statistic, df) “
Questions:
1. Find the p-value for addressing the research question.
2. Do we have evidence that rats’ heart rates are different when they are in a cage with
other rats versus when they are in a cage by themselves? Explain.
105
STAT 305: Chapter 6 – Methods for Analyzing a Single Categorical Variable
Spring 2014
Carrying out the Chi-square test in JMP
Enter the data as follows:
Once again, be sure to right-click on the count column and select Preselect Role > Frequency.
Then, select Analyze > Distribution.
Place the variable of interest in the Y, Columns box:
Click OK, and then choose Test Probabilities from the red drop-down arrow next to the
variable name. Enter the expected proportions (instead of the expected counts):
Click Done, and JMP returns the chi-square statistic and the p-value:
106
STAT 305: Chapter 6 – Methods for Analyzing a Single Categorical Variable
Spring 2014
GOODNESS OF FIT TESTS
In the previous example, we carried out what is known as a goodness of fit test. The basic idea
behind this is that we want to see how well a statistical model fits a set of data. This goodness
of fit test can be used for a single categorical variable with more than two levels, as well.
Example 6.12: Mendelian theory states that the number of a certain type of peas falling into the
classifications round and yellow, wrinkled and yellow, round and green, and wrinkled and
green should be in the ratio 9:3:3:1. Suppose that the data obtained in the following table
was obtained from 100 such peas. Are these data consistent with the model?
Round Wrinkled Round Wrinkled Total
Yellow
Yellow
Green
Green
56
19
17
8
100
Questions:
1. Set up the null and alternative hypotheses for this research question.
H0: The hypothesized model does fit the data.
Ha: The hypothesized model does NOT fit the data.
2. Calculate the chi-square statistic from the data and find the p-value.
Observed
Counts
Expected
Counts
Round
Yellow
Wrinkled
Yellow
Round
Green
Wrinkled
Green
Total
56
19
17
8
100
100
107
STAT 305: Chapter 6 – Methods for Analyzing a Single Categorical Variable
Spring 2014
Test Statistic =

(Observed - Expected)2
Expected
Carrying out the test in JMP:
Select Analyze > Distribution.
108
STAT 305: Chapter 6 – Methods for Analyzing a Single Categorical Variable
Spring 2014
Select Test Probabilities from the red drop-down arrow next to the variable name and type in
the hypothesized probabilities.
Questions:
3. Write a conclusion to address the research question.
Warning: Make sure the Chi-square test is valid!
Recall that the chi-square distribution is used to approximate p-values. This approximation
may not be very good with small sample sizes. One rule of thumb suggests that most of the
expected cell frequencies in the table should be 5 or more; otherwise, the chi-square
approximation may not be reliable. Also it should not be used in general when analyzing a
single categorical (nominal or ordinal) that has only 2 levels, e.g. the rat heart rate study.
109
Download