The Chi-Square Test for Goodness of Fit Learning Objectives After

advertisement
The Chi-Square Test for Goodness of Fit
Learning Objectives
After completion of this module, the student will be able
to
1. develop a statistical test for goodness of fit based on a
mathematical model that is appropriate for the data
2. calculate the chi-square statistics
3. determine whether or not to reject the null hypothesis
Knowledge and Skills
1. Concepts: chi-square test, goodness of fit
Prerequisites
1. mean and variance
2. binomial distribution, uniform distribution, normal
distribution
Citation: Neuhauser, C. The Chi-square Test for Goodness of Fit.
Created: January 3, 2008 Revisions: December 5, 2009
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute.
Page 1
Applications
We begin with a number of applications that illustrate situations in which we need to find a statistical
test that can assess goodness of fit.
High Blood Pressure
A study by Suh et al. (1987) examined the familial
aggregation of blood pressure in 196 4-member families. The
following table lists the number of families stratified
according to the number of members with high blood
pressure. What can you say about familial aggregation of
blood pressure? (Picture Source: Flickr)
No. of high B.P.
in the family
0
1
2
3
4
Total
No.
observed
families
108
66
19
2
1
196
Approach: Formulate a null hypothesis and describe the distribution under the null hypothesis.
How random is random?
Write down randomly 100 integers between 1 and 5 and count how many of each type you have.
Simulate and check whether your numbers are “random.”
Approach: Formulate a null hypothesis and describe the distribution under the null hypothesis.
Citation: Neuhauser, C. The Chi-square Test for Goodness of Fit.
Created: January 3, 2008 Revisions: December 5, 2009
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute.
Page 2
Plant Distribution
The tree species Pinus mugo is the most important invader of abandoned subalpine grasslands in the
Northern Calcareous Alps, Austria. The data below from Dullinger et al. 2003 shows the number of
observed trees in each of the grasslands. A question of interest is whether the different grasslands are
invaded by Pinus mugo indiscriminately or whether they show a preference.
Grassland Type
Carex firma grassland
Cares smpervirens grassland
Leontodon hispidus-Crepis aurea grassland
Agriostis alpina_Festuca pumila grassland
Calamagrostis varia grassland
Carex ferruginea grassland
Nardus stricta pasture
Deschampsia cespitosa pasture
Helictotrichon parlatorei grassland
Tall herb community
Area
Observed
(m2)
1018
61
6188
264
583
20
1243
29
408
7
846
9
320
2
2013
11
1433
2
420
0
Approach: Formulate a null hypothesis and describe the
distribution under the null hypothesis.
(Picture Source: Flickr)
Citation: Neuhauser, C. The Chi-square Test for Goodness of Fit.
Created: January 3, 2008 Revisions: December 5, 2009
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute.
Page 3
A Brief Description of the Chi-Square Test
The chi-square test for goodness of fit is designed to test whether observed frequencies differ
significantly from expected frequencies. Here are the assumptions:
1. The data needs to be grouped or binned, with each group or bin containing five or more
observations. Some data is already grouped into data classes, such as the data on High Blood
Pressure or Plant Distribution. The data you would generate in the Example How Random is
Random? can be naturally grouped into five groups according to the values 1,2,…5. The bin sizes
depend on the data. As a general rule, bins should contain at least five observations.
2. The data must come from a univariate distribution whose cumulative distribution function must be
known. The null hypothesis states that the data follow a specific distribution and the alternative is
that the data do not.
We define a quantity that summarizes our data. This quantity is the chi-square statistic. This statistic
requires that the data is grouped into “bins.”
(1)
k
Obs j  Exp j 
j 1
Exp j
 
2
2
where
Obs j  observed frequency for bin j
Exp j  expected frequency for bin j
The test statistic in Equation (1) is then approximately chi-square distributed with k  1  m degrees of
freedom in the number if m is the number of population parameters that need to be estimated. The
value of chi-square depends on how the data is binned. The larger the sample size, the better the
approximation. The null hypothesis is rejected if the statistic in Equation (1) exceeds a critical value
determined by the significance level. The function in Excel is
CHIDIST(x ,degrees_freedom)
These values are tabulated (see, for instance, the NIST/SEMATECH e-Handbook of Statistical Methods
http://www.itl.nist.gov/div898/handbook/eda/section3/eda3674.htm).
Citation: Neuhauser, C. The Chi-square Test for Goodness of Fit.
Created: January 3, 2008 Revisions: December 5, 2009
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute.
Page 4
Example
Gregor Mendel (1822-1884) was an Austrian monk and scientist who conducted careful experiments on
pea plants to gain a better understanding of how traits are inherited. The pea plants he worked on
exhibited two flower colors, red and white. If he crossed true-breeding red-flowered plants with truebreeding white-flowered plants, all offspring were red-flowered. When he then crossed plants in this
offspring generation, he observed that among 929 plants, 705 had red flowers and 224 had white
flowers. Mendel hypothesized that the fraction of red flowers is 3/4. We can thus formulate the null
hypothesis and the alternative. If we denote by q the fraction of red flowers, then
H0 : q  0.75
H1 : q  0.75
We find
Observed Expected
Red Flowers
705
697
White Flowers 224
232
Hence, we obtain for the test statistic
2 
(705  697)2 (224  232)2

 0.0918  0.2759  0.3677
697
232
To calculate the degrees of freedom, observe that we have two groups, i.e., k  2 , and that we did not
need to estimate any population parameters, i.e., m  0 . Hence, the degrees of freedom is 2  1  0  1 .
Using Excel, we find
CHIDIST(0.3677,1)  0.5443
This is not close to being statistically significant and we cannot reject the null hypothesis. Note that this
does not mean that we accept the null hypothesis.
Citation: Neuhauser, C. The Chi-square Test for Goodness of Fit.
Created: January 3, 2008 Revisions: December 5, 2009
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute.
Page 5
Why does it work?
The following is more advanced material and can be skipped if there is no need to provide a theoretical
foundation for the chi-square test.
We will illustrate how the chi-square test works in the case of two categories (success and failure) when
we test for a hypothesis of the form H: p  p0 where p denotes the probability of success. Assume that
we count the number of successes in n independent trials where each trial has probability p of success.
We denote the number of successes by Sn . This quantity is binomially distributed. We know from the
properties of the binomial distribution that the expected value of Sn is np and its variance is np(1  p) .
Using the central limit theorem, we find that
Z
Sn  np
np(1  p)
is approximately normally distributed with mean 0 and variance 1.
We will need the following result: If Z is standard normally distributed (i.e., Z has mean 0 and variance
1), then Z 2 has a chi-square distribution with one degree of freedom, denoted by 12 .
We thus conclude that
Z2 
(2)
 Sn  np 2
np(1  p)
is approximately chi-square distributed with one degree of freedom.
To transform Equation (2) into a formula that is useful for determining the goodness of fit, we will
relabel our variables. Denote the number of successes in n trials by X 1 and the number of failures by X 2 .
Denote the success probability by p1 and the failure probability by p2 . Then
X1  X2  n and p1  p2  1
Using the fact that  X1  np1    n  X2  n(1  p2 )    X2  np2    X2  np2  , it follows that
2
2
2
2
 X1  np1 2  X1  np1 2  X2  np2 2
np1 (1  p1 )

np1

np2
This allows us to rewrite the statistic in Equation (1) as
Citation: Neuhauser, C. The Chi-square Test for Goodness of Fit.
Created: January 3, 2008 Revisions: December 5, 2009
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute.
Page 6
(3)
Z 
2
 X1  np1 2  X2  np2 2

np1
np2
which is approximately chi-square distributed with one degree of freedom. The statistic in Equation (3)
is of the form
Obs1  Exp1 2 Obs2  Exp2 2
Exp1

Exp2
where Obs j is the number of observed frequencies in the jth category and Expj is the number of
expected frequencies in the jth category ( j  1,2 ). To generalize this to k categories
(4)
k
Obs j  Expj 
j 1
Exp j

2
we need that a sum of independent chi-square distributed random variables is again chi-square
distributed. To determine the degree of freedom for the statistic in Equation (4), note that because of
the constraint that the sum of the observations adds up to the sample size n, the number of degrees of
freedom is reduced by 1. If , in addition, we need to estimate m population parameters to find the
expected frequencies, the degrees of freedom is k  1  m .
References
Dullinger, T. Dirnbock, and G. Grabherr. 2003. Patterns of Shrub Invasion into High Mountain Grasslands
of the Northern Calcareous Alps, Austria. Arctic, Antarctic, and Alpine Research 35: 434-441.
Il Suh, Il Soon Kim and Young Moon Chae. 1987. Familial Aggregation of Blood Pressure. Yonsei Medical
Journal 28: 199-208.
Oberhauser, K., I. Gebhard, C. Cameron, and S. Oberhauser. 2007. Parasitism of Monarch Butterflies
(Danus plexippus) by Lespesia archippivora (Diptera: Tachinidae). Am. Midl. Nat. 157: 312-328.
NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/
Citation: Neuhauser, C. The Chi-square Test for Goodness of Fit.
Created: January 3, 2008 Revisions: December 5, 2009
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute.
Page 7
Resources
A handbook of statistics is available at NIST:
http://www.itl.nist.gov/div898/handbook/index.htm
The chi-squared test for goodness of fit is explained in
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm
Homework
1. In another experiment, Mendel hypothesized that a certain breeding experiment should result ina
3:1 ratio of green versus yellow pods. He conducted the experiment and found that among 580
plants there were 438 plants with green pods and 152 with yellow pods. Test Mendel’s hypothesis.
2. Carry out the hypothesis testing for the three applications in the beginning of this module.
Citation: Neuhauser, C. The Chi-square Test for Goodness of Fit.
Created: January 3, 2008 Revisions: December 5, 2009
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute.
Page 8
Download