The Chi-square test when the expected frequencies are less than 5

advertisement
The Chi-square test when the expected
frequencies are less than 5
Wai Wan Tsang and Kai Ho Cheng
Department of Computer Science, The University of Hong Kong, Pokfulam Road,
Hong Kong
{tsang, khcheng3}@cs.hku.hk
Summary. In the chi-square test, it is required that the expected frequency of each
cell is at least 5. This condition ensures that the CDF of the test statistic (χ2 ) can
be closely approximated by the chi-square distribution . This paper describes two
methods to compute the CDF of χ2 directly. The first method computes the exact
probabilities for all attainable values of χ2 . It is effective when both the number of
samples and the number of cells are small. The second method approximates the
CDF with an empirical distribution function that has three digits of accuracy. The
second method complements the first one when the number of cells is large. A C
program that uses these two methods to compute the CDF of χ2 is implemented.
With this program, one can carry out the chi-square test even when some or all
expected frequencies are less than 5.
Key words: goodness-of-fit test, chi-square test
1 Introduction
The chi-square goodness-of-fit test is used to check whether a set of samples fits a
purported discrete distribution. The null hypothesis is that the samples follow the
distribution. Suppose that the possible outcomes of an experiment are 1, 2, . . ., k,
with probabilities p1 , p2 , . . ., pk , respectively. The experiment is carried out n times
independently. Let o1 , o2 , . . ., ok be the numbers of 1, 2, . . ., k respectively in the n
outcomes. Note that
oi = n and
pi = 1. The chi-square statistic is defined as
P
X (o − np )
k
χ2 =
i
i=1
i
npi
P
2
(1)
oi is called the observed frequency of cell i and npi is the expected frequency.
When the null hypothesis is true and all expected frequencies are at least 5, the
CDF of χ2 is closely approximated by the chi-square distribution of k − 1 degrees of
freedom, denoted as Chisq(x, k − 1). Let p-value = Chisq(χ2 , k − 1). If the p-value
is greater than a pre-set threshold of proportion, say, 0.95, the null hypothesis is
1584
Wai Wan Tsang and Kai Ho Cheng
rejected. Otherwise, it is accepted.
χ2 indeed has discrete values but the chi-square distribution is continuous. Figure
1a shows the true CDF of χ2 when all npi ’s are 5 (the staircases) and Chisq(x, 5)
(the smooth curve). They are close to each other. Figure 1b shows the staircases and
the curve again when all npi ’s are 2. In this graph, the curve deviates noticeably
from the staircases. To ensure that the CDF can be closely approximated by the
chi-square distribution, the chi-square test requires all expected frequencies be at
least 5.
(a) k = 6, n = 30 and all pi = 1/6.
(b) k = 6, n = 12 and all pi = 1/6.
Fig. 1. The CDFs of χ2 and their approximation, Chisq(x, k − 1)
The chi-square test is suggested by Karl Pearson in 1900 [PK00]. The approximation of the CDF of χ2 with the chi-square distribution was crucial before the
computer era. With today’s computing technology, we can actually compute the
CDF of χ2 on the fly, at least when n and k are small. In doing so, we can relax the
at-least-5 requirement on the expected frequencies. The relaxation is important in
the applications where testing samples are scarce or very expensive, e.g., in medical
or genomic research.
This paper describes two methods for computing the CDF of χ2 , one analytical
and one empirical. The first method computes the exact CDF but is inefficient when
k or n is large. The second method computes an empirical distribution function
(EDF) of χ2 using 11 million trials. The resulting probabilities have at least three
digits of accuracy. A C program that uses these two methods to compute the CDF
of χ2 is implemented. With this program, one can carry out the chi-square test even
when some or all expected frequencies are less than 5.
The Chi-square test when the expected frequencies are less than 5
1585
2 The analytical method
It is easy to see that when k = 2, a test instance, specified by [o1 , o2 ], follows the
binomial distribution. That is, the probability that there are o1 1’s and o2 2’s is
n!
po1 po2
o1 !o2 ! 1 2
(2)
When k ≥ 2, [o1 , o2 , . . . , ok ] follows the multinomial distribution, a generalization
of the binomial distribution. The probability, p, that [o1 , o2 , . . . , ok ] occurs is
n!
po1 po2 . . . pokk
o1 !o2 ! . . . ok ! 1 2
(3)
The following sketches a straightforward way to compute the CDF of χ2 using
the above formula.
1. For each instance, [o1 , o2 , . . . , ok ], compute the χ2 value and p.
2. Sort the pairs of [χ2 , p] in the ascending order of the χ2 values.
3. Combine the pairs that have identical χ2 values. The p in the new pair is the sum
of the p’s in the pairs being combined. For example, [0.65, 0.01] and [0.65, 0.02]
are combined into [0.65, 0.03]. The resulting list gives the density distribution
of χ2 .
4. Accumulate the p’s in the density distribution to form the CDF.
A C program that computes the CDF using this method has been implemented.
The test instances, [o1 , o2 , . . . , ok ]’s, are enumerated using recursion. For efficiency,
the powers of pi ’s and factorials in the formula of the multinomial distribution are
pre-computed. To verify the correctness of our program, we plot the computed
CDFs together with the corresponding chi-square distributions in Figure 2. As
expected, they are very close to each other.
To demonstrate the effectiveness, we used the program to compute the CDFs for
the chi-square test of k = 2, 3, . . ., 10 cells. For each k, we found the largest n such
that the computation could end within 1 minute on a PC with 2.26GHz Pentium 4
processor. Table 1 shows the n’s recorded.
k
2
Largest n 1.97
3
6560
4
500
5
150
6
75
7
50
8
35
9
28
10
23
Table 1. The largest n’s found for various k’s s.t. the program ends in 1 minute.
3 The empirical method
The analytical method is inefficient when k is large. For such cases, we can estimate
the EDF of χ2 using simulation. This approach was suggested by Professor G.
1586
Wai Wan Tsang and Kai Ho Cheng
(a) k = 4, n = 40 and all pi = 1/4.
(b) k = 8, n = 40 and all pi = 1/8.
Fig. 2. The CDFs of χ2 and their corresponding chi-square distributions.
Marsaglia in 2005 [MAR05]. We have implemented a C program for the task.
In our program, random numbers are generated using a combination of the
multiply-with-carry generator [MZ91] and the 3-shift generator [MAR03]. Discrete
variates are obtained using the method suggested in [MTW04].
The maximum absolute error (MAE) in an EDF has the same distribution as the
Kolmogorov statistic [TW04]. Suppose that an EDF is obtained using m trials.
Using the asymptotic distribution of the Kolmogorov statistic given in [KOL33],
√
√
the mean and standard deviation of MAE are 0.87/ m and 0.26/ m , respectively.
In our program, m = 11,000,000. The mean plus three standard deviations is
0.0004975. Therefore, it is very safe to claim that the EDF is accurate up to the
third digit.
To verify the correctness of our program, we plot the estimated EDFs together
with the true CDFs computed using the analytical method in Figure 3. The EDFs
coincide with the CDFs in the graphs.
We use our program to estimate the EDFs of χ2 for different n’s and different
hypothetical distributions having 5 values (k = 5). The execution times are shown
in Table 2. As expected, the execution time is proportional to n but is insensitive
to the distribution.
n=20
p1 = 1/5, p2 = 1/5, p3 = 1/5, p4 = 1/5, p5 = 1/5
18 s
p1 = 1/15, p2 = 2/15, p3 = 3/15, p4 = 4/15, p5 = 5/15 19 s
p1 = 1/25, p2 = 2/25, p3 = 4/25, p4 = 7/25, p5 = 11/25 20 s
n=30
24 s
25 s
25 s
n=40
31 s
32 s
32 s
n=50
38 s
38 s
38 s
Table 2. Execution times for computing the EDFs for various n’s and distributions.
The Chi-square test when the expected frequencies are less than 5
(a) k = 6, n = 12 and all pi = 1/6.
1587
(b) k = 8, n = 40 and all pi = 1/8.
Fig. 3. The CDFs of χ2 and the EDFs obtained using our program.
Table 3 shows the execution times of computing the EDFs for different k’s when
n = 200. In the experiment, the hypothetical distributions are uniformly distributed,
i.e., all pi ’s are equal. The results show that the execution time is insensitive to k.
k
k
k
k
k
= 20
= 40
= 60
= 80
= 100
n = 200
137 s
146 s
147 s
168 s
172 s
Table 3. Execution times for computing the EDF for n = 200 and k = 20, 40, 60,
80 and 100.
4 Discussion
A C program that evaluates the CDF of χ2 (p-value) in the chi-square test has been
developed. If all expected frequencies are at least 5, the p-value is computed from
the chi-square distribution of k − 1 degrees of freedom as usual. Otherwise, if k ≤ 10
and n is less than or equal to the values shown in Table 1, compute the p-value
using the analytical method, else use the empirical method. If the empirical method
is used, the estimated execution time will be printed on the console. This program
can be downloaded from the website at http://www.cs.hku.hk/∼tsang/chisq.c.
We are still tuning the program for efficiency. A dynamic programming approach
is being considered for computing the true CDF of χ2 . For the empirical method,
1588
Wai Wan Tsang and Kai Ho Cheng
certain random number generators that are faster than the combined generator
used is being tested for suitability.
(a) k = 2, n = 10 and p1 = p2 = 1/2.
(b) k = 2, n = 6, p1 = 1/4 and p2 = 3/4.
Fig. 4. Two CDFs with large quantum jumps
χ2 is a discrete variable but is treated as a continuous variable in the chi-square
test. The appropriateness depends on the sizes of k and n. When k is very small, the
quantum jumps in the CDF of χ2 are obvious even when all expected frequencies
are at least 5. Figure 4a shows an extreme case where k = 2, n = 10 and p1 = p2 =
1/2. The quantum jumps are bigger when the at-least-5 requirement is not satisfied
or the pi ’s are not equal, or both, as shown in Figure 4b where k = 2, n = 6, p1 =
1/4 and p2 = 3/4. The effects of the discreteness on Type I error, Type II error and
the power of the chi-square test are worth for further investigation.
The Chi-square test when the expected frequencies are less than 5
1589
References
[KOL33] Kolmogorov, A.: Sulla determinazione empirica ei una legge di distributione. Giornale dell’ Istituto Italiano degli Attuari, 4, 83 – 91 (1933)
[MAR03] Marsaglia G: Xorshift RNGs. Journal Statistical Software, 8, Issue 14
(2003)
[MAR05] Marsaglia, G: Monkeying with the Goodness-of-Fit Test. Journal of Statistical Software, 14, Issue 13 (2005)
[MTW04] Marsaglia, G., Tsang, W.W. and Wang, J.: Fast genereation of Discrete
Random Variables. Journal of Statistical Software, 11, Issue 3 (2004)
[MZ91] Marsaglia, G. and Zaman, A.: A new class of random number generators.
The Annals of Applied Probability, 1, 462 – 480 (1991)
[PK00]
Pearson, K.: On the Criterion that a Given System of Deviations from
the Probable in the Case of Correlated System of Variables is such that
it can be Reasonably Supposed to have Arisen from Random Sampling.
Philosophical Magazine, 50, Issue 5, 157 – 175 (1900)
[TW04] Tsang, W.W and Wang, J.: Evaluating the CDF of the Kolmogorov statistic for normality testing. Proceedings of the COMPSTAT 2004, 16th Symposium of IASC, Prague, 1893 – 1900, August 23-27 (2003)
Download