The Chi-square test when the expected frequencies are less than 5 Wai Wan Tsang and Kai Ho Cheng Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong {tsang, khcheng3}@cs.hku.hk Summary. In the chi-square test, it is required that the expected frequency of each cell is at least 5. This condition ensures that the CDF of the test statistic (χ2 ) can be closely approximated by the chi-square distribution . This paper describes two methods to compute the CDF of χ2 directly. The first method computes the exact probabilities for all attainable values of χ2 . It is effective when both the number of samples and the number of cells are small. The second method approximates the CDF with an empirical distribution function that has three digits of accuracy. The second method complements the first one when the number of cells is large. A C program that uses these two methods to compute the CDF of χ2 is implemented. With this program, one can carry out the chi-square test even when some or all expected frequencies are less than 5. Key words: goodness-of-fit test, chi-square test 1 Introduction The chi-square goodness-of-fit test is used to check whether a set of samples fits a purported discrete distribution. The null hypothesis is that the samples follow the distribution. Suppose that the possible outcomes of an experiment are 1, 2, . . ., k, with probabilities p1 , p2 , . . ., pk , respectively. The experiment is carried out n times independently. Let o1 , o2 , . . ., ok be the numbers of 1, 2, . . ., k respectively in the n outcomes. Note that oi = n and pi = 1. The chi-square statistic is defined as P X (o − np ) k χ2 = i i=1 i npi P 2 (1) oi is called the observed frequency of cell i and npi is the expected frequency. When the null hypothesis is true and all expected frequencies are at least 5, the CDF of χ2 is closely approximated by the chi-square distribution of k − 1 degrees of freedom, denoted as Chisq(x, k − 1). Let p-value = Chisq(χ2 , k − 1). If the p-value is greater than a pre-set threshold of proportion, say, 0.95, the null hypothesis is 1584 Wai Wan Tsang and Kai Ho Cheng rejected. Otherwise, it is accepted. χ2 indeed has discrete values but the chi-square distribution is continuous. Figure 1a shows the true CDF of χ2 when all npi ’s are 5 (the staircases) and Chisq(x, 5) (the smooth curve). They are close to each other. Figure 1b shows the staircases and the curve again when all npi ’s are 2. In this graph, the curve deviates noticeably from the staircases. To ensure that the CDF can be closely approximated by the chi-square distribution, the chi-square test requires all expected frequencies be at least 5. (a) k = 6, n = 30 and all pi = 1/6. (b) k = 6, n = 12 and all pi = 1/6. Fig. 1. The CDFs of χ2 and their approximation, Chisq(x, k − 1) The chi-square test is suggested by Karl Pearson in 1900 [PK00]. The approximation of the CDF of χ2 with the chi-square distribution was crucial before the computer era. With today’s computing technology, we can actually compute the CDF of χ2 on the fly, at least when n and k are small. In doing so, we can relax the at-least-5 requirement on the expected frequencies. The relaxation is important in the applications where testing samples are scarce or very expensive, e.g., in medical or genomic research. This paper describes two methods for computing the CDF of χ2 , one analytical and one empirical. The first method computes the exact CDF but is inefficient when k or n is large. The second method computes an empirical distribution function (EDF) of χ2 using 11 million trials. The resulting probabilities have at least three digits of accuracy. A C program that uses these two methods to compute the CDF of χ2 is implemented. With this program, one can carry out the chi-square test even when some or all expected frequencies are less than 5. The Chi-square test when the expected frequencies are less than 5 1585 2 The analytical method It is easy to see that when k = 2, a test instance, specified by [o1 , o2 ], follows the binomial distribution. That is, the probability that there are o1 1’s and o2 2’s is n! po1 po2 o1 !o2 ! 1 2 (2) When k ≥ 2, [o1 , o2 , . . . , ok ] follows the multinomial distribution, a generalization of the binomial distribution. The probability, p, that [o1 , o2 , . . . , ok ] occurs is n! po1 po2 . . . pokk o1 !o2 ! . . . ok ! 1 2 (3) The following sketches a straightforward way to compute the CDF of χ2 using the above formula. 1. For each instance, [o1 , o2 , . . . , ok ], compute the χ2 value and p. 2. Sort the pairs of [χ2 , p] in the ascending order of the χ2 values. 3. Combine the pairs that have identical χ2 values. The p in the new pair is the sum of the p’s in the pairs being combined. For example, [0.65, 0.01] and [0.65, 0.02] are combined into [0.65, 0.03]. The resulting list gives the density distribution of χ2 . 4. Accumulate the p’s in the density distribution to form the CDF. A C program that computes the CDF using this method has been implemented. The test instances, [o1 , o2 , . . . , ok ]’s, are enumerated using recursion. For efficiency, the powers of pi ’s and factorials in the formula of the multinomial distribution are pre-computed. To verify the correctness of our program, we plot the computed CDFs together with the corresponding chi-square distributions in Figure 2. As expected, they are very close to each other. To demonstrate the effectiveness, we used the program to compute the CDFs for the chi-square test of k = 2, 3, . . ., 10 cells. For each k, we found the largest n such that the computation could end within 1 minute on a PC with 2.26GHz Pentium 4 processor. Table 1 shows the n’s recorded. k 2 Largest n 1.97 3 6560 4 500 5 150 6 75 7 50 8 35 9 28 10 23 Table 1. The largest n’s found for various k’s s.t. the program ends in 1 minute. 3 The empirical method The analytical method is inefficient when k is large. For such cases, we can estimate the EDF of χ2 using simulation. This approach was suggested by Professor G. 1586 Wai Wan Tsang and Kai Ho Cheng (a) k = 4, n = 40 and all pi = 1/4. (b) k = 8, n = 40 and all pi = 1/8. Fig. 2. The CDFs of χ2 and their corresponding chi-square distributions. Marsaglia in 2005 [MAR05]. We have implemented a C program for the task. In our program, random numbers are generated using a combination of the multiply-with-carry generator [MZ91] and the 3-shift generator [MAR03]. Discrete variates are obtained using the method suggested in [MTW04]. The maximum absolute error (MAE) in an EDF has the same distribution as the Kolmogorov statistic [TW04]. Suppose that an EDF is obtained using m trials. Using the asymptotic distribution of the Kolmogorov statistic given in [KOL33], √ √ the mean and standard deviation of MAE are 0.87/ m and 0.26/ m , respectively. In our program, m = 11,000,000. The mean plus three standard deviations is 0.0004975. Therefore, it is very safe to claim that the EDF is accurate up to the third digit. To verify the correctness of our program, we plot the estimated EDFs together with the true CDFs computed using the analytical method in Figure 3. The EDFs coincide with the CDFs in the graphs. We use our program to estimate the EDFs of χ2 for different n’s and different hypothetical distributions having 5 values (k = 5). The execution times are shown in Table 2. As expected, the execution time is proportional to n but is insensitive to the distribution. n=20 p1 = 1/5, p2 = 1/5, p3 = 1/5, p4 = 1/5, p5 = 1/5 18 s p1 = 1/15, p2 = 2/15, p3 = 3/15, p4 = 4/15, p5 = 5/15 19 s p1 = 1/25, p2 = 2/25, p3 = 4/25, p4 = 7/25, p5 = 11/25 20 s n=30 24 s 25 s 25 s n=40 31 s 32 s 32 s n=50 38 s 38 s 38 s Table 2. Execution times for computing the EDFs for various n’s and distributions. The Chi-square test when the expected frequencies are less than 5 (a) k = 6, n = 12 and all pi = 1/6. 1587 (b) k = 8, n = 40 and all pi = 1/8. Fig. 3. The CDFs of χ2 and the EDFs obtained using our program. Table 3 shows the execution times of computing the EDFs for different k’s when n = 200. In the experiment, the hypothetical distributions are uniformly distributed, i.e., all pi ’s are equal. The results show that the execution time is insensitive to k. k k k k k = 20 = 40 = 60 = 80 = 100 n = 200 137 s 146 s 147 s 168 s 172 s Table 3. Execution times for computing the EDF for n = 200 and k = 20, 40, 60, 80 and 100. 4 Discussion A C program that evaluates the CDF of χ2 (p-value) in the chi-square test has been developed. If all expected frequencies are at least 5, the p-value is computed from the chi-square distribution of k − 1 degrees of freedom as usual. Otherwise, if k ≤ 10 and n is less than or equal to the values shown in Table 1, compute the p-value using the analytical method, else use the empirical method. If the empirical method is used, the estimated execution time will be printed on the console. This program can be downloaded from the website at http://www.cs.hku.hk/∼tsang/chisq.c. We are still tuning the program for efficiency. A dynamic programming approach is being considered for computing the true CDF of χ2 . For the empirical method, 1588 Wai Wan Tsang and Kai Ho Cheng certain random number generators that are faster than the combined generator used is being tested for suitability. (a) k = 2, n = 10 and p1 = p2 = 1/2. (b) k = 2, n = 6, p1 = 1/4 and p2 = 3/4. Fig. 4. Two CDFs with large quantum jumps χ2 is a discrete variable but is treated as a continuous variable in the chi-square test. The appropriateness depends on the sizes of k and n. When k is very small, the quantum jumps in the CDF of χ2 are obvious even when all expected frequencies are at least 5. Figure 4a shows an extreme case where k = 2, n = 10 and p1 = p2 = 1/2. The quantum jumps are bigger when the at-least-5 requirement is not satisfied or the pi ’s are not equal, or both, as shown in Figure 4b where k = 2, n = 6, p1 = 1/4 and p2 = 3/4. The effects of the discreteness on Type I error, Type II error and the power of the chi-square test are worth for further investigation. The Chi-square test when the expected frequencies are less than 5 1589 References [KOL33] Kolmogorov, A.: Sulla determinazione empirica ei una legge di distributione. Giornale dell’ Istituto Italiano degli Attuari, 4, 83 – 91 (1933) [MAR03] Marsaglia G: Xorshift RNGs. Journal Statistical Software, 8, Issue 14 (2003) [MAR05] Marsaglia, G: Monkeying with the Goodness-of-Fit Test. Journal of Statistical Software, 14, Issue 13 (2005) [MTW04] Marsaglia, G., Tsang, W.W. and Wang, J.: Fast genereation of Discrete Random Variables. Journal of Statistical Software, 11, Issue 3 (2004) [MZ91] Marsaglia, G. and Zaman, A.: A new class of random number generators. The Annals of Applied Probability, 1, 462 – 480 (1991) [PK00] Pearson, K.: On the Criterion that a Given System of Deviations from the Probable in the Case of Correlated System of Variables is such that it can be Reasonably Supposed to have Arisen from Random Sampling. Philosophical Magazine, 50, Issue 5, 157 – 175 (1900) [TW04] Tsang, W.W and Wang, J.: Evaluating the CDF of the Kolmogorov statistic for normality testing. Proceedings of the COMPSTAT 2004, 16th Symposium of IASC, Prague, 1893 – 1900, August 23-27 (2003)