MEC1345.fm Page 2361 Friday, September 21, 2001 3:36 PM Molecular Ecology (2001) 10, 2361–2373 Statistical power when testing for genetic differentiation Blackwell Science, Ltd N . RY M A N * and P. E . J O R D E † *Division of Population Genetics, Stockholm University, S-106 91 Stockholm, Sweden, †Division of Zoology, Department of Biology, University of Oslo, PO Box 1050 Blindern, N-0316 Oslo, Norway Abstract A variety of statistical procedures are commonly employed when testing for genetic differentiation. In a typical situation two or more samples of individuals have been genotyped at several gene loci by molecular or biochemical means, and in a first step a statistical test for allele frequency homogeneity is performed at each locus separately, using, e.g. the contingency chi-square test, Fisher’s exact test, or some modification thereof. In a second step the results from the separate tests are combined for evaluation of the joint null hypothesis that there is no allele frequency difference at any locus, corresponding to the important case where the samples would be regarded as drawn from the same statistical and, hence, biological population. Presently, there are two conceptually different strategies in use for testing the joint null hypothesis of no difference at any locus. One approach is based on the summation of chi-square statistics over loci. Another method is employed by investigators applying the Bonferroni technique (adjusting the P-value required for rejection to account for the elevated alpha errors when performing multiple tests simultaneously) to test if the heterogeneity observed at any particular locus can be regarded significant when considered separately. Under this approach the joint null hypothesis is rejected if one or more of the component single locus tests is considered significant under the Bonferroni criterion. We used computer simulations to evaluate the statistical power and realized alpha errors of these strategies when evaluating the joint hypothesis after scoring multiple loci. We find that the ‘extended’ Bonferroni approach generally is associated with low statistical power and should not be applied in the current setting. Further, and contrary to what might be expected, we find that ‘exact’ tests typically behave poorly when combined in existing procedures for joint hypothesis testing. Thus, while exact tests are generally to be preferred over approximate ones when testing each particular locus, approximate tests such as the traditional chi-square seem preferable when addressing the joint hypothesis. Keywords: allele frequency, Bonferroni, chi-square, contingency table, Fisher’s exact test, statistical analysis Received 8 January 2001; revision received 9 April 2001; accepted 20 April 2001 Introduction An increasingly common question in conservation and evolutionary biology is whether a set of samples are likely to represent the same gene pool. Several statistical techniques are being applied when addressing this type of problem, but there has been little discussion about their relative merits for detection of genetic heterogeneity. This lack is particularly obvious for methods used to combine the information from multiple loci. Correspondence: Nils Ryman. Fax: +46 8 154041; E-mail: Nils.Ryman@popgen.su.se © 2001 Blackwell Science Ltd In a typical situation an investigator has collected tissue samples from two or more groups of individuals that are separated in space or time. Application of some biochemical or molecular techniques provides genotypes of the sampled individuals at one or more nuclear loci or at the mitochondrial genome, and each sample is described in terms of its size and allele (or haplotype) frequencies. The specific scientific questions may vary from study to study, but a very basic one, which frequently determines how to proceed with the analysis is the following one: Are the allele frequency differences observed among samples large enough to suggest that all the samples are not drawn from the same population (gene pool)? It appears that in most MEC1345.fm Page 2362 Friday, September 21, 2001 3:36 PM 2362 N . RY M A N and P. E . J O R D E cases the underlying evolutionary model is one of ‘selective neutrality — isolation — genetic drift’, which implies that all polymorphic loci examined are potentially informative with respect to the question of overall genetic heterogeneity. The general statistical approach most frequently used — and the one dealt with in this paper — is first to conduct a contingency test for allele frequency homogeneity for each locus separately, and in a second step to evaluate the simultaneous, or joint, information from all loci examined. The test procedure applied to each individual locus (contingency table) implies assessment of the probability of obtaining — if the null hypothesis (H0 ) of equal allele frequencies is true — an outcome that is as likely as, or less likely than, the observed one. This probability (P-value) can either be calculated exactly (Fisher 1950; Mehta & Patel 1983), iterated or simulated to a desired degree of precision (Roff & Bentzen 1989; Raymond & Rousset 1995a,b), or approximated by means of some test statistic expected to follow a chi-square distribution (Fisher 1950; Everitt 1977; Sokal & Rohlf 1981). In the second step the results from the separate tests are combined for evaluation of the joint null hypothesis (H0,J ) that there is no allele frequency difference at any locus (i.e. H0,J is true when all the separate H0s are true). Presently there appears to be two conceptually different strategies in use for testing H0,J. One technique is based on the summation of chi-square statistics and utilizes the fact that the sum of a series of chi-square distributed variables also follows a chi-square distribution (e.g. Everitt 1977; Sokal & Rohlf 1981). Another approach is used by investigators applying the Bonferroni technique to test if the heterogeneity observed at any particular locus can be regarded significant when considered separately. The general idea behind the Bonferroni method is to account for the increased probability of obtaining, by pure chance when the null hypothesis is true, a significant result at one or more loci when performing multiple tests (e.g. Rice 1989). Under this approach the joint null hypothesis (H0,J ) is rejected if one or more of the component contingency tests is considered significant, i.e. at least one ‘Bonferroni significance’ is required for rejecting the joint null hypothesis of equal allele frequencies at all loci. This strategy for testing H0,J is conceptually adequate in the present context, although it has been noted that it may be quite conservative resulting in too few rejections (Legendre & Legendre 1998). We are aware of no study, however, that evaluates the two approaches with respect to their ability to detect true heterogeneity. This paper compares the power of the above statistical methods — summation of chi-square vs. application of the Bonferroni method to determine if any one of the separate locus tests can be considered significant — for detecting genetic heterogeneity when multiple loci have been scored. The results show that the efficiency may differ dramatically between the two approaches and, contrary to what might be expected, this difference may become enhanced as the number of loci increases. Exemplifying the problem As an example of the statistical test options, Table 1 presents sample allele frequency data for 12 codominant and di-allelic allozyme loci from two consecutive yearclasses of brown trout (Salmo trutta) collected from Lake Blanktjärnen in central Sweden (see Jorde & Ryman 1996 for details). Are the observed allele frequency differences large enough to suggest that there are true genetic differences between year-classes? The allele frequency difference at each individual locus was tested using Fisher’s exact test, the conventional chisquare 2 × 2 contingency statistic ( X2, degrees of freedom, d.f. = 1), and the chi-square statistic with Yates’ continuity correction (X C2, d.f. = 1; chi-square test statistics are denoted by X2 to distinguish them from values of a theoretical χ2 distribution). Both chi-square approximations provide Pvalues that are reasonably similar to the exact ones, but those from the conventional X2 are generally smaller (less conservative) whereas X C2 tends to produce larger ones. All methods yield significant results (P < 0.05) for the same two loci (sAAT-4 and bGALA-2). When testing the joint null hypothesis (H0,J) that there are no allele frequency differences at any locus, we note that the sum of a set of χ2 distributed variables also follows a χ2 distribution with d.f. equal to the sum of d.f. of the contributing variables. This summation is straightforward for X2 and X C2, which are both expected to follow approximately a χ2 distribution under the null hypothesis. With respect to the P-values obtained in the exact tests, Fisher (1950) has shown that when the null hypothesis is true the quantity –2ln(P) is expected to follow asymptotically a χ2 distribution with d.f. = 2. Thus, summing the negative of twice the natural logarithm of the 12 P-values results in a X2 statistic that is to be evaluated against a χ2 distribution with d.f. = 24 (Table 1; this technique of summing –2ln(P) is sometimes referred to as ‘Fisher’s method’, but it should not be confused with Fisher’s exact test). Under each of the three approaches the summed chi-square statistic is significant (P < 0.05) and results in rejection of the joint null hypothesis (H0,J ) although the level of significance differs among them (X2 and X C2 yielding the smallest and largest summation P-value, respectively). In contrast to the above ‘summation’ approaches, application of the Bonferroni method results in a different conclusion. The Bonferroni logic implies that no individual test in a series of k-tests is to be judged significant unless the Pvalue is smaller than α/k, where α is the preassigned significance level for rejecting the null hypothesis (Rice 1989 and references therein; see below). In the present case with k = 12 (12 loci) and α = 0.05 a P-value less than 0.05/12 = © 2001 Blackwell Science Ltd, Molecular Ecology, 10, 2361– 2373 MEC1345.fm Page 2363 Friday, September 21, 2001 3:36 PM D E T E C T I N G G E N E T I C D I F F E R E N T I AT I O N 2363 Table 1 Allele frequencies (the 100 allele) and 2 × 2 contingency test statistics for 12 di-allelic allozyme loci in two cohorts (1992 and 1993) of brown trout from Lake Blanktjärnen, Sweden. The number of fish is 43 and 27 in 1992 and 1993, respectively Allele frequency (100 allele) Cohort Fisher’s exact test Chi-square test X2 Locus 1992 1993 P sAAT-4 DIA bGALA-2 bGLUA G3PDH-2 sIDHP-1 LDH-5 aMAN sMDH-2 ME MPI PEPLT 0.6395 0.7791 0.7791 0.8256 0.7558 0.4302 0.7209 0.9884 0.9767 0.8953 0.1047 0.6395 0.4074 0.8148 0.9444 0.7407 0.6296 0.5741 0.6852 0.9259 0.9815 0.8704 0.2222 0.6667 0.009 0.673 0.009 0.285 0.129 0.119 0.705 0.073 1.000 0.786 0.087 0.856 Sum Sum of d.f. P 0.0042 is thus required to be regarded significant. No individual (single locus) P-value in Table 1 is that small, and consequently no locus can be considered displaying significant heterogeneity when applying the Bonferroni technique. The joint null hypothesis (H0,J ) is therefore also accepted, regardless of whether the basic contingency P-values were obtained by means of chi-square approximation or exact calculation. Thus, in this example the ‘summation’ method indicates that the joint null hypothesis should be rejected for each of the tests statistics applied, whereas application of the Bonferroni technique consistently suggests acceptance when using the same set of individual P-values. The difference is crucial, but on the basis of the data available an investigator cannot determine what decision is most appropriate. It appears that many scientific journals would accept either approach without requiring additional data analysis. In the remainder of this paper we present results of computer simulations aimed at evaluating the probability of detecting a true genetic difference (statistical power) and the probability of falsely rejecting a true null hypothesis (α or type I error) when addressing the joint H0,J (Bonferroni vs. ‘summation’). We recognize that the present problem can also be addressed by means of exact binomial or multinomial calculations, but we chose the simulation approach for practical reasons. Simulations Random sampling of 2n genes (n diploid individuals) from populations with known allele frequencies was simulated © 2001 Blackwell Science Ltd, Molecular Ecology, 10, 2361–2373 –2ln(P) 9.421 0.793 9.489 2.512 4.097 4.261 0.700 5.240 0.000 0.482 4.881 0.311 7.222 0.258 6.849 1.454 2.550 2.748 0.205 3.756 0.036 0.205 3.596 0.107 42.19 24 0.012 28.98 12 0.004 Chi-square test Yates’ correction P X C2 P 0.007 0.611 0.009 0.228 0.110 0.097 0.651 0.053 0.851 0.651 0.058 0.743 6.331 0.087 6.298 0.969 1.954 2.207 0.068 2.099 0.165 0.032 2.662 0.021 0.012 0.768 0.012 0.325 0.162 0.137 0.794 0.147 0.685 0.858 0.103 0.885 22.89 12 0.029 by means of pseudo random number generation. At each locus the hypothesis of equal allele frequencies was tested by various r × c contingency tests (r = number of samples and c = number of alleles) using both Fisher’s exact method and chi-square tests with d.f. = (r – 1) (c – 1). The number of replicates (runs) of each simulation was typically in the interval 5000–10 000, and the frequency of replicates resulting in rejection and acceptance of a null hypothesis provided estimates of the statistical power (probability of rejecting a false hypothesis) or the α (type I) error (probability of rejecting a true hypothesis), respectively. The intended α level was consistently kept at 0.05, rejecting the null hypothesis for P < 0.05. A replicate was discarded if the random sampling of genes resulted in less than c alleles being observed in the combined material from the r samples, and a new set of r samples was drawn in those cases. Population allele frequencies were generally chosen to result in a ‘realistic’ proportion of simulated contingency tables where at least one cell had a small expected value (expectancy less than 5 or 1) at low or moderate sample sizes. This was done in order not to provide an overly optimistic view of the fit of the various chi-square approximations to the expected χ2 distribution. When simulating situations where multiple loci (k > 1) are scored, all loci within a population were assumed to segregate at identical allele frequencies (for example, for di-allelic loci the allele under consideration may occur at the frequency 0.10 at all loci in population 1 and at 0.15 at all loci in population 2). Of course, in the real world MEC1345.fm Page 2364 Friday, September 21, 2001 3:36 PM 2364 N . RY M A N and P. E . J O R D E it is not very likely to encounter a situation where all the loci examined in a population segregate at exactly the same frequency (although the true allele frequency distribution is unknown in most cases). This model is appropriate for the present purpose of examining differences between test procedures, however. When dealing with relative frequencies both the power and the α error are dependent on the population frequencies, and varying those frequencies might obscure the comparison of test procedures. A ‘regular’ chi-square test statistic (X2) was calculated for each simulated contingency table. In addition, Yates’ continuity correction was applied to provide a corrected chi-square (X C2) for all 2 × 2 tables, and for larger tables the G-statistic with and without Williams’ correction was used (Sokal & Rohlf 1981; 737 – 38). Fisher’s exact test for 2 × 2 tables was performed as described by Sokal & Rohlf (1981; 740 – 42), and the algorithm of Mehta & Patel (1983) was applied for larger tables. Computational results from the statistical routines developed for the simulation programs were checked with outputs from softwares such as biom (Rohlf 1987 ), statistica (StatSoft Inc. 1998), StatXact-Turbo (CYTEL Software Corporation 1992), and genepop (Raymond & Rousset 1995b), and the simulated power estimates were checked against exact calculations and those obtained using the ‘standard’ normal approximation for power assessment (e.g. Zar 1984; 398). When combining the information from multiple loci for evaluation of the joint null hypothesis (H0,J ) of no difference at any locus the approaches of Bonferroni and summation of chi-square statistics were applied as exemplified in the preceding section. A considerable number of simulations have been conducted during the course of this study using different combinations of allele frequencies, sample sizes, number of alleles, loci, and populations. In order not to burden the presentation unnecessarily, however, we have tried to choose as simple combinations as possible for illustration of general trends and principles. Thus, most of the paper focuses on situations like that in Table 1 where the basic statistical tests refer to 2 × 2 tables (2 samples and 2 alleles per locus). 2 × 2 contingency tables Four examples of typical simulation results for 2 × 2 contingency tests are depicted in Fig. 1(a). In the most basic case we consider a single locus (k = 1) with two alleles (A and A′), and a random sample of 20 diploid individuals (40 genes) is drawn from each of two populations (1 and 2) where the A allele occurs in the true frequency Q1 and Q2, respectively. In the case of multiple loci (k > 1) all of them have the same allele frequency (Q1) in population 1, and within population 2 all frequencies are Q2. Every plate in Fig. 1(a) represents a specific combination of Q1 and Q2, and for each test procedure the proportion of replicates resulting in rejection of the joint null hypothesis (y-axis) is indicated for different number of loci examined (x-axis; k = 1, 2 … 5, 10, 20 … 50). As in the introductory example (Table 1) the joint null hypothesis (H0,J ) was rejected for P < 0.05 (summation of chi-squares), and when using the Bonferroni method rejection of H0,J required at least one single locus P-value smaller than 0.05 /k. For k = 1 the ‘summation’ is over a single value only, and the Bonferroni rejection criterion coincides with that of the basic test. When Q1 is different from Q2 (H0 is false; upper plates) the proportion of H0,J rejections estimates the power of the test, and when Q1 = Q2 (H0 is true; lower plates) it estimates the realized α error. In a first step we focus on the situations where H0 is false (Q1 ≠ Q2; Fig. 1a, upper plates). Considering a single locus only (k = 1) the power estimates for the Q1/ Q2 combination of 0.10/0.15 are 0.107, 0.041, and 0.052 when using X2, X C2, and Fisher’s exact test, respectively, and for the 0.50/0.60 combination the corresponding estimates are 0.150, 0.103, and 0.103. The difference between these simulated power estimates and the values expected theoretically is noticeable in the third decimal place only, indicating that the simulations provide reasonably accurate results. Also, Fisher ’s exact and the X C2 tests are both expected to be more conservative than the X2 test (e.g. Everitt 1977 ), and they yield accordingly lower power estimates within each combination of Q1/Q2. In contrast to the power observed for a single locus, the results for tests involving multiple loci are more surprising. Here, one might expect a reliable test to detect the true divergence between populations more frequently as additional loci are examined. It is only the ΣX2 approach, however, that behaves consistently in this way at both combinations of Q1 and Q2. In the case of Q1/ Q2 = 0.10/ 0.15 the probability of rejecting H0 actually tends to decrease when the number of loci included in the test increases for all methods except the ΣX2. When Q1 / Q2 = 0.50/ 0.60 all the procedures provide at least some increase of power when more loci are considered, but the ΣX2 approach is consistently the most powerful one. It is an important observation that the remarkably better power obtained through summation of X2 is not associated with an unduly large α error when H0,J is true. Rather, this method is the only one that consistently appears to provide an α error that is reasonably close to the intended one of α = 0.05 (Fig. 1a, lower panels). Simulation results for the considerably larger sample sizes of 500 individuals (1000 genes) are shown in Fig. 1(b). Here, the population allele frequencies are the same as those in Fig. 1(a) when estimating the α error (lower plates), but for the sake of illustration the assessment of power (upper plates) is based on smaller differences © 2001 Blackwell Science Ltd, Molecular Ecology, 10, 2361– 2373 MEC1345.fm Page 2365 Friday, September 21, 2001 3:36 PM D E T E C T I N G G E N E T I C D I F F E R E N T I AT I O N 2365 Fig. 1 Proportion of statistical significances when testing the joint null hypothesis (H0,J) of no difference at any locus after simulated drawing of a sample of size (n) from each of two populations where the true allele frequency at di-allelic loci are Q1 and Q2, respectively. The number of loci examined is 1– 50. A test for allele frequency homogeneity is conducted for each locus separately using X2, XC2, and Fisher’s exact test, and the resulting Pvalues are combined into a joint test by the ‘summation’ and Bonferroni approaches. The intended α is 0.05, and each data point is based on 10 000 (Fig. 1a) and 5000 (Fig. 1b) replicates, respectively. See text for details. Figure 1(a): n = 20 individuals (40 genes). Figure 1(b): n = 500 individuals (1000 genes). between Q1 and Q2 to account for generally higher power associated with larger samples. (When sample sizes as large as 500 individuals the power of all methods is close to unity for the combinations of Q1 and Q2 used in Fig. 1a). At sample sizes of 1000 genes all methods are reasonably successful in keeping the α error close to the intended one of 0.05, except for those representing summation of X C2 and –2ln(P). This latter observation most likely reflects a somewhat slower approach to the limiting χ2 distribution for these two test statistics as compared to X2 (see below). When H0,J is false (upper panels, Fig. 1b) summation of X2 is the technique that consistently provides the highest probability of rejecting H0,J. For all methods, however, the power increases as the number of loci grows, although the ‘summation’ approach in all cases appears more powerful regardless of the statistical procedure applied in the basic © 2001 Blackwell Science Ltd, Molecular Ecology, 10, 2361–2373 contingency tests. This is true also when summing X C2 and –2ln(P), in spite of the lower realized α observed for these statistics. Reasons for low power For both the Bonferroni and the ‘summation’ method the low power is associated with the approximations involved when assessing P-values for the basic 2 × 2 tables. Both methods are expected to work satisfactorily in theory when samples are large and allele frequencies intermediate. In practice, however, the approximate nature of the test statistics and the P-values produced by the primary contingency tests may provide a realized α in the combined test that is far below the intended one, and the reduced α typically results in a small power. MEC1345.fm Page 2366 Friday, September 21, 2001 3:36 PM 2366 N . RY M A N and P. E . J O R D E Fig. 1 Continued Bonferroni. As noted above, the objective behind the Bonferroni technique is to avoid excessive numbers of false significances when performing multiple tests through adjusting the P-value at which a particular component test is to be judged significant. The probability of observing false significances increases rapidly as the number of tests grows. When performing k-tests of a true null hypothesis at α = 0.05 the expected probability of obtaining one or more significances by pure chance is 1 – (1 – 0.05) k; for k = 10, for example, this probability exceeds 40%. In order to maintain the intended α level (here 0.05) of a particular test when a series of k-tests have been performed, the idea behind the Bonferroni method is to adjust the P-value for rejection such that the probability of observing one or more significances among the k-tests remains at α. This goal is met approximately if rejecting any particular H0 at P < α/k (rather than at P < α). The rationale for this criterion is based on the relationship 1 – (1 – α/k) k ≈ α. The above arguments form the basis for the ‘extended’ application of the Bonferroni correction in the context of testing the joint null hypothesis H0,J that all the components H0s are true when multiple tests have been performed. That is, if all the component (single locus) H0s are true (i.e. H0,J is true), the probability is α to observe one or more ‘Bonferroni significances’ (at the α/k level). Thus, H0,J is rejected when obtaining a single locus P-value less than α/k, otherwise it is accepted. © 2001 Blackwell Science Ltd, Molecular Ecology, 10, 2361– 2373 MEC1345.fm Page 2367 Friday, September 21, 2001 3:36 PM D E T E C T I N G G E N E T I C D I F F E R E N T I AT I O N 2367 If the Bonferroni technique is to work in practice, however, two basic conditions must be met which are not satisfied in many realistic situations. First, it is necessary that each single locus test can produce a P-value that is smaller than α/k. However, this is not possible for many combinations of sample size (n) and population allele frequencies (Q1/Q2) because of the restricted number of potential outcomes of the sampling process. If basing the evaluation of H0,J on a suite of such contingency tests there is a substantial risk that the Bonferroni method yields a realized α that is considerably smaller than the intended one, and thereby a correspondingly reduced power. To clarify the point we may consider the extreme example of sampling two individuals (four genes) from each of two populations that are fixed for different alleles. The counts of the two types of alleles will be 4; 0 and 0; 4 for the two samples, respectively, which represents the most extreme outcome possible under the null hypothesis of equal allele frequencies. Fisher’s exact test yields a (twosided) P-value of 0.029, which is significant at the 5% level, and this is the smallest P-value that can be obtained with the present sample sizes. Analysis of, say, five alternately fixed loci would result in five P-values of 0.029, and intuition would justifiably make the investigator suspect that the populations are genetically divergent. A Bonferroni evaluation would dismiss such an interpretation, however, because no P-value is smaller than 0.01 (0.05/5), a P-value which is impossible to obtain with the present sample sizes. Thus, in this situation the Bonferroni correction results in a realized α level of zero, and the power is thereby also reduced to zero. In the present simplified example it is easy to see how the Bonferroni method, as an effect of a discrete number of possible experimental outcomes, may result in a reduction of the realized α far below the intended one, and thereby in a corresponding decrease of power. The phenomenon is a general one, however, and the magnitude of the effect depends on the sample sizes, the number of loci (tests), and the true allele frequency differences (Fig. 1). The other requirement for the Bonferroni method to work satisfactorily is that the realized α error of each of the basic, single locus, contingency tests is reasonably close to the intended one. In other words, when the component H0 is true the probability of obtaining a P-value of 0.05 or less should be close to 0.05, that of obtaining P ≤ 0.01 should be close to 0.01, etc. It appears that this characteristic of the sampling distribution is frequently taken for granted, in spite of the fact that substantial deviations are quite common. To exemplify the difference between intended and realized α errors in basic contingency tests we may consider the occurrence of various P-values when drawing two independent samples of 20 individuals (40 genes) from a population with the allele frequency Q = 0.10. For illustration, all the 2 × 2 tables possible when drawing two samples of © 2001 Blackwell Science Ltd, Molecular Ecology, 10, 2361–2373 40 items were created, P-values based on regular chisquare (X2) and Fisher’s test were computed, the exact frequency of occurrence of each table and its associated P-values was derived binomially, and the cumulative frequency of occurrence of P ≤ 0.05 was depicted graphically (Fig. 2, upper plate). As seen in Fig. 2 (upper plate) the realized α may be considerably smaller than the intended one, and particularly so for Fisher’s exact test. With Fisher’s exact test, for example, values of P ≤ 0.05 only occur in a frequency of 0.017 (less than half the ‘ideal’ rate), and the discrepancy is even more pronounced for smaller P-values. The corresponding cumulative distributions of P-values when sampling 50 and 500 individuals (100 and 1000 genes) are depicted in the central and lower plates of Fig. 2, respectively. For X2 the correspondence between intended and realized α is fairly good at sample sizes of 50 individuals or more. In contrast, at the present population allele frequency samples in the order of hundreds of individuals are required for this to occur with Fisher’s exact test. When realized α of the separate (single locus) contingency tests are smaller than intended, the Bonferroni approach to testing the joint H0,J of no difference at any locus may result in an overall realized α that is far below the anticipated one. To exemplify, Table 2 gives exact realized values of α for 1–50 loci when applying Fisher’s exact test to two independent samples of n = 10 (or n = 20) diploids from a population with the true allele frequency Q = 0.10. For n = 10 and k = 10 loci (tests), for instance, the Bonferroni method implies that H0,J should be rejected when observing at least one contingency P ≤ 0.005 (0.05/ 10). With two samples of n1 = n2 = 10 diploids the largest Fisher P-value meeting the criterion P ≤ 0.005 is 0.00385 (due to the restricted number of possible 2 × 2 tables at these sample sizes), and P-values this small or smaller occur at a frequency of 0.000107 (exact cumulative binomial probability). Thus, the realized α of the Bonferroni approach to testing H0,J corresponds to the probability of observing one or more single locus P-values of P ≤ 0.005, which is α = 1 – (1 – 0.000107)10 = 0.0011. Clearly, this is dramatically smaller than the intended α of 0.05, and with such a small realized α the chance of detecting anything but very large allele frequency differences is minor. Chi-square summation. As noted above, the logic of the summation approach is based on the fact that the sum of two or more χ2 distributed variables will also follow a χ2 distribution with d.f. equal to the summed d.f. of the component variables. Under the null hypothesis, the test statistic computed for each particular locus (contingency table) is expected to be asymptotically χ2 distributed with d.f. = (r – 1)(c – 1) where r is the number of samples (rows) and c is the number of alleles (columns). The fit may be quite poor, however, particularly for small sample sizes and skewed allele frequencies, and the sum of several such MEC1345.fm Page 2368 Friday, September 21, 2001 3:36 PM 2368 N . RY M A N and P. E . J O R D E Fig. 2 Exact cumulative frequency of occurrence of possible P-values when drawing two independent samples of n = 20, 50, or 100 diploids (40, 100 or 1000 genes) from a population where the true allele frequency at a di-allelic locus is 0.1. The null hypothesis of allele frequency homogeneity is tested using regular chi-square (X2) and Fisher’s exact test. Only the left-most part (P ≤ 0.05) of the distribution is shown. Table 2 Exact realized α error when applying the Bonferroni method to a series of k 2 × 2 tables. Each table represents two independent samples of n diploid individuals (2n genes) from a population where the true gene frequency at di-allelic loci is 0.10, and where the H0 of no gene frequency difference has been tested by Fisher’s exact method. Intended α = 0.05. Frequency of occurrence represents the cumulative binomial probability of obtaining the realized P-value or a smaller one. See text for details n = 10 n = 20 k = no. of tested loci Intended P-value for rejection Realized P-value for rejecting a separate 2 × 2 table Frequency of occurrence 1 2 5 10 20 50 0.05 0.025 0.01 0.005 0.0025 0.001 0.04837 0.02484 0.00953 0.00385 0.00220 0.00077 0.012002 0.003010 0.000622 0.000107 0.000015 0.000002 Realized α Realized P-value for rejecting a separate 2 × 2 table Frequency of occurrence Realized α 0.012002 0.006010 0.003107 0.001065 0.000304 0.000090 0.04817 0.02470 0.00931 0.00483 0.00240 0.00096 0.016770 0.005559 0.002022 0.000678 0.000514 0.000056 0.016770 0.011087 0.010067 0.006761 0.010225 0.002802 © 2001 Blackwell Science Ltd, Molecular Ecology, 10, 2361– 2373 MEC1345.fm Page 2369 Friday, September 21, 2001 3:36 PM D E T E C T I N G G E N E T I C D I F F E R E N T I AT I O N 2369 ‘poorly fitted’ variables may deviate dramatically from the expected χ2 distribution. The mean and variance of a χ2 distribution is d.f. and 2d.f, respectively, and as an example of the fit to χ2 Table 3 gives the observed mean and variance of the test statistics obtained when simulating the drawing of two independent samples from a population with an allele frequency of 0.1 (Q1 = Q2 = 0.1; n = 3 – 500). Both of X2 and X C2 have d.f. = 1, and if the fit is perfect we expect these statistics to yield means and variances of 1 and 2, respectively. With respect to the P-value from Fisher’s exact test (exact P) the quantity –2ln(exact P) should be asymptotically χ2 distributed with d.f. = 2 (see above), and with a perfect fit we expect the mean and variance to be 2 and 4, respectively. With respect to the traditional chi-square statistic (X2), the fit of the observed sampling distribution to the theoretical χ2 is quite good, except for a reduced variance and a slightly inflated mean at the smallest sample sizes. In contrast, the approach to the limiting χ2 distribution is markedly slower for X C2 and –2ln(exact P). The observed sampling distributions tend to be located to the left of the Mean expected one, and this shift in location produces too few false significances (realized α < 0.05), and thereby a low power. The mean and variance of X2 are fairly close to their expected values when n ≥ 50 individuals, but X C2 and –2ln(exact P) both yield markedly smaller means and variances even at n = 500. The shift of location relative to χ2 may not seem alarming when plotting the test statistic sample distribution, but a variable representing the sum of several observations from such a distribution may deviate dramatically from the one expected for the sum. To exemplify, the simulated and expected distributions of the 2 × 2 contingency X C2 (d.f. = 1) for n = 20 diploids and Q1 = Q2 = 0.1 are shown in Fig. 3(a). The deviation from the χ2 distribution with d.f. = 1 may not appear overly large, but when summing 10 observations and comparing with χ2 with d.f. = 10 the difference is dramatic (Fig. 3b). Clearly, in a situation like this when X C2 for 10 loci is being summed, the realized α error is far below the expected one, and the probability of rejecting H0,J is reduced correspondingly. In the case of 2 × 2 contingency tables we have focused on the X2, X C2, and –2ln(exact P) test statistics, and it is Variance Test statistic n Observed Expected Observed Expected X2 X2 X2 X2 X2 X2 X C2 X C2 X C2 X C2 X C2 X C2 –2ln(exact P) –2ln(exact P) –2ln(exact P) –2ln(exact P) –2ln(exact P) –2ln(exact P) G G G G G G Williams’ G Williams’ G Williams’ G Williams’ G Williams’ G Williams’ G 3 10 20 50 100 500 3 10 20 50 100 500 3 10 20 50 100 500 3 10 20 50 100 500 3 10 20 50 100 500 1.09 1.04 1.01 1.00 1.02 1.00 0.25 0.44 0.54 0.67 0.78 0.88 0.34 0.84 1.13 1.41 1.59 1.80 1.43 1.24 1.10 1.04 1.03 1.01 1.05 1.07 1.02 1.02 1.01 1.00 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 0.75 1.52 1.77 1.95 2.04 1.98 0.15 0.49 0.83 1.28 1.53 1.75 0.59 1.74 2.47 3.15 3.49 3.70 1.30 2.58 2.44 2.20 2.14 1.95 0.86 2.01 2.12 2.09 2.08 1.94 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 2 2 2 © 2001 Blackwell Science Ltd, Molecular Ecology, 10, 2361–2373 Table 3 Mean and variance of test statistics from 2 × 2 contingency tables when simulating the drawing of two independent samples of equal size (n = 3 – 500 diploids) from a population where the true allele frequency at a di-allelic locus is 0.1. The number of replicates (number of 2 × 2 tables) is 10 000 at each sample size. Expected values are those for a χ2 distribution with d.f. = 1 [or d.f. = 2 for –2ln(exactP) ]. See text for details MEC1345.fm Page 2370 Friday, September 21, 2001 3:36 PM 2370 N . RY M A N and P. E . J O R D E clear that the traditional chi-square test appears to provide the statistic (X2) that is to be preferred when combining information from multiple loci by summation (Fig. 1, Table 3). It is especially interesting to note that the fit of –2ln(exact P) is markedly poor when n is small, i.e. when the use of an exact test is typically regarded most warranted for comparisons of allele frequencies at each particular locus considered separately. As a comparison, Table 3 also gives simulated means and variances for G and Williams’ G for the case of Q1 = Q2 = 0.10. Here, Williams’ G performs as well as X2, and sometimes even better. In contrast to the other statistics, however, the sampling distribution of G seems to be shifted to the right of χ2 for many sample sizes (larger mean and variance than expected). The G-test therefore appears to produce an excess of false significances in the basic contingency tests (realized α > 0.05), and this tendency is expected to grow progressively stronger when testing H0,J through summation of G-values from multiple loci. two rows or columns (Everitt 1977), and when testing H0,J the realized α is therefore expected to be in better agreement with the intended one for both the Bonferroni and the ‘summation’ methods. As an example, Table 4 gives the results from simulated drawings of three samples (r = 3) from populations segregating for two and five alleles, respectively (c = 2 or 5), for the diploid sample sizes (n) of 10, 20, and 50. Under the conditions simulated the difference between the contingency test statistic sampling distribution and the expected χ2 is in most cases minimal for the largest sample size (n = 50). As a result, the ‘summation’ method for addressing H0,J generally provides a realized error rate that is reasonably close to the intended α at n = 50, although that of –2ln(exact P) is still a bit low in the 3 × 2 tests (Table 4). At all sample sizes the Bonferroni approach usually results in a realized α that is smaller than that obtained by summation. At the smaller sample sizes (10 and 20) the –2ln(exact P) statistic yields an α error that appears unduly small in the 3 × 2 tests, but in the 3 × 5 tables the error is close to the intended one. Most strikingly, however, summation of the G-statistic tends to produce unacceptably high rates of false significances (10 – 37%) at the smallest sample sizes, an effect of a markedly poor fit to the expected χ2 distribution. The traditional chi-square (X2) generally seems to behave in fairly good agreement with expectation, and the ‘summation’ method appears to perform better than Bonferroni over a wide range of sample sizes. Although the differences between test statistics and methods for evaluating H0,J are less pronounced than for the 2 × 2 tables, the tendencies are similar. Combining exact P-values by means of ‘Fisher’s method’ tends to result in an unduly small α error more frequently than when summing X2, particularly at small or moderate sample sizes, and it seems that ‘summation’ should be preferred before Bonferroni. Finally, it should be noted that the present results indicating generally better ‘summation properties’ of X2 relative to the other test statistics is not caused by choosing combinations of sample sizes (n) and allele frequencies (Q) that produce unduly few tables with low expectancy cells. Considering 2 × 2 contingency tables with n = 20 and Q = 0.1 (Table 3), for example, over 70% of the simulated tables have two (out of four) cells with an expectation less than five. Similarly, at n = 20 nearly 80% of the simulated 3 × 2 contingency tables (Table 4) have three cells with an expected value of less than five; almost all the 3 × 5 tables have 6–12 such cells and over 10% have 3 – 6 cells with an expectation of less than unity. General r × c contingency tables Concluding remarks and recommendations The fit of the test statistic sample distributions to χ2 is generally improved for contingency tables with more than It is obvious from the above that the choice of statistical method may be crucial for the probability of drawing Fig. 3 Simulated sampling distribution of the X2C test statistic (2 × 2 contingency chi-square with Yates’ correction) when drawing two independent samples of 20 diploids (40 genes) from a population where the true allele frequency at di-allelic loci is 0.1. The corresponding χ2 distributions are those expected asymptotically under large sample theory. (a) Examining a single locus; expected χ2 has d.f. = 1. (b) Testing 10 loci separately and summing the test statistic values; expected χ2 has d.f. = 10. © 2001 Blackwell Science Ltd, Molecular Ecology, 10, 2361– 2373 MEC1345.fm Page 2371 Friday, September 21, 2001 3:36 PM D E T E C T I N G G E N E T I C D I F F E R E N T I AT I O N 2371 Table 4 Mean and variance of test statistics from 3 × c contingency tables when simulating the drawing of three samples of equal size (n = 10 – 50 diploids) from the same population. The number of alleles (c) is 2 and 5 occurring in the frequencies 0.1 and 0.9 (c = 2) and 0.7, 0.1, 0.1, 0.05, and 0.05 (c = 5). The number of replicates (number of 3 × c tables) at each sample size is 10 000 and 1000 for the 3 × 2 and 5 × 2 tables, respectively. Expected values are those for a chi-square distribution with d.f. = 2(c – 1) [d.f. = 2 for –2ln(exact P) ]. Realized α refers to a situation where the information for 10 loci is combined by means of the summation or the Bonferroni approach and the intended α is 0.05. See text for details Mean Realized α (10 loci) Variance c Test statistic d.f. n Obs. Exp. Obs. Exp. Summation Bonferroni 2 2 2 2 2 2 2 2 2 2 2 2 5 5 5 5 5 5 5 5 5 5 5 5 X2 X2 X2 –2ln(exact P) –2ln(exact P) –2ln(exact P) G G G Williams’ G Williams’ G Williams’ G X2 X2 X2 –2ln(exact P) –2ln(exact P) –2ln(exact P) G G G Williams’ G Williams’ G Williams’ G 2 2 2 2 2 2 2 2 2 2 2 2 8 8 8 2 2 2 8 8 8 8 8 8 10 20 50 10 20 50 10 20 50 10 20 50 10 20 50 10 20 50 10 20 50 10 20 50 2.01 2.02 2.00 1.26 1.62 1.83 2.38 2.17 2.04 2.09 2.05 1.99 8.19 8.14 7.78 1.96 2.04 1.86 9.82 9.21 8.07 8.11 8.38 7.79 2 2 2 2 2 2 2 2 2 2 2 2 8 8 8 2 2 2 8 8 8 8 8 8 3.25 3.71 3.94 2.70 3.60 3.79 4.77 4.86 4.26 3.82 4.30 4.07 10.96 14.33 13.90 3.65 4.21 3.26 15.06 19.86 15.66 10.65 16.53 14.58 4 4 4 4 4 4 4 4 4 4 4 4 16 16 16 4 4 4 16 16 16 16 16 16 0.043 0.049 0.052 0.001 0.019 0.029 0.150 0.100 0.068 0.072 0.066 0.057 0.030 0.060 0.040 0.040 0.070 0.050 0.370 0.240 0.060 0.030 0.070 0.060 0.027 0.035 0.047 0.013 0.025 0.037 0.052 0.079 0.064 0.018 0.061 0.056 0.000 0.049 0.020 0.020 0.068 0.020 0.049 0.086 0.020 0.000 0.058 0.020 the right conclusion when testing for genetic differentiation. To date, it appears that the primary statistical interest has been devoted to avoidance of false significances (α errors), whereas considerably less attention has been paid to the prospect of detecting true differences (power). Because the two quantities are interrelated, excessive focus on one of them may have undesirable effects on the other. It appears that the current trend in many fields of evolutionary biology is to score a steadily growing number of loci in a quite restricted number of individuals, frequently in the range, say, 5 – 30. The question of how to combine the information from multiple loci is therefore becoming increasingly significant. Our results indicate strongly that summation of chi-square (X2) tends to perform better than any of the alternatives examined, even at fairly small sample sizes. This approach should typically be the method of choice when testing the joint null hypothesis of no difference at any locus. The technique of summing twice the negative logarithm of P-values from Fisher’s exact test [Σ – 2ln(exact P) ] © 2001 Blackwell Science Ltd, Molecular Ecology, 10, 2361–2373 appears to be used increasingly often. It should be stressed, though, that this approach (‘Fisher’s method’ applied to Fisher’s exact P) may be associated with a strikingly small power even when sample sizes are quite large, and particularly so for 2 × 2 tables. It may seem superficially appealing to base the joint test on probabilities that are computed ‘exactly’ for each contributing contingency table. Nevertheless, the poor fit to the asymptotically expected χ2 distribution frequently makes this method profusely conservative. Therefore, we suggest that results from summation of –2ln(exact P) should generally be accompanied by the results from chi-square summation. It should be stressed that we by no means suggest that exact contingency tests should be abandoned. Whenever possible, exact calculation of P is to be preferred before any approximation relying on large sample theory, and particularly so when sample sizes are modest or small. Although necessarily conservative, there is an obvious advantage with exact tests in that the investigator is guaranteed that the realized α will never exceed the MEC1345.fm Page 2372 Friday, September 21, 2001 3:36 PM 2372 N . RY M A N and P. E . J O R D E intended one. The problem we address arises when combining the information from several exact tests by means of an approximation such as ‘Fisher’s method’. There is no contingency test that is universally ‘best’ from every perspective and under all circumstances, and there are many situations where an investigator may have valid concern regarding the appropriateness of the chisquare statistic, which necessarily represents an approximation. With small expected values in one or more cells the risk of excessive rates of false significances cannot be ignored, although several reports suggest (as do the present simulation results) that the severity of this continuity problem may be overrated (e.g. Cochran 1954; Lewontin & Felsenstein 1965; Everitt 1977). Exact tests, on the other hand, may be overly conservative and fail to detect true differences more often than anticipated. When evaluating a single contingency table, however, exact calculation of the P-value should be the primary method of choice. It must be noted, though, that the observation of a nonsignificant P-value (P > 0.05) may be quite uninformative in the absence of minimal information on realized α and power of the test at the sample sizes at hand. When combining the information from several contingency tables, however, we recommend that the decision on overall genetic divergence is made on the basis of summation of chi-square (X2). The Bonferroni correction is also being applied more commonly as a tool for evaluating the joint null hypothesis of no difference at any locus, and its frequently poor performance in the present context may be perceived a bit surprising. It should be noted, though, that the Bonferroni correction was primarily designed to reduce the probability of obtaining false significances when performing several independent tests. It was not aimed at combining the information from multiple tests that address the same null hypothesis. The Bonferroni method focuses exclusively on the occurrence of (very) small P-values and largely ignores, for example, tendencies of weak significances to be overrepresented. When testing for genetic heterogeneity under a ‘selective neutrality — genetic drift’ model, however, any indication of allele frequency differences at any polymorphic locus (regardless of direction) should ideally contribute information that makes the joint null hypothesis less likely. The ‘summation’ method is directly aimed at picking up such tendencies, and the difference between the two approaches becomes particularly obvious when the underlying test statistic distributions are characterized by marked discontinuities. As with exact tests, we do not suggest that the Bonferroni method should be avoided in general. The Bonferroni (or the sequential Bonferroni) approach represents a most valuable tool for controlling the α error when an investigator, after conducting multiple tests, focuses on a particular null hypothesis (Rice 1989). Rather, because of the markedly low power we recommend against its ‘extended’ use when testing the joint null hypothesis (H0,J) that all the component H0s are true. This paper is focused on contingency tests for allele frequency heterogeneity, but problems with statistical power similar to those discussed here may also occur in other testing situations. The Bonferroni approach, for example, is often applied for evaluation of multiple P-values obtained when testing for Hardy–Weinberg proportions or linkage equilibrium between pairs of loci. For the Bonferroni method to work properly in such cases, it is also necessary that the contributing tests can produce P-values as small as α/k and that those P-values occur at frequency reasonably close to α/k when the null hypothesis is true. For instance, if it is impossible in practice to obtain P < α/k in many or most of the contributing tests, then the power of the joint test (the Bonferroni evaluation) may be very close to zero. In such a situation, any attempt to interpret an observed lack of significance in biological terms would typically be meaningless and potentially erroneous. Acknowledgements We thank Linda Laikre, Ole Christian Lingjærde, Stefan Palm, Associate Editor Laurent Excoffier, and two anonymous reviewers for comments on earlier versions of this paper. The study was supported by grants to N.R. from the Swedish Natural Science Research Council and from the Swedish research program on Sustainable Costal Zone Management, SUCOZOMA, funded by the Foundation for Strategic Environmental Research, MISTRA. P.E.J. was supported by a grant from the National Research Council of Norway. References Cochran WG (1954) Some methods for strengthening the common χ2 test. Biometrics, 10, 417–451. CYTEL Software Corporation (1992) StatXact-Turbo; statistical software for exact nonparametric inference. CYTEL Software Corporation, Cambridge, MA. Everitt BS (1977) The Analysis of Contingency Tables. Chapman & Hall, London. Fisher RA (1950) Statistical Methods for Research Workers. 11th edn. Oliver and Boy, London. Jorde PE, Ryman N (1996) Demographic genetics of brown trout (Salmo trutta) and estimation of effective population size from temporal change of allele frequencies. Genetics, 143, 1369–1381. Legendre P, Legendre L (1998) Numerical Ecology. 2nd edn. Elsevier, Amsterdam. Lewontin RC, Felsenstein J (1965) The robustness of homogeneity tests in 2xN tables. Biometrika, 36, 117 – 129. Mehta CR, Patel NR (1983) A network algorithm for performing Fisher’s exact test in r × c contingency tables. Journal of the American Statistical Association, 78, 427–434. Raymond M, Rousset F (1995a) An exact test for population differentiation. Evolution, 49, 1280–1283. © 2001 Blackwell Science Ltd, Molecular Ecology, 10, 2361– 2373 MEC1345.fm Page 2373 Friday, September 21, 2001 3:36 PM D E T E C T I N G G E N E T I C D I F F E R E N T I AT I O N 2373 Raymond M, Rousset F (1995b) genepop (version 1.2): a population genetics software for exact tests and ecumenicism. Journal of Heredity, 86, 248 – 249. Rice WR (1989) Analyzing tables of statistical tests. Evolution, 43, 223 – 225. Roff DA, Bentzen P (1989) The statistical analysis of mitochondrial DNA polymorphisms: X2 and the problem of small samples. Molecular Biology and Evolution, 6, 539–545. Rohlf FJ (1987) BIOM. A Package of Statistical Programs to Accompany the Text of Biometry. Applied Biostatistics, Inc., New York. Sokal RR, Rohlf FJ (1981) Biometry. 2nd edn. W.H. Freeman, San Francisco, CA. StatSoft Inc. (1998) STATISTICA for Windows. StatSoft, Inc., Tulsa, OK. © 2001 Blackwell Science Ltd, Molecular Ecology, 10, 2361–2373 Zar JH (1984) Biostatistical analysis. 2nd edn. Prentice Hall, Inc., Englewood Cliffs, New Jersey. Nils Ryman is a professor of genetics and heads the Division of Population Genetics at the Stockholm University. His research has focused primarily on the genetic structure of natural populations, the genetic effects of human exploitation of such populations, and related conservation genetics issues. Per Erik Jorde graduated in genetics in Stockholm and is presently a postdoctoral fellow at the University of Oslo, working on microsatellite DNA analyses of fishes.