Statistical Science 2003, Vol. 18, No. 2, 241–255 © Institute of Mathematical Statistics, 2003 Bootstrapping Phylogenetic Trees: Theory and Methods Susan Holmes Abstract. This is a survey of the use of the bootstrap in the area of systematic and evolutionary biology. I present the current usage by biologists of the bootstrap as a tool both for making inferences and for evaluating robustness, and propose a framework for thinking about these problems in terms of mathematical statistics. Key words and phrases: Bootstrap, phylogenetic trees, confidence regions, nonpositive curvature. are interested in the origins of strains of HIV, tuberculosis, influenza and other fast evolving bacteria and viruses. The data in Table 1 came from a public database of HIV sequences available at LANL (2002). 1. AN INTRODUCTION TO SYSTEMATICS The objects of study in systematics are binary rooted semilabeled trees that link species or families by their coancestral relationships. For example, Figure 1 shows a tree with seven strains of HIV. Two leaf vertices (taxa) that share a parent in the tree are supposed to be descended from the same ancestor. The ancestors, and indeed the whole tree, have to be inferred in the absence of relevant fossil data. Today, the data used to build the trees are aligned DNA or protein sequences. These are usually represented as matrices of letters, where the rows are labeled by species and the columns represent positions in the genomic sequence; many of the letters in a column are the same (see Table 1). Trees have been used in court cases and environmental surveillance. Some examples of current applications include: Phylogenetic trees are the main object of publication in many biology journals such as Systematic Biology, Molecular Biology and Evolution, Molecular Phylogenetics and Evolution, Trends in Research in Ecology and Evolution, Journal of Molecular Evolution, Evolution, Molecular Ecology and American Journal of Botany. Although several methods for tree estimation (or inferring trees) are currently available, very little inferential theory is available for quantifying uncertainty for these trees. The most widely used tool for inference is a version of the bootstrap introduced by Felsenstein (1983). Figure 1 shows bootstrap values along the edges of the tree. These are generally required for publication of a tree estimate in the same way clinical trials require publication of p values. The confidence statements made about such trees are my main focus. Biologists have also begun to adopt Bayesian methods based on Markov chain Monte Carlo computations that use parametric evolutionary models (Li, Pearl and Doss, 2000; Mau, Newton and Larget, 1999; Yang and Rannala, 1997), but I will not discuss these methods in depth here. A recent issue of Systematic Biology (Volume 51, Number 5, October 2002) had many interesting articles on parametric Bayesian approaches. The tree in Figure 1 shows bootstrap values at the inner nodes; for example, 93 means that the species CONS-CPZ and CONS O4 were siblings in 93% of • The whale watch team builds phylogenetic trees from their own databases to classify whale meat they sample from Japanese fish markets (Baker and Palumbi, 1994; Baker, Lento, Cipriano and Palumbi, 2000). Over the last eight years they have found, among others species, blue whale, humpback, minke whales, beaked whales and dolphins. They report regularly to the International Whaling Commission (Lento et al., 1998). • Immunologists, microbiologists and epidemiologists Susan Holmes is Associate Professor, Department of Statistics, Stanford University, Stanford, California 94305, detached from the Biometry Department of INRA, Montpellier, France (e-mail: susan@stat. stanford.edu). 241 242 S. HOLMES +----CONS-CPZ +---93 +--100 +----CONS O4 ! ! +---49 +---------CONS N2 ! ! +--100 +--------------CONS B34 ! ! ! ! +----CONS A10 ! +-------------55 ! +----CONS A23 ! +------------------------CONS C16 F IG . 1. Tree with bootstrap values. the bootstrap replications; 49 means that the sequences CONS B34, CONS N2, CONS-CPZ and CONS O4 were grouped together in what is called a monophyletic (a group containing the most common ancestor of a given set of taxa and all the descendents of that most recent common ancestor) clade in 49% of the bootstrap replications. The method of bootstrapping is the multinomial nonparametric bootstrap as applied in the binomial setting. For each bootstrap simulation step, a new data matrix is accumulated by choosing columns from the original data matrix at random with replacement, and this is repeated until there are as many columns in the new matrix as were in the original data. In Section 2, I explain the data and the estimation problem from a statistical viewpoint. I analyze some of the ways in which biologists make use of bootstrap values in Section 3. Section 4 shows how some statistical concepts, such as sufficiency and correction terms, can be used in this problem. Section 5 rephrases the problem in geometrical terms and Section 6 gives some constructive suggestions on how bootstrapping can be used for trees. It also summarizes some of the caveats presented in this article. Some of the challenges involved in giving a statistical justification for the basic bootstrapping procedure are reviewed first. TABLE 1 Partial HIV amino acid sequences from LANL (2002) 1 2 3 4 5 6 7 CONS A10 CONS A23 CONS B34 CONS C16 CONS N2 CONS O4 CONS-CPZ KKEEEEALLT RREEEEALLT KKEEEEALLT KKEEEEALLT RREEEEALLT CCEEEEVLLT ??EEEEALLT GADTVVVLEE GADTVVVLEE GADTVVVLEE GADTVVVLEE GADTVVVIEE GADTVVVLNN GADTVVVIDD INLGGKKKPK INLGGKKKPK MNLGGRKKPK INLGGKKKPK ?QLGGKKKPK IQLGGKTTPK IQLGG?RRPK 2. FROM MOLECULAR DATA TO PHYLOGENETIC TREE Suppose an observed data set made on s species and n characters is given: the characters are DNA or amino acids. The sequences, one for each species, are aligned before this stage of analysis. For simplicity’s sake, the aligned sequences are considered to be of equal length, thus providing a matrix block, for which each column is often called a character. Table 1 is a little example taken from LANL (2002). The Markovian evolutionary model (Li, 1997) is a low-dimensional (one to six parameters) model for such data. The metric evolutionary tree (binary rooted tree with edge lengths) can also be considered a parameter in this model. If the metric tree is supposed known and the mutation rates are known, data can be generated by choosing a character from the stationary distribution of the transition matrix (suppose A were drawn). The ancestral column at the root would then be all A’s, but each row would have a different path to follow through the tree and changes would occur with probabilities proportional to the edge lengths. This simulation procedure is available through the SEQGEN program (Rambaut and Grassly, 1997). From the observed data, the maximum likelihood method attempts to estimate the tree by choosing the tree with the highest probability of occurring, given the data, either using prior estimates of the relevant mutation rates or using the data to estimate these parameters. Maximum parsimony is a nonparametric method that uses a different criterion: it ignores all evolutionary models and searches for the tree with the least number of mutations along its branches needed to explain the data. Distance-based methods are semiparametric methods that use the evolutionary model to estimate distances between sequences and then use methodology akin to hierarchical clustering to build the tree. I will not go into the details of how a tree is estimated from these data: book-long treatments exist (Felsenstein, 2003; Page and Holmes, 2000) and a survey for statisticians was presented by Holmes (1999). The parameter estimated from the data is a binary rooted tree with the labels (taxa: species or populations) at the leaves, with edge lengths (this is called the metric tree). If the method does not provide edge lengths, by default we set all the edge lengths to 1. Even ignoring edge lengths and considering only the branching order of the tree, the size of the space is huge. There are an exponentially large [there are (2n − 3) × (2n − 5) × · · · × 3 × 1 = (2n − 3)!! (Schröder, BOOTSTRAPPING PHYLOGENETIC TREES 243 1870) different ones] number of combinatorial tree forms. I denote the metric tree estimate by τ̂ , the true tree by τ and the space of metric trees T , sometimes with an index n to denote the number of leaves. After deciding which estimator to use, a natural followup question is, “How variable is the estimate?” Making a confidence statement about the parameter itself poses numerous quandaries where the classical paradigms, both Bayesian and frequentist, suffer from a lack of theory about these general non-Euclidean parameters. Possible questions are the following: aligned set of DNA or protein sequences is the assumption of independent, identically distributed columns. It is well known that these assumptions are violated; Fitch (1971a, b), for example, showed the existence of regions that covaried. Some regions are highly conserved (most of the columns are identical all the way down) and some columns are highly variable, so the columns are not identically distributed either. These departures from the i.i.d. situation have been modeled in many ways. Various suggestions for more believable models have been included in different procedures: • What is a sampling distribution in T ? • What is the natural notion of variability in T ? • Hidden Markov model for rate variation along the sequences as discussed by Yang (1994) and Felsenstein and Churchill (1996) (see Durbin, Eddy, Krogh and Mitchison, 1998, for a review). • A Gamma distribution model of rate variation along sequences was suggested by Yang (1994). • Fitch and Markowitz (1970), Lockhart et al. (1996) and Tuffley and Steel (1998) have built covarion models to deal with sites that vary together. • Changes in rate variation can be detected and modeled using hot spot or change point detection as, for instance, in Tang and Lewontin (1999). These questions would be solved naturally if it were known how to represent a probability distribution P and distance d in T . If the sampling distribution Pn of τ̂ were known, the bias and variance could be written EPn d(τ̂ , τ ), EPn d 2 (τ̂ , τ ). At this stage, a Bayesian could look at samples from a posterior distribution on trees. In a frequentist approach, the bootstrap usually comes in and provides an estimate for bias as EPn d(τ̂ ∗ , τ̂ ) and an estimate of variance as EPn d(τ̂ ∗ , τ̂ )2 . Because this theory is lacking, biologists have simplified their questions: the simplest one concerns the presence/absence of a certain monophyletic group c, called a clade. For instance, in Figure 1 the hypothesis to test is whether the clade c =(CONS A10, CONS A23) exists in the true tree τ . Felsenstein (1983) first introduced the use of the nonparametric bootstrap to assess what biologists call repeatability: the probability that another such sample shares the clade with the original sample. In statistical terms, this is denoted by Pn (c ∈ τ̂ |c ∈ τ̂0 ). Other biologists hoped that the bootstrap would provide estimates of what they called accuracy (Hillis and Bull, 1993), that is, P(c ∈ τ |c ∈ τ̂0 ). The two quantities are linked since the tighter the sampling distribution around τ , the more probable it is that the same clade appears again in a second sample, and the more probable it is that the clade from the estimated tree is a true clade. However, as in all statistical hypothesis testing, only P (data |H0 ) is available; an estimate of P (H0 | data) requires further information. 2.1 Nonidentically Distributed, Nonindependent Columns A first statistical reservation to the vanilla multinomial nonparametric bootstrap of the columns of an The block bootstrap (Künsch, 1989) as explained clearly by Efron and Tibshirani (1993, pages 98–102) can provide a nonparametric equivalent for the multinomial under dependence of the data. Parametric bootstrapping using models such as those from Felsenstein and Churchill (1996) is also available in SEQGEN (Rambaut and Grassly, 1997). In practice there seems to be a strong preference for nonparametric resampling. This may be due to the misguided intuition that using a model that injects more variability gives safer, more conservative answers. 2.2 Consequences of Dependency Unfortunately, dependency assumptions destroy the classical validity of the bootstrap. There are no clear theoretical bounds on the difference in bootstrap results between the i.i.d. and dependent cases in this context. Generally speaking, when the columns are dependent, there are fewer effectively independent components (using the block bootstrap makes this clearer). A smaller sample means that the estimates are less accurate, and the bootstrap is no exception. In classical statistical problems such as estimating the mean, the nonparametric bootstrap itself has errors on the order of 1/n. If the effective n is much smaller than the number of columns, the actual error may be very large, and the estimation error also may be exceed- 244 S. HOLMES ingly large as pointed out, for example, in Nei, Kumar and Takahashi (1998). Thus the asymptotic consistency results are even less relevant here than in a simple univariate problem. The parameter is of very high dimension and in the realm of Freedman and Peters (1984a, b), where the size of the parameter space is large compared to the amount of data. 3. WHAT IS THE BOOTSTRAP SUPPOSED TO TELL US? If a statistician is presented with a problem where the columns are observations to be resampled, the question that arises immediately is which distribution are the columns being sampled from? Certainly no simple random sampling is in effect, as was clearly pointed out by Sanderson (1995), so the classical paradigm by which the bootstrap is justified as proposing an approximation to the sampling distribution of the estimate is not in order here. However, since this method is so widely used by biologists, it is worth asking which questions it is actually answering, to better understand why they consider it an essential feature of phylogenetic methodology. 3.1 Stability or Reliability Here are some quotes from the systematics literature: Boostrapping measures how consistently the data support given taxon bipartitions (Hedges, 1992). This is not a test of how accurate your tree is; it only gives information about the stability of the tree topology (the branching order), and it helps assess whether the sequence data is adequate to validate the topology (Berry and Gascuel, 1996). F IG . 2. A partition of tree space. The bootstrap test which is a crude way of testing interior branches, is applicable to all tree-building methods and is easy to use. Although this test has been shown to be conservative under certain theoretical frameworks, a conservative test is preferable in real data analysis, because the evolution of actual DNA (or protein) sequences never follows any mathematical model available (Nei, Kumar and Takahashi, 1998). In fact, if a more attentive look at the actual way the bootstrap is used is taken, it is seen much more as a measure of robustness of the estimator with regard to small changes in the data. Could a small plausible perturbation of the data give a different result? The best way to think about this question is graphically: partition the parameter space of all possible binary trees into regions, each region corresponding to a different tree topology (I abbreviate “different tree topologies” to “different trees”). In Section 5 it is demonstrated that this can be justified mathematically. For the time being, Figures 2 and 3 are a sufficient schematization. The partitioning Bootstrap values are a measure of support of a given edge like the measure introduced by Bremer (1988) that asks how many more steps a parsimony tree must be for a given edge to disappear. High bootstrap values (close to 100%) mean uniform support, i.e., if the bootstrap value for a certain clade is close to 100%, nearly all of the characters informative for this group agree that it is a group (Berry and Gascuel, 1996). F IG . 3. Data (the black rectangle) being projected on three possible trees. 245 BOOTSTRAPPING PHYLOGENETIC TREES 4.60 Taxon 1 Taxon 2 Taxon 3 Taxon 4 AATAATCACACAAGTATATTGTTCTTTAAACCTTGCAAAGAACCCAATATCTACTTCTGA GGCAATTATGTAAGTATATTGTTATTTAAGCACTGCAGTGAACCCCGTCTCTACAGCTGA GGTGGCCTCGTAAGTACCTTGTTCTGTAGACATTGCAGATAACCCCGTATGTACATCTCA AACGATCACGTAAGTGTACTGTTCTTCGAACATAGAGGAGAAGACCGTATCTCCATCGGG F IG . 4. represented by Figure 2 differs from Efron, Halloran and Holmes (1996, Figure 3) in that we are not considering this to be a partition of the data space, but of the parameter space onto which the data are considered to be projected. If the data are quite far from being treelike, the data may be hesitating between several different trees. Figure 3 shows such a situation schematically. Making a small perturbation of the data could give very different trees if the estimation function near that data is not continuous, as pictured in the example shown in Figure 3. Suppose that the data are at the black rectangle, and the estimator τ̂ projects the data onto the closest point on one of the flaps. Then the estimate could oscillate between the tree flaps if the data are only slightly perturbed, thus giving a discontinuous estimator. (This can only be rigorously corrected for by making the estimate a mixture of trees; for instance, in this case ( 13 , 13 , 13 ) for the weights associated to the three components.) Here is an example of such a data set. The original DNA data are shown in Figure 4. The output from PHYLIP (see the Appendix) with the parsimony criterion gives the tree described in Figure 5. Bootstrap of the Example 2 data gives, with the DNAPARS parsimony program, the results presented in Figure 6. Sets included in the consensus tree (∗ means on one side, · on the other) Set (species in order) 4321 · ∗ ∗∗ · ∗ ∗· How many times (out of 1000) 1000.0 572.66 Sets NOT included in consensus tree: Sets How many times (species in order) (out of 1000) 4321 · · ∗∗ 215.16 · ∗ ·∗ 212.16 F IG . 6. One most parsimonious tree found: +--------Taxon ! --3 +--Taxon ! +--2 +--1 +--Taxon ! +-----Taxon 4 3 2 1 requires a total of 43.000 F IG . 5. If we make one change of nucleotide on the last letter of the second column from an A to a G, then the PHYLIP output becomes as shown in Figure 7. This second data set is an obvious case of the mixture of three trees, as described above, and the bootstrap detects it. Bootstrap of the second data set (with the changed nucleotide) yields the results presented in Figure 8. The bootstrap does detect the mixture, not quite with the right proportions, but indicates the presence of alternative choices which get lost when just one tree is chosen. This discontinuous behavior is also present when maximum likelihood estimation is used, since the log likelihoods can be very close, even though the trees are CONSENSUS TREE: The numbers at the forks indicate the number of times the group consisting of the species which are to the right of that fork occurred among the trees, out of 1000 trees +--------------Taxon ! ! +---------Taxon +1000.0 ! +----Taxon +572.7 +----Taxon 4 1 2 3 246 S. HOLMES 3 tree in all found: +--------Taxon --3 ! +-----Taxon +--1 ! +--Taxon +--2 +--Taxon 4 +--------Taxon ! --3 +--Taxon ! +--2 +--1 +--Taxon ! +-----Taxon 2 3 1 requires a total of 43 4 +--------Taxon --3 ! +-----Taxon +--2 ! +--Taxon +--1 +--Taxon 3 2 1 requires a total of 43 4 3 2 1 requires a total of 43 F IG . 7. not close in tree space. Consequently, the likelihood contours are also discontinuous. This is responsible for the problem of islands such as those cited in Maddison (1991). The quality that biologists call stability or reliability is what statisticians call robustness. The question addressed here is, “Do the perturbed data all project into the same region?” In more precise terms, it would also be comforting to know which columns produce these discontinuities. PHYLIP (see the Appendix) has such a tool in DNACOMP that uses a compatibility estimation to give an indication as to whether each column agrees with the final tree or not. This is at least a first possible level of analysis of residuals or a search for points with high leverage as in regression. The output in Table 2 shows that only the third and fourth columns disagree with the final tree. As can be seen in Figure 2, some of the boundaries between tree regions are curved. In addition, points such as point 3 border more than two regions. For both of these reasons the simple bootstrap can be applied as a perturbation tool to assess the stability (in the sense Sets included in the consensus tree (∗ means on one side, · on the other) Set (species in order) 4321 · ∗ ∗∗ · ∗ ∗· How many times (out of 1000) 1000.0 336.83 Sets NOT included in consensus tree: Set How many times (species in order) (out of 1000) 4321 · · ∗∗ 336.33 · ∗ ·∗ 326.83 F IG . 8. TABLE 2 Output from DNACOMP for the above data 0123456789012345678901234567890123456789 ∗---------0 ! YYNNYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY 40 !YYYYYYYYYYYYYYYYYYYYY of continuity, a small perturbation in the data that produces only a small perturbation in the estimate) of the estimator. We can consider that we are exploring the neighborhood of the estimator by a simulation experiment as suggested by Efron and Tibshirani (1998). We can also try to describe the regions and their neighborhoods mathematically; this requires some extra work, as we will see in Section 5. To quantify robustness, a notion of distance in parameter space is necessary; this will also be provided in Section 5. The more precise study of influence functions (Huber, 1996; Hampel, Ronchetti, Rousseeuw and Stahel, 1986) is not available here because a notion of derivative in T is unavailable. CONSENSUS TREE: The numbers at the forks indicate the number of times the group consisting of the species which are to the right of that fork occurred among the trees, out of 1000 trees +--------------Taxon ! ! +----Taxon ! +336.8 +1000.0 +----Taxon ! +---------Taxon 4 1 2 3 BOOTSTRAPPING PHYLOGENETIC TREES 3.2 Statistical Insight for the Simplest Case: One Clade As pointed out by Efron, Halloran and Holmes (1996), the bootstrap compares the bootstrapped values to the original estimate, not to the truth. The essential bootstrap identity is that the bootstrapped values are to the original estimate what the original estimate is to the true parameter. Some parameter has to be chosen to evaluate the sampling distribution; often in the case of real-valued parameters this may be the variance of the estimate. If this variance is small, then we can conclude that the parameter has a high probability of being close to the original estimator, so a notion of variability would be very useful. From a statistical point of view the simplest possible context in which to study the bootrapping for trees is in the statistical test of the presence of a certain monophyletic group c. Suppose that we have decided ahead of time that we want to approximate the probability that an observed clade c is in fact in the true tree τ . A binary parameter θc can be defined that takes on the value 1 if the clade c is present in the true tree and 0 if not. Then the question posed by Hillis and Bull (1993) about the accuracy of the method [P (θc = 1|θ̂c = 1)] can be answered through the parametric bootstrap by generating data with many different known trees τ and using the SEQGEN program of Rambaut and Grassly (1997) to generate data sets with τ as the generating process. Then compute for each the bootstrap samples P (θ̂c = 1|θ̂c∗ = 1). We can thus compare the accuracies and the estimated bootstrap estimates: what happens depends on τ , its inner and pendant edge lengths (pendant edges are the edges of degree 1), its topology and the estimation method used. Even though we are only interested in whether or not a clade is present, we have to consider the number of possible alternative trees, with and without the clade, because this influences the bootstrap’s validity (Zharkikh and Li, 1995). In fact, it is as if we were trying to estimate only one component pj of a multinomial parameter (p1 , p2 , . . . , p2(n−3)!! ); we still have to know how many pi ’s are big enough to compete with pj . Rodrigo (1993), Efron, Halloran and Holmes (1996) and Zharkikh and Li (1995) suggested postprocessing the data to recalibrate by taking into account the number of neighbors and the curvature of the boundaries. Li and Zharkikh (1995) coined the phrase “number of effective neighbors” and called this number K. This accounts both for the number of neighboring regions and 247 for the probability assigned to the regions. There are some trees (in particular balanced trees when estimating with parsimony) that are combinatorially neighbors, but that have such light densities they can be neglected. Zharkikh and Li (1995) proposed to compute this number by simulation in a procedure called the complete and partial bootstrap. Efron, Halloran and Holmes (1996) “looked out” on the regions neighboring the boundary of the estimated tree by first searching for the boundaries using the bootstrap samples that have trees different from the estimated one and then using a binary search to find data that are as close as possible to the boundary. The borderline data are then used to generate more bootstrap resamples that will be as different as possible and provide an empirical profile of the neighborhood. This is then used to recalibrate the original bootstrap values. Hillis and Bull (1993) and Zharkikh and Li (1995) reported that the bootstrap estimates of repeatability were biased. Efron, Halloran and Holmes (1996), Zharkikh and Li (1995) and Newton (1996) clarified some of the reasons for this apparent bias, due to several facts that we will revisit: • Directly comparing the truth to the bootstrap samples. • Not accounting for the other edge lengths in the tree that change the number of possible neighbors and thus the baseline. • Not accounting for the curvature of the boundary between the regions that define different trees. 4. HOW CAN WE EFFECTIVELY SUMMARIZE BOOTSTRAP DATA? Most biologists agree that the simplest possible probability distribution on tree space—the uniform distribution—is not relevant; a slightly more realistic one is the Yule process (see Aldous, 2001). Building a probability distribution on trees is a complex procedure. Further, choosing optimal trees in a model cannot, in general, be decomposed into simpler problems. This is the essence of what constitutes computationally intractable problems. The estimation of both the maximum likelihood tree and the parsimony tree have been proven to be intractable. Now let us come back to actually trying to summarize a bootstrap resampling distribution on trees— a Bayesian posterior distribution or a distribution that could be used to build a frequentist confidence region. Classically, sufficient statistics arise from a model. We can also go the other way, following the statistical 248 S. HOLMES mechanics paradigm of deciding which features are relevant for a tree, call these S1 (τ ), S2 (τ ), . . . , Sk (τ ), and then forming the exponential family Z −1 · exp( θi Si (τ )) based on these, where Z is the partition constant that makes this a probability measure. See Lauritzen (1988) for background. For this to give an effective model, the distribution must be well characterized by a few summaries, the sufficient statistics. One example is the exponential family model P (τ ) = L exp(−λd(τ, τ0 )) defined by Billera, Holmes and Vogtmann (2001) in analogy to Mallows’ model. The sufficient statistic is k d(τi , τ0 ) = Sk i=1 for a collection of k trees and a central tree τ0 . So the estimation depends only on the distances between trees d. This reduces the data to one number Sk . Of course, without the assumption of the symmetrical distribution, sufficient statistics are much more complex. Note that the bootstrap distribution is often summarized by just the frequencies of edges or clades as in Figure 1. These numbers do not constitute sufficient statistics for the complete bootstrap distribution. This has been implicitly understood by several authors who work on trees. For instance, Penny and Hendy (personal communication) have proposed a nearest neighbor (NN) bootstrap which counts how many times an edge or a neighboring split occurs (for an example of its use, see Cooper and Penny, 1997). We see later that a suitable geometrical enhancement of the mathematical picture of tree space makes such a NN bootstrap natural. 4.1 Multiple Testing Showing all the bootstrap values on the tree simultaneously as in Figure 1 makes for an easy misinterpretation. Users may believe that these values can be used together. If they are considered as ersatz p values, this carries the flavor of multiple testing without correction and so should be avoided. When more than one edge is of interest, a multidimensional approach involving a certain amount of nonstandard geometry is preferable, as we see in Section 5. 4.2 Clade Frequencies as a First Level Approximation Considering just the clade frequencies as a first order approximation can be justified by considering a decomposition of a set of trees X by a Fourier type analysis in tree space. (This Fourier analysis is not the same as that proposed by Hendy and Penny, 1993 and Hendy, Penny and Steel, 1994.) Bootstrap clade frequencies just count the binomial counts of presence/absence of a given clade in a set B of trees obtained, for instance, by bootstrap simulation. The set of these trees can be considered as a function from the set of all trees into the integers. To each tree we associate its frequency of appearance in the set. Diaconis and Holmes (1998, 2002) showed that the space of all combinatorial trees on n leaves Tn can be represented as the quotient of the symmetric group on 2(n − 1) by the subgroup B2(n−1) that leaves the pairs {(1, 2), (3, 4), (5, 6), . . . , (2n − 3, 2n − 2)} invariant. This is called the matching representation of trees, where the tree is replaced by all its sibling pairs, including the inner nodes. Suppose we are trying to describe a set of 1000 bootstrap trees. This is a function from tree space to R, where each tree is associated to the number of times it occurs among the 1000 trees (we allow a fractional number of trees to be counted, because sometimes the output from the bootstrap functions can be fractional). The decomposition of functions on tree space was given by Diaconis and Holmes (1998). It is a direct sum decomposition of all functions on tree space, L(Tn ) = S 2λ , λ(n−1) where the sum is over all partitions λ of n − 1 (i.e., all vectors of integers that sum to n − 1), 2λ = (2λ1 , 2λ2 , . . . , 2λk ) and S 2λ is the associated irreducible representation of the symmetric group S2(n−1) . The first few terms in the decomposition can be interpreted as follows: • For λ = (n − 1), S 2λ counts the number of trees in the data set. • For λ = (n − 2, 1), S 2λ counts the number of times each particular sibling occurs. • For λ = (n − 3, 1, 1), S 2λ counts the number of times the sibling pair (i, j ) occurs at the same time as the pairs (k, l). • For λ = (n − 3, 2), S 2λ counts the number of times the sibling pair (i, j, k, l) occurs as a clade. It is in this sense that the sibling pair frequencies are the first order approximation to the complete distribution on trees. It would be useful to be able to say what proportion of the information contained in the data set can be reconstructed just by the sibling pair counts. This Fourier type decomposition closely follows the 249 BOOTSTRAPPING PHYLOGENETIC TREES analysis of ranking data provided by Diaconis (1989). To the other extreme is the idea that we could keep all the data and make useful multidimensional summaries of it. 5. GEOMETRICAL REPRESENTATION Trees are high-dimensional parameters, even if they do not lie naturally in a Euclidean space, so the best frequentist confidence statements that can be made about them rely on the notion of a confidence region Rα defined by statements of the form P(τ ∈ Rα ) = 1 − α (Tukey, 1975). Green (1981) suggested the use of successively peeled convex hulls as ersatz confidence regions. Take, for instance, the convex hull of all the parameter estimates in a high-dimensional space. Then omit the points on the boundary and construct the next hull, which may contain 90% of the data, for instance, so that it is a nonparametric 90% convex envelope. This envelope can then be used to test whether a certain tree is in the envelope or not. To give a more rigorous discussion of these continuity–sensitivity issues, it would be necessary to define both relevant probability measures and satisfactory metrics on Tn . Distances can also be useful for summarizing the bootstrap sampling distribution. In this section, we fill in the set of combinatorial trees, which is a discrete set, making the space of trees a continuous space with a meaningful distance that is always defined. This work was presented in its mathematical technicality by Billera, Holmes and Vogtmann (2001). We start by explaining intuitively what we would like our geometrical representation to provide. Our goal is to give a rigorous definition of the boundaries depicted in Figure 2. Every tree exists as a point in its own region. The boundaries between regions represent an area of uncertainty about the exact branching order. In biological terminology, this is called an unresolved tree. Two neighboring regions represent neighboring trees. The notion of neighboring is nearest neighbor interchange, the rotation distance in Tn . This defines two trees as neighbors if you can get from one to the other by contracting an edge to have length zero and then reexpanding this vertex of degree 4 so that it has degree 3 again. In Figure 9, we see one such change. This seems to be the most widely accepted neighborhood relationship in biology, although other distances could also be used to define a metric tree space in the same way. The natural way to vary F IG . 9. tree. Two tree regions separated by a boundary—a degenerate closeness to the boundary or unresolved tree is to make the edge lengths e decrease linearly in the direction of the boundary. There are three rooted binary semilabeled trees on three leaves. We arrange them along three half lines that meet at the origin which represents the star tree. For this article, to make our geometrical space slightly simpler, we restrict ourselves to trees with finite branch lengths. By standardizing the combinational trees to all have edge lengths of 1, we can build the space of trees with n leaves as a cube complex Tn , where the cubes are all of dimension n − 2. For rooted binary trees with four leaves, we have a set of squares pasted together by two edges each. Each square in Figure 10 corresponds F IG . 10. Three neighboring quadrants. 250 S. HOLMES to a different branching order and the position within the square is determined by the coordinates, each representing one of the two inner edge lengths. Note that the boundary is shared by two other trees. The pendant branch lengths do not appear in this geometrical representation. (To obtain a complete coordinate system of binary semilabeled trees, one would have to take the product of Tn with Rn .) All quadrants have to have the star tree as one of their corners, so that particular point will have 15 neighboring quadrants. This generalizes and explains why, at the star tree, the origin of our space, there are exponentially many cubes attached. On the other hand, a degenerate tree with only one nonzero edge is represented as a point on the segment boundary to three quadrants; thus its neighborhood contains three “flaps.” In fact, if we have a four leafed tree, but are sure what the outgroup is, the relevant space is the space of rooted trees on three leaves. This embedding, shown geometrically in Figure 11, is important if we consider the problem of finding all the trees with a given edge as in Section 2. It is the cube complex [0, 1] × Tn−1 embedded in Tn . This is important to consider when we talk about the boundary region between the trees that have the edge c and those that do not. As we saw in Section 3, Zharkikh and Li (1995) did a simulation study to find how many trees neighbor a given tree. This has consequences for the quality of the bootstrap estimate as also was pointed out in Efron, Halloran and Holmes (1996). The geometrical picture allows us to just count the number of neighbors. We can see that for a tree on four leaves, there can be either no neighbors except trees with the same F IG . 12. One tree in the neighborhood. branching pattern as in Figure 12 or three neighboring combinatorially different trees as in Figure 13 or 15 neighboring trees (if all the edges are small and we are close to the star tree). Of course, for a tree with two inner edges, this is the only possible way to have these two edges small. This same notion of neighborhood containing 15 different branching orders applies to all trees on as many leaves as necessary, but which have two contiguous “small edges” and all the other inner edges significantly larger than 0. This picture of tree space frees us from having to use simulations to find out how many different trees are in a neighborhood of a given radius r around a given tree. All we have to do is check how many contiguous edges in the tree are smaller than r. Say there is just one set of small contiguous edges of size nr . Then the neighborhood contains (2nr − 3)!! = (2nr − 3) × (2nr − 5) × · · · × 3 × 1 different types of trees. Thus a point very close to the F IG . 11. Embedding of T3 in T4 . F IG . 13. A tree with two neighbors. BOOTSTRAPPING PHYLOGENETIC TREES 251 6. SUMMARY: HOW SHOULD ONE BOOTSTRAP? F IG . 14. Cone path and geodesic path in T4 . star tree at the origin will have an exponential number of neighbors. Billera, Holmes and Vogtmann (2001) have shown that the geodesic distance induced by this filling of tree space always exists. There is always a path going through the star tree to go from one tree to another. Sometimes this is the shortest path, thus the distance; sometimes there is a shorter path as can be seen in Figure 14. The last element necessary to make a rigorous picture of tree space is the probability measure. We can define such a measure in the parametric mutation model of maximum likelihood estimation of trees. Figure 15 presents a picture of the likelihood contours in the four leaf case for the Markovian evolution model with two parameters. Many other uses of tree space were described by Billera, Holmes and Vogtmann (2001). F IG . 15. Likelihood contours. First the simulation methods involved in resampling should follow the biological knowledge as closely as possible. Block bootstrapping (Künsch, 1989) with blocks that have size on the same order as the size of dependent blocks as estimated from real data can be used. Alternatively, simulations can follow covarion models (Fitch and Markowitz, 1970) as in Lockhart et al. (1996) and Tuffley and Steel (1998). These lean toward models where the observations—columns of the sequence matrix—are not identically distributed, but depend on a covariable. In their studies, the covariate was binary, that is, could be either on or off. This is also modeled by “hot spots” along the sequences (see Tang and Lewontin, 1999) and Gamma rate variations for different sites as in Yang (1994). Any of these models can be bootstrapped using a parametric bootstrap. It is more coherent to use the same model at the resampling stage as at the estimation stage. Thus the parametric bootstrap as implemented in SEQGEN by Rambaut and Grassly (1997) is coherent when doing a maximum likelihood estimation, and even allows inclusion of rate heterogeneity models such as Yang (1994) and Felsenstein and Churchill (1996). A justification for using the multinomial bootstrap may be that the mutation model itself is being tested. This leads to confusing conclusions because the alternative is not explicitly defined. Can we switch paradigms as we deem fit? Starting with a parametric model for evolution, such as the one parameter Jukes–Cantor model, does it make sense, after a tree has been estimated, to switch to a different paradigm at the validation stage and use a nonparametric bootstrap to compute confidence levels on the tree? This is an open problem. Some statisticians often switch paradigms in the middle of their studies from data analytic to parametric to nonparametric. Usually what actually can be proved is that if the parametric model is correct, there is a loss in power (sensu statisticae strictu) when switching to a nonparametric procedure. As for the mixture between Bayesian and frequentist, empirical Bayes (Robbins, 1980, 1983, 1985) is a typical example of the loose boundaries that exist when choosing different paradigms at different stages of an analysis. Having settled on the process for generating the new data, one can think about the statistic to bootstrap. The simplest case where only one clade is of interest can be handled by recalibrating Felsenstein’s (1983) repeatability index as was done by Efron, Halloran 252 S. HOLMES far from being treelike. Another question that could be answered is, “Is the convex hull elongated along a certain direction?” 7. FUTURE RESEARCH It would be meaningful to consider bootstrapping the whole procedure, alignment and treebuilding included as, for instance, in Gong (1986). In the statistical literature, the theorems that justify the use of the bootstrap usually state that the distribution of the distance between the true parameter and the estimate can be well approximated by the distribution between the estimate and the bootstrap resampled estimate—something that can be summarized as Distribution(d(τ, τ̂ )) ≈ Distribution(d(τ̂ , τ̂ ∗ )). F IG . 16. Bootstrap histogram of d(τ̂ ∗ , τ̂ ). and Holmes (1996) (for which a PYTHON / R / PHYLIP program is available from this author) or Zharkikh and Li (1995). However, this recalibration does not apply to more extensive use of the bootstrap. In particular, if several edges are of interest, multiple testing problems loom and no correction methods are readily available. The geometric perspective outlined in Section 5 does provide two alternatives. We can use the distance d and look at the distribution of distances, as for instance in the histogram of Figure 16. This gives the average distance d̄(τ̂ ∗ , τ̂ ) = 0.167 and a 95th percentile of 0.023. Such numbers allow the use of the bootstrap to test various hypotheses. In the maximum likelihood context, it is possible to construct likelihood contours (Ramsay, 1978) such as those in Figure 15 for the bootstrap resampling distribution following Hall (1987). Another method now available through the geometric method is to use convex hulls. These exist in Tn because the space has nonpositive curvature as proved in Billera, Holmes and Vogtmann (2001). Thus methodology based on hull peeling as in Liu, Parelius and Singh (1999) could provide a 95% convex hull and one could answer questions such as, “Does the star tree belong to the 95% convex hull?” If it does, the data are probably very However, most of the theoretical work involves an assumption of independent, identically distributed variables and some strong assumptions on the properties of the distance. No actual theory in the phylogenetic context exists at present, although referring to theoretical arguments in other cases does provide useful insight into the most sensible simulations to undertake. Considering the resampling procedure as a means for making a small plausible perturbation of the original data and looking at how much the tree changes, or more precisely whether or not a clade disappears, is in fact a way to understand the estimator’s continuity (in the mathematical sense) by simulation and provides quite a bit of information. There has also been phylogenetic work on ideas very similar to Breiman (1996) in statistical learning theory. Berry and Gascuel (1996) used this idea to find more robust trees by taking consensii of bootstrapped data. The method for making consensii is quite crude (majority rule consensus chooses all the edges that have more than 50% support). A more refined method could use a geometric consensus using the distance defined in Section 5 and minimizing either d(τi , τ ) 2 or d (τi , τ ). The biggest change in statistics as practiced by statisticians over the last 30 years has been a decrease in the use of p values. Tukey and his co-workers in exploratory data analysis have shown the importance of keeping as much of the data in mind as possible. A picture is worth a thousand words, and geometry has much to offer in the complex multidimensional 253 BOOTSTRAPPING PHYLOGENETIC TREES analysis of DNA sequences through trees, graphs and their assorted confidence regions. Some open problems for mathematically inclined colleagues include the following: • Can we give quantitative bounds on how much the new assumptions involved in more realistic models of sequence distribution change the bootstrap distributions of the distances? • How can we represent confidence regions and neighborhoods of trees graphically? • Can we extend some of the work on bootstrapping trees to networks? This would be useful for analyzing regulatory networks, but also would enable us to test whether data are actually treelike. APPENDIX Some noncommercial programs for constructing, evaluating and visualizing phylogenetic trees. A very complete list, including the commercial packages not mentioned here, can be found at http://evolution. genetics.washington.edu/phylip/software.html. (Author: Joe Felsenstein.) Available on almost all machines as source and executables, its different programs allow computations of parsimony, distance matrix methods, maximum likelihood and other methods on a variety of types of data, including DNA, RNA and protein sequences. It implements the nonparametric bootstrap and consensus programs used in the examples here. Website: http://evolution.genetics.washington.edu. TREE - PUZZLE (Authors: Heiko A. Schmidt, Korbinian Strimmer, Martin Vingron and Arndt von Haeseler.) Maximum likelihood phylogenetic analysis using quartets and parallel computing. Website: http:// www.tree-puzzle.de/. MR . BAYES (Authors: John Huelsenbeck and Fredrik Ronquist.) This program implements a parametric Bayesian method for finding trees by Monte Carlo Markov chains that provide both a Bayesian estimate and the associated posterior distribution. Website: http:// morphbank.ebc.uu.se/mrbayes/. LVB (Author: Daniel Barker.) Contruction of phylogenies using simulated annealing. Website: http://sapc34.rdg.ac.uk/lvb/. SEQGEN (Authors: Andrew Rambaut and Nick Grassly.) Parametric bootstrap for generating nucleotide sequences from a given phylogeny or mixture of phylogenies. As well as the more classical parametric PHYLIP models, they allow rate heterogeneity among sites. Website: http: // evolve.zoo.ox.ac.uk / software / SeqGen/Seq-Gen.html. TREEVIEW (Author: Rod Page.) For visualizing trees on a PC or Mac, or with Linux/Unix with TREEVIEW X. Website: http:// taxonomy.zoology.gla.ac.uk/rod/treeview.html. APE (Analyses of phylogenetics and evolution) (Authors: Emmanuel Paradis, Korbinian Strimmer, Julien Claude, Yvonnick Noel and Ben Bolker.) APE is an R package that provides functions for reading, writing, plotting and manipulating phylogenetic trees, but not estimating them. It allows analyses of comparative data in a phylogenetic framework. Website: http://cran.r-project.org/src/contrib/ PACKAGES.html#ape. PAL (Authors: Alexei Drummond, Ed Buckler and Korbinian Strimmer.) A collection of Java classes for use in molecular phylogenetics that enables maximum likelihood, neighbor joining and least squares analysis. Website: http://www.cebl.auckland.ac.nz/pal-project/. NONA (Author: Pablo Goloboff.) Computes maximum parsimony and consensus trees on Windows machines only. Website: http://www. cladistics.com/about_nona.htm. ACKNOWLEDGMENTS I thank George Casella for giving me the opportunity to write this survey. Marc Coram and Persi Diaconis patiently read several versions of this text. Brad Efron and Warren Ewens had several helpful discussions on the bootstrap. My 2002 summer research students, Martina Koeva, Aaron Staple and Henry Towsner, all helped with computational aspects of this project. This research was partly funded by Grant NSF DMS-00-72569. REFERENCES A LDOUS , D. (2001). Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today. Statist. Sci. 16 23–34. BAKER , C., L ENTO , G., C IPRIANO , F. and PALUMBI , S. (2000). Predicted decline of protected whales based on molecular genetic monitoring of Japanese and Korean markets. Proc. Roy. Soc. London Ser. B 267 1191–1199. BAKER , C. and PALUMBI , S. (1994). Which whales are hunted? A molecular genetic approach to monitoring whaling. Science 265 1538–1539. B ERRY, V. and G ASCUEL , O. (1996). On the interpretation of bootstrap trees: Appropriate threshold of clade selection and induced gain. Molecular Biology and Evolution 13 999–1011. 254 S. HOLMES B ILLERA , L., H OLMES , S. and VOGTMANN , K. (2001). Geometry of the space of phylogenetic trees. Adv. in Appl. Math. 27 733–767. B REIMAN , L. (1996). Bagging predictors. Machine Learning 24 123–140. B REMER , K. (1988). The limits of amino acid sequence data in angiosperm phylogenetic reconstruction. Evolution 42 795–803. C OOPER , A. and P ENNY, D. (1997). Mass survival of birds across the cretaceous–tertiary boundary: Molecular evidence. Science 275 1109–1113. D IACONIS , P. (1989). A generalization of spectral analysis with application to ranked data. Ann. Statist. 17 949–979. D IACONIS , P. and H OLMES , S. (1998). Matchings and phylogenetic trees. Proc. Nat. Acad. Sci. U.S.A. 95 14,600–14,602. D IACONIS , P. and H OLMES , S. (2002). Random walks on trees and matchings. Electronic Journal of Probability 7 17 pages. D URBIN , R., E DDY, S., K ROGH , A. and M ITCHISON , G. (1998). Biological Sequence Analysis. Cambridge Univ. Press. E FRON , B. and T IBSHIRANI , R. (1993). An Introduction to the Bootstrap. Chapman and Hall, London. E FRON , B. and T IBSHIRANI , R. (1998). The problem of regions. Ann. Statist. 26 1687–1718. E FRON , B., H ALLORAN , E. and H OLMES , S. (1996). Bootstrap confidence levels for phylogenetic trees. Proc. Nat. Acad. Sci. U.S.A. 93 13,429–13,434. F ELSENSTEIN , J. (1983). Statistical inference of phylogenies (with discussion). J. Roy. Statist. Soc. Ser. A 146 246–272. F ELSENSTEIN , J. (2003). Inferring Phylogenies. Sinauer, Boston. F ELSENSTEIN , J. and C HURCHILL , G. A. (1996). A hidden Markov model approach to variation among sites in rate of evolution. Molecular Biology and Evolution 13 93–104. F ITCH , W. (1971a). The nonidentity of invariable positions in the cytochromes c of different species. Biochemical Genetics 5 231–241. F ITCH , W. (1971b). Rate of change of concomitantly variable codons. Journal of Molecular Evolution 1 84–96. F ITCH , W. M. and M ARKOWITZ , E. (1970). An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochemical Genetics 4 579–593. F REEDMAN , D. A. and P ETERS , S. C. (1984a). Bootstrapping a regression equation: Some empirical results. J. Amer. Statist. Assoc. 79 97–106. F REEDMAN , D. A. and P ETERS , S. C. (1984b). Bootstrapping an econometric model: Some empirical results. J. Bus. Econom. Statist. 2 150–158. G ONG , G. (1986). Cross-validation, the jackknife, and the bootstrap: Excess error estimation in forward logistic regression. J. Amer. Statist. Assoc. 81 108–113. G REEN , P. J. (1981). Peeling bivariate data. In Interpreting Multivariate Data (V. Barnett, ed.) 3–19. Wiley, New York. H ALL , P. (1987). On the bootstrap and likelihood-based confidence regions. Biometrika 74 481–493. H AMPEL , F. R., RONCHETTI , E. M., ROUSSEEUW, P. J. and S TAHEL , W. A. (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley, New York. H EDGES , S. B. (1992). The number of replications needed for accurate estimation of the bootstrap p-value in phylogenetic studies. Molecular Biology and Evolution 9 366–369. H ENDY, M. D. and P ENNY, D. (1993). Spectral analysis of phylogenetic data. J. Classification 10 5–23. H ENDY, M. D., P ENNY, D. and S TEEL , M. A. (1994). A discrete Fourier analysis for evolutionary trees. Proc. Nat. Acad. Sci. U.S.A. 91 3339–3343. H ILLIS , D. M. and B ULL , J. J. (1993). An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Systematic Biology 42 182–192. H OLMES , S. (1999). Phylogenies: An overview. In Statistics in Genetics (M. E. Halloran and S. Geisser, eds.) 81–119. Springer, New York. H UBER , P. J. (1996). Robust Statistical Procedures, 2nd ed. SIAM, Philadelphia. K ÜNSCH , H. R. (1989). The jackknife and the bootstrap for general stationary observations. Ann. Statist. 17 1217–1241. LANL (2002). HIV database. Available at URL: http://hivweb.lanl.gov/content/hiv-db/mainpage.html. L AURITZEN , S. L. (1988). Extremal Families and Systems of Sufficient Statistics. Lecture Notes in Statist. 49. Springer, Berlin. L ENTO , G. M., C IPRIANO , F., PATENAUDE , N. J., PALUMBI , S. R. and BAKER , C. S. (1998). Taking stock of minke whale in the North Pacific: The origins of products for sale in Japan and Korea. Technical Report SC/50/RMP15, Scientific Committee, International Whaling Commission. L I , S., P EARL , D. K. and D OSS , H. (2000). Phylogenetic tree construction using Markov chain Monte Carlo. J. Amer. Statist. Assoc. 95 493–508. L I , W. H. (1997). Molecular Evolution. Sinauer, Boston. L I , W. H. and Z HARKIKH , A. (1995). Statistical tests of DNA phylogenies. Systematic Biology 44 49–63. L IU , R. Y., PARELIUS , J. M. and S INGH , K. (1999). Multivariate analysis by data depth: Descriptive statistics, graphics and inference (with discussion). Ann. Statist. 77 783–858. L OCKHART, P. J., L ARKUM , A. W. D., S TEEL , M. A., WADDELL , P. J. and P ENNY, D. (1996). Evolution of chlorophyll and bacteriochlorophyll: The problem of invariant sites in sequence analysis. Proc. Nat. Acad. Sci. U.S.A. 93 1930– 1934. M ADDISON , D. (1991). The discovery and importance of multiple islands of most parsimonious trees. Systematic Zoology 40 315–328. M AU , B., N EWTON , M. A. and L ARGET, B. (1999). Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Biometrics 55 1–12. N EI , M., K UMAR , S. and TAKAHASHI , K. (1998). The optimization principle in phylogenetic analysis tends to give incorrect topologies when the number of nucleotides or amino acids used is small. Proc. Nat. Acad. Sci. U.S.A. 95 12,390–12,397. N EWTON , M. A. (1996). Bootstrapping phylogenies: Large deviations and dispersion effects. Biometrika 83 315–328. PAGE , R. and H OLMES , E. (2000). Molecular Evolution: A Phylogenetic Approach. Blackwell Science, Oxford, UK. R AMBAUT, A. and G RASSLY, N. C. (1997). Seq-gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Computational Applied Bioscience 13 235–238. R AMSAY, J. O. (1978). Confidence regions for multidimensional scaling analysis. Psychometrika 43 145–160. BOOTSTRAPPING PHYLOGENETIC TREES ROBBINS , H. (1980). An empirical Bayes estimation problem. Proc. Nat. Acad. Sci. U.S.A. 77 6,988–6,989. ROBBINS , H. (1983). Some thoughts on empirical Bayes estimation. Ann. Statist. 11 713–723. ROBBINS , H. (1985). Linear empirical Bayes estimation of means and variances. Proc. Nat. Acad. Sci. U.S.A. 82 1571–1574. RODRIGO , A. G. (1993). Calibrating the bootstrap test of monophyly. International Journal for Parasitology 23 507–514. S ANDERSON , M. J. (1995). Objections to bootstrapping phylogenies: A critique. Systematic Biology 44 299–320. S CHRÖDER , E. (1870). Vier combinatorische Probleme. Zeitschrift für Mathematik und Physik 15 361–376. TANG , H. and L EWONTIN , R. (1999). Locating regions of differential variability in DNA and protein sequences. Genetics 153 485–495. 255 T UFFLEY, C. and S TEEL , M. (1998). Modeling the covarion hypothesis of nucleotide substitution. Math. Biosci. 147 63–91. T UKEY, J. (1975). Mathematics and the picturing of data. In Proc. International Congress of Mathematicians (R. D. James, ed.) 523–531. Vancouver. YANG , Z. (1994). Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Molecular Evolution 39 306–314. YANG , Z. and R ANNALA , B. (1997). Bayesian phylogenetic inference using DNA sequences: A Markov chain Monte Carlo method. Molecular Biology and Evolution 14 717–724. Z HARKIKH , A. and L I , W.-H. (1995). Estimation of confidence in phylogeny: The complete-and-partial bootstrap technique. Molecular Phylogenetics and Evolution 4 44–63.