The amount of DNA sequence data available for phylogenetic

advertisement
ACCELERATED LIKELIHOOD SURFACE EXPLORATION: THE ‘LIKELIHOOD RATCHET’
R.A. VOS
Department of Biological Sciences,
Simon Fraser University,
8888 University Drive, Burnaby, B.C., Canada, V5A 1S6;
(rvosa@sfu.ca)
Abstract:I propose an algorithm expected to significantly increase the probability of finding the maximum likelihood
tree for large numbers of taxa. Until now, fairly simplistic heuristic algorithms have been used to explore complex
tree landscapes. My approach extends the concept of the ‘parsimony ratchet’ ([Nixon, 1999 #316]) to the likelihood
framework. Tests of concept have been encouraging.
Introduction:The computationally intensive nature of the well-known NP-complete problem (number of possible
solutions = (2n-3)!! were n=number of taxa) is compounded by the fact that calculating any single tree’s score in the
likelihood framework can take considerable time under complex models of sequence evolution. Therefore, heuristic
search algorithms are necessary. Typically, such searches are comprised of a mixture of global and local
optimization routines. For instance, one may start by approximating the globally optimal solution through stepwise
addition, and then to employ a rearrangement (branch swapping) algorithm to locally improve on this. Permutations
are accepted when the fit is improved. Although some novel search algorithms under maximum likelihood allow for
non-significant decreases in tree score (Salter and Pearl, 2001), the usual modus operandi is that only increases in
tree score are allowed: the only way is up. If none of the possible rearrangements from a given tree improves upon
the result the search terminates.
Motivation:During my Master’s, I have constructed the largest supertree for the Primates (n=246) to date (Vos, in
prep.). Where pure heuristic searches failed the ‘parsimony ratchet’ ([Nixon, 1999 #316]) succeeded. This led to an
investigation of the characteristics of tree landscapes for very large trees (Vos & Mooers, in prep.) and the present
proposal.
Rearrangement algorithms work under the assumption that tree scores are distributed in clusters over tree space
when tree space is represented as a network with closely-related tree shapes in each other’s vicinity (Hendy et al.,
1988). This is both the strength and weakness of rearrangement algorithms: for hill-climbing strategies to be
guaranteed to find the global optimum the optimality landscape must be unimodal, such that any local optimum is
also the global one this condition is often not met ([Maddison, 1991 #318]). . The resulting local optima (“tree
islands”: sets of locally optimal interconnected trees of equal length that differ in one rearrangement) may lead to
deceiving results during heuristic searches (e.g. see Sumrall et al., 2001). Efforts have been made to avoid getting
stuck on them, usually by employing novel tree searching strategies (Moilanen, 2001; Nixon, 1999; Quicke et al.,
2001; Charleston, 2001; Goloboff, 1999; Ota and Li, 2000; Ota and Li, 2001).
Perturbing the tree landscape:Some novel tree searching strategies rely on iterative perturbations of the tree
landscape in order to escape from local optima (Nixon, 1999;Quicke et al., 2001).For example, by reweighting a
random sample drawn from the data set a tree island may no longer be locally optimal and the search may continue
uphill. After reaching a new optimum, the algorithm reverts to the initial weighting scheme and the search continues,
hopefully out of the reach of the original local optimum. The advantage of this is that reweighted runs preserve some
of the original phylogenetic signal that is contained in the data rather than losing it entirely as is the case when
random starting trees are used. Such reweighting strategies have been implemented for phylogenetic inference under
Maximum Parsimony (Nixon, 1999, Quicke et al., 2001). Here, I propose the first method that expands the concept
to the likelihood framework.
PROPOSED RESEARCH
I propose to develop a tree searching algorithm that extends the concept of the ‘Parsimony Ratchet’ (Nixon,
1999) to phylogenetic inference under Maximum Likelihood: the ‘Likelihood Ratchet”. The steps of the algorithm
are outlined in figure 1.
1. In the first step, an algorithmic tree (e.g. neighbor-joining) is constructed.
2. Using this tree, the model of sequence evolution that maximizes the likelihood of the data under
study is estimated.
3. The tree from step 1 is used as a starting tree in a standard heuristic branch-swapping routine (e.g.
SPR or TBR). The search continues until it converges on an optimum. Alternatively, a time limit or
a maximum number of rearrangements can be specified after which point the search terminates and
the best solution is stored.
4. A random sample is drawn from the data set. This sample can be reweighted, or alternatively,
jackknifed. This step will change the tree landscape and allow the search to move away from the
optimum of step 3. As well, a simpler, faster model of sequence evolution is chosen (e.g. JC69).
The rationale here is that ML models of sequence evolution are known to be robust and that this
step in the algorithm merely tries to escape from an optimum without losing too much phylogenetic
information.
5. A heuristic search on the reweighted data set is started. This search continues until it converges on
an optimum or until a set time or rearrangement limit is reached.
6. Using the best solution from step 5, the search reverts to the original weighting scheme and best fit
model as estimated in step 2. From this point, the search starts at step 3 again. Steps 3 through 6 are
repeated until a predefined number of iterations is reached. After each iteration the best solution is
stored.
7. When the predefined number of iterations is reached, the optimal solution(s) from among all
iterations is selected.
A preliminary test on a 50-taxon mitochondrial data set has returned a ML tree using this method in a much
shorter time span than it takes to arrive at the same, globally optimal, solution using a branch-and-bound search. As
well, the result was significantly better than the result of heuristic searches using the random starting tree approach
when given the same amount of time. These results suggest that the consistency and robustness of maximum
likelihood can be applied on larger data sets in reasonable time using this approach. Extensive additional tests need
to be done in order to optimize and characterize this strategy and contrast it with recent advances in the Bayesian
framework (SSB annual meeting, 2001)
SIGNIFICANCE
The Likelihood Ratchet algorithm can be readily applied using standard computer packages for phylogenetic
inference such as PAUP* (Swofford, 2001). The algorithm can significantly increase the speed in which likelihood
landscapes are explored. Hence, this algorithm can potentially be applied to larger data sets than were feasible up to
this point.
SCHEDULE AND BUDGET JUSTIFICATION
REFERENCES
CHARLESTON, M. A. 2001. Hitch-hiking: a parallel heuristic search strategy, applied to the phylogeny problem. J.
Comput. Biol. 8:79-91.
FELSENSTEIN, J. 1981. Evolutionary Trees from DNA Sequences: A Maximum Likelihood Approach. J. Mol. Evol.
17:368-376.
GOLOBOFF, P. A. 1999. Analyzing large data sets in reasonable times: Solutions for composite optima. Cladistics
15:415-428.
HENDY, M. P., M. A. STEEL, D. PENNY, and I. M. HENDERSON. 1988. Families of trees and consensus. Pages 355-362
in Classification and related methods of data analysis (H. H. Bock, ed.) Elsevier, New York.
HILLIS, D. M., and C. MORITZ. 1990. Phylogeny Reconstruction. Sinauer Associates, Sunderland.
MADDISON, D. R. 1991. The discovery and importance of multiple islands of most-parsimonious trees. Syst. Zool.
40:315-328.
MOILANEN, A. 2001. Simulated evolutionary optimization and local search: Introduction and application to tree
search. Cladistics 17:S12-S25.
NIXON, K. 1999. The Parsimony Ratchet, a New Method for Rapid Parsimony Analysis. Cladistics 15:407-414.
OTA, S., and W. H. LI. 2000. NJML: A hybrid algorithm for the neighbor-joining and maximum-likelihood methods.
Mol. Biol. Evol. 17:1401-1409.
OTA, S., and W. H. LI. 2001. NJML+: An extension of the NJML method to handle protein sequence data and
computer software implementation. Mol. Biol. Evol. 18:1983-1992.
PAGE, R. D. M. 1993. On islands of trees and the efficacy of different methods of branch-swapping in finding most
parsimonious trees. Syst. Biol. 42.
QUICKE, D. L. J., J. TAYLOR, and A. PURVIS. 2001. Changing the landscape: A new strategy for estimating large
phylogenies. Syst. Biol. 50:60-66.
SALTER, L. A., and D. K. PEARL. 2001. Stochastic search strategy for estimation of maximum likelihood
phylogenetic trees. Syst. Biol. 50:7-17.
SUMRALL, C. D., C. A. BROCHU, and J. W. MERCK. 2001. Global lability, regional resolution, and majority-rule
consensus bias. Paleobiology 27:254-261.
SWOFFORD, D. L. 2001. PAUP*: phylogenetic analysis using parsimony, version 4.0b8.
TAKAHASHI, K., and M. NEI. 2000. Efficiencies of fast algorithms of phylogenetic inference under the criteria of
maximum parsimony, minimum evolution, and maximum likelihood when a large number of sequences are
used. Mol. Biol. Evol. 17:1251-1258.
Fig 1. Flowchart diagram for the ‘Likelihood Ratchet’. See the text for details.
1. Construct a
Neighbor-joining tree
2. Estimate best fit
model on NJ tree
7. Store and select ML
trees
3. Heuristically
rearrange NJ tree
shape until
convergence
4. Resample the data
and choose a different
model
5. Heuristically
rearrange tree shape
until convergence
6. Restore the data
and revert to best fit
model
Download