ACCELERATED LIKELIHOOD SURFACE EXPLORATION: THE ‘LIKELIHOOD RATCHET’ R.A. VOS Department of Biological Sciences, Simon Fraser University, 8888 University Drive, Burnaby, B.C., Canada, V5A 1S6; (rvosa@sfu.ca) Abstract:I propose an algorithm expected to significantly increase the probability of finding the maximum likelihood tree for large numbers of taxa. Until now, fairly simplistic heuristic algorithms have been used to explore complex tree landscapes. My approach extends the concept of the ‘parsimony ratchet’ ([Nixon, 1999 #316]) to the likelihood framework. Tests of concept have been encouraging. Introduction:The computationally intensive nature of the well-known NP-complete problem (number of possible solutions = (2n-3)!! were n=number of taxa) is compounded by the fact that calculating any single tree’s score in the likelihood framework can take considerable time under complex models of sequence evolution. Therefore, heuristic search algorithms are necessary. Typically, such searches are comprised of a mixture of global and local optimization routines. For instance, one may start by approximating the globally optimal solution through stepwise addition, and then to employ a rearrangement (branch swapping) algorithm to locally improve on this. Permutations are accepted when the fit is improved. Although some novel search algorithms under maximum likelihood allow for non-significant decreases in tree score (Salter and Pearl, 2001), the usual modus operandi is that only increases in tree score are allowed: the only way is up. If none of the possible rearrangements from a given tree improves upon the result the search terminates. Motivation:During my Master’s, I have constructed the largest supertree for the Primates (n=246) to date (Vos, in prep.). Where pure heuristic searches failed the ‘parsimony ratchet’ ([Nixon, 1999 #316]) succeeded. This led to an investigation of the characteristics of tree landscapes for very large trees (Vos & Mooers, in prep.) and the present proposal. Rearrangement algorithms work under the assumption that tree scores are distributed in clusters over tree space when tree space is represented as a network with closely-related tree shapes in each other’s vicinity (Hendy et al., 1988). This is both the strength and weakness of rearrangement algorithms: for hill-climbing strategies to be guaranteed to find the global optimum the optimality landscape must be unimodal, such that any local optimum is also the global one this condition is often not met ([Maddison, 1991 #318]). . The resulting local optima (“tree islands”: sets of locally optimal interconnected trees of equal length that differ in one rearrangement) may lead to deceiving results during heuristic searches (e.g. see Sumrall et al., 2001). Efforts have been made to avoid getting stuck on them, usually by employing novel tree searching strategies (Moilanen, 2001; Nixon, 1999; Quicke et al., 2001; Charleston, 2001; Goloboff, 1999; Ota and Li, 2000; Ota and Li, 2001). Perturbing the tree landscape:Some novel tree searching strategies rely on iterative perturbations of the tree landscape in order to escape from local optima (Nixon, 1999;Quicke et al., 2001).For example, by reweighting a random sample drawn from the data set a tree island may no longer be locally optimal and the search may continue uphill. After reaching a new optimum, the algorithm reverts to the initial weighting scheme and the search continues, hopefully out of the reach of the original local optimum. The advantage of this is that reweighted runs preserve some of the original phylogenetic signal that is contained in the data rather than losing it entirely as is the case when random starting trees are used. Such reweighting strategies have been implemented for phylogenetic inference under Maximum Parsimony (Nixon, 1999, Quicke et al., 2001). Here, I propose the first method that expands the concept to the likelihood framework. PROPOSED RESEARCH I propose to develop a tree searching algorithm that extends the concept of the ‘Parsimony Ratchet’ (Nixon, 1999) to phylogenetic inference under Maximum Likelihood: the ‘Likelihood Ratchet”. The steps of the algorithm are outlined in figure 1. 1. In the first step, an algorithmic tree (e.g. neighbor-joining) is constructed. 2. Using this tree, the model of sequence evolution that maximizes the likelihood of the data under study is estimated. 3. The tree from step 1 is used as a starting tree in a standard heuristic branch-swapping routine (e.g. SPR or TBR). The search continues until it converges on an optimum. Alternatively, a time limit or a maximum number of rearrangements can be specified after which point the search terminates and the best solution is stored. 4. A random sample is drawn from the data set. This sample can be reweighted, or alternatively, jackknifed. This step will change the tree landscape and allow the search to move away from the optimum of step 3. As well, a simpler, faster model of sequence evolution is chosen (e.g. JC69). The rationale here is that ML models of sequence evolution are known to be robust and that this step in the algorithm merely tries to escape from an optimum without losing too much phylogenetic information. 5. A heuristic search on the reweighted data set is started. This search continues until it converges on an optimum or until a set time or rearrangement limit is reached. 6. Using the best solution from step 5, the search reverts to the original weighting scheme and best fit model as estimated in step 2. From this point, the search starts at step 3 again. Steps 3 through 6 are repeated until a predefined number of iterations is reached. After each iteration the best solution is stored. 7. When the predefined number of iterations is reached, the optimal solution(s) from among all iterations is selected. A preliminary test on a 50-taxon mitochondrial data set has returned a ML tree using this method in a much shorter time span than it takes to arrive at the same, globally optimal, solution using a branch-and-bound search. As well, the result was significantly better than the result of heuristic searches using the random starting tree approach when given the same amount of time. These results suggest that the consistency and robustness of maximum likelihood can be applied on larger data sets in reasonable time using this approach. Extensive additional tests need to be done in order to optimize and characterize this strategy and contrast it with recent advances in the Bayesian framework (SSB annual meeting, 2001) SIGNIFICANCE The Likelihood Ratchet algorithm can be readily applied using standard computer packages for phylogenetic inference such as PAUP* (Swofford, 2001). The algorithm can significantly increase the speed in which likelihood landscapes are explored. Hence, this algorithm can potentially be applied to larger data sets than were feasible up to this point. SCHEDULE AND BUDGET JUSTIFICATION REFERENCES CHARLESTON, M. A. 2001. Hitch-hiking: a parallel heuristic search strategy, applied to the phylogeny problem. J. Comput. Biol. 8:79-91. FELSENSTEIN, J. 1981. Evolutionary Trees from DNA Sequences: A Maximum Likelihood Approach. J. Mol. Evol. 17:368-376. GOLOBOFF, P. A. 1999. Analyzing large data sets in reasonable times: Solutions for composite optima. Cladistics 15:415-428. HENDY, M. P., M. A. STEEL, D. PENNY, and I. M. HENDERSON. 1988. Families of trees and consensus. Pages 355-362 in Classification and related methods of data analysis (H. H. Bock, ed.) Elsevier, New York. HILLIS, D. M., and C. MORITZ. 1990. Phylogeny Reconstruction. Sinauer Associates, Sunderland. MADDISON, D. R. 1991. The discovery and importance of multiple islands of most-parsimonious trees. Syst. Zool. 40:315-328. MOILANEN, A. 2001. Simulated evolutionary optimization and local search: Introduction and application to tree search. Cladistics 17:S12-S25. NIXON, K. 1999. The Parsimony Ratchet, a New Method for Rapid Parsimony Analysis. Cladistics 15:407-414. OTA, S., and W. H. LI. 2000. NJML: A hybrid algorithm for the neighbor-joining and maximum-likelihood methods. Mol. Biol. Evol. 17:1401-1409. OTA, S., and W. H. LI. 2001. NJML+: An extension of the NJML method to handle protein sequence data and computer software implementation. Mol. Biol. Evol. 18:1983-1992. PAGE, R. D. M. 1993. On islands of trees and the efficacy of different methods of branch-swapping in finding most parsimonious trees. Syst. Biol. 42. QUICKE, D. L. J., J. TAYLOR, and A. PURVIS. 2001. Changing the landscape: A new strategy for estimating large phylogenies. Syst. Biol. 50:60-66. SALTER, L. A., and D. K. PEARL. 2001. Stochastic search strategy for estimation of maximum likelihood phylogenetic trees. Syst. Biol. 50:7-17. SUMRALL, C. D., C. A. BROCHU, and J. W. MERCK. 2001. Global lability, regional resolution, and majority-rule consensus bias. Paleobiology 27:254-261. SWOFFORD, D. L. 2001. PAUP*: phylogenetic analysis using parsimony, version 4.0b8. TAKAHASHI, K., and M. NEI. 2000. Efficiencies of fast algorithms of phylogenetic inference under the criteria of maximum parsimony, minimum evolution, and maximum likelihood when a large number of sequences are used. Mol. Biol. Evol. 17:1251-1258. Fig 1. Flowchart diagram for the ‘Likelihood Ratchet’. See the text for details. 1. Construct a Neighbor-joining tree 2. Estimate best fit model on NJ tree 7. Store and select ML trees 3. Heuristically rearrange NJ tree shape until convergence 4. Resample the data and choose a different model 5. Heuristically rearrange tree shape until convergence 6. Restore the data and revert to best fit model