Algorithms overview for Berkeley ITR meeting

advertisement
CIPRES:
Enabling Tree of Life Projects
Tandy Warnow
The University of Texas at Austin
Reconstructing the “Tree” of Life
Handling large datasets:
millions of species
The “Tree of Life” is not
really a tree:
reticulate evolution
Cyber Infrastructure for Phylogenetic Research
Purpose: to create a national infrastructure of hardware,
open source software, database technology, etc.,
necessary to infer the Tree of Life.
Group: 40 biologists, computer scientists, and
mathematicians from 13 institutions.
Funding: $11.6 M (large ITR grant from NSF).
URL: http://www.phylo.org
CIPRes Members
University of New Mexico
Bernard Moret
David Bader
UCSD/SDSC
Fran Berman
Alex Borchers
Phil Bourne
John Huelsenbeck
Terri Liebowitz
Mark Miller
UT Austin
Tandy Warnow
David M. Hillis
Warren Hunt
Robert Jansen
Randy Linder
Lauren Meyers
Daniel Miranker
University of Arizona
David R. Maddison
University of Connecticut
Paul O Lewis
University of British Columbia
Wayne Maddison
University of Pennsylvania
Junhyong Kim
Susan Davidson
Sampath Kannan
Val Tannen
North Carolina State University
Spencer Muse
Texas A&M
Tiffani Williams
American Museum of Natural
History
Ward C. Wheeler
NJIT
Usman Roshan
UC Berkeley
Satish Rao
Steve Evans
Richard M Karp
Brent Mishler
Elchanan Mossel
Eugene W. Myers
Christos M. Papadimitriou
Stuart J. Russell
Rice
Luay Nakhleh
SUNY Buffalo
William Piel
Florida State University
David L. Swofford
Mark Holder
Yale
Michael Donoghue
Paul Turner
CIPRES activity
• Databases - e.g. TreeBase II (Bill Piel and others)
• Simulations of large-scale complex genome-scale
evolution (Junhyong Kim)
• Outreach (Michael Donoghue and Brent Mishler)
• Algorithms (Tandy Warnow)
• Open source software (Wayne Maddison, Dave
Swofford, Mark Holder, and Bernard Moret)
• Computer cluster at SDSC (Fran Berman and
Mark Miller) - available to ATOL projects and
other groups with datasets above 1000 taxa
DNA Sequence Evolution
-3 mil yrs
AAGACTT
AAGGCCT
AGGGCAT
AGGGCAT
TAGCCCT
TAGCCCA
-2 mil yrs
TGGACTT
TAGACTT
AGCACTT
AGCACAA
AGCGCTT
-1 mil yrs
today
Phylogeny Problem
U
AGGGCAT
V
W
TAGCCCA
X
TAGACTT
Y
TGCACAA
X
U
Y
V
W
TGCGCTT
Complex Evolutionary Processes
• Gap events
• “Heterotachy” (violations of the ratesacross-sites assumption)
• New types of data (e.g., whole genomes)
• Reticulate evolution (e.g., hybrid speciation
and horizontal gene transfer)
Challenges in reconstructing large and/or
complex evolutionary histories
• Previous simulation studies don’t necessarily
help us understand phylogenetic reconstruction
on large or complex datasets
• We need new statistical models, new theory,
and probably new methods.
• Reticulate evolution and whole genome
evolution in particular present many
interesting challenges for reconstruction.
CIPRES research in algorithms
• Multiple sequence alignment
• Genomic alignment
• Heuristics for Maximum Parsimony and Maximum
Likelihood
• Bayesian MCMC methods
• Supertree methods
• Whole genome phylogeny reconstruction
• Reticulate evolution detection and reconstruction
• Data mining on sets of trees, and compact representations
of these sets
Phylogenetic reconstruction methods
1.
Heuristics for hard optimization criteria (Maximum
Parsimony and Maximum Likelihood) - hard to solve on
large datasets
Local optimum
Cost
Global optimum
Phylogenetic trees
2.
Polynomial time distance-based methods: Neighbor
Joining, FastME, Weighbor, etc. - poor accuracy on
datasets with large evolutionary distances
DCMs: Divide-and-conquer for
improving phylogeny reconstruction
“Boosting” phylogeny
reconstruction methods
• DCMs “boost” the performance of
phylogeny reconstruction methods.
Base method M
DCM
DCM-M
DCMs (Disk-Covering Methods)
• DCMs for polynomial time methods
improve topological accuracy (empirical
observation), and have provable theoretical
guarantees under Markov models of
evolution
• DCMs for hard optimization problems
reduce running time needed to achieve good
levels of accuracy (empirically observation)
DCM1-boosting distance-based methods
[Nakhleh et al. ISMB 2001]
0.8
NJ
Error Rate
DCM1-NJ
0.6
0.4
0.2
0
0
400
800
No. Taxa
1200
• DCM1-boosting
makes distancebased methods more
accurate
• Theoretical
guarantees that
DCM1-NJ
converges to the
true tree from
polynomial length
1600 sequences
Major challenge: MP and ML
• Maximum Parsimony (MP) and Maximum
Likelihood (ML) remain the methods of
choice for most systematists
• The main challenge here is to make it
possible to obtain good solutions to MP or
ML in reasonable time periods on large
datasets
Solving NP-hard problems
exactly is … unlikely
• Number of
(unrooted) binary
trees on n leaves is
(2n-5)!!
• If each tree on
1000 taxa could be
analyzed in 0.001
seconds, we would
find the best tree in
2890 millennia
#leaves
#trees
4
3
5
15
6
105
7
945
8
10395
9
135135
10
2027025
20
2.2 x 1020
100
4.5 x 10190
1000
2.7 x 102900
How good an MP analysis do we
need?
• Our research shows that we need to get
within 0.01% of optimal (or better even, on
large datasets) to return reasonable
estimates of the true tree’s “topology”
Problems with current techniques for MP
Shown here is the performance of a heuristic maximum parsimony analysis on a real
dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using
any method for any amount of time.) Acceptable error is below 0.01%.
0.2
0.18
Performance of TNT with time
0.16
0.14
Average MP
0.12
score above
optimal, shown as 0.1
a percentage of
0.08
the optimal
0.06
0.04
0.02
0
0
4
8
12
Hours
16
20
24
Observations
• The best MP heuristics cannot get
acceptably good solutions within 24 hours
on most of these large datasets.
• Datasets of these sizes may need months (or
years) of further analysis to reach
reasonable solutions.
• Apparent convergence can be misleading.
Our objective: speed up the best
MP heuristics
Fake study
Performance of hill-climbing heuristic
MP score
of best trees
Desired Performance
Time
DCM3 decomposition
Input: Set S of sequences, and guide-tree T
1. Compute short subtree graph G(S,T), based upon T
2. Find clique separator in the graph G(S,T) and form subproblems
DCM3 decompositions
(1) can be obtained in O(n) time
(2) yield small subproblems
(3) can be used iteratively
(4) can be applied recursively
Iterative-DCM3
T
DCM3
Base method
T’
New DCMs
•
DCM3
1.
2.
3.
4.
•
•
Recursive-DCM3
Iterative DCM3
1.
2.
•
Compute subproblems using DCM3 decomposition
Apply base method to each subproblem to yield subtrees
Merge subtrees using the Strict Consensus Merger technique
Randomly refine to make it binary
Compute a DCM3 tree
Perform local search and go to step 1
Recursive-Iterative DCM3
Rec-I-DCM3 significantly improves performance
0.2
0.18
Current best techniques
0.16
0.14
Average MP
0.12
score above
optimal, shown as 0.1
a percentage of
0.08
the optimal
0.06
DCM boosted version of best techniques
0.04
0.02
0
0
4
8
12
16
Hours
Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset
20
24
Datasets
Obtained from various researchers and online databases
•
•
•
•
•
•
•
•
•
•
1322 lsu rRNA of all organisms
2000 Eukaryotic rRNA
2594 rbcL DNA
4583 Actinobacteria 16s rRNA
6590 ssu rRNA of all Eukaryotes
7180 three-domain rRNA
7322 Firmicutes bacteria 16s rRNA
8506 three-domain+2org rRNA
11361 ssu rRNA of all Bacteria
13921 Proteobacteria 16s rRNA
Rec-I-DCM3(TNT) vs. TNT
(Comparison of scores at 24 hours)
TNT
Rec-I-DCM3
0.1
0.09
0.08
Average MP 0.07
score above
0.06
optimal at 24
0.05
hours, shown as a
percentage of the 0.04
0.03
optimal
0.02
0.01
0
1
2
3
4
5
6
7
8
9
10
Dataset#
Base method is the default TNT technique, the current best method for MP. Rec-I-DCM3
significantly improves upon the unboosted TNT by returning trees which are at most 0.01%
above optimal on most datasets.
Observations
• Rec-I-DCM3 improves upon the best
performing heuristics for MP.
• The improvement increases with the
difficulty of the dataset.
DCMs
• DCM for NJ and other distance methods produces
absolute fast converging (afc) methods
• DCMs for MP heuristics
• DCMs for use with the GRAPPA software for
whole genome phylogenetic analysis; these have
been shown to let GRAPPA scale from its
maximum of about 15-20 genomes to 1000
genomes.
• Current projects: DCM development for
maximum likelihood and multiple sequence
alignment.
Part II: Whole-Genome
Phylogenetics
A
A
C
D
X
B
E
Y
E
Z
C
F
B
D
W
F
Genomes Evolve by Rearrangements
1
2
3
4
5
6
7
9
10
1
2
3 –8
9 –7
-8
4 –6
–7
5 –5
–6
6 -4
–5
7 –4
8
9
10
• Inversion (Reversal)
• Transposition
• Inverted Transposition
8
Genome Rearrangement Has
A Huge State Space
• DNA sequences : 4 states per site
• Signed circular genomes with n genes:
2
n 1
(n  1)!
states, 1 site
• Circular genomes (1 site)
– with 37 genes:
2.56 10
52
states
– with 120 genes:
3.70 10232
states
Why use gene orders?
• “Rare genomic changes”: huge state space
and relative infrequency of events
(compared to site substitutions) could make
the inference of deep evolution easier, or
more accurate.
• Our research shows this is true, but accurate
analysis of gene order data is
computationally very intensive!
The Generalized Nadeau-Taylor
model
Wang and Warnow, 2001
• Three types of events: inversions,
transpositions, and inverted transpositions
• Each event of each type is equiprobable
• The relative probabilities of the three events
are parameters that the user can specify
Phylogeny reconstruction in 1998
• Distance-based
– Breakpoint (BP) distances [Blanchette, Kunisawa,
Sankoff 1998]
• Minimum length trees (NP-hard, even for three taxa)
– BPAnalysis: [Sankoff & Blanchette 1998]: exhaustive
search through treespace to find the minimum
breakpoint length (the number of breakpoints on the
tree)
40 taxa, 120 genes,
Inv.:Transp.:InvTrans
p=2:1:1
Error
in
inferred
tree
NJ(BP)
birth-death trees,
expected deviation
from ultrametricity=2
Amount of evolution
NJ(BP): seconds
BPanalysis: will not finish (will take 200 years for
a 13 genome dataset)
Progress
• Statistically-defined distance estimators:
EDE and IEBP, highly robust to model
violations
• FastME(EDE) yields very accurate trees,
except when the datasets are close to
“saturated”
40 taxa, 120 genes
Inv.:Transp.:InvTransp
=2:1:1
Birth-death trees,
expected deviation from
ultrametricity=2
Amount of evolution
BP=breakpoint distance
INV=inversion distance
EDE: statistically-based estimator [Wang et al. ‘01] - highly robust.
All these methods are polynomial time.
Minimum length trees (“parsimony”)
• Breakpoint length and inversion length: both
NP-hard to solve even on three-leaf trees.
Exact solutions exponential in both number of
taxa and number of genes.
• Inversion-phylogeny has better topological
accuracy than breakpoint phylogeny, but is
harder to solve. Highly robust to model
violations.
“Solving” the inversion and
breakpoint phylogeny problems
• Usual issue of getting stuck in local optima, since the
optimization problems are NP-hard
• Additional problem: finding the best trees is enormously
hard, since even the “point estimation” problem is hard
(worse than estimating branch lengths in ML).
Local optimum
MP score
Global optimum
Phylogenetic trees
Minimum length trees (“parsimony”)
• Breakpoint phylogeny
– BPAnalysis: [Sankoff & Blanchette 1998]
– GRAPPA [Moret et al. 2001]
– MPME [Wang et al. PSB 2002]: represents gene orders as multistate strings, and solves parsimony on this modified dataset. This
problem is exponential in the number of taxa, but polynomial in the
number of genes). Because of MP software, it cannot handle large
datasets.
– DCM4-MPME: uses a divide-and-conquer strategy (similar to
DCM3) to decompose a large dataset into smaller datasets, on the
basis of a guide tree. It can handle larger datasets than any of the
other methods.
• Inversion phylogeny:
– GRAPPA: highly accurate, robust to model violations, but cannot
analyze trees with large edge lengths in reasonable time periods.
Analyzing Large Datasets
Topological Error
NJ(EDE)+MPME
in a Divide-andConquer approach
Problem size /
divergence
NJ(EDE)
GRAPPA
MPME
• poor accuracy for
highly diverged
datasets
• cannot handle
datasets of moderate
size, or trees with
long branches
• Cannot handle
datasets of large size
120 genes, 200 taxa, Inversion/Transposition/Inverted Transposition=2:1:1
Birth-Death Trees with deviation from ultrametricity
NJ(EDE)
DCM4-MPME
DCM4-MPME: Guide tree=NJ(EDE)
GRAPPA & MPME won’t finish
(long branch lengths; too many taxa)
Summary
• True evolutionary distance estimators
improve accuracy of NJ
• Sequence-based heuristic (MPME)
• Divide-and-conquer, integrated approach
for large-scale data
Limitations and ongoing research
• Current methods are mostly limited to single
chromosomes with equal gene content (or very
small amounts of deletions and duplications).
• Moret et al. have made some progress on
developing a reliable distance-based method for
chromosomes with unequal gene content (tests on
real and simulated data show high accuracy)
• Handling the multiple chromosome case is harder
GRAPPA (Genome Rearrangement
Analysis under Parsimony and other
Phylogenetic Algorithms)
http://www.cs.unm.edu/~moret/GRAPPA/
• Heuristics for NP-hard optimization problems
• Fast polynomial time distance-based methods
• Contributors: U. New Mexico, U. Texas at
Austin, Universitá di Bologna, Italy
• Freely available in source code at this site.
• Project leader: Bernard Moret (UNM)
(moret@cs.unm.edu)
CIPRES software distributions
Software group leaders: Wayne Maddison and Dave Swofford
The first distribution (in the next months) will focus
on Rec-I-DCM3(PAUP*): fast heuristic searches
for maximum parsimony on large datasets for
PAUP* users
All software will be open source
Community contributions to software will be
enabled
Acknowledgements
•
•
•
•
NSF
The David and Lucile Packard Foundation
The Program in Evolutionary Dynamics at Harvard
The Institute for Cellular and Molecular Biology at UTAustin
See http://www.phylo.org and
http://www.cs.utexas.edu/~tandy and
http://www.cs.unm.edu/~moret/GRAPPA
Download