PPT - CIPRes

advertisement
CIPRES:
Enabling Tree of Life Projects
Tandy Warnow
The University of Texas at Austin
Reconstructing the “Tree” of Life
Handling large datasets:
millions of species
The “Tree of Life” is not
really a tree:
reticulate evolution
Cyber Infrastructure for Phylogenetic Research
Purpose: to create a national infrastructure of hardware,
open source software, database technology, etc.,
necessary to infer the Tree of Life.
Group: 40 biologists, computer scientists, and
mathematicians from 13 institutions.
Funding: $11.6 M (large ITR grant from NSF).
URL: http://www.phylo.org
CIPRes Members
University of New Mexico
Bernard Moret
David Bader
UCSD/SDSC
Fran Berman
Alex Borchers
Phil Bourne
John Huelsenbeck
Terri Liebowitz
Mark Miller
UT Austin
Tandy Warnow
David M. Hillis
Warren Hunt
Robert Jansen
Randy Linder
Lauren Meyers
Daniel Miranker
University of Arizona
David R. Maddison
University of Connecticut
Paul O Lewis
University of British Columbia
Wayne Maddison
University of Pennsylvania
Junhyong Kim
Susan Davidson
Sampath Kannan
Val Tannen
North Carolina State University
Spencer Muse
Texas A&M
Tiffani Williams
American Museum of Natural
History
Ward C. Wheeler
NJIT
Usman Roshan
UC Berkeley
Satish Rao
Steve Evans
Richard M Karp
Brent Mishler
Elchanan Mossel
Eugene W. Myers
Christos M. Papadimitriou
Stuart J. Russell
Rice
Luay Nakhleh
SUNY Buffalo
William Piel
Florida State University
David L. Swofford
Mark Holder
Yale
Michael Donoghue
Paul Turner
CIPRES activity
• Databases - e.g. TreeBase II (Bill Piel and others)
• Simulations of large-scale complex genome-scale
evolution (Junhyong Kim)
• Outreach (Michael Donoghue and Brent Mishler)
• Algorithms (Tandy Warnow)
• Open source software (Wayne Maddison, Dave
Swofford, Mark Holder, and Bernard Moret)
• Computer cluster at SDSC (Fran Berman and
Mark Miller) - available to ATOL projects and
other groups with datasets above 1000 taxa
CIPRES research in algorithms
•
•
•
•
•
•
•
•
Multiple sequence alignment
Genomic alignment
Heuristics for Maximum Parsimony and Maximum Likelihood
Bayesian MCMC methods
Supertree methods
Whole genome phylogeny reconstruction
Reticulate evolution detection and reconstruction
Data mining on sets of trees, and compact representations of these sets
Software distributions
The first distribution (in the next months) will focus
on Rec-I-DCM3(PAUP*): fast heuristic searches
for maximum parsimony on large datasets for
PAUP* users
All software will be open source
Community contributions to software will be
enabled
Phylogenetic reconstruction methods
1.
Heuristics for hard optimization criteria (Maximum
Parsimony and Maximum Likelihood) - hard to solve on
large datasets
Local optimum
Cost
Global optimum
Phylogenetic trees
2.
Polynomial time distance-based methods: Neighbor
Joining, FastME, Weighbor, etc. - poor accuracy on
datasets with large evolutionary distances
DCMs: Divide-and-conquer for
improving phylogeny reconstruction
“Boosting” phylogeny
reconstruction methods
• DCMs “boost” the performance of
phylogeny reconstruction methods.
Base method M
DCM
DCM-M
DCMs (Disk-Covering Methods)
• DCMs for polynomial time methods
improve topological accuracy (empirical
observation), and have provable theoretical
guarantees under Markov models of
evolution
• DCMs for hard optimization problems
reduce running time needed to achieve good
levels of accuracy (empirically observation)
DCM1-boosting distance-based methods
[Nakhleh et al. ISMB 2001]
0.8
NJ
Error Rate
DCM1-NJ
0.6
0.4
0.2
0
0
400
800
No. Taxa
1200
• DCM1-boosting
makes distancebased methods more
accurate
• Theoretical
guarantees that
DCM1-NJ
converges to the
true tree from
polynomial length
1600 sequences
Major challenge: MP and ML
• Maximum Parsimony (MP) and Maximum
Likelihood (ML) remain the methods of
choice for most systematists
• The main challenge here is to make it
possible to obtain good solutions to MP or
ML in reasonable time periods on large
datasets
Solving NP-hard problems
exactly is … unlikely
• Number of
(unrooted) binary
trees on n leaves is
(2n-5)!!
• If each tree on
1000 taxa could be
analyzed in 0.001
seconds, we would
find the best tree in
2890 millennia
#leaves
#trees
4
3
5
15
6
105
7
945
8
10395
9
135135
10
2027025
20
2.2 x 1020
100
4.5 x 10190
1000
2.7 x 102900
How good an MP analysis do we
need?
• Our research shows that we need to get
within 0.01% of optimal (or better even, on
large datasets) to return reasonable
estimates of the true tree’s “topology”
Problems with current techniques for MP
Shown here is the performance of a heuristic maximum parsimony analysis on a real
dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using
any method for any amount of time.) Acceptable error is below 0.01%.
0.2
0.18
Performance of TNT with time
0.16
0.14
Average MP
0.12
score above
optimal, shown as 0.1
a percentage of
0.08
the optimal
0.06
0.04
0.02
0
0
4
8
12
Hours
16
20
24
Observations
• The best MP heuristics cannot get
acceptably good solutions within 24 hours
on most of these large datasets.
• Datasets of these sizes may need months (or
years) of further analysis to reach
reasonable solutions.
• Apparent convergence can be misleading.
Our objective: speed up the best
MP heuristics
Fake study
Performance of hill-climbing heuristic
MP score
of best trees
Desired Performance
Time
DCM3 decomposition
Input: Set S of sequences, and guide-tree T
1. Compute short subtree graph G(S,T), based upon T
2. Find clique separator in the graph G(S,T) and form subproblems
DCM3 decompositions
(1) can be obtained in O(n) time
(2) yield small subproblems
(3) can be used iteratively
(4) can be applied recursively
Iterative-DCM3
T
DCM3
Base method
T’
New DCMs
•
DCM3
1.
2.
3.
4.
•
•
Recursive-DCM3
Iterative DCM3
1.
2.
•
Compute subproblems using DCM3 decomposition
Apply base method to each subproblem to yield subtrees
Merge subtrees using the Strict Consensus Merger technique
Randomly refine to make it binary
Compute a DCM3 tree
Perform local search and go to step 1
Recursive-Iterative DCM3
Rec-I-DCM3 significantly improves performance
0.2
0.18
Current best techniques
0.16
0.14
Average MP
0.12
score above
optimal, shown as 0.1
a percentage of
0.08
the optimal
0.06
DCM boosted version of best techniques
0.04
0.02
0
0
4
8
12
16
Hours
Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset
20
24
Datasets
Obtained from various researchers and online databases
•
•
•
•
•
•
•
•
•
•
1322 lsu rRNA of all organisms
2000 Eukaryotic rRNA
2594 rbcL DNA
4583 Actinobacteria 16s rRNA
6590 ssu rRNA of all Eukaryotes
7180 three-domain rRNA
7322 Firmicutes bacteria 16s rRNA
8506 three-domain+2org rRNA
11361 ssu rRNA of all Bacteria
13921 Proteobacteria 16s rRNA
Rec-I-DCM3(TNT) vs. TNT
(Comparison of scores at 24 hours)
TNT
Rec-I-DCM3
0.1
0.09
0.08
Average MP 0.07
score above
0.06
optimal at 24
0.05
hours, shown as a
percentage of the 0.04
0.03
optimal
0.02
0.01
0
1
2
3
4
5
6
7
8
9
10
Dataset#
Base method is the default TNT technique, the current best method for MP. Rec-I-DCM3
significantly improves upon the unboosted TNT by returning trees which are at most 0.01%
above optimal on most datasets.
Observations
• Rec-I-DCM3 improves upon the best
performing heuristics for MP.
• The improvement increases with the
difficulty of the dataset.
DCMs
• DCM for NJ and other distance methods produces
absolute fast converging (afc) methods
• DCMs for MP heuristics
• DCMs for use with the GRAPPA software for
whole genome phylogenetic analysis; these have
been shown to let GRAPPA scale from its
maximum of about 15-20 genomes to 1000
genomes.
• Current projects: DCM development for
maximum likelihood and multiple sequence
alignment.
Part II: Whole-Genome
Phylogenetics
A
A
C
D
X
B
E
Y
E
Z
C
F
B
D
W
F
Genomes Evolve by Rearrangements
1
2
3
4
5
6
7
9
10
1
2
3 –8
9 –7
-8
4 –6
–7
5 –5
–6
6 -4
–5
7 –4
8
9
10
• Inversion (Reversal)
• Transposition
• Inverted Transposition
8
Genome Rearrangement Has
A Huge State Space
• DNA sequences : 4 states per site
• Signed circular genomes with n genes:
2
n 1
(n  1)!
states, 1 site
• Circular genomes (1 site)
– with 37 genes:
– with 120 genes:
2.56 10
3.70 10
52
states
232
states
Why use gene orders?
• “Rare genomic changes”: huge state space
and relative infrequency of events
(compared to site substitutions) could make
the inference of deep evolution easier, or
more accurate.
• Our research shows this is true, but accurate
analysis of gene order data is
computationally very intensive!
Maximum Parsimony on Rearranged
Genomes (MPRG)
•
•
The leaves are rearranged genomes.
Find the tree that minimizes the total number of rearrangement events (NP-hard)
A
A
B
3
6
E
C
2
B
D
C
3
4
Total length
= 18
F
D
“Solving” the inversion phylogeny
• Usual issue of getting stuck in local optima, since the
optimization problems are NP-hard
• Additional problem: finding the best trees is enormously
hard, since even the “point estimation” problem is hard
(worse than estimating branch lengths in ML).
Local optimum
MP score
Global optimum
Phylogenetic trees
Benchmark gene order dataset:
Campanulaceae
• 12 genomes + 1 outgroup (Tobacco), 105 gene segments
• NP-hard optimization problems: breakpoint and inversion phylogenies
(techniques score every tree)
Joint work with Bob Jansen, Linda Raubeson, Jijun Tang, and Li-San Wang
1997: BPAnalysis (Blanchette and Sankoff): 200 years
(est.)
2000: Using GRAPPA v1.1 on the 512-processor Los
Lobos Supercluster machine: 2 minutes (200,000-fold
speedup per processor)
2003: Using latest version of GRAPPA: 2 minutes on a
single processor (1-billion-fold speedup per processor)
GRAPPA (Genome Rearrangement
Analysis under Parsimony and other
Phylogenetic Algorithms)
http://www.cs.unm.edu/~moret/GRAPPA/
• Heuristics for NP-hard optimization problems
• Fast polynomial time distance-based methods
• Contributors: U. New Mexico, U. Texas at
Austin, Universitá di Bologna, Italy
• Freely available in source code at this site.
• Project leader: Bernard Moret (UNM)
(moret@cs.unm.edu)
Limitations and ongoing research
• Current methods are mostly limited to single
chromosomes with equal gene content (or very
small amounts of deletions and duplications).
• We have made some progress on developing a
reliable distance-based method for chromosomes
with unequal gene content (tests on real and
simulated data show high accuracy)
• Handling the multiple chromosome case is harder
Acknowledgements
•
•
•
•
NSF
The David and Lucile Packard Foundation
The Program in Evolutionary Dynamics at Harvard
The Institute for Cellular and Molecular Biology at UTAustin
See http://www.phylo.org and
http://www.cs.utexas.edu/~tandy for more info
Download