Phylogenomic supertrees: the end of the road or the light at the end of the tunnel?
Olaf R. P. Bininda-Emonds
Friedrich-Schiller-Universität Jena
Outline
• what are supertrees?
• “traditional” supertrees
• the threat from phylogenomics
• supertrees in the future
• a paradigm shift
• deconstructing divideand-conquer
• challenges for the future
What is a supertree?
• results from the combination of many smaller, overlapping trees to form a single larger one
• allows inferences of relationships that cannot be made from any single source tree
• as old as systematics itself?
• “vertical” (taxonomic) substitution
• still in use e.g., Tree of Life, larger supertrees
E
F G H J K L
Agreement
A B C D E F G H I J K L consensus-like techniques
A B C K L
C D E H I K coding technique
Optimization optimization criterion
“Traditional” supertrees
A supertree of extant mammals
Monotremata
Marsupialia
Afrotheria
Xenarthra
Laurasiatheria
Euarchontoglires
4510 of the 4554 species listed in
Wilson and
Reeder (1993)
You are here
• from Bininda-Emonds et al . (2007)
A supertree of extant birds
QuickTime™ and a TI FF (Uncompressed) decompressor are needed to see this pict ure.
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
QuickTime™ and a TI FF (Uncompressed) decompressor are needed to see this pict ure.
QuickTime™ and a TI FF (Uncompressed) decompressor are needed to see this pict ure.
QuickT ime ™ an d a TIFF (Un compr ess ed) d ecomp res sor a re ne eded to se e th is p ic tu re.
• 5985 extant species
(Davis and
Page, semipubl. data)
• phylogeny from
Johnson (2001)
Criticisms of supertrees
• one step removed from the real data
• loss of information reduces accuracy
• prevents “signal enhancement”
• potential for data duplication
• can produce unsupported clades
• invalid as phylogenetic hypotheses
• summary statement (i.e., consensus)
• cannot interpret supertree biologically
• not necessary due to the molecular revolution ( stop-gap method )
• “Not many people build them [supertrees], and my sense is that their lifetime is limited : as gene sequence data becomes increasingly easy to acquire, supertrees will lose their value.”
• Anonymous review of proposed supertree book (2001)
MRP supertree of extant Carnivora
all 271 extant species
274 source trees from 177 literature sources
13 nested supertrees
• from Bininda-Emonds et al . (1999)
Carnivora sequences in GenBank
10 000 000
1 000 000
100 000
10 000
1000
100
10
1
1990 1995
677 sequences
48 species
12 new species / yr
Year
2000 2005
• as of January 1, 1996
Carnivora sequences in GenBank
10 000 000
1 000 000
1 984 623 sequences
100 000
10 000
1000
100
10
1
1990
• from Bininda-Emonds (2005)
1995
Year
2000
197 species
13.1 new species / yr
2005
• as of March 12, 2004
Distribution of GenBank data
1 984 623 sequences
1 976 358
4365 are for domestic dog (99.6%) are for domestic cat (0.2%)
3900 for remaining 195 species
(or 20.0 sequences / species)
• but: 191 of the 219 Martes americana sequences are cyt b
225 of the 302 Phoca vitulina sequences are tRNA-Pro
The molecular revolution
Species
• molecular databases are currently highly incomplete and data are not randomly distributed
• 33+ genome projects for mammals
• ESTs: lots of bps, but comparatively few species
• “data availability matrix” for green plants
(from Sanderson and Driskell, 2003)
A paradigm shift
• traditional, literature-based supertree construction probably ultimately endangered
• but more so for some groups than for others
• any future role in phylogenetics likely as an analytical tool
• traditional mixed data analyses
• divide-and-conquer homogeneous data analyses
Partitioned analyses
• utility of pure sequence-based analyses for large, taxonomically broad studies questioned increasingly
• alignment problems loss of data
• saturation / signal dropout conventional
• increasing trend for mixed analyses using analysis data that require different models and assumptions:
• e.g., morphology, DNA sequence data, AA alignments, RCGs, gene order, gene content, …
• mixed-data analyses might benefit from a
“traditional” supertree approach
• i.e., supertree represents end result of analysis supertree construction
Analyzing DNA supermatrices
• partitioned approach incorporating supertrees needed around turn of century
• less need today through advances in hardware (clusters and parallel computing) and software (faster algorithms and “tricks”) conventional analysis conventional analysis supertree construction
• ever larger phylogenetic problems now increasingly feasible (esp. in a likelihood framework), with bootstraps and mixed model analyses
Archimedean phylogenetics
“Give me a cluster large enough and a data set on which to work on, and I shall derive the phylogeny.”
subtree optimization
(conventional analysis) supertree construction global optimization
(conventional analysis)
• adapted from Roshan et al . (2004)
Stage
Divide
Subtree optimization
Supertree construction
BUILD
MR / O
Global optimization
Speed Accuracy n/a
compare to pruned model tree simulate data
(K2P, ti:tv = 2.0,
= 0.5, = 0.1,
2000 bp) subsample data
(4, 8, 16, …,
1024, 2048 taxa) phylogenetic analysis
(NJ, weighted MP,
ML, or ML-DCM3)
Sampling schemes
• “clade sampling” • “random sampling”
Stage
Divide
Subtree optimization
Supertree construction
BUILD
MR / O
Global optimization
Speed Accuracy n/a
Divide step
• investigated chiefly by Daniel Huson, Tandy Warnow,
Usman Roshan and colleagues
• developed disk-covering methods (DCMs)
• fastest current implementation is Recursive-Iterative-DCM3
(Rec-I-DCM3)
• sampling strategy for divide step crucial
• Roshan et al . (2004) noted that performance gain dependent on quality of initial decomposition
• due to effects on analysis times of subtree optimization step
1.000
0.950
0.900
0.850
0.800
0.750
1
Scaling of accuracy
MP (random)
MP (clade)
NJ (random)
NJ (clade)
ML (random)
ML (clade)
ML-DCM3 (random)
ML-DCM3 (clade)
10 100
Size of subsampled tree
1000 10000
• from Bininda-Emonds and Stamatakis (2006)
Accuracy and sampling strategy
1.15
1.10
1.05
1.00
0.95
1
MP
NJ
ML
ML-DCM3
10 100
Size of subsampled tree
1000 10000
• from Bininda-Emonds and Stamatakis (2006)
100000
10000
1000
100
0.1
0.01
10
1
1
Scaling of analysis time
MP (random)
MP (clade)
NJ (random)
NJ (clade)
ML (random)
ML (clade)
ML-DCM3 (random)
ML-DCM3 (clade)
10 100
Size of subsampled tree
1000 10000
• from Bininda-Emonds and Stamatakis (2006)
Analysis time and sampling strategy
1.5
1.0
0.5
MP
NJ
ML
ML-DCM3
0.0
1 10 100
Size of subsampled tree
1000 10000
• from Bininda-Emonds and Stamatakis (2006)
Stage
Divide
Subtree optimization
Supertree construction
BUILD
MR / O
Global optimization
Speed Accuracy n/a
Supertree step
• two main alternative strategies: BUILD-based vs. matrix representation / optimization based
• problem:
• BUILD is fast , but shows poor accuracy
• MR / O shows good accuracy , but is deadly slow
• can we devise a supertree method that combines speed and accuracy ?
• BUILD shows more promise MR / O will always be slow
• NB: accuracy ≠ resolution !
Problems with BUILD
• lot of BUILD-derived algorithms:
• BUILD, MinCutSupertree, BUILD-with-Distances ,
AncestralBUILD, MultiLevelSupertree, PhySIC , …
• MinCut the most widely known and basis for many other methods
• tends to approximate Adams consensus (at least empirically)
• tends to favour larger source trees (= size bias )
• tends to spit out single conflicting taxa at each step yielding very unbalanced, comb-like trees
Does divide-and-conquer work?
• it should / could:
• tremendous speed gain to analyzing many, smaller problems: n time x << time n x
1
• accuracy ~flat with respect to problem size
10000000
1000000
100000
10000
1000
100
10
1
1
• e.g., can run ~250 000 MP analyses of 16 clade-sampled taxa (≈ 4 000 000 taxa in total) in the time taken to analyze 4096 taxa simultaneously
MP (random)
MP (clade)
NJ (random)
NJ (clade)
ML (random)
ML (clade)
= 4096 taxa
10 100
Size of subsampled tree
1000 10000
• from Bininda-Emonds and Stamatakis (2006)
Does divide-and-conquer work?
• it should / could:
• tremendous speed gain to analyzing many, smaller problems: n time x << time n x
1
• accuracy ~flat with respect to problem size
• but these potential savings aren’t realized in full empirically …
Analyses of full 4096-taxon data set
NJ
MP
Method
ML-DCM3
ML (“standard hill climbing”)
Accuracy
(1 – d
S
)
0.857
Time taken
(seconds)
193
0.917
0.921
0.923
69 392
195 371
303 450
1.55x
• from Bininda-Emonds and Stamatakis (2006)
Analyses of full data set
Method
Accuracy
(1 – d
S
)
Time taken
(seconds)
0.857
193 NJ
MP
ML (“fast hill climbing”)
ML-DCM3
0.917
0.912
0.921
69 392
38 737
195 371
5.04x
ML (“standard hill climbing”) 0.923
303 450
• from Bininda-Emonds and Stamatakis (2006)
What’s the problem?
• bottleneck remains terminal global optimization step
• any excessive branch swapping will slow it down
• but branching swapping crucial for accuracy
• therefore, key is to provide as accurate of a starting tree as possible
• DCM3 method seems to be providing only a slightly better tree than NJ (PHYML) or greedy MP (RAxML)
Possible solutions: input
• improve accuracy of supertree by any of all of:
• increasing coverage by analyzing more subtrees with more overlap
• including several larger backbone trees
• deriving support values for subtrees
(e.g., fast bootstrapping) to enable weighted supertree analysis
• time is available for these steps
• also lend themsevles to parallelization
Possible solutions: analysis
• optimize global optimization step using constraints
• minimize amount of intensive branchswapping and tree surfing
• idea in DCM-based methods (“refinement” of
SCM supertree)
• supertree serves as starting tree and constraint tree
• crucial that supertree is accurate (NB accuracy ≠ resolution!)
• can also judge node support empirically and constraint only well supported nodes
What’s the answer?
• increasing technological sophistication will keep increasing range of conventional analyses
• analyses of ≤10000 taxa now feasible and usually without parallelization
• but, does a divide-and-conquer + supertree framework have a role beyond this?
• theoretically yes, but only by solving a number of challenges
Challenges for the future
• divide (+ subtree optimization) steps
• find subtree size(s) or combinations thereof that maximize speed and especially accuracy
• find optimal sampling scheme : clade and backbone vs cladelike sampling
• do general rules-of-thumb exist or do parameters need to be empirically determined on a case-by-case basis?
• alternatives to disk-covering methods?
• supertree step
• can we find a method that is fast like BUILD and accurate like
MR / O methods? PhySIC???
• global-optimization step
• have to weigh costs (no error correction) vs benefits (speed!) of searching under constraints
Bicliques
Taxa A B C D E F G
Genes 1 2 3 4 5 6 7 8
E
F
G
C
D
A
B
Genes
1 2 3 4 5 6 7 8
+ – – – – – – –
+ + – – – – – –
– + + + + – – –
– + + + + + – –
– + + + + – – –
– – – – – + + –
– – – – – – + + maximal biclique = K
4,3
Extending bicliques
• quasi-bicliques
• allow a certain proportion of missing edges
• as input for a supertree analysis
• essentially build bicliques of bicliques bicliques that overlap for at least two taxa, but no sequences
1
A B C
Taxa
D E F G
2 3 4 5
Genes
6 7 8
Challenges for the future
• divide (+ subtree optimization) steps
• find subtree size(s) or combinations thereof that maximize speed and especially accuracy
• find optimal sampling scheme : clade and backbone vs cladelike sampling
• do general rules-of-thumb exist or do parameters need to be empirically determined on a case-by-case basis?
• alternatives to disk-covering methods?
• supertree step
• can we find a method that is fast like BUILD and accurate like
MR / O methods? PhySIC???
• global-optimization step
• have to weigh costs (no error correction) vs benefits (speed!) of searching under constraints