Phylogenomic supertrees: the end of the road Olaf R. P. Bininda-Emonds

advertisement

Phylogenomic supertrees: the end of the road or the light at the end of the tunnel?

Olaf R. P. Bininda-Emonds

Friedrich-Schiller-Universität Jena

Outline

• what are supertrees?

• “traditional” supertrees

• the threat from phylogenomics

• supertrees in the future

• a paradigm shift

• deconstructing divideand-conquer

• challenges for the future

What is a supertree?

• results from the combination of many smaller, overlapping trees to form a single larger one

• allows inferences of relationships that cannot be made from any single source tree

• as old as systematics itself?

• “vertical” (taxonomic) substitution

• still in use e.g., Tree of Life, larger supertrees

E

Formal supertree construction

F G H J K L

Agreement

A B C D E F G H I J K L consensus-like techniques

A B C K L

C D E H I K coding technique

Optimization optimization criterion

“Traditional” supertrees

A supertree of extant mammals

Monotremata

Marsupialia

Afrotheria

Xenarthra

Laurasiatheria

Euarchontoglires

4510 of the 4554 species listed in

Wilson and

Reeder (1993)

You are here

• from Bininda-Emonds et al . (2007)

A supertree of extant birds

QuickTime™ and a TI FF (Uncompressed) decompressor are needed to see this pict ure.

QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

QuickTime™ and a TI FF (Uncompressed) decompressor are needed to see this pict ure.

QuickTime™ and a TI FF (Uncompressed) decompressor are needed to see this pict ure.

QuickT ime ™ an d a TIFF (Un compr ess ed) d ecomp res sor a re ne eded to se e th is p ic tu re.

• 5985 extant species

(Davis and

Page, semipubl. data)

• phylogeny from

Johnson (2001)

Criticisms of supertrees

• one step removed from the real data

• loss of information reduces accuracy

• prevents “signal enhancement”

• potential for data duplication

• can produce unsupported clades

• invalid as phylogenetic hypotheses

• summary statement (i.e., consensus)

• cannot interpret supertree biologically

• not necessary due to the molecular revolution ( stop-gap method )

• “Not many people build them [supertrees], and my sense is that their lifetime is limited : as gene sequence data becomes increasingly easy to acquire, supertrees will lose their value.”

• Anonymous review of proposed supertree book (2001)

MRP supertree of extant Carnivora

 all 271 extant species

274 source trees from 177 literature sources

13 nested supertrees

• from Bininda-Emonds et al . (1999)

Carnivora sequences in GenBank

10 000 000

1 000 000

100 000

10 000

1000

100

10

1

1990 1995

677 sequences

48 species

12 new species / yr

Year

2000 2005

• as of January 1, 1996

Carnivora sequences in GenBank

10 000 000

1 000 000

1 984 623 sequences

100 000

10 000

1000

100

10

1

1990

• from Bininda-Emonds (2005)

1995

Year

2000

197 species

13.1 new species / yr

2005

• as of March 12, 2004

Distribution of GenBank data

1 984 623 sequences

1 976 358

4365 are for domestic dog (99.6%) are for domestic cat (0.2%)

3900 for remaining 195 species

(or 20.0 sequences / species)

• but: 191 of the 219 Martes americana sequences are cyt b

225 of the 302 Phoca vitulina sequences are tRNA-Pro

The molecular revolution

Species

• molecular databases are currently highly incomplete and data are not randomly distributed

• 33+ genome projects for mammals

• ESTs: lots of bps, but comparatively few species

• “data availability matrix” for green plants

(from Sanderson and Driskell, 2003)

A paradigm shift

• traditional, literature-based supertree construction probably ultimately endangered

• but more so for some groups than for others

• any future role in phylogenetics likely as an analytical tool

• traditional  mixed data analyses

• divide-and-conquer  homogeneous data analyses

Partitioned analyses

• utility of pure sequence-based analyses for large, taxonomically broad studies questioned increasingly

• alignment problems  loss of data

• saturation / signal dropout conventional

• increasing trend for mixed analyses using analysis data that require different models and assumptions:

• e.g., morphology, DNA sequence data, AA alignments, RCGs, gene order, gene content, …

• mixed-data analyses might benefit from a

“traditional” supertree approach

• i.e., supertree represents end result of analysis supertree construction

Analyzing DNA supermatrices

• partitioned approach incorporating supertrees needed around turn of century

• less need today through advances in hardware (clusters and parallel computing) and software (faster algorithms and “tricks”) conventional analysis conventional analysis supertree construction

• ever larger phylogenetic problems now increasingly feasible (esp. in a likelihood framework), with bootstraps and mixed model analyses

Archimedean phylogenetics

“Give me a cluster large enough and a data set on which to work on, and I shall derive the phylogeny.”

subtree optimization

(conventional analysis) supertree construction global optimization

(conventional analysis)

• adapted from Roshan et al . (2004)

Stage

Divide

Subtree optimization

Supertree construction

BUILD

MR / O

Global optimization

Speed Accuracy n/a

compare to pruned model tree simulate data

(K2P, ti:tv = 2.0,

 = 0.5,  = 0.1,

2000 bp) subsample data

(4, 8, 16, …,

1024, 2048 taxa) phylogenetic analysis

(NJ, weighted MP,

ML, or ML-DCM3)

Sampling schemes

• “clade sampling” • “random sampling”

Stage

Divide

Subtree optimization

Supertree construction

BUILD

MR / O

Global optimization

Speed Accuracy n/a

Divide step

• investigated chiefly by Daniel Huson, Tandy Warnow,

Usman Roshan and colleagues

• developed disk-covering methods (DCMs)

• fastest current implementation is Recursive-Iterative-DCM3

(Rec-I-DCM3)

• sampling strategy for divide step crucial

• Roshan et al . (2004) noted that performance gain dependent on quality of initial decomposition

• due to effects on analysis times of subtree optimization step

1.000

0.950

0.900

0.850

0.800

0.750

1

Scaling of accuracy

MP (random)

MP (clade)

NJ (random)

NJ (clade)

ML (random)

ML (clade)

ML-DCM3 (random)

ML-DCM3 (clade)

10 100

Size of subsampled tree

1000 10000

• from Bininda-Emonds and Stamatakis (2006)

Accuracy and sampling strategy

1.15

1.10

1.05

1.00

0.95

1

MP

NJ

ML

ML-DCM3

10 100

Size of subsampled tree

1000 10000

• from Bininda-Emonds and Stamatakis (2006)

100000

10000

1000

100

0.1

0.01

10

1

1

Scaling of analysis time

MP (random)

MP (clade)

NJ (random)

NJ (clade)

ML (random)

ML (clade)

ML-DCM3 (random)

ML-DCM3 (clade)

10 100

Size of subsampled tree

1000 10000

• from Bininda-Emonds and Stamatakis (2006)

Analysis time and sampling strategy

1.5

1.0

0.5

MP

NJ

ML

ML-DCM3

0.0

1 10 100

Size of subsampled tree

1000 10000

• from Bininda-Emonds and Stamatakis (2006)

Stage

Divide

Subtree optimization

Supertree construction

BUILD

MR / O

Global optimization

Speed Accuracy n/a

Supertree step

• two main alternative strategies: BUILD-based vs. matrix representation / optimization based

• problem:

• BUILD is fast , but shows poor accuracy

• MR / O shows good accuracy , but is deadly slow

• can we devise a supertree method that combines speed and accuracy ?

• BUILD shows more promise  MR / O will always be slow

• NB: accuracy ≠ resolution !

Problems with BUILD

• lot of BUILD-derived algorithms:

• BUILD, MinCutSupertree, BUILD-with-Distances ,

AncestralBUILD, MultiLevelSupertree, PhySIC , …

• MinCut the most widely known and basis for many other methods

• tends to approximate Adams consensus (at least empirically)

• tends to favour larger source trees (= size bias )

• tends to spit out single conflicting taxa at each step yielding very unbalanced, comb-like trees

Does divide-and-conquer work?

• it should / could:

• tremendous speed gain to analyzing many, smaller problems: n time  x << time n x

1

• accuracy ~flat with respect to problem size

10000000

1000000

100000

10000

1000

100

10

1

1

• e.g., can run ~250 000 MP analyses of 16 clade-sampled taxa (≈ 4 000 000 taxa in total) in the time taken to analyze 4096 taxa simultaneously

MP (random)

MP (clade)

NJ (random)

NJ (clade)

ML (random)

ML (clade)

= 4096 taxa

10 100

Size of subsampled tree

1000 10000

• from Bininda-Emonds and Stamatakis (2006)

Does divide-and-conquer work?

• it should / could:

• tremendous speed gain to analyzing many, smaller problems: n time  x << time n x

1

• accuracy ~flat with respect to problem size

• but these potential savings aren’t realized in full empirically …

Analyses of full 4096-taxon data set

NJ

MP

Method

ML-DCM3

ML (“standard hill climbing”)

Accuracy

(1 – d

S

)

0.857

Time taken

(seconds)

193

0.917

0.921

0.923

69 392

195 371

303 450

1.55x

• from Bininda-Emonds and Stamatakis (2006)

Analyses of full data set

Method

Accuracy

(1 – d

S

)

Time taken

(seconds)

0.857

193 NJ

MP

ML (“fast hill climbing”)

ML-DCM3

0.917

0.912

0.921

69 392

38 737

195 371

5.04x

ML (“standard hill climbing”) 0.923

303 450

• from Bininda-Emonds and Stamatakis (2006)

What’s the problem?

• bottleneck remains terminal global optimization step

• any excessive branch swapping will slow it down

• but branching swapping crucial for accuracy

• therefore, key is to provide as accurate of a starting tree as possible

• DCM3 method seems to be providing only a slightly better tree than NJ (PHYML) or greedy MP (RAxML)

Possible solutions: input

• improve accuracy of supertree by any of all of:

• increasing coverage by analyzing more subtrees with more overlap

• including several larger backbone trees

• deriving support values for subtrees

(e.g., fast bootstrapping) to enable weighted supertree analysis

• time is available for these steps

• also lend themsevles to parallelization

Possible solutions: analysis

• optimize global optimization step using constraints

• minimize amount of intensive branchswapping and tree surfing

• idea in DCM-based methods (“refinement” of

SCM supertree)

• supertree serves as starting tree and constraint tree

• crucial that supertree is accurate (NB accuracy ≠ resolution!)

• can also judge node support empirically and constraint only well supported nodes

What’s the answer?

• increasing technological sophistication will keep increasing range of conventional analyses

• analyses of ≤10000 taxa now feasible and usually without parallelization

• but, does a divide-and-conquer + supertree framework have a role beyond this?

• theoretically yes, but only by solving a number of challenges

Challenges for the future

• divide (+ subtree optimization) steps

• find subtree size(s) or combinations thereof that maximize speed and especially accuracy

• find optimal sampling scheme : clade and backbone vs cladelike sampling

• do general rules-of-thumb exist or do parameters need to be empirically determined on a case-by-case basis?

• alternatives to disk-covering methods?

• supertree step

• can we find a method that is fast like BUILD and accurate like

MR / O methods?  PhySIC???

• global-optimization step

• have to weigh costs (no error correction) vs benefits (speed!) of searching under constraints

Bicliques

Taxa A B C D E F G

Genes 1 2 3 4 5 6 7 8

E

F

G

C

D

A

B

Genes

1 2 3 4 5 6 7 8

+ – – – – – – –

+ + – – – – – –

– + + + + – – –

– + + + + + – –

– + + + + – – –

– – – – – + + –

– – – – – – + + maximal biclique = K

4,3

Extending bicliques

• quasi-bicliques

• allow a certain proportion of missing edges

• as input for a supertree analysis

• essentially build bicliques of bicliques  bicliques that overlap for at least two taxa, but no sequences

1

A B C

Taxa

D E F G

2 3 4 5

Genes

6 7 8

Challenges for the future

• divide (+ subtree optimization) steps

• find subtree size(s) or combinations thereof that maximize speed and especially accuracy

• find optimal sampling scheme : clade and backbone vs cladelike sampling

• do general rules-of-thumb exist or do parameters need to be empirically determined on a case-by-case basis?

• alternatives to disk-covering methods?

• supertree step

• can we find a method that is fast like BUILD and accurate like

MR / O methods?  PhySIC???

• global-optimization step

• have to weigh costs (no error correction) vs benefits (speed!) of searching under constraints

Download