Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign Phylogeny (evolutionary tree) Orangutan From the Tree of the Life Website, University of Arizona Gorilla Chimpanzee Human Phylogenies and Applications Basic Biology: How did life evolve? Applications of phylogenies to: protein structure and function population genetics human migrations metagenomics Figure from https://en.wikipedia.org/wiki/Common_descent phylogenomics gene 1 gene 2 gene 999 gene 1000 ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TCCTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT Orangutan Chimpanzee Gorilla Human “gene” here refers to a portion of I’ll use the term “gene” to refer “c-genes”: theto genome (not a functional gene) recombination-free orthologous stretches of the genome 2 DNA Sequence Evolution -3 mil yrs AAGACTT AAGGCCT AGGGCAT AGGGCAT TAGCCCT TAGCCCA -2 mil yrs TGGACTT TAGACTT AGCACTT AGCACAA AGCGCTT -1 mil yrs today Phylogenetic Tree Estimation U AGGGCAT V W TAGCCCA X TAGACTT Y TGCACAA X U Y V W TGCGCTT However… U V W AGGGCATGA AGAT X TAGACTT Y TGCACAA X U Y V W TGCGCTT Indels (insertions and deletions) Deletion Mutation …ACGGTGCAGTTACCA… …ACCAGTCACCA… Deletion Substitution …ACGGTGCAGTTACCA… Insertion …ACCAGTCACCTA… …ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… The true multiple alignment – Reflects historical substitution, insertion, and deletion events – Defined using transitive closure of pairwise alignments computed on edges of the true tree Phylogenetic Tree Estimation S1 S2 S3 S4 = = = = AGGCTATCACCTGACCTCCA TAGCTATCACGACCGC TAGCTGACCGC TCACGACCGACA Input: unaligned sequences S1 S2 S3 S4 = = = = AGGCTATCACCTGACCTCCA TAGCTATCACGACCGC TAGCTGACCGC TCACGACCGACA Phase 1: Alignment S1 S2 S3 S4 = = = = AGGCTATCACCTGACCTCCA TAGCTATCACGACCGC TAGCTGACCGC TCACGACCGACA S1 S2 S3 S4 = = = = -AGGCTATCACCTGACCTCCA TAG-CTATCAC--GACCGC-TAG-CT-------GACCGC--------TCAC--GACCGACA Phase 2: Construct tree S1 S2 S3 S4 = = = = AGGCTATCACCTGACCTCCA TAGCTATCACGACCGC TAGCTGACCGC TCACGACCGACA S1 S4 S1 S2 S3 S4 S2 S3 = = = = -AGGCTATCACCTGACCTCCA TAG-CTATCAC--GACCGC-TAG-CT-------GACCGC--------TCAC--GACCGACA Two-phase estimation Alignment methods • Clustal • POY (and POY*) • Probcons (and Probtree) • Probalign • MAFFT • Muscle • Di-align • T-Coffee • Prank (PNAS 2005, Science 2008) • Opal (ISMB and Bioinf. 2007) • FSA (PLoS Comp. Bio. 2009) • Infernal (Bioinf. 2009) • Etc. Phylogeny methods • • • • • • • • Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc. Two-phase estimation Alignment methods • Clustal • POY (and POY*) • Probcons (and Probtree) • Probalign • MAFFT • Muscle • Di-align • T-Coffee • Prank (PNAS 2005, Science 2008) • Opal (ISMB and Bioinf. 2007) • FSA (PLoS Comp. Bio. 2009) • Infernal (Bioinf. 2009) • Etc. Phylogeny methods • • • • • • • • Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc. RAxML: heuristic for large-scale ML optimization Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) FP 50% error rate 1000-taxon models, ordered by difficulty (Liu et al., Science 19 June 2009) Multiple Sequence Alignment (MSA): a scientific grand challenge1 S1 = S2 = S3 = … Sn = AGGCTATCACCTGACCTCCA TAGCTATCACGACCGC TAGCTGACCGC TCACGACCGACA S1 S2 S3 … Sn = -AGGCTATCACCTGACCTCCA = TAG-CTATCAC--GACCGC-= TAG-CT-------GACCGC-= -------TCAC--GACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013 1KP: Thousand Transcriptome Project G. Ka-Shu Wong J. Leebens-Mack U Georgia U Alberta N. Wickett Northwestern N. Matasci iPlant T. Warnow, UT-Austin S. Mirarab, UT-Austin N. Nguyen UT-Austin First publication: Wickett, Mirarab, et al., PNAS, 2014 Used SATé (Liu et al., Science 2009 and Syst Biol 2012) to compute multiple sequence alignments and trees Used ASTRAL (Mirarab et al., Bioinf 2014 and 2015) to compute the species tree Upcoming Challenge: • Multiple sequence alignment and gene tree estimation on 100,000 sequences Computational Phylogenetics (2005) Current methods can use months to estimate trees on 1000 DNA sequences Our objective: More accurate trees and alignments on 500,000 sequences in under a week Courtesy of the Tree of Life web project, tolweb.org Computational Phylogenetics (2015) 1997-2001: Distance-based phylogenetic tree estimation from polynomial length sequences 2012: Computing accurate trees (almost) without multiple sequence alignments 2009-2015: Co-estimation of multiple sequence alignments and gene trees, now on 1,000,000 sequences in under two weeks 2014-2015: Species tree estimation from whole genomes in the presence of massive gene tree heterogeneity Courtesy of the Tree of Life web project, tolweb.org Computational Phylogenetics (2015) 1997-2001: Distance-based phylogenetic tree estimation from polynomial length sequences 2012: Computing accurate trees (almost) without multiple sequence alignments 2009-2015: Co-estimation of multiple sequence alignments and gene trees, now on 1,000,000 sequences in under two weeks 2014-2015: Species tree estimation from whole genomes in the presence of massive gene tree heterogeneity Courtesy of the Tree of Life web project, tolweb.org Key technique: Divide-and-conquer! • In general, small datasets with not too much “heterogeneity” are easy to analyze with good accuracy. Divide-and-Conquer • Divide-and-conquer is a basic algorithmic trick for solving problems! • Three steps: – divide a dataset into two or more sets, – solve the problem on each set, and – combine solutions. Sorting 10 3 54 23 75 5 1 25 Objective: sort this list of integers from smallest to largest. 10, 3, 54, 23, 75, 5, 1, 25 should become 1, 3, 5, 10, 23, 25, 54, 75 MergeSort 10 3 54 23 75 5 1 25 Step 1: Divide into two sublists Step 2: Recursively sort each sublist Step 3: Merge the two sorted sublists Step 1: break into two lists X: 10 3 54 23 Y: 75 5 1 25 Step 2: sort the two lists X: 3 10 23 54 Y: 1 5 25 75 Step 3: merge the sorted lists X: Result: 3 10 23 54 Y: 1 5 25 75 Merging (cont.) X: Result: 3 10 1 23 54 Y: 5 25 75 Merging (cont.) X: Result: 10 1 23 3 54 Y: 5 25 75 Merging (cont.) X: Result: 10 1 23 3 54 5 Y: 25 75 Merging (cont.) X: Result: 23 1 3 54 5 Y: 10 25 75 Merging (cont.) X: Result: 54 1 3 5 Y: 10 25 23 75 Merging (cont.) X: Result: 54 1 3 5 Y: 10 75 23 25 Merging (cont.) X: Result: Y: 1 3 5 10 75 23 25 54 Merging (cont.) X: Result: Y: 1 3 5 10 23 25 54 75 Multiple Sequence Alignment (MSA): a scientific grand challenge1 S1 = S2 = S3 = … Sn = AGGCTATCACCTGACCTCCA TAGCTATCACGACCGC TAGCTGACCGC TCACGACCGACA S1 S2 S3 … Sn = -AGGCTATCACCTGACCTCCA = TAG-CTATCAC--GACCGC-= TAG-CT-------GACCGC-= -------TCAC--GACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013 SATé and PASTA Input: set of unaligned sequences Output: multiple sequence alignment and phylogenetic tree • SATé: Liu et al., Science 2009 (up to 10,000 sequences) and Systematic Biology 2012 (up to 50,000 sequences) • PASTA: Mirarab et al., J. Comp Biol 2015 (up to 1,000,000 sequences) 1000-taxon models, ordered by difficulty (Liu et al., Science 19 June 2009) Re-aligning on a tree C A B D Decompose dataset Estimate ML tree on merged alignment ABCD A B C D Align subproblem s A B C D Merge subalignments SATé and PASTA Algorithms Obtain initial alignment and estimated ML tree Tree Use tree to compute new alignment SATé and PASTA Algorithms Obtain initial alignment and estimated ML tree Tree Use tree to compute new alignment Alignment SATé and PASTA Algorithms Obtain initial alignment and estimated ML tree Tree Use tree to compute new alignment Estimate ML tree on new alignment Alignment SATé and PASTA Algorithms Obtain initial alignment and estimated ML tree Tree Use tree to compute new alignment Estimate ML tree on new alignment Alignment Repeat until termination condition, and return the alignment/tree pair with the best ML score SATé: 24-hour co-estimation of highly accurate alignments and trees on 1000 sequences 1000-taxon models, ordered by difficulty (Liu et al., Science 19 June 2009) 24-hour SATé analysis, on desktop machines (Similar improvements for biological datasets) SATé-2: even more accurate! (Liu et al., Syst Biol 61(1):90-106, 2012) PASTA: even more accurate, and can scale to 1,000,000 sequences RNASim Tree Error (FN Rate) 0.20 0.15 Clustal−Omega Muscle Mafft Starting Tree 0.10 SATe2 PASTA Reference Alignment 0.05 0.00 10000 • • • • 50000 100000 200000 Simulated RNASim datasets from 10K to 200K taxa Limited to 24 hours using 12 CPUs Not all methods could run (missing bars could not finish) PASTA, Mirarab et al., J Comp Biol 22(5): 377-386 (2015) Avian Phylogenomics Project E Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI T. Warnow UT-Austin S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin Plus many many other people… First analysis (Jarvis, Mirarab, et al., Science 2014): • Approx. 50 species, 14,000 loci • Used SATé for gene sequence alignment and tree estimation Next analysis will have more species, and will use PASTA 1KP: Thousand Transcriptome Project G. Ka-Shu Wong J. Leebens-Mack U Georgia U Alberta N. Wickett Northwestern N. Matasci iPlant T. Warnow, UT-Austin S. Mirarab, UT-Austin N. Nguyen UT-Austin First analysis (Wickett, Mirarab, et al., PNAS, 2014) • About 100 species and 800 loci • Used SATé Next analysis will be much larger and more difficult: • Multiple sequence alignment and gene tree estimation on 100,000 sequences, many datasets highly fragmentary • Will use PASTA and UPP (Nguyen et al., Genome Biology 2015) Computational Phylogenetics (2015) 1997-2001: Distance-based phylogenetic tree estimation from polynomial length sequences 2012: Computing accurate trees (almost) without multiple sequence alignments 2009-2015: Co-estimation of multiple sequence alignments and gene trees, now on 1,000,000 sequences in under two weeks 2014-2015: Species tree estimation from whole genomes in the presence of massive gene tree heterogeneity Courtesy of the Tree of Life web project, tolweb.org “Boosters”, or “Meta-Methods” • Meta-methods use divide-and-conquer and iteration (or other techniques) to “boost” the performance of base methods (phylogeny reconstruction, alignment estimation, etc) Base method M Meta-method M* Main Points • Innovative algorithm design can improve accuracy as well as reduce running time. • Divide-and-conquer is a key algorithmic technique that has dramatically changed the toolkit for biologists! Acknowledgments Funding: Guggenheim Foundation, Packard Foundation, NSF, Microsoft Research New England, David Bruton Jr. Centennial Professorship, Grainger Foundation, and TACC (Texas Advanced Computing Center) Avian Phylogenomics Project E Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI T. Warnow UT-Austin S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin Plus many many other people… •Jarvis, Mirarab, et al., Science 2014 Major challenge: • Massive gene tree heterogeneity consistent with incomplete lineage sorting • Very poor resolution in the 14,000 gene trees • Standard coalescent-based species tree estimation methods had poor accuracy Solution: • New technique to improve coalescent-based species tree (statistical binning, Mirarab et al., Science 2014)