PPT - Tandy Warnow - University of Illinois at Urbana

advertisement
Using Divide-and-Conquer to
Construct the Tree of Life
Tandy Warnow
University of Illinois at Urbana-Champaign
Phylogeny (evolutionary tree)
From the Tree of the Life Website,
University of Arizona
phylogenomics
Orangutan
gene 1
gene 2
gene 999
gene 1000
ACTGCACACCG
ACTGC-CCCCG
AATGC-CCCCG
-CTGCACACGG
CTGAGCATCG
CTGAGC-TCG
ATGAGC-TCCTGA-CAC-G
AGCAGCATCGTG
AGCAGC-TCGTG
AGCAGC-TC-TG
C-TA-CACGGTG
CAGGCACGCACGAA
AGC-CACGC-CATA
ATGGCACGC-C-TA
AGCTAC-CACGGAT
Chimpanzee
Gorilla
Two
dimensions:
number
of
genes
and
I’ll use the term “gene” to refer to “c-genes”:
recombination-free
orthologous
stretches of the genome
number
of species
Two
Human
“gene” here refers to a portion of
the genome (not a functional gene)
2
Phylogenomic pipeline
• Select taxon set and markers
• Gather and screen sequence data, possibly identify orthologs
• Compute multiple sequence alignments for each locus, and construct gene trees
• Compute species tree or network:
– Combine the estimated gene trees, OR
– Estimate a tree from a concatenation of the multiple sequence alignments
• Get statistical support on each branch (e.g., bootstrapping)
• Estimate dates on the nodes of the phylogeny
• Use species tree with branch support and dates to understand biology
Phylogenomic pipeline
• Select taxon set and markers
• Gather and screen sequence data, possibly identify orthologs
• Compute multiple sequence alignments for each locus, and construct gene trees
• Compute species tree or network:
– Combine the estimated gene trees, OR
– Estimate a tree from a concatenation of the multiple sequence alignments
• Get statistical support on each branch (e.g., bootstrapping)
• Estimate dates on the nodes of the phylogeny
• Use species tree with branch support and dates to understand biology
1KP: Thousand Transcriptome Project
G. Ka-Shu Wong
U Alberta



J. Leebens-Mack
U Georgia
N. Wickett
Northwestern
N. Matasci
iPlant
T. Warnow,
UT-Austin
S. Mirarab,
UT-Austin
N. Nguyen
UT-Austin
First publication: Wickett, Mirarab, et al., PNAS, 2014
Used SATé (Liu et al., Science 2009 and Syst Biol 2012) to compute multiple
sequence alignments and trees
Used ASTRAL (Mirarab et al., Bioinf 2014 and 2015) to compute the species tree
Upcoming Challenge:
• Multiple sequence alignment and gene tree estimation on 100,000
sequences.
• Many sequences are highly fragmentary.
Multiple Sequence Alignment (MSA):
1
a scientific grand challenge
S1 =
S2 =
S3 =
…
Sn =
AGGCTATCACCTGACCTCCA
TAGCTATCACGACCGC
TAGCTGACCGC
TCACGACCGACA
S1
S2
S3
…
Sn
= -AGGCTATCACCTGACCTCCA
= TAG-CTATCAC--GACCGC-= TAG-CT-------GACCGC-= -------TCAC--GACCGACA
Novel techniques needed for scalability and accuracy
NP-hard problems and large datasets
Current methods do not provide good accuracy
Few methods can analyze even moderately large datasets
Many important applications besides phylogenetic estimation
1 Frontiers
in Massive Data Analysis, National Academies Press, 2013
Divide-and-Conquer
• Divide-and-conquer is a basic algorithmic trick for
solving problems!
• Three steps:
– divide a dataset into two or more sets,
– solve the problem on each set, and
– combine solutions.
Computational Phylogenetics (2005)
Current methods can use months to
estimate trees on 1000 DNA sequences
Our objective:
More accurate trees and alignments on
500,000 sequences in under a week
Courtesy of the Tree of Life web project, tolweb.org
Computational Phylogenetics (2015)
2012: Computing accurate trees (almost)
without multiple sequence alignments
2009-2015: Co-estimation of multiple
sequence alignments and gene trees, now
on 1,000,000 sequences in under two
weeks
Courtesy of the Tree of Life web project, tolweb.org
2014-2015: Species tree estimation from
whole genomes in the presence of
massive gene tree heterogeneity
Deletion
Substitution
…ACGGTGCAGTTACCA…
Insertion
…ACCAGTCACCTA…
…ACGGTGCAGTTACC-A…
…AC----CAGTCACCTA…
The true multiple alignment
– Reflects historical substitution, insertion, and deletion events
– Defined using transitive closure of pairwise alignments computed on edges
of the true tree
Phylogenetic Tree Estimation
S1
S2
S3
S4
=
=
=
=
AGGCTATCACCTGACCTCCA
TAGCTATCACGACCGC
TAGCTGACCGC
TCACGACCGACA
Input: unaligned sequences
S1
S2
S3
S4
=
=
=
=
AGGCTATCACCTGACCTCCA
TAGCTATCACGACCGC
TAGCTGACCGC
TCACGACCGACA
Phase 1: Alignment
S1
S2
S3
S4
=
=
=
=
AGGCTATCACCTGACCTCCA
TAGCTATCACGACCGC
TAGCTGACCGC
TCACGACCGACA
S1
S2
S3
S4
=
=
=
=
-AGGCTATCACCTGACCTCCA
TAG-CTATCAC--GACCGC-TAG-CT-------GACCGC--------TCAC--GACCGACA
Phase 2: Construct tree
S1
S2
S3
S4
=
=
=
=
AGGCTATCACCTGACCTCCA
TAGCTATCACGACCGC
TAGCTGACCGC
TCACGACCGACA
S1
S4
S1
S2
S3
S4
S2
S3
=
=
=
=
-AGGCTATCACCTGACCTCCA
TAG-CTATCAC--GACCGC-TAG-CT-------GACCGC--------TCAC--GACCGACA
Quantifying Error
FN
FN: false negative
(missing edge)
FP: false positive
(incorrect edge)
50% error rate
FP
Evaluation of MSA methods (Science 2009)
Alignment methods
• Clustal
• MAFFT
• Muscle
• Prank (PNAS 2005, Science
2008)
• Opal (ISMB and Bioinf. 2007)
Datasets:
1000-taxon simulated datasets
under varying rates of evolution
Biological datasets with structural
alignments
Phylogeny estimation:
• Maximum likelihood using
RAxML
Liu et al., Science 2009
1000-taxon models, ordered by difficulty (Liu et al., Science 19 June 2009)
Observations
• Large datasets can be easy to align with high
accuracy if there is not too much heterogeneity.
• Poor alignments produce poor trees.
Observations
• Highly accurate alignments are easy if the dataset is
not too heterogeneous.
• We can use phylogenies to decompose datasets into
smaller, less heterogeneous datasets.
Re-aligning on a tree
A
B
C
Decompose
dataset
D
Estimate ML tree
on merged
alignment
ABCD
A
B
C
D
Align
subproblems
A
B
C
D
Merge subalignments
SATé and PASTA
Input: set of unaligned sequences
Output: multiple sequence alignment and tree
•SATé: Liu et al., Science 2009 (up to 10,000 sequences)
and Systematic Biology 2012 (up to 50,000 sequences)
•PASTA: Mirarab et al., J. Comp Biol 2015 (up to 1,000,000
sequences)
SATé and PASTA Algorithms
Obtain initial alignment and estimated
ML tree
Tree
Use tree to compute new
alignment
SATé and PASTA Algorithms
Obtain initial alignment and estimated
ML tree
Tree
Use tree to compute new
alignment
Alignment
SATé and PASTA Algorithms
Obtain initial alignment and estimated
ML tree
Tree
Use tree to compute new
alignment
Estimate ML tree on new
alignment
Alignment
SATé and PASTA Algorithms
Obtain initial alignment and estimated
ML tree
Tree
Use tree to compute new
alignment
Estimate ML tree on new
alignment
Alignment
Repeat until termination condition, and
return the alignment/tree pair with the best ML score
SATé: 24-hour co-estimation of highly accurate alignments and trees on 1000 sequences
1000-taxon models, ordered by difficulty (Liu et al., Science 19 June 2009)
24-hour SATé analysis, on desktop machines
(Similar improvements for biological datasets)
SATé-2: even more accurate!
(Liu et al., Syst Biol 61(1):90-106, 2012)
PASTA: even more accurate, and can scale to
1,000,000 sequences
RNASim
Tree Error (FN Rate)
0.20
0.15
Clustal−Omega
Muscle
Mafft
Starting Tree
0.10
SATe2
PASTA
Reference Alignment
0.05
0.00
10000
•
•
•
•
50000
100000
200000
Simulated RNASim datasets from 10K to 200K taxa
Limited to 24 hours using 12 CPUs
Not all methods could run (missing bars could not finish)
PASTA, Mirarab et al., J Comp Biol 22(5): 377-386 (2015)
Main Points
• Innovative algorithm design can improve accuracy
as well as reduce running time.
• Divide-and-conquer is a key algorithmic technique
that has dramatically changed the toolkit for
biologists!
Acknowledgments
Funding:
HHMI (to Siavash
Mirarab)
Guggenheim
Foundation
Download