PPTX - Tandy Warnow - University of Illinois at Urbana

advertisement
Constructing the Tree of Life:
Divide-and-Conquer!
Tandy Warnow
University of Illinois at Urbana-Champaign
Phylogeny (evolutionary tree)
Orangutan
From the Tree of the Life Website,
University of Arizona
Gorilla
Chimpanzee
Human
Phylogenies and Applications
Basic Biology:
How did life evolve?
Applications of phylogenies to:
protein structure and function
population genetics
human migrations
metagenomics
Figure from https://en.wikipedia.org/wiki/Common_descent
phylogenomics
gene 1
gene 2
gene 999
gene 1000
ACTGCACACCG
ACTGC-CCCCG
AATGC-CCCCG
-CTGCACACGG
CTGAGCATCG
CTGAGC-TCG
ATGAGC-TCCTGA-CAC-G
AGCAGCATCGTG
AGCAGC-TCGTG
AGCAGC-TC-TG
C-TA-CACGGTG
CAGGCACGCACGAA
AGC-CACGC-CATA
ATGGCACGC-C-TA
AGCTAC-CACGGAT
Orangutan
Chimpanzee
Gorilla
Human
“gene” here refers to a portion of
I’ll use the term “gene” to refer
“c-genes”:
theto
genome
(not a functional gene)
recombination-free orthologous stretches of the genome
2
DNA Sequence Evolution
-3 mil yrs
AAGACTT
AAGGCCT
AGGGCAT
AGGGCAT
TAGCCCT
TAGCCCA
-2 mil yrs
TGGACTT
TAGACTT
AGCACTT
AGCACAA
AGCGCTT
-1 mil yrs
today
Phylogenetic Tree Estimation
U
AGGGCAT
V
W
TAGCCCA
X
TAGACTT
Y
TGCACAA
X
U
Y
V
W
TGCGCTT
However…
U
V
W
AGGGCATGA
AGAT
X
TAGACTT
Y
TGCACAA
X
U
Y
V
W
TGCGCTT
Indels (insertions and deletions)
Deletion
Mutation
…ACGGTGCAGTTACCA…
…ACCAGTCACCA…
Deletion
Substitution
…ACGGTGCAGTTACCA…
Insertion
…ACCAGTCACCTA…
…ACGGTGCAGTTACC-A…
…AC----CAGTCACCTA…
The true multiple alignment
– Reflects historical substitution, insertion, and deletion
events
– Defined using transitive closure of pairwise alignments
computed on edges of the true tree
Phylogenetic Tree Estimation
S1
S2
S3
S4
=
=
=
=
AGGCTATCACCTGACCTCCA
TAGCTATCACGACCGC
TAGCTGACCGC
TCACGACCGACA
Input: unaligned sequences
S1
S2
S3
S4
=
=
=
=
AGGCTATCACCTGACCTCCA
TAGCTATCACGACCGC
TAGCTGACCGC
TCACGACCGACA
Phase 1: Alignment
S1
S2
S3
S4
=
=
=
=
AGGCTATCACCTGACCTCCA
TAGCTATCACGACCGC
TAGCTGACCGC
TCACGACCGACA
S1
S2
S3
S4
=
=
=
=
-AGGCTATCACCTGACCTCCA
TAG-CTATCAC--GACCGC-TAG-CT-------GACCGC--------TCAC--GACCGACA
Phase 2: Construct tree
S1
S2
S3
S4
=
=
=
=
AGGCTATCACCTGACCTCCA
TAGCTATCACGACCGC
TAGCTGACCGC
TCACGACCGACA
S1
S4
S1
S2
S3
S4
S2
S3
=
=
=
=
-AGGCTATCACCTGACCTCCA
TAG-CTATCAC--GACCGC-TAG-CT-------GACCGC--------TCAC--GACCGACA
Two-phase estimation
Alignment methods
• Clustal
• POY (and POY*)
• Probcons (and Probtree)
• Probalign
• MAFFT
• Muscle
• Di-align
• T-Coffee
• Prank (PNAS 2005, Science 2008)
• Opal (ISMB and Bioinf. 2007)
• FSA (PLoS Comp. Bio. 2009)
• Infernal (Bioinf. 2009)
• Etc.
Phylogeny methods
•
•
•
•
•
•
•
•
Bayesian MCMC
Maximum parsimony
Maximum likelihood
Neighbor joining
FastME
UPGMA
Quartet puzzling
Etc.
Two-phase estimation
Alignment methods
• Clustal
• POY (and POY*)
• Probcons (and Probtree)
• Probalign
• MAFFT
• Muscle
• Di-align
• T-Coffee
• Prank (PNAS 2005, Science 2008)
• Opal (ISMB and Bioinf. 2007)
• FSA (PLoS Comp. Bio. 2009)
• Infernal (Bioinf. 2009)
• Etc.
Phylogeny methods
•
•
•
•
•
•
•
•
Bayesian MCMC
Maximum parsimony
Maximum likelihood
Neighbor joining
FastME
UPGMA
Quartet puzzling
Etc.
RAxML: heuristic for large-scale ML optimization
Quantifying Error
FN
FN: false negative
(missing edge)
FP: false positive
(incorrect edge)
FP
50% error rate
1000-taxon models, ordered by difficulty (Liu et al., Science 19 June 2009)
Multiple Sequence Alignment (MSA):
a scientific grand challenge1
S1 =
S2 =
S3 =
…
Sn =
AGGCTATCACCTGACCTCCA
TAGCTATCACGACCGC
TAGCTGACCGC
TCACGACCGACA
S1
S2
S3
…
Sn
= -AGGCTATCACCTGACCTCCA
= TAG-CTATCAC--GACCGC-= TAG-CT-------GACCGC-= -------TCAC--GACCGACA
Novel techniques needed for scalability and accuracy
NP-hard problems and large datasets
Current methods do not provide good accuracy
Few methods can analyze even moderately large datasets
Many important applications besides phylogenetic estimation
1 Frontiers
in Massive Data Analysis, National Academies Press, 2013
1KP: Thousand Transcriptome Project
G. Ka-Shu Wong J. Leebens-Mack
U Georgia
U Alberta



N. Wickett
Northwestern
N. Matasci
iPlant
T. Warnow,
UT-Austin
S. Mirarab,
UT-Austin
N. Nguyen
UT-Austin
First publication: Wickett, Mirarab, et al., PNAS, 2014
Used SATé (Liu et al., Science 2009 and Syst Biol 2012) to compute
multiple sequence alignments and trees
Used ASTRAL (Mirarab et al., Bioinf 2014 and 2015) to compute the
species tree
Upcoming Challenge:
• Multiple sequence alignment and gene tree estimation on 100,000 sequences
Computational Phylogenetics (2005)
Current methods can use months to
estimate trees on 1000 DNA sequences
Our objective:
More accurate trees and alignments
on 500,000 sequences in under a week
Courtesy of the Tree of Life web project, tolweb.org
Computational Phylogenetics (2015)
1997-2001: Distance-based phylogenetic
tree estimation from polynomial length
sequences
2012: Computing accurate trees (almost)
without multiple sequence alignments
2009-2015: Co-estimation of multiple
sequence alignments and gene trees, now
on 1,000,000 sequences in under two
weeks
2014-2015: Species tree estimation from
whole genomes in the presence of massive
gene tree heterogeneity
Courtesy of the Tree of Life web project, tolweb.org
Computational Phylogenetics (2015)
1997-2001: Distance-based phylogenetic
tree estimation from polynomial length
sequences
2012: Computing accurate trees (almost)
without multiple sequence alignments
2009-2015: Co-estimation of multiple
sequence alignments and gene trees, now
on 1,000,000 sequences in under two
weeks
2014-2015: Species tree estimation from
whole genomes in the presence of massive
gene tree heterogeneity
Courtesy of the Tree of Life web project, tolweb.org
Key technique: Divide-and-conquer!
• In general, small datasets with not too much
“heterogeneity” are easy to analyze with good
accuracy.
Divide-and-Conquer
• Divide-and-conquer is a basic algorithmic trick
for solving problems!
• Three steps:
– divide a dataset into two or more sets,
– solve the problem on each set, and
– combine solutions.
Sorting
10
3
54
23
75
5
1
25
Objective: sort this list of integers from
smallest to largest.
10, 3, 54, 23, 75, 5, 1, 25 should become
1, 3, 5, 10, 23, 25, 54, 75
MergeSort
10
3
54
23
75
5
1
25
Step 1: Divide into two sublists
Step 2: Recursively sort each sublist
Step 3: Merge the two sorted sublists
Step 1: break into two lists
X:
10
3
54
23
Y:
75
5
1
25
Step 2: sort the two lists
X:
3
10
23
54
Y:
1
5
25
75
Step 3: merge the sorted lists
X:
Result:
3
10
23
54
Y:
1
5
25
75
Merging (cont.)
X:
Result:
3
10
1
23
54
Y:
5
25
75
Merging (cont.)
X:
Result:
10
1
23
3
54
Y:
5
25
75
Merging (cont.)
X:
Result:
10
1
23
3
54
5
Y:
25
75
Merging (cont.)
X:
Result:
23
1
3
54
5
Y:
10
25
75
Merging (cont.)
X:
Result:
54
1
3
5
Y:
10
25
23
75
Merging (cont.)
X:
Result:
54
1
3
5
Y:
10
75
23
25
Merging (cont.)
X:
Result:
Y:
1
3
5
10
75
23
25
54
Merging (cont.)
X:
Result:
Y:
1
3
5
10
23
25
54
75
Multiple Sequence Alignment (MSA):
a scientific grand challenge1
S1 =
S2 =
S3 =
…
Sn =
AGGCTATCACCTGACCTCCA
TAGCTATCACGACCGC
TAGCTGACCGC
TCACGACCGACA
S1
S2
S3
…
Sn
= -AGGCTATCACCTGACCTCCA
= TAG-CTATCAC--GACCGC-= TAG-CT-------GACCGC-= -------TCAC--GACCGACA
Novel techniques needed for scalability and accuracy
NP-hard problems and large datasets
Current methods do not provide good accuracy
Few methods can analyze even moderately large datasets
Many important applications besides phylogenetic estimation
1 Frontiers
in Massive Data Analysis, National Academies Press, 2013
SATé and PASTA
Input: set of unaligned sequences
Output: multiple sequence alignment and
phylogenetic tree
• SATé: Liu et al., Science 2009 (up to 10,000
sequences) and Systematic Biology 2012 (up
to 50,000 sequences)
• PASTA: Mirarab et al., J. Comp Biol 2015 (up to
1,000,000 sequences)
1000-taxon models, ordered by difficulty (Liu et al., Science 19 June 2009)
Re-aligning on a tree
C
A
B
D
Decompose
dataset
Estimate ML
tree on merged
alignment
ABCD
A
B
C
D
Align
subproblem
s
A
B
C
D
Merge subalignments
SATé and PASTA Algorithms
Obtain initial alignment and
estimated ML tree
Tree
Use tree to compute
new alignment
SATé and PASTA Algorithms
Obtain initial alignment and
estimated ML tree
Tree
Use tree to compute
new alignment
Alignment
SATé and PASTA Algorithms
Obtain initial alignment and
estimated ML tree
Tree
Use tree to compute
new alignment
Estimate ML tree on new
alignment
Alignment
SATé and PASTA Algorithms
Obtain initial alignment and
estimated ML tree
Tree
Use tree to compute
new alignment
Estimate ML tree on new
alignment
Alignment
Repeat until termination condition, and
return the alignment/tree pair with the best ML score
SATé: 24-hour co-estimation of highly accurate alignments and trees on 1000 sequences
1000-taxon models, ordered by difficulty (Liu et al., Science 19 June 2009)
24-hour SATé analysis, on desktop machines
(Similar improvements for biological datasets)
SATé-2: even more accurate!
(Liu et al., Syst Biol 61(1):90-106, 2012)
PASTA: even more accurate, and can scale
to 1,000,000 sequences
RNASim
Tree Error (FN Rate)
0.20
0.15
Clustal−Omega
Muscle
Mafft
Starting Tree
0.10
SATe2
PASTA
Reference Alignment
0.05
0.00
10000
•
•
•
•
50000
100000
200000
Simulated RNASim datasets from 10K to 200K taxa
Limited to 24 hours using 12 CPUs
Not all methods could run (missing bars could not finish)
PASTA, Mirarab et al., J Comp Biol 22(5): 377-386 (2015)
Avian Phylogenomics Project
E Jarvis,
HHMI
MTP Gilbert,
Copenhagen
G Zhang,
BGI
T. Warnow
UT-Austin
S. Mirarab Md. S. Bayzid,
UT-Austin
UT-Austin
Plus many many other people…
First analysis (Jarvis, Mirarab, et al., Science 2014):
• Approx. 50 species, 14,000 loci
• Used SATé for gene sequence alignment and tree estimation
Next analysis will have more species, and will use PASTA
1KP: Thousand Transcriptome Project
G. Ka-Shu Wong J. Leebens-Mack
U Georgia
U Alberta
N. Wickett
Northwestern
N. Matasci
iPlant
T. Warnow,
UT-Austin
S. Mirarab,
UT-Austin
N. Nguyen
UT-Austin
First analysis (Wickett, Mirarab, et al., PNAS, 2014)
•
About 100 species and 800 loci
•
Used SATé
Next analysis will be much larger and more difficult:
• Multiple sequence alignment and gene tree estimation on 100,000
sequences, many datasets highly fragmentary
• Will use PASTA and UPP (Nguyen et al., Genome Biology 2015)
Computational Phylogenetics (2015)
1997-2001: Distance-based phylogenetic
tree estimation from polynomial length
sequences
2012: Computing accurate trees (almost)
without multiple sequence alignments
2009-2015: Co-estimation of multiple
sequence alignments and gene trees, now
on 1,000,000 sequences in under two
weeks
2014-2015: Species tree estimation from
whole genomes in the presence of massive
gene tree heterogeneity
Courtesy of the Tree of Life web project, tolweb.org
“Boosters”, or “Meta-Methods”
• Meta-methods use divide-and-conquer
and iteration (or other techniques) to
“boost” the performance of base methods
(phylogeny reconstruction, alignment
estimation, etc)
Base method M
Meta-method
M*
Main Points
• Innovative algorithm design can improve
accuracy as well as reduce running time.
• Divide-and-conquer is a key algorithmic
technique that has dramatically changed the
toolkit for biologists!
Acknowledgments
Funding:
Guggenheim Foundation, Packard Foundation, NSF,
Microsoft Research New England, David Bruton Jr.
Centennial Professorship, Grainger Foundation, and
TACC (Texas Advanced Computing Center)
Avian Phylogenomics Project
E Jarvis,
HHMI
MTP Gilbert,
Copenhagen
G Zhang,
BGI
T. Warnow
UT-Austin
S. Mirarab Md. S. Bayzid,
UT-Austin
UT-Austin
Plus many many other people…
•Jarvis, Mirarab, et al., Science 2014
Major challenge:
• Massive gene tree heterogeneity consistent with incomplete lineage sorting
• Very poor resolution in the 14,000 gene trees
• Standard coalescent-based species tree estimation methods had poor
accuracy
Solution:
• New technique to improve coalescent-based species tree (statistical binning,
Mirarab et al., Science 2014)
Download