PPTX - Tandy Warnow

advertisement
CS 598 AGB
Supertrees
Tandy Warnow
Today’s Material
• Supertree construction: given set of trees on
subsets of S (the full set of taxa), construct
tree on the full set S of taxa.
• Textbook material: Chapter 5 (Aho, Sagiv,
Szymanski, and Ullman) and Chapter 8.1-8.3.
Computing a tree from a set of rooted
triplet trees
• Constructing a rooted tree from a set of
compatible rooted triplet trees. Equivalently,
test compatibility of a set of rooted triplet
trees.
• Recursive algorithm by Aho, Sagiv, Szymanski,
and Ullman
• Chapter 5.1
ASSU algorithm
Given set X of k triplet trees on n species:
• If n>1, then construct graph with each species
one of the vertices, and edges (a,b) for triplets
ab|c.
• If the graph has a single component, reject (the
set is not compatible); else recurse on each
component, and return tree formed by making
the rooted trees on the components each a
subtree off the root of the returned tree.
Why does it work?
If the set X of triplet trees is compatible,
• Then there is a rooted tree T with at least two subtrees off the root,
T1 and T2.
• Any two leaves a,b in the same subtree cannot be in a triplet ab|c.
• Hence the graph formed for the set of triplet trees cannot be
connected.
• Therefore the graph formed for the set of triplet trees must have at
least two components.
• This argument applies recursively to every subset of X.
• Hence the algorithm returns a tree on which all the triplet trees
agree.
If the set X of triplet trees is not compatible, it is not hard to show that
the algorithm will detect this (proof by induction on the number of
taxa).
Compatibility of rooted trees
• Suppose the input is a set X of rooted trees
(not necessarily triplet trees).
• Can we use ASSU to determine if X is
compatible, and to compute a compatibility
supertree for X?
• Solution: YES, just encode each rooted tree in
X by its set of rooted triplet trees (or some
subset of these that suffices to define each
tree in X), and then run ASSU.
Summary so far
• Testing compatibility of an arbitrary set of rooted trees (and
constructing compatibility supertree): polynomial time,
using ASSU
• Testing compatibility of an arbitrary set of unrooted trees
(and constructing compatibility supertree): NP-complete!
• Special cases for testing compatibility of unrooted trees:
– Input has a tree on every four taxa. (Solution: Use All Quartets
Method to test for compatibility)
– Input trees all have a common species, A. (Solution: root all the
input trees using leaf A, and then run ASSU.)
– Input has all the “short quartets” of a tree. (Solution: Use Dyadic
Closure to test for compatibility, see Chapter 13)
Summary so far
• Testing compatibility of an arbitrary set of rooted trees (and
constructing compatibility supertree): polynomial time,
using ASSU
• Testing compatibility of an arbitrary set of unrooted trees
(and constructing compatibility supertree): NP-complete!
• Special cases for testing compatibility of unrooted trees:
– Input has a tree on every four taxa. (Solution: Use All Quartets
Method to test for compatibility)
– Input trees all have a common species, A. (Solution: root all the
input trees using leaf A, and then run ASSU.)
– Input has all the “short quartets” of a tree. (Solution: Use Dyadic
Closure to test for compatibility, see Chapter 13)
Summary so far
• Testing compatibility of an arbitrary set of rooted trees (and
constructing compatibility supertree): polynomial time,
using ASSU
• Testing compatibility of an arbitrary set of unrooted trees
(and constructing compatibility supertree): NP-complete!
• Special cases for testing compatibility of unrooted trees:
– Input has a tree on every four taxa. (Solution: Use All Quartets
Method to test for compatibility)
– Input trees all have a common species, A. (Solution: root all the
input trees using leaf A, and then run ASSU.)
– Input has all the “short quartets” of a tree. (Solution: Use Dyadic
Closure to test for compatibility, see Chapter 13)
Summary so far
• Testing compatibility of an arbitrary set of rooted trees (and
constructing compatibility supertree): polynomial time,
using ASSU
• Testing compatibility of an arbitrary set of unrooted trees
(and constructing compatibility supertree): NP-complete!
• Special cases for testing compatibility of unrooted trees:
– Input has a tree on every four taxa. (Solution: Use All Quartets
Method to test for compatibility)
– Input trees all have a common species, A. (Solution: root all the
input trees using leaf A, and then run ASSU.)
– Input has all the “short quartets” of a tree. (Solution: Use Dyadic
Closure to test for compatibility, see Chapter 13)
Supertree Methods
• Most of the time, the input is a set of
unrooted source trees that is incompatible.
• All the methods described so far only return
compatibility supertrees.
• How can we construct supertrees from
incompatible source trees?
Supertree estimation
Challenges:
• Tree compatibility is NP-complete (therefore,
even if subtrees are correct, supertree estimation
is hard)
• Estimated subtrees have error
Advantages:
• Estimating individual gene trees can be
computationally feasible (compared to the
combined analysis of many genes)
• Can use different types of data for each source
tree
Many Supertree Methods
•
•
•
•
•
•
•
•
Matrix Representation with Parsimony
(Most commonly used and most accurate)
MRP
• QMC
weighted MRP
• Q-imputation
MRF
• SDM
MRD
• PhySIC
Robinson-Foulds
• Majority-Rule
Supertrees
Supertrees
Min-Cut
• Maximum Likelihood
Supertrees
Modified Min-Cut
• and many more ...
Semi-strict Supertree
Supertree Optimization Problems
•
•
•
•
MRP (Matrix Representation with Parsimony)
MRL (Matrix Representation with Likelihood)
RFS (Robinson-Foulds Supertree)
MQDS (Minimum Quartet Distance Supertree)
Everything is NP-hard. Some of the methods have
good heuristics.
It is easy to see that if the input source trees are
compatible, then MRP, RFS, and MQDS return a
compatibility tree.
FN rate of MRP vs.
combined analysis
Scaffold Density (%)
Comparison of Supertree methods and Concatenation
From Swenson et al., Algorithms for Molecular Biology 2010
http://almob.biomedcentral.com/articles/10.1186/1748-7188-5-8
Comparison of Supertree Methods
Swenson et al., Algorithms for Molecular Biology 2011
http://almob.biomedcentral.com/articles/10.1186/1748-7188-6-7
SuperFine
• SuperFine is a technique for improving the speed
and accuracy of supertree methods.
• The first step computes a “strict consensus
merger” (SCM) of the input trees, and the second
step refines the SCM using the supertree method.
• The SCM calculation is very fast. The refinement
step is applied to each polytomy (node with
degree greater than 3) independently, and is fast
when the degree is small.
SuperFine-boosting: improves
accuracy of MRP
Scaffold Density (%)
(Swenson et al., Syst. Biol. 2012)
SuperFine
• First, construct a supertree with low false
positives
The Strict
Consensus
• Then, refine the tree to reduce false
negatives by resolving each polytomy
using a “base” supertree method (e.g.,
MRP)
Quartet Max Cut
Theoretical results for SCM
• SCM can be computed in polynomial time
• For certain types of inputs, the SCM
method solves the NP-hard “Tree
Compatibility” problem
• All splits in the SCM “appear” in at least
one source tree (and are not contradicted
by any source tree)
Comparing Supertree Methods on 1000-taxon datasets
Figure 1 from Nguyen, Mirarab, and Warnow, Algorithms for Molecular Biology 2012
http://almob.biomedcentral.com/articles/10.1186/1748-7188-7-3
Obtaining a supertree with low
FP
The Strict Consensus Merger (SCM)
SCM of two trees
Computes the strict consensus on the common
leaf set
Then superimposes the two trees, contracting
more edges in the presence of “collisions”
Strict Consensus Merger
(SCM)
e
b
b
e
a
a
f
c
a
d
g
b
c
f
g
b
d
f
g
b
a
e
a
c
h
i
c
c
h
h
i
j
d
i
j
d
j
d
Performance of SCM
• Low false positive (FP) rate
(Estimated supertree has few false edges)
• High false negative (FN) rate
(Estimated supertree is missing many true
edges)
Part II of SuperFine
• Refine the tree to reduce false
negatives by resolving each
polytomy using a base supertree
method (e.g., MRP)
Part 1 of SuperFine
b
e
a
f
c
a
d
g
b
e
a
b
c
f
g
b
d
f
g
b
a
e
a
c
h
i
c
c
h
h
i
j
d
i
j
d
j
d
Part 2 of SuperFine
e
a
b
1
f
g
i
1
j
2
d
d
4
g
5
a
1
b
1
3
h
a c e b
f6
c1
c
h
b
1
e
1
a
1
1
i
4
d
j
5
g
6
f
c1
h
2
3i
3
3j
d
4
Step 2: Apply MRP to the collection of
reduced source trees
1
6
4
5
1
5
4
MRP
1
4
2
3
2
6
3
Replace polytomy using tree from
MRP
e
a
b
f
a
g
c
e
b
g
5
1
d
4
c
h
i
d
j
2
h
h
a c e b
i
d
i
j
g
3
6f
f
j
Resolving a single polytomy, v,
using MRP
• Step 1: Reduce each source tree to a tree
on leafset, {1,2,...,d} where d=degree(v)
• Step 2: Apply MRP to the collection of
reduced source trees, to produce a tree t
on {1,2,...,d}
• Step 3: Replace the star tree at v by tree t
SuperFine-boosting: improves
accuracy of MRP
Scaffold Density (%)
(Swenson et al., Syst. Biol. 2012)
SuperFine is also much faster
MRP 8-12 sec.
SuperFine 2-3 sec.
Scaffold Density (%)
Scaffold Density (%)
Scaffold Density (%)
Summary (so far)
• Supertree methods are useful for constructing very
large species trees from a set of source trees.
• The most well known supertree method is MRP, but
there are more accurate methods (e.g., MRL, and
perhaps quartet-based methods that try to solve
Minimum Quartet Distance Supertree).
• SuperFine is a technique for improving the speed and
accuracy of supertree methods.
• CA-ML (concatenation using maximum likelihood) is
often more accurate than current supertree methods,
but is more computationally intensive.
Limitations of Supertree
Methods
• Traditional supertree methods assume that the
true gene trees match the true species tree.
• This is known to be unrealistic in some
situations, due to processes such as
• Deep coalescence (“incomplete lineage sorting”)
• Gene duplication and loss
• Horizontal gene transfer
Red gene tree ≠ species tree
(green gene tree okay)
Coming up
• Supertree methods based on quartets are also
good for species tree estimation in the
presence of ILS and/or HGT!
• Supertree methods are useful for divide-andconquer methods (e.g., DACTAL).
Download