CS 598 AGB Supertrees Tandy Warnow Today’s Material • Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree on the full set S of taxa. • Textbook material: Chapter 5 (Aho, Sagiv, Szymanski, and Ullman) and Chapter 8.1-8.3. Computing a tree from a set of rooted triplet trees • Constructing a rooted tree from a set of compatible rooted triplet trees. Equivalently, test compatibility of a set of rooted triplet trees. • Recursive algorithm by Aho, Sagiv, Szymanski, and Ullman • Chapter 5.1 ASSU algorithm Given set X of k triplet trees on n species: • If n>1, then construct graph with each species one of the vertices, and edges (a,b) for triplets ab|c. • If the graph has a single component, reject (the set is not compatible); else recurse on each component, and return tree formed by making the rooted trees on the components each a subtree off the root of the returned tree. Why does it work? If the set X of triplet trees is compatible, • Then there is a rooted tree T with at least two subtrees off the root, T1 and T2. • Any two leaves a,b in the same subtree cannot be in a triplet ab|c. • Hence the graph formed for the set of triplet trees cannot be connected. • Therefore the graph formed for the set of triplet trees must have at least two components. • This argument applies recursively to every subset of X. • Hence the algorithm returns a tree on which all the triplet trees agree. If the set X of triplet trees is not compatible, it is not hard to show that the algorithm will detect this (proof by induction on the number of taxa). Compatibility of rooted trees • Suppose the input is a set X of rooted trees (not necessarily triplet trees). • Can we use ASSU to determine if X is compatible, and to compute a compatibility supertree for X? • Solution: YES, just encode each rooted tree in X by its set of rooted triplet trees (or some subset of these that suffices to define each tree in X), and then run ASSU. Summary so far • Testing compatibility of an arbitrary set of rooted trees (and constructing compatibility supertree): polynomial time, using ASSU • Testing compatibility of an arbitrary set of unrooted trees (and constructing compatibility supertree): NP-complete! • Special cases for testing compatibility of unrooted trees: – Input has a tree on every four taxa. (Solution: Use All Quartets Method to test for compatibility) – Input trees all have a common species, A. (Solution: root all the input trees using leaf A, and then run ASSU.) – Input has all the “short quartets” of a tree. (Solution: Use Dyadic Closure to test for compatibility, see Chapter 13) Summary so far • Testing compatibility of an arbitrary set of rooted trees (and constructing compatibility supertree): polynomial time, using ASSU • Testing compatibility of an arbitrary set of unrooted trees (and constructing compatibility supertree): NP-complete! • Special cases for testing compatibility of unrooted trees: – Input has a tree on every four taxa. (Solution: Use All Quartets Method to test for compatibility) – Input trees all have a common species, A. (Solution: root all the input trees using leaf A, and then run ASSU.) – Input has all the “short quartets” of a tree. (Solution: Use Dyadic Closure to test for compatibility, see Chapter 13) Summary so far • Testing compatibility of an arbitrary set of rooted trees (and constructing compatibility supertree): polynomial time, using ASSU • Testing compatibility of an arbitrary set of unrooted trees (and constructing compatibility supertree): NP-complete! • Special cases for testing compatibility of unrooted trees: – Input has a tree on every four taxa. (Solution: Use All Quartets Method to test for compatibility) – Input trees all have a common species, A. (Solution: root all the input trees using leaf A, and then run ASSU.) – Input has all the “short quartets” of a tree. (Solution: Use Dyadic Closure to test for compatibility, see Chapter 13) Summary so far • Testing compatibility of an arbitrary set of rooted trees (and constructing compatibility supertree): polynomial time, using ASSU • Testing compatibility of an arbitrary set of unrooted trees (and constructing compatibility supertree): NP-complete! • Special cases for testing compatibility of unrooted trees: – Input has a tree on every four taxa. (Solution: Use All Quartets Method to test for compatibility) – Input trees all have a common species, A. (Solution: root all the input trees using leaf A, and then run ASSU.) – Input has all the “short quartets” of a tree. (Solution: Use Dyadic Closure to test for compatibility, see Chapter 13) Supertree Methods • Most of the time, the input is a set of unrooted source trees that is incompatible. • All the methods described so far only return compatibility supertrees. • How can we construct supertrees from incompatible source trees? Supertree estimation Challenges: • Tree compatibility is NP-complete (therefore, even if subtrees are correct, supertree estimation is hard) • Estimated subtrees have error Advantages: • Estimating individual gene trees can be computationally feasible (compared to the combined analysis of many genes) • Can use different types of data for each source tree Many Supertree Methods • • • • • • • • Matrix Representation with Parsimony (Most commonly used and most accurate) MRP • QMC weighted MRP • Q-imputation MRF • SDM MRD • PhySIC Robinson-Foulds • Majority-Rule Supertrees Supertrees Min-Cut • Maximum Likelihood Supertrees Modified Min-Cut • and many more ... Semi-strict Supertree Supertree Optimization Problems • • • • MRP (Matrix Representation with Parsimony) MRL (Matrix Representation with Likelihood) RFS (Robinson-Foulds Supertree) MQDS (Minimum Quartet Distance Supertree) Everything is NP-hard. Some of the methods have good heuristics. It is easy to see that if the input source trees are compatible, then MRP, RFS, and MQDS return a compatibility tree. FN rate of MRP vs. combined analysis Scaffold Density (%) Comparison of Supertree methods and Concatenation From Swenson et al., Algorithms for Molecular Biology 2010 http://almob.biomedcentral.com/articles/10.1186/1748-7188-5-8 Comparison of Supertree Methods Swenson et al., Algorithms for Molecular Biology 2011 http://almob.biomedcentral.com/articles/10.1186/1748-7188-6-7 SuperFine • SuperFine is a technique for improving the speed and accuracy of supertree methods. • The first step computes a “strict consensus merger” (SCM) of the input trees, and the second step refines the SCM using the supertree method. • The SCM calculation is very fast. The refinement step is applied to each polytomy (node with degree greater than 3) independently, and is fast when the degree is small. SuperFine-boosting: improves accuracy of MRP Scaffold Density (%) (Swenson et al., Syst. Biol. 2012) SuperFine • First, construct a supertree with low false positives The Strict Consensus • Then, refine the tree to reduce false negatives by resolving each polytomy using a “base” supertree method (e.g., MRP) Quartet Max Cut Theoretical results for SCM • SCM can be computed in polynomial time • For certain types of inputs, the SCM method solves the NP-hard “Tree Compatibility” problem • All splits in the SCM “appear” in at least one source tree (and are not contradicted by any source tree) Comparing Supertree Methods on 1000-taxon datasets Figure 1 from Nguyen, Mirarab, and Warnow, Algorithms for Molecular Biology 2012 http://almob.biomedcentral.com/articles/10.1186/1748-7188-7-3 Obtaining a supertree with low FP The Strict Consensus Merger (SCM) SCM of two trees Computes the strict consensus on the common leaf set Then superimposes the two trees, contracting more edges in the presence of “collisions” Strict Consensus Merger (SCM) e b b e a a f c a d g b c f g b d f g b a e a c h i c c h h i j d i j d j d Performance of SCM • Low false positive (FP) rate (Estimated supertree has few false edges) • High false negative (FN) rate (Estimated supertree is missing many true edges) Part II of SuperFine • Refine the tree to reduce false negatives by resolving each polytomy using a base supertree method (e.g., MRP) Part 1 of SuperFine b e a f c a d g b e a b c f g b d f g b a e a c h i c c h h i j d i j d j d Part 2 of SuperFine e a b 1 f g i 1 j 2 d d 4 g 5 a 1 b 1 3 h a c e b f6 c1 c h b 1 e 1 a 1 1 i 4 d j 5 g 6 f c1 h 2 3i 3 3j d 4 Step 2: Apply MRP to the collection of reduced source trees 1 6 4 5 1 5 4 MRP 1 4 2 3 2 6 3 Replace polytomy using tree from MRP e a b f a g c e b g 5 1 d 4 c h i d j 2 h h a c e b i d i j g 3 6f f j Resolving a single polytomy, v, using MRP • Step 1: Reduce each source tree to a tree on leafset, {1,2,...,d} where d=degree(v) • Step 2: Apply MRP to the collection of reduced source trees, to produce a tree t on {1,2,...,d} • Step 3: Replace the star tree at v by tree t SuperFine-boosting: improves accuracy of MRP Scaffold Density (%) (Swenson et al., Syst. Biol. 2012) SuperFine is also much faster MRP 8-12 sec. SuperFine 2-3 sec. Scaffold Density (%) Scaffold Density (%) Scaffold Density (%) Summary (so far) • Supertree methods are useful for constructing very large species trees from a set of source trees. • The most well known supertree method is MRP, but there are more accurate methods (e.g., MRL, and perhaps quartet-based methods that try to solve Minimum Quartet Distance Supertree). • SuperFine is a technique for improving the speed and accuracy of supertree methods. • CA-ML (concatenation using maximum likelihood) is often more accurate than current supertree methods, but is more computationally intensive. Limitations of Supertree Methods • Traditional supertree methods assume that the true gene trees match the true species tree. • This is known to be unrealistic in some situations, due to processes such as • Deep coalescence (“incomplete lineage sorting”) • Gene duplication and loss • Horizontal gene transfer Red gene tree ≠ species tree (green gene tree okay) Coming up • Supertree methods based on quartets are also good for species tree estimation in the presence of ILS and/or HGT! • Supertree methods are useful for divide-andconquer methods (e.g., DACTAL).