Phylogenetic supertrees: seeing the data for the trees Olaf R. P. Bininda-Emonds Technische Universität München Outline • the fundamental issue: characters versus trees • open questions: are trees data? • loss of contact with primary character data • loss of information • “novel” solutions • data duplication • the nature of supertrees • analytical issues • conclusions • are supertrees a valid phylogenetic technique? The fundamental issues The basic distinction Conventional studies Supertrees • source data: measurable attribute of an organism • basic unit: character • source data: phylogenies • basic unit: membership criterion / statement of relationship • can be viewed as a putative statement of relationship • at best, can be viewed as a proxy for a shared derived character The fundamental issue • supertrees combine trees, not “real data” • has led to many criticisms of supertree construction • but also lends advantages to the approach EFGH JKL Supertree construction Direct consensus-like techniques A B C K L C D E H I K AB CD E F GH I J K L optimization criterion coding technique Indirect Supertree methods Direct Indirect • strict consensus supertrees • MinCutSupertree (and variants) • semi-strict supertrees • most matrix representation (MR) supertrees • Lanyon (1993) • Goloboff and Pol (2002) • parsimony (MRP and variants) • compatibility (MRC) • minimum flip supertrees (MRF) • average consensus (MRD) • gene tree parsimony Are trees data? Open questions • loss of contact with raw (character) data • loss of information • “novel” solutions • data duplication • the nature of supertrees: consensus or phylogenetic hypothesis? • analytical issues Loss of information • a tree is a graphical representation of the “primary signal” in a character-based data set • strength of primary signal can be measured (e.g., bootstrap frequencies) • but information regarding nature of any conflicting “subsignals” lost A B C D 0000100101010000001 0111110000100010111 1011111110101101010 1111001111111111100 A B C D Potential problems • all trees and clades on them have equal support a priori • prevents “signal enhancement” (sensu de Queiroz et al., 1995) in combined data sets • coherent subsignals in different data partitions, when combined, outweigh conflicting primary signals • “throwing away of information” should cause a supertree analysis to be less accurate than a total evidence one, where primary data are combined No loss of accuracy • simulation studies indicate loss of information is not detrimental • MRP (and variants) (Bininda-Emonds and Sanderson, 2001) • average consensus (Lapointe and Levasseur, 2001) • both methods perform about on a par with total evidence analyses of primary character data • and show similar behaviour to total evidence analyses Maximizing contact • weighting according to evidential support in source trees • possible for all MR methods, average consensus, and MinCutSupertree (and gene tree parsimony?) • causes MRP to outperform total evidence analyses of primary character data in simulation (BinindaEmonds and Sanderson, 2001) • bootstrapping of primary character data • both non-parametric (Moore et al., in prep) and parametric versions (Huelsenbeck et al., in prep) Non-parametric bootstrapped supertrees original data bootstrapped source trees bootstrapped supertree consensus of supertrees Open questions • loss of contact with raw (character) data • loss of information • “novel” solutions • data duplication • the nature of supertrees: consensus or phylogenetic hypothesis? • analytical issues Novel clades • all supertree methods have the potential to yield novel statements • relationships between taxa that do not co-exist on any single source tree (sensu Sanderson et al., 1998) • defining characteristic of method A B C D C D E + A B C D E Unsupported clades • some supertree methods have the potential to make statements that are not only novel, but also contradicted (unsupported) by every source tree • violation of a weaker form of co-Pareto property • co-Pareto = relationship of a given kind in the consensus is present in at least one input tree A B C D E F + C D E A B F A B C D E F 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 1 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 0 0 A B C D E F • from Goloboff and Pol (2002) Comparing supertree methods • indirect, optimization-based methods seem more prone to producing unsupported clades • strict consensus supertrees • semi-strict supertrees • MRC • • • • MRP (and variants) MRF? average consensus? MinCutSupertree (and variants)? • gene tree parsimony? Questions: unsupported clades • how should they be treated? • how common are they? A B C D E F C D E A B F + A B C D E F Appropriateness Conventional studies Supertrees • unsupported clades (at level of resulting trees) arise via signal enhancement • have direct character support in the combined matrix • subsignals are invisible • unsupported clades lack any support among source trees should be regarded as spurious (Pisani and Wilkinson, 2002) • not equivalent to signal enhancement A B C D E F + C D E A B F A B C D E F 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 1 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 0 0 A B C D E F Incidence of unsupported clades • circumstantial evidence hints that they are rare • only a few reported in the literature • theoretical: Goloboff and Pol (2002); Wilkinson et al. (2001) • empirical: Bininda-Emonds and Bryant (1998); Wilkinson et al. (2001) • estimated that 8 of the 198 clades in the carnivore MRP supertree (~ 4%) had no support among the source trees (Bininda-Emonds et al., 1999) • dinosaur MRP supertree (Pisani et al., 2002) has no unsupported clades Unsupported clades are very rare • simulation results (MRP only) • occur most often with source trees that are: • few in number (n ≤ 5) • large in size (up to 50 taxa) • possess identical taxon sets (“consensus setting”) • “most often” means < 0.21% of all simulated clades • overall incidence was 131 of 282 137 clades (< 0.05%) • empirical results • both the carnivore and lagomorph MRP supertrees have no unsupported clades whatsoever Open questions • loss of contact with raw (character) data • loss of information • “novel” solutions • data duplication • the nature of supertrees: consensus or phylogenetic hypothesis? • analytical issues Data duplication • character data are often recycled between phylogenetic analyses e.g., total evidence analyses, molecular studies of the same gene • the same character data may contribute to more than one source tree • overrepresented in a supertree analysis data duplication • also violates assumption of data non-independence • data duplication among cetartiodactyl source trees in the Liu et al. (2001) mammal order MRP supertree • from Gatesy et al. (2002) Minimizing duplication • data duplication a potential problem for all supertree methods • use of trees does not reveal directly source of underlying data set • but can be minimized / avoided with careful data collection protocols e.g., supertrees of Daubin et al. (2001) and Kennedy and Page (2002) lack data duplication Is data duplication unavoidable? • no phylogenies are independent given a single Tree of Life • all characters and data sources have been subject to the same evolutionary processes and history • want to combine phylogenetic hypotheses that can reasonably be viewed as being independent Is the problem overrated? • supertrees combine phylogenetic hypotheses • emergent property composed of more than their raw character data • manipulation of data (weighting, alignment, recoding) • method and assumptions of analysis • for example: • strongly conflicting molecular phylogenies for whales can be explained largely by the choice of outgroup (Messenger and McGuire, 1998) • alignment and weighting of primary data also important Is data duplication overrated? • data duplication is often only partial • most combined data sets represent unique combinations of individual data sets • easy to deal with data sets that are supersets of others • signal enhancement means that each unique combination could justifiably be viewed as an independent hypothesis • also independent from constituent data sets Are supertrees unfairly singled out? • data duplication also exists in conventional studies (but less obviously so and to a lesser known extent): • morphological single features often described by multiple characters • molecular secondary structure (e.g., stems in tRNA, protein folding) and codon structure mean primary mutations may require secondary compensatory ones • total evidence mixing of phenotypic and genotypic data must represent data duplication at some level Open questions • loss of contact with raw (character) data • loss of information • “novel” solutions • data duplication • the nature of supertrees: consensus or phylogenetic hypothesis? • analytical issues The nature of supertrees • is the supertree itself a legitimate phylogenetic hypothesis? • many would say “no”, arguing instead that they are a: • form of consensus • historical summary of systematic effort • therefore, supertrees should not be used to answer biological questions Supertrees as consensus • association derives from: • similar methodology (combining trees rather than data) • both containing polytomies • resulting topologies may be suboptimal given underlying data • why are consensus trees not valid phylogenetic hypotheses? • especially if polytomies viewed as soft rather than hard Dealing with incongruence • all supertree methods must somehow deal with incongruence among source trees • ignore it: strict consensus, semi-strict, MinCutSupertree, MRC • “fix” it: MRF • explain it biologically: gene tree parsimony • optimize it: average consensus and MRP Incongruence as homoplasy • a repeated criticism of MRP is that inferred homoplasy on supertree has no biological meaning • convergence and reversals meaningless with respect to a membership criterion • but why is MRP singled out? • similar arguments should apply at least to average consensus Parsimony and parsimony Principle of parsimony Cladistic parsimony • a criterion for deciding among • specific application of scientific theories or principle of parsimony explanations • prefer the tree with the fewest • “Plurality should not be number of evolutionary steps (i.e., character state changes) posited without necessity” choose the simplest • additional changes over minimum number represent explanation of a phenomenon homoplasy Homoplasy and supertrees • notions of homoplasy, convergence, and reversals have nothing to do with parsimony per se • or really even with cladistic parsimony • post hoc biological interpretation of incongruence • incongruence on an MRP supertree is simply incongruence • idea of homoplasy in this context is epistemologically, not biologically meaningless Open questions • loss of contact with raw (character) data • loss of information • “novel” solutions • data duplication • the nature of supertrees: consensus or phylogenetic hypothesis? • analytical issues Limitations of total evidence • analytical limitations of combined primary data sets also result in a loss of information • data must be compatible • use of a single optimization criterion usually MP, but ML now also possible • some data still not analyzable under either framework (e.g., DNA-DNA hybridization, morphometric data) • use of simplistic models of evolution • MP: differential weighting (including ti:tv ratio) • ML: same model for every partition • alignment problems Advantages to supertrees • no loss of information: all phylogenetic hypotheses can be combined • even those that aren’t based on any data • process amounts to partitioned analyses • each partition can be analyzed according to most appropriate model of evolution, and optimization criterion • can be done in parallel • results then combined with little loss of accuracy • or hopefully less than loss of information for a total evidence analysis entails A phylogeny of mammals The “superteam” Molecular data • have complete supertrees for: • Murphy et al. (2001a) • • • • • • Carnivora Chiroptera Insectivora Lagomorpha Marsupialia Primates • total of 1923 species (41.5%) • 9779 bp from 18 genes for 64 species • Madsen et al. (2001) • 8655 bp from 4 genes for 82 species • Murphy et al. (2001b) • 16 397 bp from 22 genes for 44 species (< 1%) Summary Whither supertrees? • criticisms of supertree construction have been launched at two levels • at the supertree approach as a whole • at individual supertree methods ? Of approaches … • supertree problem inherently difficult because of missing data • results in the lack of a single right answer D C B A B + D A ? Of approaches … • trees are data • potential loss of information not detrimental • key is to think in terms of phylogenetic hypotheses • still awaiting a response from the cladistic community … … and methods • all methods will go astray if its assumptions are violated e.g., parsimony and long-branch attraction, likelihood and wrong model, regression and data non-independence • for supertrees, key is to try and establish: • what each method’s boundary conditions are • how robust each method is to violations of its assumptions • what the properties of each method are (in relation to our desired objective)