Phylogenetic Supertrees: Seeing the Data for the Trees

advertisement
Phylogenetic supertrees:
seeing the data for the trees
Olaf R. P. Bininda-Emonds
Technische Universität München
Outline
• the fundamental issue: characters versus
trees
• open questions: are trees data?
• loss of contact with primary character data
• loss of information
• “novel” solutions
• data duplication
• the nature of supertrees
• analytical issues
• conclusions
• are supertrees a valid phylogenetic technique?
The fundamental issues
The basic distinction
Conventional studies
Supertrees
• source data: measurable
attribute of an organism
• basic unit: character
• source data: phylogenies
• basic unit: membership
criterion / statement of
relationship
• can be viewed as a
putative statement of
relationship
• at best, can be viewed as a
proxy for a shared derived
character
The fundamental issue
• supertrees combine trees, not “real data”
• has led to many criticisms of supertree
construction
• but also lends advantages to the approach
EFGH JKL
Supertree construction
Direct
consensus-like techniques
A B C K L
C D E H I K
AB CD E F GH I J K L
optimization
criterion
coding
technique
Indirect
Supertree methods
Direct
Indirect
• strict consensus supertrees
• MinCutSupertree (and
variants)
• semi-strict supertrees
• most matrix representation
(MR) supertrees
• Lanyon (1993)
• Goloboff and Pol (2002)
• parsimony (MRP and
variants)
• compatibility (MRC)
• minimum flip supertrees
(MRF)
• average consensus (MRD)
• gene tree parsimony
Are trees data?
Open questions
• loss of contact with raw (character) data
• loss of information
• “novel” solutions
• data duplication
• the nature of supertrees: consensus or
phylogenetic hypothesis?
• analytical issues
Loss of information
• a tree is a graphical
representation of the
“primary signal” in a
character-based data set
• strength of primary signal
can be measured (e.g.,
bootstrap frequencies)
• but information regarding
nature of any conflicting
“subsignals” lost
A
B
C
D
0000100101010000001
0111110000100010111
1011111110101101010
1111001111111111100
A B C D
Potential problems
• all trees and clades on them have equal support a
priori
• prevents “signal enhancement” (sensu de Queiroz
et al., 1995) in combined data sets
• coherent subsignals in different data partitions, when
combined, outweigh conflicting primary signals
• “throwing away of information” should cause a
supertree analysis to be less accurate than a total
evidence one, where primary data are combined
No loss of accuracy
• simulation studies indicate loss of information
is not detrimental
• MRP (and variants) (Bininda-Emonds and
Sanderson, 2001)
• average consensus (Lapointe and Levasseur, 2001)
• both methods perform about on a par with total
evidence analyses of primary character data
• and show similar behaviour to total evidence
analyses
Maximizing contact
• weighting according to evidential support in
source trees
• possible for all MR methods, average consensus,
and MinCutSupertree (and gene tree parsimony?)
• causes MRP to outperform total evidence analyses
of primary character data in simulation (BinindaEmonds and Sanderson, 2001)
• bootstrapping of primary character data
• both non-parametric (Moore et al., in prep) and
parametric versions (Huelsenbeck et al., in prep)
Non-parametric bootstrapped supertrees
original
data
bootstrapped
source trees
bootstrapped
supertree
consensus
of supertrees
Open questions
• loss of contact with raw (character) data
• loss of information
• “novel” solutions
• data duplication
• the nature of supertrees: consensus or
phylogenetic hypothesis?
• analytical issues
Novel clades
• all supertree methods have
the potential to yield
novel statements
• relationships between taxa
that do not co-exist on any
single source tree (sensu
Sanderson et al., 1998)
• defining characteristic of
method
A B C D
C D E
+
A
B
C
D
E
Unsupported clades
• some supertree methods have the potential
to make statements that are not only novel,
but also contradicted (unsupported) by
every source tree
• violation of a weaker form of co-Pareto
property
• co-Pareto = relationship of a given kind in the
consensus is present in at least one input tree
A B C D E F
+
C D E A B F
A
B
C
D
E
F
0
1
1
1
1
1
0
0
1
1
1
1
0
0
0
1
1
1
0
0
0
0
1
1
1
1
1
1
1
0
1
0
1
1
1
0
0
0
1
1
1
0
0
0
1
1
0
0
A B C D E F
• from Goloboff and Pol (2002)
Comparing supertree methods
• indirect, optimization-based methods seem more
prone to producing unsupported clades

• strict consensus
supertrees
• semi-strict supertrees
• MRC

•
•
•
•
MRP (and variants)
MRF?
average consensus?
MinCutSupertree (and
variants)?
• gene tree parsimony?
Questions: unsupported clades
• how should they be treated?
• how common are they?
A B C D E F
C D E A B F
+
A B C D E F
Appropriateness
Conventional studies
Supertrees
• unsupported clades (at
level of resulting trees)
arise via signal
enhancement
• have direct character
support in the combined
matrix
• subsignals are invisible
• unsupported clades lack
any support among source
trees  should be
regarded as spurious
(Pisani and Wilkinson,
2002)
• not equivalent to signal
enhancement
A B C D E F
+
C D E A B F
A
B
C
D
E
F
0
1
1
1
1
1
0
0
1
1
1
1
0
0
0
1
1
1
0
0
0
0
1
1
1
1
1
1
1
0
1
0
1
1
1
0
0
0
1
1
1
0
0
0
1
1
0
0
A B C D E F
Incidence of unsupported clades
• circumstantial evidence hints that they are rare
• only a few reported in the literature
• theoretical: Goloboff and Pol (2002); Wilkinson et al. (2001)
• empirical: Bininda-Emonds and Bryant (1998); Wilkinson et
al. (2001)
• estimated that 8 of the 198 clades in the carnivore MRP
supertree (~ 4%) had no support among the source trees
(Bininda-Emonds et al., 1999)
• dinosaur MRP supertree (Pisani et al., 2002) has no
unsupported clades
Unsupported clades are very rare
• simulation results (MRP only)
• occur most often with source trees that are:
• few in number (n ≤ 5)
• large in size (up to 50 taxa)
• possess identical taxon sets (“consensus setting”)
• “most often” means < 0.21% of all simulated clades
• overall incidence was 131 of 282 137 clades (< 0.05%)
• empirical results
• both the carnivore and lagomorph MRP supertrees
have no unsupported clades whatsoever
Open questions
• loss of contact with raw (character) data
• loss of information
• “novel” solutions
• data duplication
• the nature of supertrees: consensus or
phylogenetic hypothesis?
• analytical issues
Data duplication
• character data are often recycled between
phylogenetic analyses
e.g., total evidence analyses, molecular studies of the
same gene
• the same character data may contribute to
more than one source tree
• overrepresented in a supertree analysis  data
duplication
• also violates assumption of data non-independence
• data duplication
among cetartiodactyl source
trees in the Liu et
al. (2001)
mammal order
MRP supertree
• from Gatesy et
al. (2002)
Minimizing duplication
• data duplication a potential problem for all
supertree methods
• use of trees does not reveal directly source of
underlying data set
• but can be minimized / avoided with careful
data collection protocols
e.g., supertrees of Daubin et al. (2001) and
Kennedy and Page (2002) lack data duplication
Is data duplication unavoidable?
• no phylogenies are independent given a
single Tree of Life
• all characters and data sources have been
subject to the same evolutionary processes and
history
• want to combine phylogenetic hypotheses
that can reasonably be viewed as being
independent
Is the problem overrated?
• supertrees combine phylogenetic hypotheses
• emergent property  composed of more than their raw
character data
• manipulation of data (weighting, alignment, recoding)
• method and assumptions of analysis
• for example:
• strongly conflicting molecular phylogenies for whales
can be explained largely by the choice of outgroup
(Messenger and McGuire, 1998)
• alignment and weighting of primary data also important
Is data duplication overrated?
• data duplication is often only partial
• most combined data sets represent unique
combinations of individual data sets
• easy to deal with data sets that are supersets of others
• signal enhancement means that each unique
combination could justifiably be viewed as an
independent hypothesis
• also independent from constituent data sets
Are supertrees unfairly singled out?
• data duplication also exists in conventional studies
(but less obviously so and to a lesser known extent):
• morphological  single features often described by
multiple characters
• molecular  secondary structure (e.g., stems in tRNA,
protein folding) and codon structure mean primary
mutations may require secondary compensatory ones
• total evidence  mixing of phenotypic and genotypic
data must represent data duplication at some level
Open questions
• loss of contact with raw (character) data
• loss of information
• “novel” solutions
• data duplication
• the nature of supertrees: consensus or
phylogenetic hypothesis?
• analytical issues
The nature of supertrees
• is the supertree itself a legitimate phylogenetic
hypothesis?
• many would say “no”, arguing instead that they
are a:
• form of consensus
• historical summary of systematic effort
• therefore, supertrees should not be used to
answer biological questions
Supertrees as consensus
• association derives from:
• similar methodology (combining trees rather than data)
• both containing polytomies
• resulting topologies may be suboptimal given underlying
data
• why are consensus trees not valid phylogenetic
hypotheses?
• especially if polytomies viewed as soft rather than hard
Dealing with incongruence
• all supertree methods must somehow deal
with incongruence among source trees
• ignore it: strict consensus, semi-strict,
MinCutSupertree, MRC
• “fix” it: MRF
• explain it biologically: gene tree parsimony
• optimize it: average consensus and MRP
Incongruence as homoplasy
• a repeated criticism of MRP is that inferred
homoplasy on supertree has no biological
meaning
• convergence and reversals meaningless with
respect to a membership criterion
• but why is MRP singled out?
• similar arguments should apply at least to
average consensus
Parsimony and parsimony
Principle of parsimony
Cladistic parsimony
• a criterion for deciding among • specific application of
scientific theories or
principle of parsimony
explanations
• prefer the tree with the fewest
• “Plurality should not be
number of evolutionary steps
(i.e., character state changes)
posited without necessity” 
choose the simplest
• additional changes over
minimum number represent
explanation of a phenomenon
homoplasy
Homoplasy and supertrees
• notions of homoplasy, convergence, and
reversals have nothing to do with parsimony
per se
• or really even with cladistic parsimony
• post hoc biological interpretation of incongruence
• incongruence on an MRP supertree is simply
incongruence
• idea of homoplasy in this context is
epistemologically, not biologically meaningless
Open questions
• loss of contact with raw (character) data
• loss of information
• “novel” solutions
• data duplication
• the nature of supertrees: consensus or
phylogenetic hypothesis?
• analytical issues
Limitations of total evidence
• analytical limitations of combined primary data
sets also result in a loss of information
• data must be compatible
• use of a single optimization criterion  usually MP, but ML
now also possible
• some data still not analyzable under either framework (e.g.,
DNA-DNA hybridization, morphometric data)
• use of simplistic models of evolution
• MP: differential weighting (including ti:tv ratio)
• ML: same model for every partition
• alignment problems
Advantages to supertrees
• no loss of information: all phylogenetic hypotheses
can be combined
• even those that aren’t based on any data
• process amounts to partitioned analyses
• each partition can be analyzed according to most
appropriate model of evolution, and optimization criterion
• can be done in parallel
• results then combined with little loss of accuracy
• or hopefully less than loss of information for a total evidence
analysis entails
A phylogeny of mammals
The “superteam”
Molecular data
• have complete supertrees for: • Murphy et al. (2001a)
•
•
•
•
•
•
Carnivora
Chiroptera
Insectivora
Lagomorpha
Marsupialia
Primates
• total of 1923 species (41.5%)
• 9779 bp from 18 genes for
64 species
• Madsen et al. (2001)
• 8655 bp from 4 genes for
82 species
• Murphy et al. (2001b)
• 16 397 bp from 22 genes
for 44 species (< 1%)
Summary
Whither supertrees?
• criticisms of supertree
construction have been
launched at two levels
• at the supertree
approach as a whole
• at individual supertree
methods
?
Of approaches …
• supertree problem inherently difficult because
of missing data
• results in the lack of a single right answer
D
C
B
A
B
+
D
A
?
Of approaches …
• trees are data
• potential loss of information not detrimental
• key is to think in terms of phylogenetic
hypotheses
• still awaiting a response from the cladistic
community …
… and methods
• all methods will go astray if its assumptions are
violated
e.g., parsimony and long-branch attraction, likelihood and
wrong model, regression and data non-independence
• for supertrees, key is to try and establish:
• what each method’s boundary conditions are
• how robust each method is to violations of its assumptions
• what the properties of each method are (in relation to our
desired objective)
Download