Phylogeny Description of species & arrangement

advertisement
Phylogeny
Wayne Maddison 25 March 2003
• History + basic ideas
• Methods for reconstructing phylogeny
• Applications of phylogeny
Description of
species &
arrangement
into groups
(classification)
Systematics & Taxonomy
Beginning 1750’s:
What are the species and how are they related?
By ca. 1950:
• hundreds of
thousands of
species
described
• general idea
of phylogeny, at
least for
multicellular
creatures
Systematics of mid 20th century:
Phylogeny:
(2) Phylogeny & classification were
permitted to differ
Must be contiguous piece of tree (i.e. Archaeopteryx +
ostrich + crow isn’t paraphyletic)
Paraphyletic group:
an ancestor and some but not all of its descendants
Reptiles
turtles
lizards
crocodiles
Birds
Mammals
Classification:
(1) Focus was not so much classification &
phylogeny; focus was distinguishing species.
Phylogeny inferred usually without explicit data or
analyses (i.e. seat of the pants)
Monophyletic group:
an ancestor and all of its descendants
or a group consisting of species more closely related to
each other than to any other species not in the group
or a group whose most recent common ancestor is
more recent than any common ancestor shared with
species outside the group
ar
ds
liz
s
ile
rd
s
bi
od
als
cr
oc
m
am
m
es
rtl
tu
Polyphyletic group:
neither monophyletic nor paraphyletic
i.e. discontiguous pieces of the phylogenetic tree
Before Hennig:
If a group of species is closely related, they should
share traits
Hennig:
If a group of species is monophyletic, they should
share traits derived within the containing group.
Other types of groups (paraphyletic, polyphyletic) are
not expected to possess derived traits uniquely
Therefore sharing of derived traits is the indicator
of monophyly.
1950’s & 60’s
formalizing methods begins
Hennig 1950 (German), 1966 (English)
(1) Formal logic for reconstructing
phylogeny
(2) Classification should match phylogeny
(all groups monophyletic)
apomorphy: derived trait
synapomorphy: shared derived trait
apomorphy
plesiomorphy: ancestral trait
plesiomorphy
Synapomorphy indicates monophyly
1960’s: quantification
A
0
0
1
0
0
0
1
C
1
0
0
0
1
0
0
B
2
1
0
1
2
0
0
?
Tree 4
D
0
1
0
1
0
0
0
E
0
1
0
1
0
1
0
D
0
1
0
1
0
0
0
G
0
1
2
1
0
0
2
Tree 5
F
1
0
0
0
1
1
0
C
1
0
0
0
1
0
0
F
1
0
0
0
1
1
0
G
0
1
2
1
0
0
2
Examine trees,
choose tree that
optimizes some
criterion
Optimality methods
E
0
1
0
1
0
1
0
Character with states 0, 1
Formal coding of data into matrix of characters
and character states
B
2
1
0
1
2
0
0
Papilio
Nymphalis
Pieris
Danaus
Battus
Heliconius
Colias
A
0
0
1
0
0
0
1
?
Data matrix
?
Tree 3
Tree 2
?
Papilio
Nymphalis
Pieris
Danaus
Battus
Heliconius
Colias
?
Tree 1
Papilio
Nymphalis
Pieris
Danaus
Battus
Heliconius
Colias
A
0
0
1
0
0
0
1
B
2
1
0
1
2
0
0
C
1
0
0
0
1
0
0
D
0
1
0
1
0
0
0
E
0
1
0
1
0
1
0
some loss of information
Data matrix
F
1
0
0
0
1
1
0
G
0
1
2
1
0
0
2
Danaus
0
Pieris
0
Nymphalis
Distance matrix
0.5
0
Papilio
0.5
Papilio
0.6
Heliconius
0
Colias
Tree
Distance methods
Battus
0
Pieris
0
0.8
Nymphalis
0.9
0
0.7
0.6
0.8
0.4
0.5
0.7
0.8
0.4
0.3
0.6
0.8
0.5
0.5
0.2
0.9
Battus
0.7
Danaus
Colias
Heliconius
Distance methods
UPGMA
Neighbor Joining
Optimality methods
Parsimony
— seeks tree minimizing evolutionary change
Likelihood
— seeks tree maximizing probability
of observed data
B
1
C
0
D
0
Parsimony
— counting steps, examples
— seeks “simplest explanation” (minimizes ad hoc
hypothesis against contradictory evidence)
Likelihood
simple example
A
What is
1
probability of
these states
evolving if
probability of
change on each
branch is 0.1?
Likelihood: the probability of the data observed given
the hypothesis and assumptions P(Data |
Hypothesis)
A
B
C
D
ATTGTA
ATTGTA
ACCGCA
ACCGCA
Observed Data
Goal: to find that hypothesis that maximizes the
probability
A
B
C
D
If this were the tree ... what would be the probability
of these sequences evolving?
A
B
C
D
ATTGTA
ATTGTA
ACCGCA
ACCGCA
Observed Data
Probability of evolving sequences doesn't depend
only on tree
A
B
C
D
Also depends on:
-the ancestral sequence, or the probabilities of
various possible ancestral sequences
rates of mutation
per unitper
time
-the probabilities
of mutations
unit time
-the times involved (branch lengths)
Model of sequence evolution:
-probabilities of bases at ancestor
C
A
α
α
-
A
α
α
-
α
C
α
-
α
α
G
-
α
α
α
T
Rates of change
G
α
-base rates of mutations per unit time
-site toofsite
rate variation
Rates
mutation
per unit time
Simplest:
Jukes-Cantor 1969
A,C,G,T equally
probable
at ancestor
All substitutions
AND
alllikely
rates equal
equally
T
For each candidate tree,
probability of sequences
evolving will depend on
branch lengths and the α
parameter of the JC model
A
C
G
T
A
α
α
α
Example of likelihood analysis using JC 69
A
B
C
D
C G T
α α α
- α α
α - α
α α -
Search: Try alternative trees, branch lengths and
α's to find combination maximizing probability
of observing the data
— simultaneous estimation
-
A
αfC βfG γfT
C
T
A
αfA -
G
C
βfA δfC -
ρfT
δfG εfT
G
More complex, realistic models
all changes can
differ in rate as
long as
symmetrical
More complex, realistic models
C
βfA αfC -
γfA εfC ρfG -
GTR (General
time reversible)
A
αfC βfG αfT
T
-
G
A
αfA -
αfG βfT
C
G
αfA βfC αfG -
αfT
T
equilibrium base
frequencies
might not be
equal
T
HKY 85
transitions and
transversions can
differ in rate
equilibrium base
frequencies
might not be
equal
Another issue: not all sites evolve at the same rate
Protein coding: 2nd positions much slower, third
positions and introns fastest
Non-coding: e.g., ribosomal or tRNA: areas not
vital to secondary structure or interactions may
evolve quickly
gamma rate
variation
+
G
C
A
γfA
βfA
A
δfC
αfA -
-
εfC
-
δfG
αfC βfG
C
ρfG
-
ρfT
εfT
γfT
T
G
T
Example of likelihood analysis using GTR + gamma
A
B
C
D
Search: Try alternative trees, branch lengths
substitution rate parameters, equilibrium base
frequences and gamma shape parameters to find
combination maximizing probability of observing
the data
— simultaneous estimation of model & tree
α=2
α = 0.5
rate of evolution
α = 50
α = 200
One model of site-to-site rate variation:
The gamma distribution
- has one parameter, the "shape"
frequency of sites
with that rate
slow
fast
fewer parameters
Better likelihoods with more complex model
tree + JC (rate)
tree + HKY (transition & transversion
rates + equil. freq.)
tree + GTR (6 rates + equil. freq.) +
gamma rate variation
more parameters
but should use model only as complex as you need!
Can use likelihood ratio tests to test significance
Practical difficulties: computation
Searching among all possible
-trees
-branch lengths
-rate matrix parameters
-equilibrium base frequencies
-gamma shape parameters
Parsimony
— originally justified as “simplicity of explanation”
Likelihood
— statistically best justified, but imposes assumptions
of uniformity of process across characters & branches
of tree
to find the combination maximizing probability
is not easy!
at each step check optimality criterion to see what
adjustment to make
— Hill-climbing algorithm making adjustments to
tree
— Build initial tree by adding taxa
Heuristic search (“good guess”)
Can’t search exhaustively among all
trees
both make explicit predictions about character
distributions, and assess disagreement between
predicted & observed character distributions
number of possible trees
3
15
105
945
10395
3.4 e7
8.2 e21
4.95 e38
1 e5,866,723
Difficulties of searching for optimal trees
number of taxa*
3
4
5
6
7
10
20
30
1,000,000
*terminal taxa, OTU’s
The age of the universe is about 5 e29
picoseconds
Distance methods (popular 60’s & 70’s; Neighbor
Joining brought back in 90’s)
Optimality methods
Parsimony (popular 80’s & 90’s; remains popular
with morphological data)
Likelihood (popular 90’s & 00’s)
Success?
Tree from visible
structures & color
Trees from four
separate gene
regions
Jumping spiders: different data agree
Download