Phylo_3

advertisement
Phylogenetic Trees
Lecture 3
Based on: Durbin et al 7.4; Gusfield 17
.
Character-based methods
for constructing phylogenies
In this approach, trees are constructed by comparing
the characters of the corresponding species.
Characters may be morphological (teeth structures)
or molecular (homologous DNA sequences). One
common approach is Maximum Parsimony.
Assumptions:
Independence of characters (no interactions)
Best tree is one where minimal changes take place
2
1. Maximum Parsimony
Input: four nucleotide sequences: AAG, AAA, GGA, AGA
taken from four species.
Question: Which evolutionary tree best explains these
sequences ?
One Answer (the parsimony principle): Pick a tree that
has a minimum total number of substitutions of symbols
between species and their originator in the phylogenetic
tree.
AAA
AAA
1
AAG
2
GGA
AAA
AAA
1
AGA
Total #substitutions = 4
3
Example Continued
There are many trees possible. For example:
AAA 1
AAA
1
AAG
AAA
AAA
AAA
AGA
1
GGA
AGA
1
AAG
1
AGA
AAA
AAA
2
GGA
Total #substitutions = 3
Total #substitutions = 4
The left tree is preferred over the right tree.
The total number of changes is called the parsimony score.
4
Simple Example
 Suppose
we have five species, such that three
have ‘C’ and two ‘T’ at a specified position
 Minimal
tree has one evolutionary change:
C
T
C
C
T
C
C
T
TC
5
Extension to Many Letters
 What
is the parsimony score of
Aardvark Bison Chimp Dog
A:
B:
C:
D:
E:
CAGGTA
CAGACA
CGGGTA
TGCACT
TGCGTA
Elephant
We do it character after character;
each score is computed independently
of the others.
6
Fitch’s Algorithm of Evaluating Trees
Traverse tree from leaves to root determining
set of possible states (e.g. nucleotides) for each
internal node
Traverse tree from root to leaves picking
ancestral states for internal nodes
7
Fitch’s Algorithm – Step 1
 # of changes = # union operations
T
T
AGT
CT
C
GT
T G
T
A
T
8
Fitch’s Algorithm – Step 1
 Do a post-order (from leaves to root) traversal of tree
 Determine possible states Ri of internal node i with
children j and k
 R j  Rk if R j  Rk   
Ri  

 R j  Rk otherwise

9
Fitch’s Algorithm – Step 2
T
T
AGT
CT
C
GT
T G
T
A
T
10
Fitch’s Algorithm – Step 2
Do a pre-order (from root to leaves) traversal of tree
Select state rj of internal node j with parent i
ri if ri  R j



rj  

arbitrary
state

R
otherwise
j


11
Weighted Version of Fitch’s Algorithm
Instead of assuming all state changes are equally likely,
use different costs c(a, b) for different changes
a b
1st step of algorithm is to propagate costs up through tree
12
Weighted Version of Fitch’s Algorithm
Want to determine minimal cost S(i, a)
of assigning character a to node i
For leaves:
0
if
a
is
a
character
at
leaf




S(i, a)  

 otherwise




13
Weighted Version of Fitch’s Algorithm
Want to determine min. cost
S(i, a)
of assigning character a to node i
For internal nodes:
S (i, a)  min ( S ( j , b)  c(a, b))  min ( S (k , b)  c(a, b))
b
b
i
j a b
a
k
b
14
Weighted Version of Fitch’s Algorithm –
Step 2
Do a pre-order (from root to leaves) traversal of tree
Select minimal cost character for root
For each internal node j, select character that produced
minimal cost at parent i
15
Weighted Parsimony Scores
Weighted Parsimony score:
Each change is weighted by a score c(a, b).
The weighted parsimony score reduces to the
parsimony score when c(a,a)=0 and c(a,b)=1
for all b  a.
16
Evaluating Weighted Parsimony Scores
Each position is independent and computed by itself.
Use Dynamic Programming on a given tree.

If k is a node with children i and j, then
S(i, a) = minx(S(j, x)+c(a, x)) + miny(S(k, y)+c(a, y))
S(i, a)the minimum
score of subtree rooted
at k when k has
character a.
i
S(i,a)
k
S(j,x)
j
S(k,y)
17
Evaluating Parsimony Scores
Dynamic programming on a given tree
Initialization:
 For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise
S(i,a) = 
Iteration:
 if i is node with children j and k, then
S(i,a) = minx(S(j,x)+c(a,x)) + miny(S(k,y)+c(a,y))
Termination:
 cost of tree is minxS(r,x) where r is the root
Comment:
To reconstruct an optimal assignment, we need to keep in each
node i and for each character a the two characters x, y that
bring about the minimum when i has character a.
18
Cost of Evaluating Parsimony for binary trees
If there are n nodes, m characters, and k possible
values for each character, then complexity is
O(nmk2).
Of course, we still need to search over ALL
possible trees and find the best one. One
usually resorts to heuristic search techniques.
19
Exploring the Space of Trees
We’ve considered how to find the minimum
number of changes for a given tree topology
Need some search procedure for exploring the
space of tree topologies
Given n sequences there are
possible rooted trees
(2n  3)!!
(2n  3)!! 3  5    (2n  3)
20
Counting Trees
n=3
One Tree:
1
3
n=4
3 Trees
2
A rooted tree with n leaves has (2n-1) nodes and (2n-2) edges,
discounting the edge to the root; hence an unrooted tree has (2n-3)
edges. For each additional leaf we add two edges. Therefore we
have
1 • 3 • 5 • … • (2n-5) unrooted trees with n leaves.
Each of such trees has (2n-3) edges, which can be chosen as a root
of the rooted tree. Hence we have
1 • 3 • 5 • … • (2n-5) • (2n-3) rooted trees with n leaves
21
Exploring the Space of Trees
taxa (n)
4
5
6
8
10
# of rooted trees
15
105
945
135,135
30,405,375
22
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10
Species 1 – A G G G T A A C T G
Species 2 - A C G A T T A T T A
Species 3 - A T A A T T G T C T
Species 4 - A A T G T T G T C G
How many possible unrooted trees?
23
Maximum Parsimony
How many possible unrooted trees?
Species 1 Species 2 Species 3 Species 4 -
1
A
A
A
A
2
G
C
T
A
3
G
G
A
T
4
G
A
A
G
5
T
T
T
T
6
A
T
T
T
7
A
A
G
G
8
C
T
T
T
9
T
T
C
C
10
G
A
T
G
1
3 1
2
1
3
2
4
3
4
4
2
24
Maximum Parsimony
How many substitutions?
1 change
tree
1
2
3
4
A
A
5 changes
G
A G
G
A
A
G
G A
G
MP
25
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10
1-A
2-A
3-A
4-A
1
3
2
4
1
2
3
4
1
3
G
C
T
A
G
G
A
T
G
A
A
G
T
T
T
T
A
T
T
T
A
A
G
G
C
T
T
T
T
T
C
C
G
A
T
G
0
0
0
4
2
26
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10
1-A
2-A
3-A
4-A
1
3
2
4
1
2
3
4
1
3
G
C
T
A
G
G
A
T
G
A
A
G
T
T
T
T
A
T
T
T
A
A
G
G
C
T
T
T
T
T
C
C
G
A
T
G
0 3
0 3
0 3
4
2
27
Maximum Parsimony
G1
C2
3T
C
3
1-G
4A
2-C
G1
T3
2C
C
G1
3-T
3
4-A
4A
3T
3
A4
C
2C
28
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10
1-A
2-A
3-A
4-A
1
3
2
4
1
2
3
4
1
3
G
C
T
A
G
G
A
T
G
A
A
G
T
T
T
T
A
T
T
T
A
A
G
G
C
T
T
T
T
T
C
C
G
A
T
G
0 3 2
0 3 2
0 3 2
4
2
29
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10
1-A
2-A
3-A
4-A
1
3
2
4
1
2
3
4
1
3
G
C
T
A
G
G
A
T
G
A
A
G
T
T
T
T
A
T
T
T
A
A
G
G
C
T
T
T
T
T
C
C
G
A
T
G
0 3 2 2
0 3 2 2
0 3 2 1
4
2
30
Maximum Parsimony
G1
A2
3A
A
G1
A3
4G
G1
1-G
2-A
2A
A
2
4
3-A
2
4-G
4G
3A
1
G4
A
2A
31
Maximum Parsimony
1
3
2
4
1
2
3
4
1
3
0 3 2 2 0 1 1 1 1 3 14
0 3 2 2 0 1 2 1 2 3 16
0 3 2 1 0 1 2 1 2 3 15
4
2
32
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10
1-A G G G T A A C T G
2-A C G A T T A T T A
3-A T A A T T G T C T
4-A A T G T T G T C G
1
3
2
4
0 3 2 2 0 1 1 1 1 3 14
33
Finding most parsimonious trees exact solutions
 Exact
solutions can only be used for
small numbers of taxa.
 Exhaustive
search examines all
possible trees.
 Typically
used for problems with less
than 10 taxa.
34
Finding most parsimonious trees - exhaustive
search
(1)
B
C
Starting tree, any 3 taxa
A
Add fourth taxon (D) in each of three possible positions: three trees
E
B
D
D
C
B
C
(2b)
(2a)
A
B
C
E
D
(2c)
A
E
E
A
E
Add fifth taxon (E) in each of the five possible positions on each
of the three trees -> 15 trees, and so on
35
Finding most parsimonious trees exact solutions

Branch and bound saves time by discarding
families of trees during tree construction that can
not be smaller than the smallest tree found so far.
(Here “smaller” means more parsimonious.)

Can be enhanced by specifying an initial upper
bound for tree length.

Typically used only for problems with less than 20
taxa.
36
Finding most parsimonious trees:
branch and bound
C2.1
C
D
C2.2
A
B
C
C3.1
C
B
B
C3.2
D
C2.3
C3.3
A
C2.4
B2
B3
A
C2.5
C3.4
A
C3.5
B
E
B
D
C
C
D
C
C1.1
D E
B
B1
C1.5
A
A
A
B
B
D
E
A
C1.3
D
B
E
C
C1.2
E
D
C
A
C1.4
C
A
37
Finding most parsimonious trees heuristics

The number of possible trees increases
exponentially with the number of taxa making
exhaustive searches impractical for many data
sets (an NP complete problem)

Heuristic methods are used to search tree space
for most parsimonious trees

The trees found are not guaranteed to be the
most parsimonious - they are best guesses
38
Finding most parsimonious trees - heuristics


Stepwise addition
Asis - the order in the data matrix
Closest -starts with shortest 3-taxon tree adds taxa in
order that produces the least increase in tree length
Simple - the first taxon in the matrix is a taken as a
reference - taxa are added to it in the order of their
decreasing similarity to the reference
Random - taxa are added in a random sequence, many
different sequences can be used
Recommend random with as many (e.g. 10-100) addition
sequences as practical
39
Finding most parsimonious trees - heuristics
Branch Swapping:
Nearest neighbor interchange (NNI)
Subtree pruning and regrafting (SPR)
Tree bisection and reconnection (TBR)
40
Finding most parsimonious trees - heuristics 1
Nearest neighbor interchange (NNI)
C
A
D
E
F
B
G
A
D
C
C
E
A
D
E
F
B
G
F
B
G
41
Finding most parsimonious trees heuristics 2
Subtree pruning and regrafting (SPR)
A
C
D
E
F
B
G
C
D
E
C
F
G
E
F
G
B
D
A
42
Finding most parsimonious trees - heuristics 3
Tree bisection and reconnection (TBR)
A
C
D
E
F
B
G
E
A
C
A
B
G
F
D
F
B
G
D
C
E
43
Finding most parsimonious trees heuristics - summary

Branch Swapping
Nearest neighbor interchange (NNI)
Subtree pruning and regrafting (SPR)
Tree bisection and reconnection (TBR)
The nature of heuristic searches means we cannot
know which method will find the most parsimonious
trees or all such trees.
 However, TBR is the most extensive swapping
routine and its use with multiple random addition
sequences should work well.

44
Tree space may be populated by local minima
and islands of most parsimonious trees
RANDOM ADDITION SEQUENCE REPLICATES
FAILURE
SUCCESS
Branch
Swapping
Branch Swapping
FAILURE
Tree
Length
Branch Swapping
Local
Minimum
GLOBAL
MINIMUM
Local
Minima
45
Multiple most parsimonious trees




Many parsimony analyses yield multiple equally optimal trees
Multiple trees are due to either:
- Alternative equally parsimonious optimizations of
homoplastic characters
- Missing data
- Or both
We can further select among these trees with additional
criteria, but
Most commonly relationships common to all the optimal trees
are summarized with consensus trees
46
Consensus methods - 1
A consensus tree is a summary of the agreement
among a set of fundamental trees
 There are many different consensus methods that
differ in:
1. the kind of agreement
2. the level of agreement
 Consensus methods can be used with any types of
tree - not just parsimony

47
Strict consensus methods - 1
Strict consensus methods require agreement
across all the fundamental trees
 They show only those relationships that are
unambiguously supported by the parsimonious
interpretation of the data
 The commonest method (strict component
consensus) focuses on clades
 This method produces a consensus tree that
includes all and only those clades found in all the
fundamental trees
 Other relationships (those in which the
fundamental trees disagree) are shown as
unresolved polytomies

48
Strict consensus methods - 2
TWO FUNDAMENTAL TREES
A
B
C
D
E
A
F
G
B
C
B
A
D
E
C
F
E
D
F
G
G
STRICT COMPONENT CONSENSUS TREE
49
Majority-rule consensus methods
Majority-rule consensus methods require
agreement across a majority of the fundamental
trees
 May include relationships that are not supported
by the most parsimonious interpretation of the data
 The commonest method focuses on clades
 This method produces a consensus tree that
includes all and only those clades found in a
majority (>50%) of the fundamental trees
 Other relationships are shown as unresolved
polytomies
 Of particular use in bootstrapping

50
Majority rule consensus
THREE FUNDAMENTAL TREES
A B
C D E F G
A B
Numbers indicate frequency of
clades in the fundamental trees
C E F D G
A B
C E D F G
100
66
66
A B
C E D F G
66
66
MAJORITY-RULE COMPONENT CONSENSUS TREE
51
Reduced consensus methods - 1
Focuses upon any cladistic relationships
(statements that some taxa are more closely related
to each other than to some other taxa)
 Reduced consensus methods occur in strict and
majority-rule varieties
 Other relationships are shown as unresolved
polytomies
 May be more sensitive than methods focusing only
on clades

52
Reduced consensus methods - 2
TWO FUNDAMENTAL TREES
A
B
C
D
E
F
G
A
G
B
C
D
E
F
A BCDE F G
A
B
C
D
E
F
Strict component consensus
completely unresolved
STRICT REDUCED CLADISTIC CONSENSUS TREE
Taxon G is excluded
53
Consensus methods - 2
Three fundamental trees
strict reduced cladistic
strict (component)
Ochromonas
Symbiodinium
Prorocentrum
Loxodes
Tetrahymena
Tracheloraphis
Spirostomum
Euplotes
Gruberia
Ochromonas
Symbiodinium
Prorocentrum
Loxodes
Tetrahymena
Spirostomumum
Tracheloraphis
Euplotes
Gruberia
Ochromonas
Symbiodinium
Prorocentrum
Loxodes
Tetrahymena
Spirostomumum
Euplotes
Tracheloraphis
Gruberia
Ochromonas
Symbiodinium
Prorocentrum
Loxodes
Tetrahymena
Euplotes
Spirostomumum
Tracheloraphis
Gruberia
Euplotes excluded
majority-rule
100
100
66
66
10
0
100
Ochromonas
Symbiodinium
Prorocentrum
Loxodes
Tetrahymena
Spirostomum
Euplotes
Tracheloraphis
Gruberia
Symbiodinium
Prorocentrum
Loxodes
Tetrahymena
Spirostomum
Tracheloraphis
Gruberia
Ochromonas
54
Consensus methods - 3
 Use
strict methods to identify those
relationships unambiguously supported by
parsimonious interpretation of the data
 Use reduced methods where consensus trees
are poorly resolved
 Use majority-rule methods in bootstrapping
 Avoid other methods which have ambiguous
interpretations
55
Parsimony - advantages
a simple method - easily understood operation
 does not seem to depend on an explicit model of
evolution
 gives both trees and associated hypotheses of
character evolution
 should give reliable results if the data is well
structured and homoplasy is either rare or randomly
distributed on the tree

56
Parsimony - disadvantages





May give misleading results if homoplasy is common or
concentrated in particular parts of the tree, e.g:
- thermophilic convergence
- base composition biases
- long branch attraction
Underestimates branch lengths
Model of evolution is implicit - behaviour of method not well
understood
Parsimony often justified on purely philosophical grounds - we
must prefer simplest hypotheses - particularly by
morphologists
For most molecular systematists this is uncompelling
57
Parsimony can be inconsistent


Felsenstein (1978) developed a simple model phylogeny including four
taxa and a mixture of short and long branches
Under this model parsimony will give the wrong tree
A
B
Model tree
p
p
q
C
q
q
D
Rates or
Branch lengths
p >> q
Parsimony tree
C
A
Wrong
B
D
Long branches are
attracted but the
similarity is
homoplastic
• With more data the certainty that parsimony will give the wrong tree
increases - so that parsimony is statistically inconsistent.
• Advocates of parsimony initially responded by claiming that Felsenstein’s
result showed only that his model was unrealistic.
• It is now recognized that the long-branch attraction (the Felsenstein Zone)
is one of the most serious problems in phylogenetic inference.
58
2. Perfect Phylogeny
Data on species is given by a Character State Matrix.
Cell (p, i) has value j iff character i of object (species) p has state j .
Goal: constructing evolution tree for the species.
Character
Object
c1
c2
c3
c4
c5
A
1
1
2
0
0
B
2
0
1
2
1
C
3
2
3
3
1
D
0
3
4
1
0
E
1
1
0
0
1
59
Motivation: Evolution Tree
Internal nodes correspond to speciation events, where
some character (attribute) is acquired.
Assumptions:
1. No reversals (characters are not lost)
2. No convergences (a character is created only once)
60
61
Perfect Phylogeny for a 0-1 Matrix
A 0-1 matrix: Each character is either 0 (non exists) or 1 (exists).
 Each of the n objects label exactly one leaf of T
 Each of the m characters labels exactly one edge of T
 Object p has exactly the characters labeling the path from p to the
root.
A perfect phylogeny for the matrix: Tree with no convergence, no
reversals.
2
3
1
4
D
B
E
5
A
C
1
2
3
4
5
A
1
1
0
0
0
B
0
0
1
0
0
C
1
1
0
0
1
D
0
0
1
1
0
E
0
1
0
0
0
62
The (Binary) Perfect Phylogeny Problem
Problem: Given a 0-1 matrix M, determine if it has a
perfect phylogeny, and construct one if it does.
(Note: edges are labeled by characters: edge labeled by i
represent changing character i’s state from 0 to 1).
2
3
4
5
A
1
1
0
0
0
B
0
0
1
0
0
C
1
1
0
0
1
5
D
0
0
1
1
0
C
E
0
1
0
0
0
2
3
1
4
D
1
E
B
A
63
Solution to Perfect Phylogeny Problem
Definition: Given a 0-1 matrix M, Ok={j: Mjk=1}; i.e., Ok is the
set of objects that have character k.
Theorem: M has a perfect phylogenetic tree iff the sets {Oi} are
laminar, ie: for all i, j, either Oi and Oj are disjoint, or one
includes the other.
Laminar
Not Laminar
1
2
3
4
5
1
2
3
4
5
A
1
1
0
0
0
A
1
1
0
0
0
B
0
0
1
0
0
B
0
0
1
0
1
C
1
1
0
0
1
C
1
1
0
0
1
D
0
0
1
1
0
D
0
0
1
1
0
E
0
1
0
0
0
E
0
1
0
0
1
64
Proof
: Assume M has a perfect phylogeny, and let i, j be given.
Consider the edges labeled i and j.
Case 1: There is a root to leaf path containing both. Then one is
included in the other (2 and 1 below).
Case 2: not case 1. Then they are disjoint (2 and 3 below).
2
3
1
4
D
E
B
5
A
C
65
Proof (cont.)
: Assume for all i, j, either Oi and Oj are disjoint, or one includes the
other. We prove by induction on the number of characters that it has.
Basis: one character. Then there are at most two objects, one with and
one without this character.
1
A
B
1
0
1
B
A
66
Proof (cont.)
: Induction step: Assume correctness for n-1 characters, and
consider a matrix with n characters (non-zero columns).
WLOG assume that O1 is not contained in Oj for j > 1.
Let S1 be the set of objects that have character 1, and S2 be the
remaining objects. Then each character belongs to objects in S1 or S2,
but not both. By induction there are trees T1 and T2 for S1 and S2.
Combining them as below gives the desired tree.
1
2
3
4
5
A
1
1
0
0
0
B
0
0
1
0
0
C
1
1
0
0
1
D
0
0
1
1
0
E
1
0
0
0
0
1
T1
T2
67
Efficient Implementation
1. Sort the columns by decreasing value when considered as binary
numbers. (Time complexity: O(mn), using radix sort).
Claim: If the binary value of column i is larger than that of column j,
then Oi is not a proper subset of Oj.
Proof: Oi – Oj > 0 means the 1’s in Oi are not covered by the 1’s in Oj.
1
2
3
4
5
2
1
3
5
4
A
1
1
0
0
0
A
1
1
0
0
0
B
0
0
1
0
0
B
0
0
1
0
0
C 1
1
0
0
1
C 1
1
0
1
0
D
0
0
1
1
0
D
0
0
1
0
1
E
0
1
0
0
0
E
1
0
0
0
0
68
Efficient Implementation (2)
2. Make a backwards linked list of the 1’s in each row (leftmost 1 in
each row points at itself). Time complexity: O(mn).
Claim: If the columns are
sorted, then the set of columns
is laminar iff for each column
i, all the links leaving column i
point at the same column. Can
be checked in O(mn) time.
2
1
3
5
4
A
1
1
0
0
0
B
0
0
1
0
0
C
1
1
0
1
0
D
0
0
1
0
1
E
1
0
0
0
0
69
Examples
Not laminar
2
laminar
1 3
5
4
A
1
1
0
0
0
A
1
1
0
0
0
B
0
0
1
0
0
B
0
0
1
0
0
C
1
1
0
1
0
C
1
1
0
1
0
D
0
0
1
0
1
D
0
0
1
0
1
E
1
0
1
1
0
E
1
0
0
0
0
70
Efficient Implementation (3)
3. When the matrix is laminar, the tree edges corresponding to
characters are defined by the backwards links in the matrix.
remaining edges and leaves are
determined by the characters of
each object. Needs O(mn) time.
2
3
1
4
D
E
B
5
A
C
2
1
3
5
4
A
1
1
0
0
0
B
0
0
1
0
0
C
1
1
0
1
0
D
0
0
1
0
1
E
1
0
0
0
0
71
Download