Slide 12

advertisement
Phylogeny
Presented By
Dr. Shazzad Hosain
Asst. Prof. EECS, NSU
What is phylogenetics?
Phylogenetics is the study of evolutionary relationships
among and within species.
birds
rodents
snakes
primates
crocodiles
marsupials
lizards
What is phylogenetics?
crocodiles
birds
lizards
snakes
rodents
primates
marsupials
This is an example of a phylogenetic tree.
Applications of phylogenetics
• Forensics:
Did a patient’s HIV infection result from an invasive dental
procedure performed by an HIV+ dentist?
• Conservation:
How much gene flow is there among local populations of island
foxes off the coast of California?
• Medicine:
What are the evolutionary relationships among the various
prion-related diseases?
To be continued…
Phylogenetic concepts:
Interpreting a Phylogeny
Sequence A
Sequence B
Sequence C
Sequence D
Sequence E
Time
Which sequence is most
closely related to B?
A, because B diverged
from A more recently
than from any other
sequence.
Physical position in tree is
not meaningful! Only
tree structure matters.
Phylogenetic concepts:
Rooted and Unrooted Trees
A
A
B
A
? ?
X
B
Root
=
B
X
Root
=?
?
C
?
D
Time
C
D
C
?
D
Rooting and Tree Interpretation
chicken
human
fruit fly
chicken
human
oak
– bones
bacteria
oak
archaea
– cell nuclei
fruit fly
bacteria
archaebacteria
oak
bacteria
archaebacteria
fruit fly
+ cell nuclei
human
+ bones
chicken
Rooting Methods
Outgroup Rooting a network of relationships
Given an unrooted network of relationships among four species of Carnivora [left],
outgroup rooting uses an additional taxon (the outgroup) known from independent
evidence to be less closely related to any of the other species (the ingroup) than they
are to each other. The root is then placed on the branch between the outgroup and the
ingroup. In this case, Lynx is a feloid carnivore in a separate superfamily from the four
canoid carnivores. Inclusion of Lynx in the network analysis places it on the
internode.This method requires accurate information as to ingroup / outgroup
relationships.
How Many Trees?
(assuming bifurcation only)
Unrooted trees
#
sequences
3
4
5
6
10
30
N
# pairwise
distances
# trees
# branches
/tree
Rooted trees
# trees
# branches
/tree
How Many Trees?
Unrooted trees
Rooted trees
# pairwise
distances
3
3
1
3
3
4
4
6
3
5
15
6
5
10
15
7
105
8
6
15
105
9
945
10
10
45
2,027,025
17
34,459,425
18
30
435
8.69  1036
57
4.95  1038
58
N
N (N - 1)
2
# branches
/tree
# branches
/tree
#
sequences
# trees
(2N - 5)!
2N - 3 (N - 3)!
2N - 3
# trees
(2N - 3)!
2N - 2 (N - 2)!
2N - 2
Tree Properties
Ultrametricity
Additivity
All tips are an equal
distance from the root.
X
Distance between any two
tips equals the total branch
length between them.
a
b
Root
c
d
a=b+c+d+e
e Y
a X
b
Root
c
e
d
XY = a + b + c + d + e
In simple scenarios, evolutionary trees are ultrametric
and phylograms are additive.
Y
Terminology
• External nodes: things under comparison; operational
taxonomic units (OTUs)
• Internal nodes: ancestral units; hypothetical; goal is to
group current day units
• Root: common ancestor of all OTUs under study. Path
from root to node defines evolutionary path
• Unrooted: specify relationship but not evolutionary path
– If have an outgroup (external reason to believe certain OTU
branched off first), then can root
• Topology: branching pattern of a tree
• Branch length: amount of difference that occurred along
a branch
Phylogeny Applications
• Tree of Life: Analyzing changes that have
occurred in evolution of different organisms
http://tolweb.org/tree/phylogeny.html
• Phylogenetic relationships among genes can
help predict which ones might have similar
functions (e.g., ortholog detection)
• Follow changes occurring in rapidly changing
species (e.g., HIV virus)
Phylogeny Packages
• PHYLIP, Phylogenetic inference package
– evolution.genetics.washington.edu/phylip.html
– Felsenstein
– Free!
• PAUP, phylogenetic analysis using parsimony
– paup.csit.fsu.edu
– Swofford
Similarity vs. Homology
• Similar
– sequences resemble one another
• Homolog
– sequences derived from common ancestor
• Ortholog
– homologous sequences within a species
• Paralog
– homologous sequences between species
Ortholog vs. Paralog
• Ortholog
– genomic variation occurs after speciation
– hence can be used for phylogeny of organism
• Paralog
– genetic duplication occurs before speciation
– hence not suitable for phylogeny of organism
Homoplasy
• Sequence similarity NOT due to common
ancestry
• May arise due to parallelism or convergent
evolution
• Parallelism or parallel evolution
– the development of a similar trait in related, but
distinct, species descending from the same
ancestor, but from different clades
• Convergent evolution
Parallel evolution
Parallel evolution occurs when two species
that have descended from the same
ancestor remain similar over long periods
of time because they independently
acquire the same evolutionary adaptations.
Parallel evolution occurs because
genetically related species adapt to similar
environmental changes in similar ways.
After many years, the organisms may still
resemble each other, even though they
speciated in the distant past.
Convergent evolution
when species from different ancestors colonize the same environment, they may
independently acquire the same adaptations. The evolution of species descended from
different ancestors to become superficially similar because they are adapting to the same
environment is called convergent evolution
Divergent Evolution
Phylogeny of what?
• Organisms
– Whole genome phylogeny
– Ribosomal RNA (surrogate for whole genome)
•
•
•
•
•
•
•
•
Strains (closely related microbes)
Individual genes (or gene families)
Repetitive DNA sequences
Metabolic pathways
Secondary Structures
Any discrete character(s)
Human languages
Microbial communities
Why compute phylogenetic trees?
• Understand evolutionary history
• Map pathogen strain diversity for vaccines
• Assist in epidemiology
– Of infectious diseases
– Of genetic defects
• Aid in prediction of function of novel genes
• Biodiversity studies
• Understanding microbial ecologies
Tree Building Exercises
Computational Approaches to
Phylogenetic Tree Computation
• Distance Based Methods
– UPGMA
– Neighbor joining
• Character State Methods
– Maximum Parsimony Method
– Maximum Likelihood Methods
• Tree merging
– Consensus trees, super-trees
What data is used to build trees?
• Traditionally: morphological features (e.g.,
number of legs, beak shape, etc.)
• Today: Mostly molecular data (e.g., DNA and
protein sequences)
Data for Phylogeny
• Can be classified into two categories:
– Numerical data
• Distance between objects
– e.g., distance(man, mouse)=500,
– distance(man, chimp)=100
– Usually derived from sequence data
– Discrete characters
• Each character has finite number of states
– e.g., number of legs = 1, 2, 4
– DNA = {A, C, T, G}
UPGMA
UPGMA
2. Determine the evolutionary distances and build
distance matrix
1.
2.
3.
4.
- A simple example
AGGCCATGAATTAAGAATAA
AGCCCATGGATAAAGAGTAA
AGGACATGAATTAAGAATAA
AAGCCAAGAATTACGAATAA
Distance Matrix
1
2
3
4
1
2
3
4
-
0.2
0.05
0.15
-
0.25
0.4
-
0.2
-
In this example the evolutionary
distance is expressed as the
number of nucleotide differences
for each sequence pair. For
example, sequences 1 and 2 are
20 nucleotides in length and have
four differences, corresponding to
an evolutionary difference of 4/20
= 0.2.
3. Phylogenetic Tree Construction example
(UPGMA algorithm)
UPMGA (Michener & Sokal 1957)
Dij
Bear
Raccoon
Weasel
Seal
Bear
-
0.26
0.34
0.29
-
0.42
0.44
-
0.44
Raccoon
Weasel
Seal
Bear
0.13
Raccoon
0.13
-
1.
Pick smallest entry Dij
2.
Join the two intersecting species and assign branch lengths Dij/2
to each of the nodes
3. Phylogenetic Tree Construction example
(UPGMA algorithm)
Dij
Bear
Raccoon
Weasel
Seal
3.
Bear
-
Raccoon
Weasel
Seal
0.26
0.34
0.29
-
0.42
0.44
-
0.44
Bear
Raccoon
0.13
0.13
-
Compute new distances to the other species using
arithmetic means
DW B  DW R 0.34  0.42
DW ( BR) 

 0.38
2
2
D  DSR 0.29  0.44
DS ( BR)  SB

 0.365
2
2
3. Phylogenetic Tree Construction example
(UPGMA algorithm)
Dij
BR
Weasel
Seal
Bear
BR
Weasel
Seal
-
0.38
0.365
-
0.44
Raccoon
Seal
0.13
0.1825
0.1825
-
1.
Pick smallest entry Dij
2.
Join the two intersecting species and assign branch
lengths Dij/2 to each of the nodes
3. Phylogenetic Tree Construction example
(UPGMA algorithm)
Dij
BR
Weasel
Seal
BR
-
0.38
0.365
-
0.44
Weasel
Seal
Bear
Raccoon
Seal
0.13
0.1825
0.1825
-
3. Compute new distances to the other species using
arithmetic means
DW ( BRS )
DW B  DW R  DW S 0.34  0.42  0.44


 0.4
3
3
3. Phylogenetic Tree Construction example
(UPGMA algorithm)
Dij
BRS
Weasel
Bear
Raccoon
0.13
BRS
-
Weasel
Seal
Weasel
0.1825
0.4
-
0.2
0.2
1. Pick smallest entry Dij.
2. Join the two intersecting species and assign branch lengths
Dij/2 to each of the nodes.
3. Done!
Downside of UPGMA




37
Assume molecular clock (assuming the evolutionary
rate is approximately constant)
Generates only rooted tree
Trees are ultrametric
Doesn’t work the following case:
Computational Approaches to
Phylogenetic Tree Computation
• Distance Based Methods
– UPGMA
– Neighbor joining
• Character State Methods
– Maximum Parsimony Method
– Maximum Likelihood Methods
• Tree merging
– Consensus trees, super-trees
Neighbor-joining method





39
Developed in 1987 by Saitou and Nei
Works in a similar fashion to UPGMA
Still fast – works great for large dataset
Doesn’t require the data to be ultrametric
Great for largely varying evolutionary rates
How to construct a tree
with Neighbor-joining method?

Step 1:
 Calculate
 Sx

= (sum all Dx) / (leaves - 2)
Step 2:
 Calculate
 Mij

sum all distance from x and divide by (leaves – 2)
pair with smallest M
= Distance ij – Si – Sj
Step 3:
 Create
 S1U
40
a node U that joins pair with lowest Mij
= (Dij / 2) + (Si – Sj) / 2
How to construct a tree
with Neighbor-joining method?

Step 4:
 Join
I and j according to S and make all other taxa in
form of a star

Step 5:
 Recalculate
new distance matrix of all other taxa to U
with:
 DxU
41
= Dix + Djx - Dij
Example of Neighbor-joining
A
B
C
D
B
5
C
4
7
D
7
10
7
E
6
9
6
5
F
8
11
8
9
E
8
Step 1: S calculation : Sx = (sum all Dx) / (leaves - 2)
S(A) = (5 + 4 + 7 + 6 + 8) / 4 = 7.5
S(B) = (5 + 7 + 10 + 9 + 11) / 4 = 10.5
S(C) = (4 + 7 + 7 + 6 + 8) / 4 = 8
S(D) = (7+ 10 + 7 + 5 + 9) / 4 = 9.5
S(E) = (6 + 9 + 6 + 5 + 8) / 4 = 8.5
42
S(F) = (8 + 11 + 8 + 9 + 8) / 4 = 11
Example of Neighbor-joining cont 1

Step 2: Calculate pair with smallest M
Mij = Distance ij – Si – Sj

Smallest are
M(AB) = d(AB) – S(A) –S(B) = 5 – 7.5 – 10.5= -13
 M(DE) = 5 – 9.5 – 8.5 = -13

A
B
C
C
D
E
-13
-11.5 -11.5
D
-10
-10
-10.5
E
-10
-10
-10.5
F
43
B
-10.5 -10.5
-11
-13
-11.5 -11.5
Example of Neighbor-joining cont 2

Step 3: Create a node U
S1U = (Dij / 2) + (Si – Sj) / 2

U1 joins A and B:
 S(AU1)
= d(AB) / 2 + (S(A) – S(B)) / 2
= 5 / 2 + (7.5 - 10.5) / 2 = 1
 S(BU1) = d(AB) / 2 + (S(B) – S(A)) / 2
= 5 / 2 + (10.5 – 7.5) / 2 = 4
44
Example of Neighbor-joining cont 3

45
Step 4: Join A and B according to S, and make all other
taxa in form of a star. Branches in black are unknown
length and Branches in red are known length
Example of Neighbor-joining cont 4


Step5: Calculate new distance matrix
Dxu = (Dix + Djx – Dij) / 2
 d(CU) = (d(AC) + d(BC) - d(AB)) / 2
= (4 + 7 - 5) / 2 =3
 d(DU) = d(AD) + d(BD) - d(AB) / 2 = 6
Same as EU and FU
Then we get the new distance matrix
U1
46
C
D
C
3
D
6
7
E
5
6
5
F
7
8
9
E
8
Example of Neighbor-joining cont 5


47
Repeat 1 to 5 until all branches are done
In this example, we will get this at the end
Downside of Neighbor-joining


48
Generates only one possible tree
Generates only unrooted tree
Computational Approaches to
Phylogenetic Tree Computation
• Distance Based Methods
– UPGMA
– Neighbor joining
• Character State Methods
– Maximum Parsimony Method
– Maximum Likelihood Methods
• Tree merging
– Consensus trees, super-trees
Maximum Parsimony Method
Parsimony-score:
Number of character-changes (mutations) along the evolutionary tree
(tree containing labels on internal vertices)
Example:
Score = 3
Score = 4
0
1 AAA
AAG
1
AGA
0
0
AAA
0
AAA
AAA
AAA
1 AAA
2
GGA
AAG
0
AAA
Most parsimonious tree:
 Tree with minimal parsimony score
Minimal Evolution Principle
50
1
0
AGA
AGA
1
GGA
Small vs. Large Parsimony
We break the problem into two:
1.
Small parsimony: Given the topology find the best assignment to
internal nodes
2. Large parsimony: Find the topology which gives best score
 Large parsimony is NP-hard
 We’ll show solution to small parsimony (Fitch and Sankoff’s algorithms)
Input to small parsimony:
tree with character-state assignments to leaves
Example:
Aardvark Bison Chimp
Dog
Elephant
51
A: CAGGTA
B: CAGACA
C: CGGGTA
D: TGCACT
E: TGCGTA
Fitch’s Algorithm
Execute independently for each character:
1.
Bottom-up phase: Determine set of possible states for each
internal node
2. Top-down phase: Pick states for each internal node
Dynamic Programming framework
2
1
Aardvark Bison Chimp
Dog
CAGGTA
CGGGTA
CAGACA
TGCACT
52
Elephant
TGCGTA
Fitch’s Algorithm
Bottom-up phase
Determine set of possible states for each internal node
•
•
Initialization: Ri = {si}
Do a post-order (from leaves to root) traversal of tree
– Determine Ri of internal node i with children j, k:

 R j  Rk if R j  Rk   

Ri  

R

R
otherwise


k
 j

T
T
CT
C
Parsimony-score =
# union operations
AGT
GT
T G
score = 3
T
53
A
T
Fitch’s Algorithm
Top-down phase
Pick states for each internal node
•
•
Pick arbitrary state in Rroot for the root
Do pre-order (from root to leaves) traversal of tree
– Determine sj of internal node j with parent i:
si if si  R j

sj  

arbitrary
state

R
otherwise


j
Complexity: O(mnk)
T
T
#characters
#states
#taxa/nodes
AGT
CT
C
GT
T G
score = 3
T
54
A
T
Weighted Parsimony
Sankoff’s algorithm
•
Each mutation a↔b costs differently - S(a,b).
1.
Bottom-up phase: Determine Ri(s) – cost of optimal stateassignment for subtree of i, when it is assigned state s.
2. Top-down phase: Pick optimal states for each internal node
Fitch’s algorithm as special case:
• Ri – set of states which yield minimal-cost subtree of i
Same as algorithm for
optimal lifted tree alignment
(Tutorial #4)
55
Sankoff’s Algorithm
Bottom-up phase
Determine Ri(s) for each internal node
•
•
0 if si  s 
Ri ( s)  

 otherwise
Do a post-order (from leaves to root) traversal of tree
– Determine Ri of internal node i with children j, k:
Initialization:
Ri ( s)  min s ' R j ( s' )  S ( s' , s) min s ' Rk ( s' )  S ( s' , s)
Natural generalization
For non-binary trees
Remember pointers
ss’
C
T G
T
56
A
T
Sankoff’s Algorithm
Top-down phase
Pick states for each internal node
•
Select minimal cost character for root (s minimizing Rroot(s))
•
Do pre-order (from root to leaves) traversal of tree:
- For internal node j, with parent i, select state that produced
minimal cost at i (use pointers kept in 1st stage)
Ri ( s) 
min s ' R j ( s' )  S ( s' , s)
min s ' Rk ( s' )  S ( s' , s)

Complexity: O(mnk2)
C
T G
T
A
57
T
#characters
#states
#taxa/nodes
Fitch’s Algorithm
as special case of Sankoff’s algorithm
0 if a  b
1 otherwise
Unweighted parsimony: S (a, b)  
Sankoff’s algorithm:
• Ri(s) - cost of optimal subtree of i, when it is assigned state s
Fitch’s algorithm:
• Score(i) - cost of optimal state-assignment for subtree of i
• Ri
- set of optimal state-assignment for subtree of i
We need to show that:
1. Optimal tree assigns node i with state from Ri.
2. Fitch’s bottom-up recursive formula for Ri. is correct:

 R j  Rk if R j  Rk   
 Check for yourselves
Ri  

R

R
otherwise


k
 j

58
Fitch’s Algorithm
as special case of Sankoff’s algorithm
0 if a  b
1 otherwise
Unweighted parsimony: S (a, b)  
•
•
Score(i) - cost of optimal state-assignment for subtree of i
Ri
- set of optimal state-assignment for subtree of i
We need to show that:
1. Optimal tree assigns node i with state from Ri.
• Trivially true for the root
• Assume (to the contrary) that in an optimal assignment, some
node – j is assigned sj∉Rj
root
Why is this not
the case for the
weighted version?
i
j
Parsimony-score is integer
sj∉Rj  Rj(sj) ≥ Score(j)+1 
By switching from sj to some s∊Rj
we do not raise the parsimony-score
59
Computational Approaches to
Phylogenetic Tree Computation
• Distance Based Methods
– UPGMA
– Neighbor joining
• Character State Methods
– Maximum Parsimony Method
– Maximum Likelihood Methods
• Tree merging
– Consensus trees, super-trees
Maximum likelihood




61
Originally developed for statistics by Ronald Fisher
between 1912 and 1922
Therefore, explicit statistical model
Uses all the data
Tends to outperform parsimony or distance matrix
methods
How to construct a tree
with Maximum likelihood?



62
Step 1: Make all possible trees depending on the
number of leaves
Step 2: Calculate likelihood of occurring with the
given data
L(Tree) = probability of each tree.
• optimizing branch length
• generating tree topology
Step 3: Pick the tree that have the highest
likelihood.
Sounds really great?

63
Num of leaves
Num of possible trees
3
1
5
15
10
2027025
13
15058768725
20
8200794532637891559375
Maximum likelihood is very expensive and
extremely slow to compute
Comparison of Methods
Distance
Maximum
parsimony
Maximum likelihood
Uses only pairwise
distances
Uses only shared
derived characters
Uses all data
Minimizes distance
between nearest
neighbors
Minimizes total
distance
Maximizes tree likelihood
given specific parameter
values
Very fast
Slow
Very slow
Easily trapped in local
optima
Assumptions fail
when evolution is
rapid
Highly dependent on
assumed evolution
model
Good for generating
tentative tree, or
choosing among
multiple trees
Best option when
tractable (<30 taxa,
homoplasy rare)
Good for very small data
sets and for testing trees
built using other methods
Methods of evaluating trees
• Bootstrap: resample initial data set with one
datum removed and replaced with another
member
• Jackknife: resample initial distribution with
one datum missing and not replaced
• MCMC: complex, but generates random
numbers to produce a desired probability
distribution with which to compare model
Phylogeny Flowchart
Difference in Methods
• Maximum-likelihood and parsimony methods have models
of evolution
• Distance methods do not necessarily
– Useful aspect in some circumstances
• E.g., trees built based on whole genomes, presence or absence of
genes
• Religious wars over which methods to use
– Most people now believe ML based methods are best: most
sensitive at large evolutionary distances – but also most timeconsuming & depend on specific model of evolution used
• Most commonly used packages contain software for all
three methods: may want to use more than 1 to have
confidence in built tree
Phylip
• URL: http://evolution.genetics.washington.edu/phylip.html
• Parsimony
– DNApenny or Protpars
• Distance
– Compute distance measure using DNAdist or Protdist
– Neighbor (can use NJ or UPGMA)
• ML
– DNAml
Visualising trees
• Treeview
• You can change the graphic presentation of a tree (cladogram,
rectangular cladogram, radial tree, phylogram), but not
change the structure of a tree
• http://homopan.wayne.edu/softwares/phoenix/index.html
Reference
• Mostly from Web
Download