Notes

advertisement
Phylogenetics I
Evolution
Evolution of new organisms is
driven by
• Mutations
– The DNA sequence can be
changed due to single base
changes, deletion/insertion of
DNA segments, etc.
• Selection bias
Theory of Evolution
• Basic idea
– speciation events lead to creation of different
species.
– Speciation caused by physical separation into
groups where different genetic variants
become dominant
• Any two species share a (possibly distant)
common ancestor
The Tree of Life
Primate evolution
A phylogeny is a tree that describes the sequence of speciation
events that lead to the forming of a set of current day species;
also called a phylogenetic tree.
Morphological vs. Molecular
• Classical phylogenetic analysis:
morphological features: number of legs,
lengths of legs, etc.
• Modern biological methods allow to use
molecular features
– Gene sequences
– Protein sequences
Morphological topology
(Based on Mc Kenna and Bell, 1997)
Bonobo
Chimpanzee
Man
Gorilla
Sumatran orangutan
Bornean orangutan
Common gibbon
Barbary ape
Baboon
White-fronted capuchin
Slow loris
Tree shrew
Japanese pipistrelle
Long-tailed bat
Jamaican fruit-eating bat
Horseshoe bat
Little red flying fox
Ryukyu flying fox
Mouse
Rat
Vole
Cane-rat
Guinea pig
Squirrel
Dormouse
Rabbit
Pika
Pig
Hippopotamus
Sheep
Cow
Alpaca
Blue whale
Fin whale
Sperm whale
Donkey
Horse
Indian rhino
White rhino
Elephant
Aardvark
Grey seal
Harbor seal
Dog
Cat
Asiatic shrew
Long-clawed shrew
Small Madagascar hedgehog
Hedgehog
Gymnure
Mole
Armadillo
Bandicoot
Wallaroo
Opossum
Platypus
Archonta
Glires
Ungulata
Carnivora
Insectivora
Xenarthra
From sequences to a phylogenetic tree
Rat
QEPGGLVVPPTDA
Rabbit
QEPGGMVVPPTDA
Gorilla QEPGGLVVPPTDA
Cat
REPGGLVVPPTEG
There are many possible types of
sequences to use (e.g. Mitochondrial vs
Nuclear proteins).
Mitochondrial topology
(Based on Pupko et al.,)
Donkey
Horse
Indian rhino
White rhino
Grey seal
Harbor seal
Dog
Cat
Blue whale
Fin whale
Sperm whale
Hippopotamus
Sheep
Cow
Alpaca
Pig
Little red flying fox
Ryukyu flying fox
Horseshoe bat
Japanese pipistrelle
Long-tailed bat
Jamaican fruit-eating bat
Asiatic shrew
Long-clawed shrew
Mole
Small Madagascar hedgehog
Aardvark
Elephant
Armadillo
Rabbit
Pika
Tree shrew
Bonobo
Chimpanzee
Man
Gorilla
Sumatran orangutan
Bornean orangutan
Common gibbon
Barbary ape
Baboon
White-fronted capuchin
Slow loris
Squirrel
Dormouse
Cane-rat
Guinea pig
Mouse
Rat
Vole
Hedgehog
Gymnure
Bandicoot
Wallaroo
Opossum
Platypus
Perissodactyla
Carnivora
Cetartiodactyla
Chiroptera
Moles+Shrews
Afrotheria
Xenarthra
Lagomorpha
+ Scandentia
Primates
Rodentia 1
Rodentia 2
Hedgehogs
Nuclear topology
(Based on Pupko et al. slide)
(tree by Madsenl)
Round Eared Bat
Flying Fox
Hedgehog
Mole
Pangolin
1
Cow
Cat
Dog
Horse
Rhino
Rat
3
Capybara
Rabbit
Flying Lemur
Tree Shrew
Human
Galago
Sloth
4
Eulipotyphla
Pholidota
Whale
Hippo
Pig
2
Chiroptera
Hyrax
Dugong
Elephant
Aardvark
Elephant Shrew
Opossum
Kangaroo
Cetartiodactyla
Carnivora
Perissodactyla
Glires
Scandentia+
Dermoptera
Primate
Xenarthra
Afrotheria
Phylogenenetic trees
Aardvark Bison Chimp Dog
Elephant
• Leaves - current day species (or taxa – plural of
taxon)
• Internal vertices - hypothetical common ancestors
• Edges length - “time” from one speciation to the
next
Twists in molecular phylogenies
• We have to emphasize that gene/protein
sequence can be homologous for several
different reasons:
– Orthologs -- sequences diverged after a
speciation event
– Paralogs -- sequences diverged after a
duplication event
– Xenologs -- sequences diverged after a
horizontal transfer (e.g., by virus)
Paralogs
Consider evolutionary tree of three taxa:
Gene Duplication
…and assume that at some point
in the past a gene duplication
event occurred.
1
2
3
Paralogs
The gene evolution is described by this tree (A, B
are the copies of the same gene).
Gene Duplication
Speciation events
1A
2A
3A
3B
2B
1B
Paralogs
If we happen to consider genes 1A, 2B, and 3A of
species 1,2,3, we get a wrong tree that does not
represent the phylogeny of the host species
S
Gene Duplication
S
1A
2A
Speciation events
3A
3B
S
2B
1B
Types of Trees
A natural model to consider is that of rooted
trees
Common
Ancestor
Types of trees
Unrooted tree represents the same phylogeny
without the root node
Depending on the model, data from current day species does
not distinguish between different placements of the root.
Rooted versus unrooted trees
Tree a
Tree b
Tree c
b
a
c
Represents the three rooted trees
Total numbers of trees
• For N taxa,
– Rooted bifurcating trees:
• (2n-3)!! = (2n-3)!/2n-2(n-2)!
– Unrooted bifurcating trees
• (2n-5)!!
– Tree shapes
Positioning Roots in Unrooted
Trees
• We can estimate the position of the root by
introducing an outgroup:
– a set of species that are definitely distant from
all the species of interest
Proposed root
Falcon
Aardvark Bison Chimp Dog
Elephant
Type of Data
• Distance-based
– Input is a matrix of distances between species
– Can be fraction of residue they disagree on,
or alignment score between them, or …
• Character-based
– Examine each character (e.g., residue)
separately
Two methods of tree
Construction
• Distance- A weighted tree that realizes the distances
between the objects.
• Parsimony – A tree with a total minimum number of
character changes between nodes.
We start with distance based methods, considering the
following question:
Given a set of species (leaves in a supposed tree), and
distances between them – construct a phylogeny which
best “fits” the distances.
Distance Matrix
• Given n species, we can compute the n x n
distance matrix Dij
• Dij may be defined as the edit distance
between a gene in species i and species j,
where the gene of interest is sequenced for
all n species.
The distance between two
sequences
• Protein sequences:
– PAM
– BLOSUM
• DNA sequences
– Jukes-Cantor
– HGY
– Kimura 2-Parameter
General Stationary Timereversible Model
.
pArAC
R=
pCrCA pGrGA
.
pArAG pCrCG
pArAT
pTrTA
pGrGC pTrTC
.
pTrTG
pCrCT pGrGT
.
(Diagonal elements such that rows sum to zero)
Time reversibility: pirij = pjrji
General Stationary Timereversible Model
P(t) = eRt
Given rates, one can find transition
probabilities, and vice-versa.
Jukes-Cantor
R=
.
u/3
u/3
u/3
u/3
.
u/3
u/3
u/3
u/3
.
u/3
u/3
u/3
u/3
.
Jukes-Cantor
• P(no mutation) = e-4/3ut
• P(at least one mutation) = 1-e-4/3ut
• Ds = ¾ * (1-e-4/3ut)
• D  ut = -3/4 ln (1-4/3 * Ds)
Kimura 2-Parameter
R=
A
C
G
T
.
b
a
b
b
.
b
a
a
b
.
b
b
a
b
.
a/b = transition/transversion bias  R
a+2b = 1 per unit time
Kimura 2-Parameter
 a=R/(R+1),
b=0.5/(R+1)
Prtransition | t   14  12 exp  2RR++11 t  + 14 exp  R2+1 t   P
Prtransversion | t   12 1  exp  R2+1 t   Q

t   ln 1  2Q1  2P  Q
1
4
2

HKY (Hasegawa, Kishino, Yano)
R=
.
mpC
mkpG
mpT
mpA
.
mpG
mkpT
mkpA
mpC
.
mpT
mpA
mkpC
mpG
.
k = transversion / transition
Distances in Trees
• Edges may have weights reflecting:
– Number of mutations on evolutionary path
from one species to another
– Time estimate for evolution of one species
into another
• In a tree T, we often compute
dij(T) - the length of a path between leaves i and j
Distance in Trees: an Exampe
j
i
d1,4 = 12 + 13 + 14 + 17 + 12 = 68
Fitting Distance Matrix
• Given n species, we can compute the n x
n distance matrix Dij
• Evolution of these genes is described by a
tree that we don’t know.
• We need an algorithm to construct a tree
that best fits the distance matrix Dij
Reconstructing a 3 Leaved Tree
• Tree reconstruction for any 3x3 matrix is
straightforward
• We have 3 leaves i, j, k and a center
vertex c
Observe:
dic + djc = Dij
dic + dkc = Dik
djc + dkc = Djk
Reconstructing a 3 Leaved Tree
dic + djc = Dij
+ dic + dkc = Dik
2dic + djc + dkc = Dij + Dik
2dic +
Djk
= Dij + Dik
dic = (Dij + Dik – Djk)/2
Similarly,
djc = (Dij + Djk – Dik)/2
dkc = (Dki + Dkj – Dij)/2
Trees with > 3 Leaves
• An tree with n leaves has 2n-3 edges
• This means fitting a given tree to a
distance matrix D requires solving a
system of “n choose 2” equations with 2n3 variables
• This is not always possible to solve for n >
3
Additive Distance Matrices
Matrix D is
ADDITIVE if there
exists a tree T with
dij(T) = Dij
NON-ADDITIVE
otherwise
Distance Based Phylogeny
Problem
• Goal: Reconstruct an evolutionary tree
from a distance matrix
• Input: n x n distance matrix Dij
• Output: weighted tree T with n leaves
fitting D
• If D is additive, this problem has a solution
and there is a simple algorithm to solve it
Using Neighboring Leaves to Construct the Tree
• Find neighboring leaves i and j with parent k
• Remove the rows and columns of i and j
• Add a new row and column corresponding to k,
where the distance from k to any other leaf m
can be computed as:
Dkm = (Dim + Djm – Dij)/2
Compress i and j into
k, iterate algorithm for
rest of tree
Finding Neighboring Leaves
• To find neighboring leaves we simply select
a pair of closest leaves.
Finding Neighboring Leaves
• To find neighboring leaves we simply select
a pair of closest leaves.
WRONG
Finding Neighboring Leaves
• Closest leaves aren’t necessarily neighbors
• i and j are neighbors, but (dij = 13) > (djk =
12)
• Finding a pair of neighboring leaves is
a nontrivial problem!
Neighbor Joining Algorithm
• In 1987 Naruya Saitou and Masatoshi Nei
developed a neighbor joining algorithm for
phylogenetic tree reconstruction
• Finds a pair of leaves that are close to each
other but far from other leaves: implicitly finds
a pair of neighboring leaves
• Advantages: works well for additive and other
non-additive matrices, it does not have the
flawed molecular clock assumption
Constructing additive trees:
The neighbor joining algorithm
Let i, j be neighboring leaves in a tree, let k be their parent, and let
m be any other vertex.
The formula d (k , m)  1 [d (i, m) + d ( j , m)  d (i, j )]
2
shows that we can compute the distances of k to all other leaves.
This suggest the following method to construct tree from a
distance matrix:
1. Find neighboring leaves i,j in the tree,
2. Replace i,j by their parent k and recursively construct a tree T
for the smaller set.
3. Add i,j as children of k in T.
Neighbor Finding
How can we find from distances alone a pair of nodes
which are neighboring leaves?
Closest nodes aren’t necessarily neighboring leaves.
A
B
C
Next we show one way to find neighbors from distances.
D
Neighbor Finding: Seitou & Nei
algorithm
Definitions
For a leaf i, let ri 
 d (i, u).
u is a leaf
For leaves i, j :
D(i, j )  ( L  2)d (i, j )  ( ri + r j )
Theorem (Saitou & Nei) Assume all edge weights are positive. If
D(i,j) is minimal (among all pairs of leaves), then i and j are
neighboring leaves in the tree.
Complexity of Neighbor Joining
Algorithm
Naive Implementation:
Initialization: θ(L2) to compute d(r,i) and
C(i,j) for all i,jL.
Each Iteration:
• O(L2) to find the maximal C(i,j).
• O(L) to compute {C(m,k):m L} for the
new node k.
Total of O(L3).
r
C(m,k)
m
k
Complexity of Neighbor Joining Algorithm
Using Heap to store the C(i,j)’s:
Input: Distance matrix D= d(i,j), and an arbitrary object r.
Initialization: θ(L2) to compute and heapify the C(i,j)’s in a heap H.
Each Iteration:
• O(log L) to find and delete the maximal C(i,j) from H.
• O(L) to add the values {d(k,m)} to D, for all objects m.
• O(L) to delete {d(m,i), d(m,j)} from D (for all m).
• O(L log L) to delete {C(i,m), C(j,m)} and add C(k,m) from H, for all
objects m.
Total of O(L2 log L).
(implementation details are omitted)
Neighbor Joining Algorithm
• Applicable to matrices which are not additive
• Known to work good in practice
• The algorithm and its variants are the most widely
used distance-based algorithms today.
The Four Point Condition
Compute: 1. Dij + Dkl, 2. Dik + Djl, 3. Dil + Djk
2
2 and 3 represent
the same
number: the
length of all
edges + the
middle edge (it is
counted twice)
3
1
1 represents a
smaller
number: the
length of all
edges – the
middle edge
The Four Point Condition: Theorem
• The four point condition for the quartet
i,j,k,l is satisfied if two of these sums are
the same, with the third sum smaller than
these first two
• Theorem : An n x n matrix D is additive if
and only if the four point condition holds
for every quartet 1 ≤ i,j,k,l ≤ n
Least Squares Distance Phylogeny
Problem
• If the distance matrix D is NOT additive, then we look
for a tree T that approximates D the best:
Squared Error : ∑i,j (dij(T) – Dij)2
• Squared Error is a measure of the quality of the fit
between distance matrix and the tree: we want to
minimize it.
• Least Squares Distance Phylogeny Problem: finding
the best approximation tree T for a non-additive matrix
D (NP-hard).
Download