The distance between sequences

advertisement
The distance between sequences,
Part I.
Foundations
M. Elizabeth Corey, Ph.D.
UCSC Extension and UC Berkeley Extension
A simple start
Suppose we have two sequences, A and B
A = {a1, a2, …, am}
and
B = {b1, b2, …, bn}
and we want to know how similar they are.
What is the basis for their similarity?
The practical measure
What we usually do is obtain an alignment
and then score using the sum of the pairwise
scores:
n
S AB   s (ai , bi )
i 1
The nice metric
Wouldn’t it be nice if we could simply say
that the distance between sequences was the
geometric sum of the distances between loci
in the sequences?
DAB 
n
2
(
a

b
)
 i i
i 1
Dynamic programming methods
• Compuational method
dating from the 40’s,
introduced to biology as
“Needleman-Wunsch” in
1969.
• A numerical value is
assigned to every cell in
the array giving the
similarity/dissimilarity of
residues
• The example shown
– match = +1
– mismatch = null (value 0)
1
a
1 a
2
b
3
c
4
n
5
j
6
r
7
q
8
c
9
l
10
c
11
r
12
p
1
2 j
1
3 c
1
1
4 j
1
1
5 n
1
6 r
1
7 c
1
1
8 k
9 c
1
10 r
11 b
12 p
1
1
1
1
1
13
m
Dynamic programming methods
a b
• GOAL: For each cell find
the maximum possible
score for an alignment
ending at that point
• Searchs subrow and
subcolumn, as shown, for
highest score
• Adds this to the score for
the current row
• Proceeds row by row
through the array
a
c
n
j
r
q
c
l
c
r
p
m
1
j
1
c
1
1
j
1
1
n
1
r
1 4
3
3
2
2
0
0
c
3
3
4
3
3
3
3
4
3
3
1
0
0
k
3
3
3
3
3
3
3
3
3
2
1
0
0
c
2
2
3
2
2
2
2
3
2
3
1
0
0
r
2
1
1
1
1
2
1
1
1
1
2
0
0
b
1
2
1
1
1
1
1
1
1
1
1
0
0
p
0
0
0
0
0
0
0
0
0
0
0
1
0
Maximum bipartite matching
Series of solutions, starting with Dijskta, 1950’s
A
s(a1, b1)
B

O VE

Find the set of matches that provide maximum flow.
Each match, ai to bj, has a capacity equal to its
pairwise score.
Alignment’s not really the
problem
• Optimal alignment falls into a set of
problems with a long history in computer
science.
• The underlying metric for distances
between sequences falls in the province of
biology.
Beguiled by a matrix
(PAM)
PAM
• PAM starts with closely related sequences
from 34 superfamilies, grouped into 71
evolutionary trees.
• PAM rests on a measure of amino acid
“mutability”.
• PAM attempts to capture a representative
slice of evolutionary behavior.
PAM
(From Dayhoff, Schwartz and Orcutt)
• Obtain alignments for homologous proteins
• Compute scoring matrix elements using:
sij (t 0 ) 
Gmi aij
a
,i  j
ij
i
sii (t 0 )  1 – G m i
where aij is substitution frequency, mi is the mutability of i
and G is a proportionality constant.
• Extrapolate to longer evolutionary distances by
using {S(t0)}n
Limitations of PAM matrices
PAM matrices are built from alignments with
> 85% identity.
The entries in the initial scoring matrix, S(t=1)
arise from short time interval substitutions;
raising S(1) to a higher power may not
capture some interesting substitutions with
longer rate constants.
The Gutzwiller temptation
• An abstract dynamic system (M, m, ft)
– a measurable space, M, composed of the set of
all sequences.
– a measure m based on transition probabilities.
– a group of automorphisms, ft, that map M onto
itself, that preserves m and where the variable t
runs through the integers.
What’s Bernoulli got to do with
it?
• A scheme with subshift
– The measure on M is generated by the sets Ai,j,k
= {a |ai = j, ai+1 = k} whose measure is given by
a matrix of transition probabilities pjk >= 0.
– A future event a1 depends on a0; hence,
memory.
– Realized in the geodesic flow on a compact
closed surface of constant negative curvature.
System behaviors
• Ergodicity: Transition probabilities are positive
recurrent and aperiodic.
• Mixing: Inheritance and Mendelian exceptions
lead to mixing.
• K-systems: Speciation events rigidly segregate
M; other segregations exist.
Our salad days
•
•
•
•
•
Jukes-Cantor
HGY
Kimura 2-Parameter
PAM
BLOSUM
General Stationary Timereversible Model
R=
.
pCrCA
pGrGA
pTrTA
pArAC
.
pGrGC
pTrTC
pArAG
pCrCG
.
pTrTG
pArAT
pCrCT
pGrGT
.
(Diagonal elements such that rows sum to zero)
Time reversibility: pirij = pjrji
General Stationary Timereversible Model
P(t) = eRt
Given rates, one can find transition
probabilities, and vice-versa.
Jukes-Cantor
R=
-3a
a
a
a
a
-3a
a
a
a
a
-3a
a
a
a
a
-3a
Kimura 2-Parameter
R=
A
C
G
T
.
b
a
b
b
.
b
a
a
b
.
b
b
a
b
.
a/b = transition/transversion bias
HKY (Hasegawa, Kishino, Yano)
R=
.
mpC
mkpG
mpT
mpA
.
mpG
mkpT
mkpA
mpC
.
mpT
mpA
mkpC
mpG
.
k = transversion / transition
The BLOSUMn matrices
• Start with multiple, ungapped alignments of
proteins found using PROTOMAT.
• Build clusters by placing together sequences with
N% identity.
• Measure the score for each pair defined as:
sij = 2*log2(pij/eij)
eij is expected probability of occurrence of the i,j pair
pij is observed probability of the i,j pair.
Limitations
Naive approach: measure frequencies of
aligned pairs and gaps in randomly selected
confirmed alignments to get pij, use a
“random” set of sequences to obtain eij.
• Difficulty 1: it is difficult to get a good random
sample of sequences or alignments – databases are
biased.
• Difficulty 2: When sequences diverge from a
common ancestor recently, pij is small and s is
strongly negative. When sequences diverged long
ago, pij tends to eij and s approaches zero.
A short compendium of distances
and scores
•
•
•
•
•
•
Jukes-Cantor distance
Kimura distance
Dayhoff evolutionary distance
BLOSUM scores
Profile scores
Average scores
References
• Gu, X. & Li, W, 1996. A general additive distance with
time-reversibility and rate variation among nucleotide
sites. Proc. Natl. Acad. Sci. USA 93: 4671-4676.
• Hasegawa, M., Kishino, H., & Yano, T., 1985. Dating of
the human-ape splitting by a molecular clock of
mitochondrial DNA. J. Mol. Evol. 22: 160-174.
• Sanderson, M. J. & Shaffer, H. B., 2002. Troubleshooting
molecular phylogenetic analyses. Annu. Rev. Ecol. Syst.
33: 49-72.
The distance between sequences,
Part II.
Careful Measures
M. Elizabeth Corey, Ph.D.
UCSC Extension and UC Berkeley Extension
Exceptions to Mendel’s Laws
The theory: a chromosomal basis of inheritance
Some so-called exceptions:
• linkage and recombination
• gene conversion
• transposition and mobile genetic elements
• A plethora of other mutations: point mutations,
reversions, deletions, frameshifts, duplications,
inversions
“Exceptions” do not result in rejection of Mendelian genetics
but a better understanding of the mechanisms underlying
Mendelian inheritance.
Mutation frequencies
(#mutations/generation)
•
•
•
Frequency of point mutation: 10-7 to 10-8
Reversion of point mutations: ~10-8. Sometimes
called back mutation, sometimes called
convergence.
Reversion of deletion mutations: undetectably
small.
“Loss of function” mutations result in grossly lower
biological fitness. The rate of extinction due to gross
“loss of function” is much great than the rate of
reversion, so the line will die long before reversion can
occur. In the aggregate, the record will show a pseudoreversion.
Mutation frequencies
•
•
•
•
•
Deletions: 10-6 – dependent on chromosomal
region. Caveat: May be underestimated; less
detectable because they are often lethal .
Frameshifts: 10-6 – often repaired.
Duplications: 10-3 - E. coli: approximately 0.l%
of a culture for a given region of the
chromosome.
Inversions: hard to detect, not always mutations
Gene Conversions: still unknown. Reparative.
mutators increase mutation frequencies by ~100,
they work on “hot spots”
Protein-based inheritance –
Prions
• Proteins that change their shape in response to
fluctuating environmental pressures, and then
maintain that shape during mitosis and meiosis,
constitute a form of cellular memory.
• Various structural conformations are propagated
outside of the traditional genetic framework.
Hsp90 and Sup35
• A buffer for silent polymorphisms: Hsp90
– promotes the folding of signal tranducers
– buffers the effects of many silent polymorphisms
– may serve as a capacitor of evolutionary change –
storing and releasing genetic variation
• “Epigenetic inheritance”: The Sup35 prion
James Joyce’s List
Milk
Call mom!
Lettuce
Plumb the smithy of my soul for the unborn raceconsciousness…
Rent
-------------------------------------Thriving in fluctuating environments by exploiting
pre-existing genetic variations.
References
Recent Publications on Conformational Change
and Evolution
• Queitsch, C., Sangster, T.A. and Lindquist, S. 2002. Hsp90 as a
capacitor of phenotypic variation. Nature 417: 618-624.
• Jensen, M.A., True, H.L., Chernoff, Y.O., and Lindquist, S., 2001
Molecular Population Genetics and Evolution of a Prion-like
Protein in Saccaromyces cerevisiae. Genetics 159: 527-525.
• True, H.L., and Lindquist, S.L. 2000. A yeast prion provides an
exploratory mechanism for genetic variation and phenotypic
diversity. Nature 407: 477-483.
• Rutherford, S.L. and Lindquist, S. 1998. Hsp90 as a capacitor
for morphological evolution. Nature 396: 336-342.
Mutations and time
Take a series of sequences and figure out how
different they are by counting up their
substitutions.
A
6 substitutions
5 substitutions
B
3 substitutions
C
Mutations and time
What process takes us from A to B to C?
A
gene conversion
No direct ancestry
C
B
frameshift (repairable)
2 point accepted mutations
Counting mutations
Consider a counting process {N(t), t e T}
where N(ti) – N(tj) is the number of
mutations in the time interval (ti,tj].
A
No direct ancestry
but we can still count
substitutions:
N (tAC) = 6 PM
N(tAB) = 1 GC
B
C
N(tBC) = 1 FS, 2 PM
Times on the edges of the tree
The “interoccurrence” times between
mutations, t1 = 0, , t1 = t2 – t1, … ti = ti – ti-1,
are exponential variables with mean 1/b
such that
P[ti > h] = e-bh
and
P[ti <= h] = 1 - e-bh
for h>= 0.
Edge times
Gene conversions bgc = 1 gc/2,000* years
Frame shifts bfs = 1 shift/5,000 years
Point mutations bpm = 1 pm/10,000 years
A
1/bgc = 2,000 yrs
1/bpm = 60,000 yrs
B
C
*Just an wild guess
2/bpm+ 1/bfs = 25,000 yrs
Edge times
Population of A = 105
Population of B = 106
Population of C = I don’t care.
A
1/bpm = 60 * 10-2 yrs
1/Na bgc = 20 * 10-2 yrs
B
C
2/ Nb bpm+ 1/ Nb bfs
= 25*10-3 yrs
Calculating divergence times
Doolittle, D.F., Fend, D-F, Tsang, S., Cho, G
and Little, E. “Determining Divergence
Times of the Major Kingdoms of Living
Organisms With a Protein Clock.” Science,
271, pp. 470-477, 1996.
Calculating divergence times
Task: Build a model for evolutionary time
based on pairwise distances, dij, and the
fossil record
– Start with the vertebrate fossil record - the
biogeochemistry gives reliable times.
– Map the fossil-based phylogeny to the sequence
based phylogeny and compare edge lengths.
– Adjust the sequence-based time model to match
the vertebrate fossil record.
Using the fossil record
Vertebrates: Time of first appearance in fossil record
versus sequence similarity
Distance Measure
30
25
20
15
10
5
0
0
100
200
300
Time (ma)
400
500
600
Readjusting the clock
After sampling the vertebrate fossil-record
and fitting the sequence data to the fossilrecord , they maintain the same clock.
Result: Eukaryotes and Prokaryotes diverged
about 2.5 billion years ago.
On fitting the fossil record to
sequence data
Challenges: unequal rates of change in different
species due to:
– different reproductive cycles in different species
– different base population sizes in different species.
Obtaining bacterial mutation rates using vertebrate
mutation rates when we are looking at the
evolution of populations: how viable is it?
Population mutation
Suppose an average rate of mutation per site is about
10-7 (ignoring duplications).
Compare lengths of reproductive cycles:
– Prokaryotes (blue-green algae and bacteria): 20 minutes
to an hour per generation.
– Humans: US, average time to first child is 24.8 years.
How many times does a bacteria reproduce in the
time it takes a human being to reproduce?
24 * 365 * 25 = 219,000
So if we are comparing bacterial mutation rates to
human mutation rates and we looking at aggregate
populations, we have to adjust by a factor of 106.
Population mutation
Size of the base population on planet earth:
5 * 1030 prokaryotes (UG, Bill Whitman) - including
about a mole of bacteria
3 * 109 humans
How many bacteria are there, propagating how fast, in
comparison to humans? Worst case ratio?
Calculate using
base population * rate of generation * number of mutable
genes
(1023 * 106*103)
-------------------------- = 1018
(109 * 1*104)
One final issue: The Success
Question
When mutations succeed, they succeed within an ecological
niche.
So when we ask “When did a species arise?”, it is not enough
to ask about the likelihood of a certain kind of mutation,
one must also ask: what is the likelihood that that mutation
arose in a niche that would support it?
So, don’t forget about acceptance rates.
The FOXP2 point mutation
Enard et al, “Molecular evolution of
FOXP2, a gene involved in speech and
language”, Nature, Vol. 418, August 22,
2002
Silent/expressed mutations in
FOXP2
OHG
1/2
Orangutan
0/7
2/2
Human
HG
0/2
Gorilla
Edge labels are: Amino Acid / DNA substitutions
Selective sweeps
Measures for determining the existence of a
sweep:
– Tajima’s D: from Genetics, 1989 (conservative)
– Fay and Wu’s H: from “Hitchhiking under
positive Darwinian selection”, Genetics, 2000.
Also, Griffiths and Tavare estimate selection
using linked SNP data
Population mutation rates
Pia = 4Na bi - the population mutation rate for
site i in species a, where Na is the effective
population size of species a and bi is the
mutation rate per generation at site i.
0.2
# of point
mutations
0
1 4 7 10 13 16
Tajima’s D for FOXP2
 2.20
P  0.03%
S/an = 0.079%
S is the sample size
an is the number of segregating sites.
Discovering different rate
constants
Finding the time of appearance of
the FOXP2 segregation
• Sample current human population
worldwide.
• Generate trees with different times for
the human sequence data.
• Measure the likelihood of the different
trees.
Multiple rates
The automorphism f, mapping M onto itself, used
to be a simple shift operation.
Now, it incorporates several underlying processes,
including:
– mutation of the bases (mutation rate)
– expression of the mutations (expression rate)
– stabilization of a conformational phenotype
(stabilization rate)
– success of the substitution (acceptance rate)
The distance between sequences,
Part III.
Algorithms for phylogenies
M. Elizabeth Corey, Ph.D.
UCSC Extension and UC Berkeley Extension
Motivation
• Phylogenies provide measures of similarity and
can lay a foundation for scoring alignments.
• Rate structures provide indicators for motifs.
• Branch points allow us to identify and classify
interesting bases.
– If the branch points are in phenotypic trees, the
mutating bases can be used as phenotypic identifiers.
– If the branch points are in genotypic trees, mutating
(nonsilent) bases can be used as genetic identifiers.
What goes into a phylogeny?
Pairwise
Alignment
Multiple
Alignment
Distance measures (UPGMA, NN)
Site info (MLE and Parsimony)
Phylogenies
Substitution scores
Equilibrium distributions for MLE
Transitional probability data
What do we get in return?
Pairwise
Alignment
Multiple
Alignment
Guide trees
Phylogenies
Rates and probabilities
Scoring matrices
Scoring matrices
Transitional probability data
Part III: Goals
• Depict methods for finding guide trees for
progressive multiple alignment.
• Clarify the differences between MLE,
Maximum Parsimony and Distance
Methods and identify the optimization
techniques appropriate for each.
• Define a new approach for faster
identification of near-optimal phylogenies.
Progressive multiple alignment
• Choose a set of scores for sequence comparison
– Alignment scores from Needleman-Wunsch, Smith-Waterman and
variants.
– Consensus word score from BLAST, PSI-BLAST and others
– Substitution (scoring) matrices – PAM, BLOSUM, Jukes-Cantor,
etc.
• Construct a reputable guide tree
– Hierachical clustering (UPGMA, Neighbor-Joining, Fitch and
Margoliash)
– Maximum Parsimony (simple or weighted).
– Maximum Likelihood Estimation (MLE)
• Use the guide tree to produce an alignment
Tree evaluation - Parsimony
• Given a semi-labeled tree,
it is possible
to determine the tree’s internal nodes
(ancestral sequences) using a parsimony
algorithm.
• Evaluation function: A summation of the
scored mutations in the parsimonious tree.
Parsimony - Illustrated
ABC
node 3, cost is 3
ABC
A(B or D)C
A(B or C) (E or C)
node 1, cost is 1
node 2, cost is 2
ADC
ABE
ACC
Example: Simple Parsimony
Initialization:
Set the cost, C = 0. Set k = 2n-1, where n is the
number of sequences.
Recursion to compute node, Nk:
if k is a leaf node, Nk= sequence k
if k is not a leaf node
Compute Ni and Nj for the daughter nodes of Nk.
where the intersection of Ni and Nj is nonempty,
N k  Ni  N j
otherwise increment the cost by the number of
nonmatching
N residues
N  N and set
k
i
j
Termination:
Minimum cost of tree = C.
Tree evaluation – Distance
methods
• Given a set of alignment scores, but without
assuming a tree topology, it is possible to
determine a tree and its edge lengths using a
distance method. This is sometimes called
minimum evolution and includes the
hierarchical clustering methods.
• Evaluation function: The sum of the edge
lengths.
Hierarchical Clustering –
Illustrated UPGMA
1 2
1 2
3
3
4
4
5
t1= t2= ½d12
6
1
6
5
1
2
8
1 2
6
1
2 4
9
1 2
8
7
2 4
6
5 3
1
3
4
7
2 4
3
4
5
5
From Durbin et al, 2001
7
5 3
5
½d68
Algorithm: UPGMA
Input: N sequences and their relative distances, dij
Initialization:
Assign each sequence to its own cluster, Ci.
Define a leaf of T for each sequence and place at height = 0.
Iteration
Pick two clusters Ci, Cj such that dij is minimal.
Define a new cluster k by Ck = {Ci,Cj}.
Define a new set of distances {dkl} between Ck and all
current clusters.
Define a node k with daughter nodes i and j, and place it at
height hik = ½dik.
Add k to the set of current clusters and remove i and j.
Termination:
Rooted: When only two clusters i, j, remain, add the root at
height ½dik.
Tree evaluation - MLE
• Given a tree topology and sequences
preassigned to each leaf, it is possible to
determine a tree’s edge lengths using
maximum likelihood estimation.
• Evaluation function: the likelihood of the
tree.
Estimating Likelihood
• Estimate branch lengths by viewing evolution
as a random process
• Requires a probability model of evolution as a
function of time.
– For DNA one can use Jukes-Cantor model (all
nucleotides have same substitution rates), or
Kimura model (different rates for transitions
and transversions).
– For proteins one can use Dayhoff, but in the
probability form not the log-odds form.
Estimating Likelihood
S1, etc. are the bases or residues observed in the extant
and ancestral taxa.
v = lt where l is the substitution rate and t is absolute
time
Pi,j(v) is the probability that the residue at node si
becomes residue at node sj in time v
p0 is the prior probability of the bases or nucleotides at
any position
The likelihood for this tree is:
L = p0P0,5(v5) P5,1(v1) P5,2(v2) P0,6(v6) P6,3(v3) P6,4(v4)
Example: Likelihood
For each mutating site in a set of sequences
Initialization:
Set k = 2n-1, where n is the number of sequences.
Recursion:
Compute P(Lk|a) for each symbol, a, in the alphabet as
follows:
If k is a leaf node:
if xk,u = “a”, then P(Lk|a) = 1,
else Pk(a) = 0.
if k is not a leaf node:
Compute P(Li|a), P(Lj|a) for all a at daughters i,j
Set P(Lk|a) = Sb,cP(b|a,ti) P(Li|b) P(c|a,tj) P(Lj|c).
Termination:
Likelihood for site u = pa SaP(L2n-1|a)
(pa is the equilibrium value of the probability distribution
for a.)
Concluding step: Combine the likelihoods for each site.
Maximizing Likelihood
Estimation over edge times
Likehood estimation includes a step for computing the likelihood of
some character “a” at node k given the subtree of k.
While we know that there is the possibility of substitutions leading to a,
these depend on how long a time we have to make those substitutions
and we do not know the edge times of the tree. We must explore a
series of possible times in order to to maximize the likelihood.
• A method that maximizes likelihoods over edge times is what is
usually referred to as MLE.
• Standard MLE procedures do not maximize likelihoods over all
topologies of the tree.
Comparisons between MLE, Parsimony and
Distance Methods
Algorithm
Requires
semilabeled
tree
Requires
scored
alignments
Order
Results – Edge
weights
Results –
Internal tree
nodes
Resulting Tree
Is Ultrametric
MLE
Yes
No
La2n-1
2an2
Transitional
probabilities
subtree
probability
Yes
Parsimony
Yes
No
2an2
Mutation counts
Ancestral
sequences
No
Distance
Methods
No
Yes
2n2
Distance
measures – e.g.
alignment scores
UPGMA: a
cluster of
sequences
UPGMA - no
NN - yes.
Exploring different topologies
• Successive addition and rearrangement
– Very common method (see Phylip programs including:
PROTPARS, DNAPARS, DNACOMP, DNAML,
DNAMLK, RESTML, KITSCH, FITCH, CONTML,
MIX and DOLLOP)
– Sequences are taken in the order that they appear in the
input file and successively added to a tree.
• MCMC
Successive addition
• Initialization:
– Place the set of sequences into L.
– Create a tree,T, with one node – the root.
• Iteration: for each sequence in L
– Remove a sequence from L and add it as a leaf to T.
– Apply a process of local rearrangement (in Felsenstein’s package,
there are (n-1)(2n-3) arrangements.)
– Score each locally arranged tree.
– set T to equal the best scoring tree.
• Termination: Globally rearrange the tree by swapping subtrees,
score each globally rearranged tree and accept the tree with the best
score.
Markov Chain Monte Carlo
A Bayesian method for phylogenetic inference
– Moderately new method rooted in molecular dynamics.
– Topologies are randomly generated and scored so that a
representative set of most likely tree topologies can be
identified.
Mau, Newton and Larget (1998) apply MCMC to
sample trees using Bayes theorem. The following
explanation is based on their methodology - the mistakes
are mine, the facts and foundations, theirs.
Introduction to the method
{t } is the set of
all semi-labeled
trees
Introduction to the method
Sampling the set
of trees
Q2
Q1
Q3
Introduction to the method
to
a bc d
Q01
t1
a c bd
Q12
t2
Q23
a bc d
A Chain of Accepted Samples
…
Introduction to the method
The partitioned space
with representatives
{t1,, t3}
t1
t3
MCMC propaganda
• allow exact inference provided certain
convergent criteria are demonstrated.
• are efficient and can handle many more taxa
or sequences.
• measure uncertainty during tree
construction (no bootstrapping needed.)
Summary of the Algorithm
1. Choose a starting tree
2. Perturb the current tree’s topology and branch
lengths to find a new tree.
3. Measure the likelihood for the new tree.
4. Compare the new tree to the last tree and
decide whether or not to accept it into the
chain.
5. If you’ve got a sufficiently long chain, check
the characteristics of your sample to see if
there is convergence to a set of representative
topologies. If so, stop. Otherwise, to to 2.
Subproblems to be discussed
1. How do we represent the tree so it that is easy
to operate on? Cophentic matrices.
2. What is our perturbation operator?
3. How do we build our sampling chain?
4. When are we done sampling?
The Cophenetic Matrix
Some Notation
t – a topology
n – a node
a(n) – the ancestor of a node
L – a leaf node (the leaves are the current record)
I – an internal node (the historical record)
Cophenetic Trees
Labeled history (t1, t2) provides an order on
coalescent levels.
I0
level 0
t1 {
t2 {
level 1
I1
level 2
I2
L1
L2
L3
Example: A Cophenetic Tree
t1= 0.8
t2=0.3
t3=0.7
t4=0.5
t5=0.9
t6=1.5
total: 4.7
These trees are described in terms of nodes coalescing or merging
backwards in time.
Example: Cophenetic Matrix
The cophenetic matrix for the previous tree.
Leaf
5
7
4
1
2
6
3
5
0
9.4
9.4
9.4
9.4
9.4
9.4
0
1.6
4.6
6.4
6.4
6.4
0
4.6
6.4
6.4
6.4
0
6.4
6.4
6.4
0
3.6
3.6
0
2.2
7
4
1
2
6
3
0
The tree representation (s, a) is
{(5,7,4,1,2,6,3), (4.7, 0.8, 2.3, 3.2, 1.8, 1.1)}
The Cophenetic Matrix
Theorem: For any weighted binary tree with labeled
leaf nodes, the tree topology and branch lengths
can be uniquely determined using the within-tree
distances between all pairs of leaf nodes. (Lapoint
and Legendre, 1992)
Note, each permutation of the leaf labels generates a
different n x n symmetric matrix of distance
distances.
What is the perturbation operator?
Q is the proposal function and it has two
stages:
• Q1 randomly selects a new leaf order
• Q2 perturbs the values of the matrix
supradiagonals.
The proposal mechanism is symmetrical
Q(tn,tn+1) = Q(tn+1,tn)
Details on Q1 and Q2
Q1 samples one of the 2n-1 leaf orderings of
the current tree model.
Q2 simultaneously and independently
modifies the elements of the superdiagonal
by creating a uniform distribution (ai  d)
where d is constant.
By applying both types of perturbations, Q1
and Q2, all the permutations of trees can be
reached.
Illustration of Q2
Subproblems to be discussed
1. How do we represent the tree so it that is
easy to operate on? Use cophenetic
matrices.
2. What is our perturbation operator? Q.
3. How do we build our sampling chain?
Apply Metropolis-Hastings
4. When are we done?
Acceptance with Metropolis-Hastings
Given a tree t, Metropolis-Hastings:
1. Applies Q to build a new tree, t*.
2. Always accepts the new tree when it is more
likely than the old one and sometimes accepts it
when it is less likely than the old one.
Acceptance with MetropolisHastings – the algorithm
If P(t*) > P(t)
accept t* into the chain.
else
accept t* into the chain with
probability P(t*) / Pt
Acceptance with Metropolis-Hastings
The final step in evaluating the acceptance
test is evaluating
P(t*) / Pt
This is easy: P(t) is approximated using the
LE of t.
Size of chain and convergence
• How many trees do you have to propose before
you begin to get a good enough sample? Mau et
al 1998 sample over about 2500 trees for Clarkia,
a phylogeny with 9 leaves
• How do you test that you are done? At the end of
the run, we say that we have convergence if there
is a small set of topologies with high relative
frequency in the chain.
• What’s the result? The topologies with the highest
frequencies are the reported reconstructions.
Mixing
• To obtain a confidence measure, the
algorithm must be run more than once:
each run generates a chain of accepted trees.
• When chains “mix” well when they come
up with the same representative topologies,
starting from different tree topologies.
• If running a sufficient number of
independent chains is computationally
prohibitive, Suchard et al, 2002, provide a
“poor man's estimate of the uncertainty”.
Example with binary data
(from Mau, et al, 1998)
9 species of genus Clarkia (California plants)
120 restriction sites
Data translated into a 9 x 120 matrix of zeroes
and ones, representing the absence or
presence of a restriction site in the genome
of each species.
Running the MCMC algorithm
Random starting trees
Chains of length 250,000 were subsampled at
rate of 1/100 = 2500 trees
Each run took 20 minutes on a Sparc 10.
Convergence was inferred by reproducibility
across runs with very different starting
trees.
The most common topologies for
Clarkia
A = 1,2; B = 3,4; C = 5,6; D=8,9
References
Smouse and Li (1989) introduced the Bayesian
paradigm, but not the notation, to the phylogeny
reconstruction problem.
Goldman (1993) used non-Bayesian Monte Carlo
tests of significance to assess the adequacy of
evolutionary models.
Griffiths and Tavare (1994) constructed Markov
chains to compute likelihoods for ancestral
inference.
Mau, Newton and Larget (1998) apply MCMC to
sample trees using Bayes theorem.
Drill-down: Rates
The way I use it, and I admit this is quirky, motif means the genetic profile
for a functional structure. Using the following definitions:
– Let rG be the rate of mutation for a gene.
– Let rE be the rate of expressed mutation for the protein G encodes.
– Let rS be the rate of structural mutation for the protein G encodes.
– Let rF be the rate of functional mutation for the protein G encodes.
rG > r E > r S > r F
Note that the rate of neutral mutations is rN = rG – rF.
The “true” rate of mutation for a motif is rF, the observed rate of mutation
for members of a motif in a genotypic tree is rG. If we want motif
branchings, we eliminate all branchings in the phylogeny occuring
with rates rN.
Drill-down: Semi-labeled trees
Trees with a defined branching pattern and defined leaf labels
but WITHOUT edge lengths or internal node labels.
nccbac nacbac
ncbbbc
nccnaa
In our terms, phylogenies with known branching patterns but
without information about ancestors or mutation times.
Drill-down: Progressive
Alignments
• As you move up the tree, add to sum of
characters in growing alignment
Progressive Alignments
Sum of characters in growing alignment can be represented in a table
of values called afrequency matrix or a profile
Progressive Alignments
Alignments are frozen once they are made. Scores are then
calculated between aligned positions tabulated in a
frequency matrix, using a scoring table
A
G
A
4
3
G
1
2
S
T
2
8
14
S
3
6
Sij = 2 × G:G + 1×A:G
T
10
Algorithm: Neighbor-joining
Input: N sequences and their relative distances, dij
Initialization:
Define a leaf of T for each sequence
Iteration
Pick two nodes i,j such that dij – (ri + rj) is minimal.
Define a new set of distances, {dkl} between k and all
current nodes.
Define a node k with daughter nodes i and j, and place
it at edge length eik = ½(dij + ri – rj) and ejk = dij –dik.
Add k to the set of current nodes and remove i and j.
Termination:
Unrooted: When only two nodes i, j, remain, add an
edge of length dij/2.
Comparison:
Neighbor-joining and UPGMA
Minimization:
– UPGMA uses dij
– Nearest-neighbor uses dij – (ri + rj) where ri 
Distance measures:
1
 dik .
| L | 2 kL
For distances between leaves i and j:
• dij is the same in both algorithms.
For distances between nodes k and m
• UPGMA uses dik = 1/|Ci||Cj| Sp in Ci, q in Cjdpq
• Nearest-neighbor uses dkm = ½ (dim + djm – dij) where i
and j are the daughters of k.
Edge lengths:
UPGMA set the height of node k to ½ the distance between
daughters i,j (½ dij).
Nearest neighbor sets the edge length between k and
daughters j to ½(dij + ri – rj), daughter k to dij – dik.
Drill-down: MLE
a
P(Lk|a) = P(c|a,ti) P(b|a,tj)
P(c|a,ti)
P(b|a,tj)
nccbabc
ncbbcbc
P(Li|c) = 1
P(Lj|b) = 1
site u = 3
simplest case
Drill-down: Enumerating
topologies
1  2(n  1) 

| t unlabeled | 
n  (n  1) 
(2n - 2)!
| t semilabeled | (2n - 3)!!  n -1
2 (n - 1)!
Drill-down: Acceptance with
Metropolis-Hastings
A proposed tree t* is accepted with
probability:min 1, P(t *)Q(t *,t ) 


P(t )Q(t ,t *) 
However, by detailed balance you can step
forward or backward with equal probability:
Q(t,t*) = Q(t*, t)
 P(t *) 
min 1,
Hence our test becomes
P(t ) 


Download