B - geneprediction.org

advertisement
Evolutionary Models for
Multiple Sequence Alignment
CBB/CS 261
B. Majoros
Part I
Evolutionary Sequence Models
The Utility of Evolutionary Models
Evolutionary sequence models make use of the assumption that natural
selection operates more strongly on some genomic features than others
(i.e., functional versus non-functional DNA elements), resulting in a
detectable bias in sequence conservation for the features of interest.
(Siepel et al., 2005)
More generally, conservation patterns may differ between levels of
DNA organization (i.e., amino acids in coding segments, versus
individual nucleotides in conserved noncoding elements).
Recall: Multivariate HMMs
CCATATAATCCCAGGCTCCGCTTCA
AGATATCATCAGATGCTACGTATCT
...
push
N tracks
ACATAATATCCGAGGCTCCGCTTCG
single emission from a single state
Each state emits N residues, one per track—i.e., a column of a
multiple alignment.
Thus, each state must have a model for the joint distribution of
the tracks (to represent emission probabilities).
Q: how might we model dependencies between tracks?
Non-independence of Sequences
Due to their common ancestry, genomic sequences for related taxa are
not independent. We can control for that non-independence by explicitly
modeling their dependence structure using a phylogenetic tree:
Branch lengths represent
evolutionary distance, which
conflates the distinct
phenomena of elapsed time
and substitution rate (as well
as selection and drift).
We will see later that a phylogenetic tree (or “phylogeny”) can be
interpreted as a special type of Bayesian network, in which sequence
conservation probabilities are expressed as a function of the branch
lengths.
PhyloHMMs
A PhyloHMM is a discrete multivariate HMM in which each state qi has
an associated evolution model i describing the expected rates and
patterns of evolution in the class of features represented by that state.
(Siepel et al., 2005)
Evaluating the Emission Probability
Bayesian network
alignment
P(G)
...A...
...B...
...C...
...D...
G
human
unobservables
chimp
mouse
E
rat
A
P(A,B,C,D) 
F
B
C
D
observables
 P(G)P(E G)P(A E)P(B E)P(F G)P(C F)P(D F)
G,E,F




P(observables)   P(root)  P(v parent(v))

unobservables
nonroot
v


A Recursion for the Emission Likelihood
The likelihood can be computed using a recursion known as
Felsenstein’s pruning algorithm:
 (u,a)
if u is a leaf


Lu (a)  
   Lc (b)P(c  b | u  a) otherwise
c
b

children(u)
=P(descendents of u|u=a).
P(c=b|u=a) = probability of observing b in the child, given that we observe a in the
parent.
 We can model this using a matrix of substitution probabilities, parameterized
by the evolutionary time t that has passed between the ancestor and descendant taxa:
pC G
pGC
pGG
pT C
pT G

pC T 
pGT 

pT T 
ancestor
pC C
A C G T
A
pAA

pCA

P(t) 
pGA

pT A
descendant
C p G pT 
pAC
AG
AT
Substitution Matrix vs. Rate Matrix
There is an important distinction between a substitution matrix
and an instantaneous rate matrix.
Q = Instantaneous rate matrix.
Gives instantaneous
rates of substitutions (not parameterized by time).
P(t) = Substitution matrix.
Gives the probabilities of
substitutions for a specific branch length, t (“time”).
Given Q and a set of phylogeny branch lengths
{ti}, we can compute a substitution matrix P(ti)
for each branch...
One Q, Many P(t)’s
P(t4 )
P(t1 )
P(t5 )

P(t2 )
human

dolphin
 P(t3 )
whale
kangaroo 

P(t)  e
P(t6 )
Qt
Continuous-time Markov Chains (CTMCs)
Substitution models are typically based on
continuous-time Markov chains. The
Markov property for CTMCs states that: P(t  s )  P(t )P( s )
We can derive an
instantaneous rate matrix
Q from P(t), where we
make use of the fact that
P(0)=I.
dP(t )
P (t  t )  P(t )
P(t )P(t )  P(t )I
 lim
 lim
t  0
t  0
dt
t
t
P(t )  P(0)
 P(t ) lim
 P(t )Q
t  0
t
does not depend on t
The solution to this
n n
2 2

Q
t
Q
t
 I  tQ 
 ...
differential equation is: P(t )  eQt  
n 0
n!
2!
eQt (the “matrix exponential”) denotes a Taylor expansion, which we
can solve via spectral (eigenvector) decomposition:
1
 2
Q  G



 1
G

n 
 e t 1
t 2

e
P(t )  G 



 1
G

t n
e 
Desirable Properties of Substitution Matrices
Reversibility:
(“detailed balance”)
ij(πiPij(t)=πjPji(t))
where πi is the equilibrium frequency of base i.
Transition/Transversion Discrimination:
Using different parameters for transition and transversion rates.
Transitions:
purinepurine,
pyrimidinepyrimidine
Transversions: purinepyrimidine
Some Common Forms for Q
Jukes-Cantor:
Q JK
Kimura:
    
    


      Q K 2 P       
    

    
    


Felsenstein:
  C G T 
A  G T 
Q FEL  
A C  T 


A C G  
these assume uniform equilibrium frequencies!
Hasegawa, Kishino, Yano:
Q HKY
 
A


 A
A
C G
 G
C 
C G
T 
T 
T 

 
General reversible model:
  C G T 
A  G T 
Q REV  
A C  T 


  A C G  
Part II
Multiple Alignment
Recall: Pairwise alignment with PHMMs
• Emission probabilities assess
similarity between aligned residues
BLOSUM62
• Transition probabilities can be used
to penalize gaps
δ = gap open ε = gap extend
• Viterbi decoding finds the optimal
alignment
ε
ε
I
δ
1-ε
M
1-2δ
D
δ
Pair HMM  Triple HMM  Quadruple HMM ?
for 100bp sequences...
pair HMM
aligns 2 species
triple HMM
aligns 3 species
#gallons of
water in the
Pacific ocean
quadruple HMM
aligns 4 species
(Holmes & Bruno, 2001)
Progressive Alignment: One Pair at a Time
-AAGTAGCATACG--GGCA-ATT
-AACT--CATACG--CCCAGATT
T-AGT--CATACGGG--CA-ATC
T-AGT--CGTATGGGGGCA-GTT
profile-to-profile
alignment
TAGTCATACGGG--CAATC
TAGTCGTATGGGGGCAGTT
AAGTAGCATACGGGCA-ATT
AACT--CATACGCCCAGATT
AACTCATACGCCCAGATT
AAGTAGCATACGGGCAATT
TAGTCATACGGGCAATC
sequence-to-sequence
alignment
TAGTCGTATGGGGGCAGTT
• First, align the leaves of the tree (using a Pair HMM)
• Then align ancestral taxa, using either a “consensus” sequence
for ancestors, or averaging over all pairs of leaf residues
A more principled approach: model the ancestral sequences
explicitly, using a probabilistic evolutionary model...
Networks of Residues
Problem:
Align the sequences ABC, DE, FG, and HI.
Solution:
AB-C-D-E--FG---HI
PQRS T
J KL
ABC
DE
MNO
FG
H I
The multiple alignment problem is precisely the problem
of inferring the network of residue homologies—i.e., the
evolutionary history of each base.
ABC
Building the Network
FG
DE
H I
ABC
DE
FG
HI
ABC
Building the Network
DE
FG
FG-HI
H I
ABC
DE
FG
HI
ABC
Building the Network
DE
unobservables
FG
FG-HI
MNO
FG
H I
H I
observables
ABC
DE
FG
HI
ABC
Building the Network
ABC
-DE
DE
FG
FG-HI
FG
H I
ABC
DE
MNO
FG
HI
H I
ABC
Building the Network
ABC
-DE
ABC
DE
FG
FG-HI
DE
FG
HI
DE
MNO
FG
H I
ABC
J KL
H I
ABC
Building the Network
ABC
DE
FG
MNO
FG
H I
ABC
DE
DE
FG
HI
H I
J KL
J KL
MNO
ABC
Building the Network
ABC
DE
FG
MNO
FG
H I
ABC
DE
DE
FG
HI
H I
J KL
J KL
MNO
JK-L--MNO
ABC
Building the Network
J KL
DE
J KL
ABC
DE
FG
MNO
FG
H I
MNO
H I
JK-L--MNO
PQRS T
J KL
ABC
DE
FG
HI
ABC
DE
MNO
FG
H I
ABC
Building the Network
J KL
FG
MNO
FG
H I
AB-C-D-E--FG---HI
ABC
DE
J KL
ABC
DE
DE
FG
HI
MNO
H I
PQRS T
J KL
ABC
DE
MNO
FG
H I
Evaluating Emission Probabilities
B
K
X
MNO
X
D
DE
J KL
ABC
P(X)
N
FG
H I
G
P(B,D,G,H) 
H
 P(X)P(K X)P(B K)P(DK)P(N X)P(G N)P(H N)
X ,K ,N




P(observables)   P(root)  P(v parent(v))

unobservables
nonroot
v


Sampling Alignments
= a sequence S
= (B,H), a “Branch-HMM” (transducer) B
describing the evolutionary process whereby
the child evolves from the parent, and the
actual indel history H which is a specific
realization of this process (a “draw”)
Sampling of alignments proceeds by sampling pairwise “branch
alignments” (or “indel histories”) H that live within the yellow
squares. An indel history is simply a path through a Pair HMM.
Sampling branch alignments is simple: just sample from a PHMM via
Forward or Backward: B
P (y y )P (s ,s y )
i1, j 1,y
k 1
k

Bi, j,y k1

Bi, j 1,y Pt (y k y k 1)Pe (,s j,2 y k )
k
P(y k y k 1,i, j,S1,S2 )  
Bi, j,y k1

Bi1, j,y Pt (y k y k 1)Pe (si,1, y k )
k

Bi, j,y k1

t
k
e
i,1
j,2
k
if y k  QM
if y k  QI
if y k  QD
Posterior Alignment Matrix
sequence 2...
pixel intensity =
posterior
probability of a
match in that
cell
(posterior
probability:
conditional on
the full input
sequences)
sequence 1...
Banding: Reduce the Search Space
Block Rearrangements are a Problem!
The simple
case:
Translocation
Duplication
Inversion
Summary
• Optimal MSA computation is intractible in the general case
• Progressive alignment is more tractible, but is greedy
• Iterative refinement attempts to undo greedy decisions
• PairHMM’s provide a principled way to perform pairwise steps
• Felsenstein’s algorithm computes likelihoods on phylogenies
• Substitution models can use continuous-time Markov chains
• Large-scale rearrangements are a problem
• Banding can improve alignment speed
Download