PPT

advertisement
Ch 5. Profile HMMs for sequence
families
Biological sequence analysis: Probabilistic
models of proteins and nucleic acids
Richard Durbin
Sean R. Eddy
Anders Krogh
Graeme Mitchison
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
1
Contents
• Components of profile HMMs
• HMMs from multiple alignments
• Searching with profile HMMs
• Variants for non-global alignments
• More on estimation probabilities
• Optimal model construction
• Weighting training sequences
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
2
Introduction
• Interest on sequence families
• Profile HMMs
– Consensus modeling
• Theory about inference, learning of profile
HMMs
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
3
• figure 5.1
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
4
Ungapped score matrices
• Only considering ungapped regions
– Probability model
L
P( x | M )   ei ( xi )
i 1
• PSSM (position specific score matrix)
– Log-odd ratio
L
ei ( xi )
S   log
qxi
i 1
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
5
Components of profile HMMs (1)
• Consideration of gaps
– Henikoff & Henikoff [1991]
•Combining the multiple ungapped block models
– Allowing gaps at each position using the same gap
scores (g) at each position
• Profile HMMs
– Repetitive structure of states
– Different probabilities in each position
– Full probabilistic model for sequences in the
sequence family
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
6
Components of profile HMMs (2)
• Match states
– Emission probabilities
eM i (a)
Begin
....
Mj
....
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
End
7
Components of profile HMMs (3)
• Insert states
– Emission prob. eIi (a)
• Usually back ground distribution qa.
– Transition prob.
• Mi to Ii, Ii to itself, Ii to Mi+1
– Log-odds score of a gap of length k (no logg-odds from
emission)
logaM j I j  logaI j M j 1  (k 1) logaI j I j
Ij
Begin
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
Mj
End
8
Components of profile HMMs (4)
• Delete states
– No emission prob.
– Cost of a deletion
• M→D, D→D, D→M
• Each D→D might be different
Dj
Begin
Mj
End
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
9
Components of profile HMMs (5)
• Combining all parts
Dj
Ij
Begin
Mj
End
Figure 5.2 The transition structure of a profile HMM.
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
10
HMMs from multiple alignments (1)
• Key idea behind profile HMMs
– Model representing the consensus for the family
– Not the sequence of any particular member
HBA_HUMAN
HBB_HUMAN
MYG_PHYCA
GLB3_CHITP
GLB5_PETMA
LGB2_LUPLU
GLB1_GLYDI
...VGA--HAGEY...
...V----NVDEV...
...VEA--DVAGH...
...VKG------D...
...VYS--TYETS...
...FNA--NIPKH...
...IAGADNGAGV...
*** *****
Figure 5.3 Ten columns from the multiple alignment of seven globin
protein sequences shown in Figure 5.1 The starred columns are ones
that will be treated as ‘matches’ in the profile HMM.
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
11
HMMs from multiple alignments (2)
• Non-probabilistic profiles
– Gribskov, Mclachlan & Eisenberg [1987]
•Score for residue a in column 1
5
1
1
s (V, a )  s (F, a )  s (I, a )
7
7
7
– Disadvantages
•More conserved region might be corrupted.
•Intuition about the likelihood can’t be maintained.
•The score for gaps do not behave as expected.
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
12
HMMs from multiple alignments (3)
• Basic profile HMM parameterization
– Aim: making the distribution peak around
members of the family
• Parameters
– the probabilities values : trivial if many of
independent alignment sequences are given.
akl 
Akl
l ' Akl'
ek (a) 
Ek (a)
a' Ek (a' )
– length of the model: heuristics or systematic way
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
13
HMMs from multiple alignments (4)
• Figure 5.4
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
14
Searching with profile HMMs (1)
• Main usage of profile HMMs
– Detecting potential membership in a family
– Matching a sequence to the profile HMMs
– Viterbi equations or forward equation
– Maintaining log-odd ratio compared with random
model
P( x | R)   qxi
i
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
15
Searching with profile HMMs (2)
• Viterbi equation
V jM1 (i  1)  log aM M ,
j 1
j
e
(
x
)

M
i
V jM (i )  log j
 max V jI1 (i  1)  log aI j1M j ,
q xi
V D (i  1)  log a
D j 1M j ;
 j 1
V jM (i  1)  log aM I ,
j j
eI j ( xi )
 I
I
V j (i )  log
 max V j (i  1)  log aI j I j ,
q xi
V D (i  1)  log a ;
D jI j
 j
V jM1 (i )  log aM D ,
j 1 j

V jD (i )  max V jI1 (i )  log aI j1D j ,
V D (i )  log a
D j 1D j ;
 j 1
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
16
Searching with profile HMMs (3)
• Forward algorithm
F (i )  log
M
j
eM j ( xi )
q xi
 log[aM j1M j exp(F jM1 (i  1))
 aI j1M j exp(F jI1 (i  1))  aD j1M j exp(F jD1 (i  1))];
F (i )  log
I
j
eI j ( xi )
q xi
 log[aM j I j exp(F jM (i  1))
 log aI j I j exp(F jI (i  1))  aD j I j exp(F jD (i  1))];
F jD (i )  log[aM j1D j exp(F jM1 (i ))  log aI j1D j exp(F jI1 (i ))
 aD j1D j exp(F jD1 (i ))];
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
17
Variants for non-global alignments (1)
• Local alignments (flanking model)
– Emission prob. in flanking states use background values qa.
– Looping prob. close to 1, e.g. (1- ) for some small .
Dj
Ij
Mj
Begin
End
Q
Q
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
18
Variants for non-global alignments (2)
• Overlap alignments
– Only transitions to the first model state are allowed.
– When expecting to find either present as a whole or absent
– Transition to first delete state allows missing first residue
Dj
Q
Ij
Begin
Mj
Q
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
End
19
Variants for non-global alignments (3)
• Repeat alignments
– Transition from right flanking state back to random model
– Can find multiple matching segments in query string
Dj
Ij
Mj
Begin
Q
End
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
20
More on estimation of prob. (1)
• Maximum likelihood (ML) estimation
– given observed freq. cja of residue a in position j.
eM j (a) 
c ja

a'
c ja'
• Problem of ML estimation
– If observed cases are absent?
– Specially when observed examples are somewhat
few.
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
21
More on estimation of prob. (2)
• Simple pseudocounts
– qa: background distribution
– A: weight factor
c ja  Aq a
eM ( a ) 
A   a ' c ja '
j
– Laplace’s rule: Aqa = 1
• Bayesian framework
– Dirichlet prior
P( D |  ) P( )
P( | D) 
P( D)
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
22
More on estimation of prob. (3)
• Dirichlet mixtures
– Mixtures of dirichlet prior: better than single
dirichlet prior
– With K pseudocount priors,
eM j (a )   P (k | c j )
k
P(k | c j ) 
k
(
c


a ' ja' a ' )
pk P(c j | k )
 p P(c | k ' )
( c )! (c
| k) 
 c ! (  c
k'
P(c j
c ja   ak
k'
a
ja
a
ja
j
a
a
k
k


)

(

a a )
ja
a
k
k


)

(

ja
a a
a)
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
23
Optimal model construction (1)
• Model construction
– Which columns to insert states or which to match
states?
– If marked multiple alignments have no errors, the
optimal model can be constructed.
– 2L combinations for markings of L columns
– Manual construction
– Maximum a posteriori (MAP) construction
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
24
Optimal model construction (2)
(a) Multiple alignment:
x x . . . x
bat
A G - - - C
rat
A - A G - C
cat
A G - A A gnat
- - A A A C
goat
A G - - - C
1 2 . . . 3
(c) Observed emission/transition counts
match
emissions
insert
emissions
(b) Profile-HMM architecture:
D
D
D
I
I
I
I
beg
M
M
M
end
0
1
2
3
4
state
transitions
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
A
C
G
T
A
C
G
T
M-M
M-D
M-I
I-M
I-D
I-I
D-M
D-D
D-I
0
0
0
0
0
4
1
0
0
0
0
-
1
4
0
0
0
0
0
0
0
3
1
0
0
0
0
0
1
0
2
0
0
3
0
6
0
1
0
2
0
1
2
1
4
0
0
2
3
0
4
0
0
0
0
0
0
4
0
0
0
0
0
1
0
0
25
Optimal model construction (3)
• MAP match-insert assignment
– Recursive calculation of a number Sj
• Sj: log prob. of the optimal model for alignment up to
and including column j, assuming j is marked.
• Sj is calculated from Si and summed log prob. between i
and j.
• Tij: summed log prob. of all the state transitions
between marked i and j.
Tij 
c
xy
x , yM, D, I
log axy
– cxy are obtained from partial state paths implied by marking i
and j.
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
26
Optimal model construction (4)
• Algorithm: MAP model construction
– Initialization:
• S0 = 0, ML+1 = 0.
– Recurrence: for j = 1,..., L+1:
S j  max Si  Tij  M j  I i 1, j 1   ;
0i  j
 j  arg max Si  Tij  M j  I i 1, j 1   ;
0i  j
– Traceback: from j = L+1, while j > 0:
• Mark column j as a match column
• j =  j.
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
27
Weighting training sequences (1)
• Good random sample do you have?
• “Assumption : all examples are independent
samples” might be incorrect
• Solutions
– Weight sequences based on similarity
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
28
Weighting training sequences (2)
• Simple weighting schemes derived from a tree
– Phylogenetic tree is given.
• [Thompson, Higgins & Gibson 1994b]
– Kirchohoff’s law
• [Gerstein, Sonnhammer & Chothia 1994]
wi  tn

wi
leavesk below n
wk
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
29
Weighting training sequences (3)
7
t6 = 3
V7
I1+I2+I3
6
V6
t5 = 3
t3 = 5
5
2
I1
3
I4
V5
t2 = 2
t1 = 2
1
I1+I2
t4 = 8
I2
I3
4
w1:w2:w3:w4 = 35:35:50:64
I1:I2:I3:I4 = 20:20:32:47
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
30
Weighting training sequences (4)
• Root weights from Gaussian parameters
– Influence of leaves on the root distr.
– Altchul-Carroll-Lipman wieghts
•Make gaussian distr.
•Mean : linearly combination of xi.
•Combination weights represent the influences of leaves.
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
31
Weighting training sequences (5)
P ( x at node 4 | L1 , L2 )  K1e

( x  v1 x1  v2 x2 ) 2
2 t12
v1  t2 /(t1  t2 ), v1  t1 /(t1  t2 ), t12  t1t2 /(t1  t2 )
  v1 x1  v2 x2
5
4
t3
t2
t1
x1
x2
x3
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
32
Weighting training sequences (6)
• Voronoi weights
– Proportional to the volume of empty space
– Sequence family in sequence space
– Algorithm
•Random sample: choosing at kth position uniformly
from the set of residues occurring kth position
• ni: count of samples closest to the ith family
• ith weight
ni / k nk
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
33
Weighting training sequences (7)
• Maximum discrimination weights
– Focus: decision on whether sequences are
members of the family or not
– discrimination D   P(M | x k )
k
P( M | x) 
P( x | M ) P( M )
P( x | M ) P( M )  P( x | R)(1  P( M ))
– weight: 1-P(M|xi)
– effect: difficult members are given big weight
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
34
Weighting training sequences (8)
• Maximum entropy weights (1)
– Intuition
• kia: number of residues of type a in column i of a
multiple alignment
• mi: number of different types of residues in column i
•As uniform as possible
– weight for sequence k: 1 /(mi kixk )
i
– ML estimation under the weights: pia = 1/mi
– Averaging over all columns [Henikoff 1994]
1
wk  
i mi kixk
i
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
35
Weighting training sequences (9)
• Maximum entropy weights (2)
– entropy: an measure of the ‘uniformity’ [Krogh &
Mitchison 1995]
– maximize i H i (w )   k wk
(sum to one constraints)
– example
H i (w )  a pia log pia
• x1 = AFA, x2 = AAC, x3 = DAC
H1 ( w )  ( w1  w2 ) log(w1  w2 )  w3 log w3
H 2 ( w )  w1 log w1  ( w2  w3 ) log(w2  w3 )
H 3 ( w )  w1 log w1  ( w2  w3 ) log(w2  w3 )
• w1 = w3 =0.5, w2 = 0
SNU BioIntelligence Lab. (http://bi.snu.ac.kr)
36
Download