Lecture 8

advertisement
Profile HMMs for sequence
families and Viterbi equations
Linda Muselaars and Miranda Stobbe
Example alignment
HBA_HUMAN
HBB_HUMAN
MYG_PHYCA
GLB3_CHITP
GLB5_PETMA
LGB2_LUPLU
GLB1_GLYDI
–HGSAQVKGHGKKVADALTNAVAHVVMGNPKVKAHGKKVLGAFSDGLAHLMKASEDLKKHGVTVLTALGAILKK-IKGTAPFETHANRIVGFFSKIIGELLKKSADVRWHAERIINAVNDAVASMPQNNPELQAHAGKVFKLVYEAAIQLQ
---DPGVAALGAKVLAQIGVAVSHL-
Linda Muselaars and Miranda Stobbe
2
Overview chapter 5








Ungapped score matrices.
Adding insert and delete states to obtain profile
HMMs.
Deriving profile HMMs from multiple alignments
Searching with profile HMMs.
Profile HMM variants for non-global alignments.
More on estimation of probabilities.
Optimal model construction.
Weighting training sequences.
Linda Muselaars and Miranda Stobbe
3
Overview chapter 5








Ungapped score matrices.
Adding insert and delete states to obtain profile
HMMs.
Deriving profile HMMs from multiple alignments
Searching with profile HMMs.
Profile HMM variants for non-global alignments.
More on estimation of probabilities.
Optimal model construction.
Weighting training sequences.
Linda Muselaars and Miranda Stobbe
4
Key-issues

Identifying the relationship of an individual
sequence to a sequence family.
 How to build a profile HMM.
 Use profile HMMs to detect potential
membership in a family.
 Use profile HMMs to give an alignment of
a sequence to the family.
Linda Muselaars and Miranda Stobbe
5
Key-issues (2)
Lollypops for a
valuable (up to the
speakers to decide)
contribution to this
lecture.
Linda Muselaars and Miranda Stobbe
6
Needed theory

Emission probabilities.
 Silent states.
 Pair HMMs.
 The Viterbi algorithm.
 The Forward algorithm.
Linda Muselaars and Miranda Stobbe
7
Contents

Ungapped score matrices.
 Adding insert and delete states to obtain profile
HMMs.
 Deriving profile HMMs from multiple alignments.
– Non-probabilistic profiles
– Basic profile HMM parameterisation

Searching with profile HMMs.
 Profile HMM variants for non-global alignments.
Linda Muselaars and Miranda Stobbe
8
Example alignment
HBA_HUMAN
HBB_HUMAN
MYG_PHYCA
GLB3_CHITP
GLB5_PETMA
LGB2_LUPLU
GLB1_GLYDI
–HGSAQVKGHGKKVADALTNAVAHVVMGNPKVKAHGKKVLGAFSDGLAHLMKASEDLKKHGVTVLTALGAILKK-IKGTAPFETHANRIVGFFSKIIGELLKKSADVRWHAERIINAVNDAVASMPQNNPELQAHAGKVFKLVYEAAIQLQ
---DPGVAALGAKVLAQIGVAVSHL*********************
Linda Muselaars and Miranda Stobbe
9
Ungapped regions

Gaps tend to line up.
 We can consider models for ungapped
regions.
 Specify indepependent probabilities ei(a).
L
P( x | M )   ei ( xi )

i 1
But of course: log-odds ratio!
 Position specific score matrix.
Linda Muselaars and Miranda Stobbe
10
Drawbacks

Multiple alignments do have gaps.
 Need to be accounted for.
 For example: BLOCKS database, with
combined scores of ungapped regions.
 We will develop a single probabilistic
model for the whole extent of the
alignment.
Linda Muselaars and Miranda Stobbe
11
Contents

Ungapped score matrices.
 Adding insert and delete states to obtain profile
HMMs.
 Deriving profile HMMs from multiple alignments.
– Non-probabilistic profiles
– Basic profile HMM parameterisation

Searching with profile HMMs.
 Profile HMM variants for non-global alignments.
Linda Muselaars and Miranda Stobbe
12
Short review

Emission probabilities:
the probability that a certain symbol is
seen when in certain state k.
 Silent states:
states that do not emit symbols in an
HMM.
Linda Muselaars and Miranda Stobbe
13
Building the model (1)

We need position sensitive gap scores.
 HMM with repetitive structure of (match)
states.
 Transitions of probability 1.
 Emmision probabilities: eMi(a).
Begin
....
Mj
....
Linda Muselaars and Miranda Stobbe
End
14
Building the model (2)

Deal with insertions: set of new states Ii.
 Ii have emission distribution eIi(a).
 Set to the background distribution qa.
Ij
Begin
Mj
Linda Muselaars and Miranda Stobbe
End
15
Building the model (3)

Deal with deletions.
 Possibly forward jumps.
 For arbitrarily long gaps: silent states Dj .
Dj
Begin
Mj
Linda Muselaars and Miranda Stobbe
End
16
Costs for additional states

States for insertions: the sum of the costs of
the transitions and emissions (M→ I,
number of I→ I, I→ M).
 States for deletions: the sum of the costs of
an M→ D transition and a number of D→ D
transitions and an D→ M transition.
Linda Muselaars and Miranda Stobbe
17
Full model
Dj
Ij
Begin
Mj
Linda Muselaars and Miranda Stobbe
End
18
Comparison with pair HMM
X
qxi
Begin
M
pxiyj
End
Y
qyj
Linda Muselaars and Miranda Stobbe
19
Contents

Ungapped score matrices.
 Adding insert and delete states to obtain profile
HMMs.
 Deriving profile HMMs from multiple alignments.
– Non-probabilistic profiles
– Basic profile HMM parameterisation

Searching with profile HMMs.
 Profile HMM variants for non-global alignments.
Linda Muselaars and Miranda Stobbe
20
Non-probabilistic profiles

Profile HMM without underlying
probabilistic model.
 Set scores to averages of standard
substitution scores.
 Anomalies:
– Conservation of columns is not taken into
account.
– Scores for gaps do not behave properly.
Linda Muselaars and Miranda Stobbe
21
Example
HBA_HUMAN
HBB_HUMAN
MYG_PHYCA
GLB3_CHITP
GLB5_PETMA
LGB2_LUPLU
GLB1_GLYDI
...VGA--HAGEY...
...V----NVDEV...
...VEA--DVAGH...
...VKG------D...
...VYS--TYETS...
...FNA--NIPKH...
...IAGADNGAGV...
*** *****
The score for residue a in column 1 would be set to:
5
1
1
s(V, a)  s(F, a)  s(I, a)
7
7
7
Linda Muselaars and Miranda Stobbe
22
Basic profile HMM
parameterisation

Objective: make the probability distribution
peak around members of the family.
 Available parameters:
– Length of the model.
– Transition and emission probabilities.
Linda Muselaars and Miranda Stobbe
23
Length of the model

Which multiple alignment columns do we
assign to match states?
 And which to insert states?
 Heuristic rule: Columns that consist for
more than 50% of gap characters should be
modeled by insert states.
Linda Muselaars and Miranda Stobbe
24
Probability parameters
# of transitions from state k to state l

Transition probability:
Akl
akl 
l ' Akl '
# of transitions from state k to any other state

Emission probability:
Ek ( a )
ek (a) 
a ' E k ( a ' )

In the limit this is an accurate and consistent
estimation.
 Pseudocount method: LaPlace’s rule.
Linda Muselaars and Miranda Stobbe
25
Example
Bat
Rat
Cat
Gnat
Goat
A
A
A
A
*
G
G
G
*
A
A
-
G
A
A
*
A
A
-
Linda Muselaars and Miranda Stobbe
C
C
C
C
*
26
Example continued
D1
D2
D3
D4
I0
I1
I2
I3
I4
Begin
A 5/8
C 1/8
G 1/8
T 1/8
A 1/7
C 1/7
G 4/7
T 1/7
A 3/7
C 1/7
G 2/7
T 1/7
A 1/8
C 5/8
G 1/8
T 1/8
M1
M2
M3
M4
aM1M2 = 4/7
aM1D2 = 2/7
End
aM1I1 = 1/7
Linda Muselaars and Miranda Stobbe
27
Contents

Ungapped score matrices.
 Adding insert and delete states to obtain profile
HMMs.
 Deriving profile HMMs from multiple alignments.
– Non-probabilistic profiles
– Basic profile HMM parameterisation

Searching with profile HMMs.
 Profile HMM variants for non-global alignments.
Linda Muselaars and Miranda Stobbe
28
Searching with profile HMMs

Obtaining significant matches of a sequence
to the profile HMM:
– Viterbi algorithm: P(x, π*| M).
– Forward algorithm: P(x | M).

Give an alignment of a sequence to the
family.
– Highest scoring, or Viterbi, alignment.
Linda Muselaars and Miranda Stobbe
29
Viterbi equations

Log-odds score of best path matching subsequence
x1…i to the submodel up to state j, ending with xi
being emitted by state Mj: V jM (i)
 Log-odds score of the best path ending in xi being
emitted by Ij: V jI (i)
 The best path ending in state Dj: V jD (i)
(1  2   )v M (i  1, j  1),

M
X
 Pair HMM: v (i, j )  p ( xi, yj ) max  (1     )v (i  1, j  1),
 (1     )v Y (i  1, j  1);

Linda Muselaars and Miranda Stobbe
30
Viterbi equations
V jM1 (i  1)  log aM M ,
j 1
j
e
(
x
)

M
i
V jM (i )  log j
 max  V jI1 (i  1)  log aI j1M j ,
q xi
V D (i  1)  log a
D j 1M j ;
 j 1
V jM (i  1)  log aM I ,
j j
e
(
x
)

I
i
j
V jI (i )  log
 max  V jI (i  1)  log aI j I j ,
q xi
V D (i  1)  log a ;
D jI j
 j
V jM1 (i )  log aM D ,
j 1 j
 I
D
V j (i )  max  V j 1 (i )  log aI j1D j ,
V D (i )  log a
D j 1D j ;
 j 1
Linda Muselaars and Miranda Stobbe
31
Forward algorithm
F (i )  log
M
j
eM j ( xi )
q xi
 log[ aM j1M j exp( F jM1 (i  1))
 aI j1M j exp( F jI1 (i  1))  aD j1M j exp( F jD1 (i  1))];
F (i )  log
I
j
eI j ( xi )
q xi
 log[ aM j I j exp( F jM (i  1))
 log aI j I j exp( F jI (i  1))  aD j I j exp( F jD (i  1))];
F jD (i )  log[ aM j1D j exp( F jM1 (i ))  log aI j1D j exp( F jI1 (i ))
 aD j1D j exp( F jD1 (i ))];
Linda Muselaars and Miranda Stobbe
32
Initialisation and termination

Viterbi algorithm:
M

V
– Initialisation: V (0)  0
L ( n  1)  log aM L M L1 ,
 I
– Termination: V (n)  max  VL (n  1)  log aI j1M L1 ,
V D (n  1)  log a
D L M L1 ;
 L
M
0
M
L 1

Forward algorithm:
– Initialisation: F0M (0)  0
– Termination: FLM1 (n)  log[ aM
M
exp(
F
L ( n  1))
L M L 1
 aI L M L1 exp( FLI (n  1))  aD L M L1 exp( FLD (n  1))]
Linda Muselaars and Miranda Stobbe
33
Alternative to log-odds scoring

Log Likelihood score (LL score)
 Strongly length dependent.
 Solutions:
– Divide by sequence length
– Z-score

Which method is preferred?
Linda Muselaars and Miranda Stobbe
34
Linda Muselaars and Miranda Stobbe
35
Demo
Linda Muselaars and Miranda Stobbe
36
Part of the profile HMM
Linda Muselaars and Miranda Stobbe
37
Scoring
Linda Muselaars and Miranda Stobbe
38
Part of the multiple alignment
Linda Muselaars and Miranda Stobbe
39
Relative frequencies
Linda Muselaars and Miranda Stobbe
40
Contents

Ungapped score matrices.
 Adding insert and delete states to obtain profile
HMMs.
 Deriving profile HMMs from multiple alignments.
– Non-probabilistic profiles
– Basic profile HMM parameterisation

Searching with profile HMMs.
 Profile HMM variants for non-global alignments.
Linda Muselaars and Miranda Stobbe
41
Flanking model states

Used to model the flanking sequences to the
actual profile match itself.
 Extra probabilities needed:
– Emission probability: qa.
– ‘Looping’ transition probability: (1 - η).
– Transition probability from left flanking state:
depends on application.
Linda Muselaars and Miranda Stobbe
42
Model for local alignment
Smith-Waterman style
Dj
Ij
Mj
Begin
End
Begin
End
Q
Q
Linda Muselaars and Miranda Stobbe
43
Model for overlap matches
Dj
Q
Ij
Begin
Mj
Linda Muselaars and Miranda Stobbe
Q
End
44
Model for repeat matches
Dj
Ij
Begin
Mj
Begin
Q
Linda Muselaars and Miranda Stobbe
End
End
45
Summary

Construction of a profile HMM for different
kinds of alignments.
 Use profile HMMs to detect potential
membership in a family.
 Use profile HMMs to give an alignment of
a sequence to the family.
Linda Muselaars and Miranda Stobbe
46
Discussion subject
BLAST versus profile HMM
Linda Muselaars and Miranda Stobbe
47
Download