PHMM

advertisement
Introduction to Profile
Hidden Markov Models
Mark Stamp
PHMM
1
Hidden Markov Models

Here, we assume you know about HMMs


If not, see “A revealing introduction to hidden
Markov models”
Executive summary of HMMs
HMM is a machine learning technique
 Also, a discrete hill climb technique
 Train model based on observation sequence
 Score given sequence to see how closely it
matches the model
 Efficient algorithms, many useful applications

PHMM
2
HMM Notation
Recall, HMM model denoted λ = (A,B,π)
 Observation sequence is O
 Notation:

PHMM
3
Hidden Markov Models
 Among
the many uses for HMMs…
 Speech analysis
 Music search engine
 Malware detection
 Intrusion detection systems (IDS)
 Many more, and more all the time
PHMM
4
Limitations of HMMs

Positional information not considered
HMM has no “memory”
 Higher order models have some memory
 But no explicit use of positional information

Does not handle insertions or deletions
 These limitations are serious problems in
some applications

In bioinformatics string comparison, sequence
alignment is critical
 Also, insertions and deletions occur

PHMM
5
Profile HMM
 Profile
HMM (PHMM) designed to
overcome limitations on previous slide
 In
some ways, PHMM easier than HMM
 In some ways, PHMM more complex
 The
basic idea of PHMM
 Define
multiple B matrices
 Almost like having an HMM for each
position in sequence
PHMM
6
PHMM
 In
bioinformatics, begin by aligning
multiple related sequences
 Multiple
sequence alignment (MSA)
 This is like training phase for HMM
 Generate
 Easy,
PHMM based on given MSA
once MSA is known
 Hard part is generating MSA
 Then
 Use
can score sequences using PHMM
forward algorithm, like HMM
PHMM
7
Generic View of PHMM
Circles are Delete states
 Diamonds are Insert states
 Rectangles are Match states



Match states correspond to HMM states

Each transition has associated probability
Arrows are possible transitions
Transition probabilities are A matrix
 Emission probabilities are B matrices

In PHMM, observations are emissions
 Match and insert states have emissions

PHMM
8
Generic View of PHMM
Circles are Delete states, diamonds are
Insert states, rectangles are Match states
 Also, begin and end states

PHMM
9
PHMM Notation
 Notation
PHMM
10
PHMM
 Match
state probabilities easily
determined from MSA, that is
transitions between match states
 eMi(k) emission probability at match state
 aMi,Mi+1
 Note:
 For
other transition probabilities
example, aMi,Ii and aMi,Di+1
 Emissions
at all match & insert states
 Remember,
emission == observation
PHMM
11
MSA
 First
we show MSA construction
 Then
construct PHMM from MSA
 This
is the difficult part
 Lots of ways to do this
 “Best” way depends on specific problem
 The
easy part
 Standard algorithm for this
 How
to score a sequence?
 Forward
algorithm, similar to HMM
PHMM
12
MSA
 How
to construct MSA?
 Construct
pairwise alignments
 Combine pairwise alignments to obtain MSA
 Allow
gaps to be inserted
 Makes
 But
better matches
gaps tend to weaken scoring
 So
there is a tradeoff
PHMM
13
Global vs Local Alignment

In these pairwise alignment examples
“-” is gap
 “|” are aligned
 “*” omitted beginning and ending symbols

PHMM
14
Global vs Local Alignment
 Global
 But
alignment is lossless
gaps tend to proliferate
 And gaps increase when we do MSA
 More gaps implies more sequences match
 So, result is less useful for scoring
 We
usually only consider local alignment
 That
 For
is, omit ends for better alignment
simplicity, we assume global
alignment here
PHMM
15
Pairwise Alignment
We allow gaps when aligning
 How to score an alignment?

Based on n x n substitution matrix S
 Where n is number of symbols


What algorithm(s) to align sequences?
Usually, dynamic programming
 Sometimes, HMM is used
 Other?


Local alignment --- more issues
PHMM
16
Pairwise Alignment
 Example
 Note
gaps vs misaligned elements
 Depends
on S and gap penalty
PHMM
17
Substitution Matrix
 Masquerade
 Detect
 Consider
detection
imposter using an account
4 different operations
E
== send email
 G == play games
 C == C programming
 J == Java programming
 How
similar are these to each other?
PHMM
18
Substitution Matrix

Consider 4 different operations:

E, G, C, J
Possible substitution matrix:
 Diagonal --- matches



Which others most similar?


High positive scores
J and C, so substituting C for J is a high score
Game playing/programming, very different

So substituting G for C is a negative score
PHMM
19
Substitution Matrix
 Depending
on problem, might be easy or
very difficult to get useful S matrix
 Consider masquerade detection based on
UNIX commands
 Sometimes
difficult to say how “close” 2
commands are
 Suppose
aligning DNA sequences
 Biological
rationale for closeness of symbols
PHMM
20
Gap Penalty


Generally must allow gaps to be inserted
But gaps make alignment more generic
So, less useful for scoring
 Therefore, we penalize gaps



How to penalize gaps?
Linear gap penalty function


f(g) = dg (i.e., constant penalty per gap)
Affine gap penalty function
f(g) = a + e(g – 1)
 Gap opening penalty a, then constant factor of e

PHMM
21
Pairwise Alignment Algorithm
 We
use dynamic programming
 Based
on S matrix, gap penalty function
 Notation:
PHMM
22
Pairwise Alignment DP

Initialization:

Recursion:
PHMM
23
MSA from Pairwise Alignments
 Given
pairwise alignments…
 …how to construct MSA?
 Generic approach is “progressive
alignment”
 Select
one pairwise alignment
 Select another and combine with first
 Continue to add more until all are combined
 Relatively
easy (good)
 Gaps may proliferate, unstable (bad)
PHMM
24
MSA from Pairwise Alignments

Lots of ways to improve on generic
progressive alignment
Here, we mention one such approach
 Not necessarily “best” or most popular


Feng-Dolittle progressive alignment
Compute scores for all pairs of n sequences
 Select n-1 alignments that a) “connect” all
sequences and b) maximize pairwise scores
 Then generate a minimum spanning tree
 For MSA, add sequences in the order that
they appear in the spanning tree

PHMM
25
MSA Construction

Create pairwise alignments
Generate substitution matrix
 Dynamic program for pairwise alignments


Use pairwise alignments to make MSA
Use pairwise alignments to construct spanning
tree (e.g., Prim’s Algorithm)
 Add sequences to MSA in spanning tree order
(from highest score, insert gaps as needed)
 Note: gap penalty is used

PHMM
26
MSA Example

Suppose 10 sequences, with the following
pairwise alignment scores:
PHMM
27
MSA Example: Spanning Tree
 Spanning
tree
based on scores
 So process pairs
in following order:
(5,4), (5,8), (8,3),
(3,2), (2,7), (2,1),
(1,6), (6,10), (10,9)
PHMM
28
MSA Snapshot
 Intermediate
step and final
 Use
“+” for
neutral symbol
 Then “-” for
gaps in MSA
 Note
increase
in gaps
PHMM
29
PHMM from MSA
 For
PHMM, must determine match and
insert states & probabilities from MSA
 “Conservative” columns are match states
 Half
or less of symbols are gaps
 Other
columns are insert states
 Majority
 Delete
of symbols are gaps
states are a separate issue
PHMM
30
PHMM States from MSA
Consider a simpler MSA…
 Columns 1,2,6 are match
states 1,2,3, respectively



Since less than half gaps
Columns 3,4,5 are combined
to form insert state 2
Since more than half gaps
 Match states between insert

PHMM
31
PHMM Probabilities from MSA
 Emission
probabilities
 Based
on symbol
distribution in match and
insert states
 State
transition probs
 Based
on transitions in
the MSA
PHMM
32
PHMM Probabilities from MSA

Emission probabilities:

But 0 probabilities are bad
Model “overfits” the data
 So, use “add one” rule
 Add one to each numerator,
add total to denominators

PHMM
33
PHMM Probabilities from MSA

More emission probabilities:

But 0 probabilities are bad
Model “overfits” the data
 Again, use “add one” rule
 Add one to each numerator,
add total to denominators

PHMM
34
PHMM Probabilities from MSA

Transition probabilities:

We look at some examples

Note that “-” is delete state

First, consider begin state:

Again, use add one rule
PHMM
35
PHMM Probabilities from MSA
Transition probabilities
 When no information in
MSA, set probs to uniform
 For example I1 does not
appear in MSA, so

PHMM
36
PHMM Probabilities from MSA
Transition probabilities,
another example
 What about transitions
from state D1?
 Can only go to M2, so


Again, use add one rule:
PHMM
37
PHMM Emission Probabilities

Emission probabilities for the given MSA

Using add-one rule
PHMM
38
PHMM Transition Probabilities

Transition probabilities for the given MSA

Using add-one rule
PHMM
39
PHMM Summary
 Construct
 Usually,
 Use
pairwise alignments
use dynamic programming
these to construct MSA
 Lots
 Using
of ways to do this
MSA, determine probabilities
 Emission
probabilities
 State transition probabilities
 In
effect, we have trained a PHMM
 Now
what???
PHMM
40
PHMM Scoring
 Want
to score sequences to see how
closely they match PHMM
 How did we score sequences with HMM?
 Forward
 How
to score sequences with PHMM?
 Forward
 But,
algorithm
algorithm
algorithm is a little more complex
 Due
to complex state transitions
PHMM
41
Forward Algorithm
 Notation
 Indices
i and j are columns in MSA
 xi is ith observation symbol
 qxi is distribution of xi in “random model”
 Base case is

is score of x1,…,xi up to state j (note
that in PHMM, i and j may not agree)
 Some states undefined
 Undefined states ignored in calculation
PHMM
42
Forward Algorithm


Compute P(X|λ) recursively
Note that
and

depends on
,
And corresponding state transition probs
PHMM
43
PHMM
 We
will see examples of PHMM later
 In particular,
 Malware
detection based on opcodes
 Masquerade detection based on UNIX
commands
PHMM
44
References

Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acids,
Durbin, et al
 Masquerade detection using profile hidden
Markov models, L. Huang and M. Stamp, to
appear in Computers and Security
 Profile hidden Markov models for
metamorphic virus detection, S. Attaluri,
S. McGhee and M. Stamp, Journal in
Computer Virology, Vol. 5, No. 2, May
2009, pp. 151-169
PHMM
45
Download