b2_PHMM

advertisement
Profile Hidden Markov Models
Mark Stamp
PHMM
1
Hidden Markov Models

Here, we assume you know about HMMs
o If not, see “A revealing introduction to
hidden Markov models”

Executive summary of HMMs
HMM is a machine learning technique
Also, a discrete hill climb technique
Train model based on observation sequence
Score given sequence to see how closely it
matches the model
o Efficient algorithms, many useful
applications
o
o
o
o
PHMM
2
HMM Notation
Recall, HMM model denoted λ = (A,B,π)
 Observation sequence is O
 Notation:

PHMM
3
Hidden Markov Models
 Among
the many uses for HMMs…
 Speech analysis
 Music search engine
 Malware detection
 Intrusion detection systems (IDS)
 Many more, and more all the time
PHMM
4
Limitations of HMMs

Positional information not considered
o HMM has no “memory”
o Higher order models have some memory
o But no explicit use of positional information
Does not handle insertions or deletions
 These limitations are serious problems in
some applications

o In bioinformatics string comparison,
sequence alignment is critical
o Also, insertions and deletions occur
PHMM
5
Profile HMM
 Profile
HMM (PHMM) designed to
overcome limitations on previous slide
o In some ways, PHMM easier than HMM
o In some ways, PHMM more complex
 The
basic idea of PHMM
o Define multiple B matrices
o Almost like having an HMM for each
position in sequence
PHMM
6
PHMM
 In
bioinformatics, begin by aligning
multiple related sequences
o Multiple sequence alignment (MSA)
o This is like training phase for HMM
 Generate
PHMM based on given MSA
o Easy, once MSA is known
o Hard part is generating MSA
 Then
can score sequences using
PHMM
o Use forward algorithm, like HMM
PHMM
7
Training: PHMM vs HMM
 Training
PHMM
o Determine MSA  nontrivial
o Determine PHMM matrices  trivial
 Training
HMM
o Append training sequences  trivial
o Determine HMM matrices  nontrivial
 These
are opposites…
o In some sense
PHMM
8
Generic View of PHMM
 Have
delete, insert, and match states
o Match states correspond to HMM states
 Arrows
are possible transitions
o Each transition has a probability
 Transition
probabilities are A matrix
 Emission probabilities are B matrices
o In PHMM, observations are emissions
o Match and insert states have emissions
PHMM
9
Generic View of PHMM
 Circles
are delete states, diamonds are
insert states, squares are match states
 Also, begin and end states
PHMM
10
PHMM Notation
 Notation
PHMM
11
PHMM
 Match
state probabilities easily
determined from MSA
aMi,Mi+1 transitions between match states
eMi(k) emission probability at match state
 Many
other transition probabilities
o For example, aMi,Ii and aMi,Di+1
 Emissions
at all match & insert states
o Remember, emission == observation
PHMM
12
Multiple Sequence Alignment

First we show MSA construction
o This is the difficult part
o Lots of ways to do this
o “Best” way depends on specific problem

Then construct PHMM from MSA
o This is the easy part
o Standard algorithm for this

How to score a sequence?
o Forward algorithm, similar to HMM
PHMM
13
MSA
 How
to construct MSA?
o Construct pairwise alignments
o Combine pairwise alignments for MSA
 Allow
gaps to be inserted
o To make better matches
 Gaps
tend to weaken PHMM scoring
o A tradeoff between gaps and scoring
PHMM
14
Global vs Local Alignment

In these pairwise alignment examples
o “-” is gap
o “|” means elements aligned
o “*” for omitted beginning/ending symbols
PHMM
15
Global vs Local Alignment
 Global
o
o
o
o
alignment is lossless
But gaps tend to proliferate
And gaps increase when we do MSA
More gaps, more random sequences match…
…and result is less useful for scoring
 We
usually only consider local alignment
o That is, omit ends for better alignment
 For
simplicity, assume global alignment
in examples presented here
PHMM
16
Pairwise Alignment
Allow gaps when aligning
 How to score an alignment?

o Based on n x n substitution matrix S
o Where n is number of symbols

What algorithm(s) to align sequences?
o Usually, dynamic programming
o Sometimes, HMM is used
o Other?

Local alignment  creates more issues
PHMM
17
Pairwise Alignment
 Example
 Tradeoff
gaps vs misaligned elements
o Depends on matrix S and gap penalty
PHMM
18
Substitution Matrix
 Masquerade
detection
o Detect imposter using an account
 Consider
4 different operations
o E == send email
o G == play games
o C == C programming
o J == Java programming
 How
PHMM
similar are these to each other?
19
Substitution Matrix

Consider 4 different operations:
o E, G, C, J


Possible substitution matrix:
Diagonal  matches
o High positive scores

Which others most similar?
o J and C, so substituting C for J is a high score

Game playing/programming, very different
o So substituting G for C is a negative score
PHMM
20
Substitution Matrix
 Depending
on problem, might be easy or
very difficult to find useful S matrix
 Consider masquerade detection based on
UNIX commands
o Sometimes difficult to say how “close” 2
commands are
 Suppose
aligning DNA sequences
o Biological rationale for closeness of symbols
PHMM
21
Gap Penalty
Generally must allow gaps to be inserted
 But gaps make alignment more generic

o Less useful for scoring, so we penalize gaps
How to penalize gaps?
 Linear gap penalty function:

g(x) = ax (constant penalty for every gap)

Affine gap penalty function
g(x) = a + b(x – 1)
o Gap opening penalty a and constant penalty
of b for each extension of existing gap
PHMM
22
Pairwise Alignment Algorithm
 We
use dynamic programming
o Based on S matrix, gap penalty function
 Notation:
PHMM
23
Pairwise Alignment DP

Initialization:

Recursion:
where
PHMM
24
MSA from Pairwise Alignments
Given pairwise alignments…
 How to construct MSA?
 Generally use “progressive alignment”

o Select one pairwise alignment
o Select another and combine with first
o Continue to add more until all are combined
Relatively easy (good)
 Gaps proliferate, and it’s unstable (bad)

PHMM
25
MSA from Pairwise Alignments

Lots of ways to improve on generic
progressive alignment
o Here, we mention one such approach
o Not necessarily “best” or most popular

Feng-Dolittle progressive alignment
o Compute scores for all pairs of n sequences
o Select n-1 alignments that a) “connect” all
sequences and b) maximize pairwise scores
o Then generate a minimum spanning tree
o For MSA, add sequences in the order that
they appear in the spanning tree
PHMM
26
MSA Construction

Create pairwise alignments
o Generate substitution matrix
o Dynamic program for pairwise alignments

Use pairwise alignments to make MSA
o Use pairwise alignments to construct
spanning tree (e.g., Prim’s Algorithm)
o Add sequences to MSA in spanning tree
order (from highest score, insert gaps as
needed)
o Note: gap penalty is used
PHMM
27
MSA Example

Suppose 10 sequences, with the following
pairwise alignment scores
PHMM
28
MSA Example: Spanning Tree
 Spanning
tree
based on scores
 So process pairs
in following order:
(5,4), (5,8), (8,3),
(3,2), (2,7), (2,1),
(1,6), (6,10), (10,9)
PHMM
29
MSA Snapshot
 Intermediate
step and final
o Use “+” for
neutral symbol
o Then “-” for
gaps in MSA
 Note
increase
in gaps
PHMM
30
PHMM from MSA
 In
PHMM, determine match and insert
states & probabilities from MSA
 “Conservative” columns  match
states
o Half or less of symbols are gaps
 Other
columns are insert states
o Majority of symbols are gaps
 Delete
PHMM
states are a separate issue
31
PHMM States from MSA
Consider a simpler MSA…
 Columns 1,2,6 are match
states 1,2,3, respectively

o Since less than half gaps

Columns 3,4,5 are combined
to form insert state 2
o Since more than half gaps
o Match states between insert
PHMM
32
Probabilities from MSA
 Emission
probabilities
o Based on symbol
distribution in match and
insert states
 State
transition probs
o Based on transitions in
the MSA
PHMM
33
Probabilities from MSA

Emission probabilities:

But 0 probabilities are bad
o Model “overfits” the data
o So, use “add one” rule
o Add one to each numerator,
add total to denominators
PHMM
34
Probabilities from MSA

More emission probabilities:

But 0 probabilities still bad
o Model “overfits” the data
o Again, use “add one” rule
o Add one to each numerator,
add total to denominators
PHMM
35
Probabilities from MSA

Transition probabilities:

We look at some examples
o Note that “-” is delete state

First, consider begin state:

Again, use add one rule
PHMM
36
Probabilities from MSA
Transition probabilities
 When no information in
MSA, set probs to uniform
 For example I1 does not
appear in MSA, so

PHMM
37
Probabilities from MSA
Transition probabilities,
another example
 What about transitions
from state D1?
 Can only go to M2, so


Again, use add one rule:
PHMM
38
PHMM Emission Probabilities

Emission probabilities for the given MSA
o Using add-one rule
PHMM
39
PHMM Transition Probabilities

Transition probabilities for the given MSA
o Using add-one rule
PHMM
40
PHMM Summary

Construct pairwise alignments
o Usually, use dynamic programming

Use these to construct MSA
o Lots of ways to do this

Using MSA, determine probabilities
o Emission probabilities
o State transition probabilities

Then we have trained a PHMM
o Now what???
PHMM
41
PHMM Scoring
 Want
to score sequences to see how
closely they match PHMM
 How did we score using HMM?
o Forward algorithm
 How
to score sequences with PHMM?
o Forward algorithm (surprised?)
 But,
algorithm is a little more complex
o Due to complex state transitions
PHMM
42
Forward Algorithm

Notation
o Indices i and j are columns in MSA
o xi is ith observation symbol
o qxi is distribution of xi in “random model”
o Base case is
is score of x1,…,xi up to state j (note
that in PHMM, i and j may not agree)
o Some states undefined
o Undefined states ignored in calculation
o
PHMM
43
Forward Algorithm


Compute P(X|λ) recursively
Note that
and
depends on
,
o And corresponding state transition probs
PHMM
44
PHMM
 We
will see examples of PHMM later
 In particular,
o Malware detection based on opcodes
o Masquerade detection based on UNIX
commands
PHMM
45
References



Durbin, et al, Biological Sequence Analysis:
Probabilistic Models of Proteins and Nucleic
Acids
L. Huang and M. Stamp, Masquerade
detection using profile hidden Markov models,
Computers & Security, 30(8):732-747, 2011
S. Attaluri, S. McGhee, and M. Stamp, Profile
hidden Markov models for metamorphic virus
detection, Journal in Computer Virology,
5(2):151-169, 2009
PHMM
46
Download