PHMM

Introduction to Profile Hidden Markov Models Mark Stamp PHMM 1 Hidden Markov Models  Here, we assume you know about HMMs   If not, see “A revealing introduction to hidden Markov models” Executive summary of HMMs HMM is a machine learning technique  Also, a discrete hill climb technique  Train model based on observation sequence  Score given sequence to see how closely it matches the model  Efficient algorithms, many useful applications  PHMM 2 HMM Notation Recall, HMM model denoted λ = (A,B,π)  Observation sequence is O  Notation:  PHMM 3 Hidden Markov Models  Among the many uses for HMMs…  Speech analysis  Music search engine  Malware detection  Intrusion detection systems (IDS)  Many more, and more all the time PHMM 4 Limitations of HMMs  Positional information not considered HMM has no “memory”  Higher order models have some memory  But no explicit use of positional information  Does not handle insertions or deletions  These limitations are serious problems in some applications  In bioinformatics string comparison, sequence alignment is critical  Also, insertions and deletions occur  PHMM 5 Profile HMM  Profile HMM (PHMM) designed to overcome limitations on previous slide  In some ways, PHMM easier than HMM  In some ways, PHMM more complex  The basic idea of PHMM  Define multiple B matrices  Almost like having an HMM for each position in sequence PHMM 6 PHMM  In bioinformatics, begin by aligning multiple related sequences  Multiple sequence alignment (MSA)  This is like training phase for HMM  Generate  Easy, PHMM based on given MSA once MSA is known  Hard part is generating MSA  Then  Use can score sequences using PHMM forward algorithm, like HMM PHMM 7 Generic View of PHMM Circles are Delete states  Diamonds are Insert states  Rectangles are Match states    Match states correspond to HMM states  Each transition has associated probability Arrows are possible transitions Transition probabilities are A matrix  Emission probabilities are B matrices  In PHMM, observations are emissions  Match and insert states have emissions  PHMM 8 Generic View of PHMM Circles are Delete states, diamonds are Insert states, rectangles are Match states  Also, begin and end states  PHMM 9 PHMM Notation  Notation PHMM 10 PHMM  Match state probabilities easily determined from MSA, that is transitions between match states  eMi(k) emission probability at match state  aMi,Mi+1  Note:  For other transition probabilities example, aMi,Ii and aMi,Di+1  Emissions at all match & insert states  Remember, emission == observation PHMM 11 MSA  First we show MSA construction  Then construct PHMM from MSA  This is the difficult part  Lots of ways to do this  “Best” way depends on specific problem  The easy part  Standard algorithm for this  How to score a sequence?  Forward algorithm, similar to HMM PHMM 12 MSA  How to construct MSA?  Construct pairwise alignments  Combine pairwise alignments to obtain MSA  Allow gaps to be inserted  Makes  But better matches gaps tend to weaken scoring  So there is a tradeoff PHMM 13 Global vs Local Alignment  In these pairwise alignment examples “-” is gap  “|” are aligned  “*” omitted beginning and ending symbols  PHMM 14 Global vs Local Alignment  Global  But alignment is lossless gaps tend to proliferate  And gaps increase when we do MSA  More gaps implies more sequences match  So, result is less useful for scoring  We usually only consider local alignment  That  For is, omit ends for better alignment simplicity, we assume global alignment here PHMM 15 Pairwise Alignment We allow gaps when aligning  How to score an alignment?  Based on n x n substitution matrix S  Where n is number of symbols   What algorithm(s) to align sequences? Usually, dynamic programming  Sometimes, HMM is used  Other?   Local alignment --- more issues PHMM 16 Pairwise Alignment  Example  Note gaps vs misaligned elements  Depends on S and gap penalty PHMM 17 Substitution Matrix  Masquerade  Detect  Consider detection imposter using an account 4 different operations E == send email  G == play games  C == C programming  J == Java programming  How similar are these to each other? PHMM 18 Substitution Matrix  Consider 4 different operations:  E, G, C, J Possible substitution matrix:  Diagonal --- matches    Which others most similar?   High positive scores J and C, so substituting C for J is a high score Game playing/programming, very different  So substituting G for C is a negative score PHMM 19 Substitution Matrix  Depending on problem, might be easy or very difficult to get useful S matrix  Consider masquerade detection based on UNIX commands  Sometimes difficult to say how “close” 2 commands are  Suppose aligning DNA sequences  Biological rationale for closeness of symbols PHMM 20 Gap Penalty   Generally must allow gaps to be inserted But gaps make alignment more generic So, less useful for scoring  Therefore, we penalize gaps    How to penalize gaps? Linear gap penalty function   f(g) = dg (i.e., constant penalty per gap) Affine gap penalty function f(g) = a + e(g – 1)  Gap opening penalty a, then constant factor of e  PHMM 21 Pairwise Alignment Algorithm  We use dynamic programming  Based on S matrix, gap penalty function  Notation: PHMM 22 Pairwise Alignment DP  Initialization:  Recursion: PHMM 23 MSA from Pairwise Alignments  Given pairwise alignments…  …how to construct MSA?  Generic approach is “progressive alignment”  Select one pairwise alignment  Select another and combine with first  Continue to add more until all are combined  Relatively easy (good)  Gaps may proliferate, unstable (bad) PHMM 24 MSA from Pairwise Alignments  Lots of ways to improve on generic progressive alignment Here, we mention one such approach  Not necessarily “best” or most popular   Feng-Dolittle progressive alignment Compute scores for all pairs of n sequences  Select n-1 alignments that a) “connect” all sequences and b) maximize pairwise scores  Then generate a minimum spanning tree  For MSA, add sequences in the order that they appear in the spanning tree  PHMM 25 MSA Construction  Create pairwise alignments Generate substitution matrix  Dynamic program for pairwise alignments   Use pairwise alignments to make MSA Use pairwise alignments to construct spanning tree (e.g., Prim’s Algorithm)  Add sequences to MSA in spanning tree order (from highest score, insert gaps as needed)  Note: gap penalty is used  PHMM 26 MSA Example  Suppose 10 sequences, with the following pairwise alignment scores: PHMM 27 MSA Example: Spanning Tree  Spanning tree based on scores  So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9) PHMM 28 MSA Snapshot  Intermediate step and final  Use “+” for neutral symbol  Then “-” for gaps in MSA  Note increase in gaps PHMM 29 PHMM from MSA  For PHMM, must determine match and insert states & probabilities from MSA  “Conservative” columns are match states  Half or less of symbols are gaps  Other columns are insert states  Majority  Delete of symbols are gaps states are a separate issue PHMM 30 PHMM States from MSA Consider a simpler MSA…  Columns 1,2,6 are match states 1,2,3, respectively    Since less than half gaps Columns 3,4,5 are combined to form insert state 2 Since more than half gaps  Match states between insert  PHMM 31 PHMM Probabilities from MSA  Emission probabilities  Based on symbol distribution in match and insert states  State transition probs  Based on transitions in the MSA PHMM 32 PHMM Probabilities from MSA  Emission probabilities:  But 0 probabilities are bad Model “overfits” the data  So, use “add one” rule  Add one to each numerator, add total to denominators  PHMM 33 PHMM Probabilities from MSA  More emission probabilities:  But 0 probabilities are bad Model “overfits” the data  Again, use “add one” rule  Add one to each numerator, add total to denominators  PHMM 34 PHMM Probabilities from MSA  Transition probabilities:  We look at some examples  Note that “-” is delete state  First, consider begin state:  Again, use add one rule PHMM 35 PHMM Probabilities from MSA Transition probabilities  When no information in MSA, set probs to uniform  For example I1 does not appear in MSA, so  PHMM 36 PHMM Probabilities from MSA Transition probabilities, another example  What about transitions from state D1?  Can only go to M2, so   Again, use add one rule: PHMM 37 PHMM Emission Probabilities  Emission probabilities for the given MSA  Using add-one rule PHMM 38 PHMM Transition Probabilities  Transition probabilities for the given MSA  Using add-one rule PHMM 39 PHMM Summary  Construct  Usually,  Use pairwise alignments use dynamic programming these to construct MSA  Lots  Using of ways to do this MSA, determine probabilities  Emission probabilities  State transition probabilities  In effect, we have trained a PHMM  Now what??? PHMM 40 PHMM Scoring  Want to score sequences to see how closely they match PHMM  How did we score sequences with HMM?  Forward  How to score sequences with PHMM?  Forward  But, algorithm algorithm algorithm is a little more complex  Due to complex state transitions PHMM 41 Forward Algorithm  Notation  Indices i and j are columns in MSA  xi is ith observation symbol  qxi is distribution of xi in “random model”  Base case is  is score of x1,…,xi up to state j (note that in PHMM, i and j may not agree)  Some states undefined  Undefined states ignored in calculation PHMM 42 Forward Algorithm   Compute P(X|λ) recursively Note that and  depends on , And corresponding state transition probs PHMM 43 PHMM  We will see examples of PHMM later  In particular,  Malware detection based on opcodes  Masquerade detection based on UNIX commands PHMM 44 References  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Durbin, et al  Masquerade detection using profile hidden Markov models, L. Huang and M. Stamp, to appear in Computers and Security  Profile hidden Markov models for metamorphic virus detection, S. Attaluri, S. McGhee and M. Stamp, Journal in Computer Virology, Vol. 5, No. 2, May 2009, pp. 151-169 PHMM 45

PHMM

Related documents

Products

Support

PHMM

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib