Introduction to Profile Hidden Markov Models Mark Stamp PHMM 1 Hidden Markov Models  Here, we assume you know about HMMs   If not, see “A revealing introduction to hidden Markov models” Executive summary of HMMs HMM is a machine learning technique  Also, a discrete hill climb technique  Train model based on observation sequence  Score given sequence to see how closely it matches the model  Efficient algorithms, many useful applications  PHMM 2 HMM Notation Recall, HMM model denoted λ = (A,B,π)  Observation sequence is O  Notation:  PHMM 3 Hidden Markov Models  Among the many uses for HMMs…  Speech analysis  Music search engine  Malware detection  Intrusion detection systems (IDS)  Many more, and more all the time PHMM 4 Limitations of HMMs  Positional information not considered HMM has no “memory”  Higher order models have some memory  But no explicit use of positional information  Does not handle insertions or deletions  These limitations are serious problems in some applications  In bioinformatics string comparison, sequence alignment is critical  Also, insertions and deletions occur  PHMM 5 Profile HMM  Profile HMM (PHMM) designed to overcome limitations on previous slide  In some ways, PHMM easier than HMM  In some ways, PHMM more complex  The basic idea of PHMM  Define multiple B matrices  Almost like having an HMM for each position in sequence PHMM 6 PHMM  In bioinformatics, begin by aligning multiple related sequences  Multiple sequence alignment (MSA)  This is like training phase for HMM  Generate  Easy, PHMM based on given MSA once MSA is known  Hard part is generating MSA  Then  Use can score sequences using PHMM forward algorithm, like HMM PHMM 7 Generic View of PHMM Circles are Delete states  Diamonds are Insert states  Rectangles are Match states    Match states correspond to HMM states  Each transition has associated probability Arrows are possible transitions Transition probabilities are A matrix  Emission probabilities are B matrices  In PHMM, observations are emissions  Match and insert states have emissions  PHMM 8 Generic View of PHMM Circles are Delete states, diamonds are Insert states, rectangles are Match states  Also, begin and end states  PHMM 9 PHMM Notation  Notation PHMM 10 PHMM  Match state probabilities easily determined from MSA, that is transitions between match states  eMi(k) emission probability at match state  aMi,Mi+1  Note:  For other transition probabilities example, aMi,Ii and aMi,Di+1  Emissions at all match & insert states  Remember, emission == observation PHMM 11 MSA  First we show MSA construction  Then construct PHMM from MSA  This is the difficult part  Lots of ways to do this  “Best” way depends on specific problem  The easy part  Standard algorithm for this  How to score a sequence?  Forward algorithm, similar to HMM PHMM 12 MSA  How to construct MSA?  Construct pairwise alignments  Combine pairwise alignments to obtain MSA  Allow gaps to be inserted  Makes  But better matches gaps tend to weaken scoring  So there is a tradeoff PHMM 13 Global vs Local Alignment  In these pairwise alignment examples “-” is gap  “|” are aligned  “*” omitted beginning and ending symbols  PHMM 14 Global vs Local Alignment  Global  But alignment is lossless gaps tend to proliferate  And gaps increase when we do MSA  More gaps implies more sequences match  So, result is less useful for scoring  We usually only consider local alignment  That  For is, omit ends for better alignment simplicity, we assume global alignment here PHMM 15 Pairwise Alignment We allow gaps when aligning  How to score an alignment?  Based on n x n substitution matrix S  Where n is number of symbols   What algorithm(s) to align sequences? Usually, dynamic programming  Sometimes, HMM is used  Other?   Local alignment --- more issues PHMM 16 Pairwise Alignment  Example  Note gaps vs misaligned elements  Depends on S and gap penalty PHMM 17 Substitution Matrix  Masquerade  Detect  Consider detection imposter using an account 4 different operations E == send email  G == play games  C == C programming  J == Java programming  How similar are these to each other? PHMM 18 Substitution Matrix  Consider 4 different operations:  E, G, C, J Possible substitution matrix:  Diagonal --- matches    Which others most similar?   High positive scores J and C, so substituting C for J is a high score Game playing/programming, very different  So substituting G for C is a negative score PHMM 19 Substitution Matrix  Depending on problem, might be easy or very difficult to get useful S matrix  Consider masquerade detection based on UNIX commands  Sometimes difficult to say how “close” 2 commands are  Suppose aligning DNA sequences  Biological rationale for closeness of symbols PHMM 20 Gap Penalty   Generally must allow gaps to be inserted But gaps make alignment more generic So, less useful for scoring  Therefore, we penalize gaps    How to penalize gaps? Linear gap penalty function   f(g) = dg (i.e., constant penalty per gap) Affine gap penalty function f(g) = a + e(g – 1)  Gap opening penalty a, then constant factor of e  PHMM 21 Pairwise Alignment Algorithm  We use dynamic programming  Based on S matrix, gap penalty function  Notation: PHMM 22 Pairwise Alignment DP  Initialization:  Recursion: PHMM 23 MSA from Pairwise Alignments  Given pairwise alignments…  …how to construct MSA?  Generic approach is “progressive alignment”  Select one pairwise alignment  Select another and combine with first  Continue to add more until all are combined  Relatively easy (good)  Gaps may proliferate, unstable (bad) PHMM 24 MSA from Pairwise Alignments  Lots of ways to improve on generic progressive alignment Here, we mention one such approach  Not necessarily “best” or most popular   Feng-Dolittle progressive alignment Compute scores for all pairs of n sequences  Select n-1 alignments that a) “connect” all sequences and b) maximize pairwise scores  Then generate a minimum spanning tree  For MSA, add sequences in the order that they appear in the spanning tree  PHMM 25 MSA Construction  Create pairwise alignments Generate substitution matrix  Dynamic program for pairwise alignments   Use pairwise alignments to make MSA Use pairwise alignments to construct spanning tree (e.g., Prim’s Algorithm)  Add sequences to MSA in spanning tree order (from highest score, insert gaps as needed)  Note: gap penalty is used  PHMM 26 MSA Example  Suppose 10 sequences, with the following pairwise alignment scores: PHMM 27 MSA Example: Spanning Tree  Spanning tree based on scores  So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9) PHMM 28 MSA Snapshot  Intermediate step and final  Use “+” for neutral symbol  Then “-” for gaps in MSA  Note increase in gaps PHMM 29 PHMM from MSA  For PHMM, must determine match and insert states & probabilities from MSA  “Conservative” columns are match states  Half or less of symbols are gaps  Other columns are insert states  Majority  Delete of symbols are gaps states are a separate issue PHMM 30 PHMM States from MSA Consider a simpler MSA…  Columns 1,2,6 are match states 1,2,3, respectively    Since less than half gaps Columns 3,4,5 are combined to form insert state 2 Since more than half gaps  Match states between insert  PHMM 31 PHMM Probabilities from MSA  Emission probabilities  Based on symbol distribution in match and insert states  State transition probs  Based on transitions in the MSA PHMM 32 PHMM Probabilities from MSA  Emission probabilities:  But 0 probabilities are bad Model “overfits” the data  So, use “add one” rule  Add one to each numerator, add total to denominators  PHMM 33 PHMM Probabilities from MSA  More emission probabilities:  But 0 probabilities are bad Model “overfits” the data  Again, use “add one” rule  Add one to each numerator, add total to denominators  PHMM 34 PHMM Probabilities from MSA  Transition probabilities:  We look at some examples  Note that “-” is delete state  First, consider begin state:  Again, use add one rule PHMM 35 PHMM Probabilities from MSA Transition probabilities  When no information in MSA, set probs to uniform  For example I1 does not appear in MSA, so  PHMM 36 PHMM Probabilities from MSA Transition probabilities, another example  What about transitions from state D1?  Can only go to M2, so   Again, use add one rule: PHMM 37 PHMM Emission Probabilities  Emission probabilities for the given MSA  Using add-one rule PHMM 38 PHMM Transition Probabilities  Transition probabilities for the given MSA  Using add-one rule PHMM 39 PHMM Summary  Construct  Usually,  Use pairwise alignments use dynamic programming these to construct MSA  Lots  Using of ways to do this MSA, determine probabilities  Emission probabilities  State transition probabilities  In effect, we have trained a PHMM  Now what??? PHMM 40 PHMM Scoring  Want to score sequences to see how closely they match PHMM  How did we score sequences with HMM?  Forward  How to score sequences with PHMM?  Forward  But, algorithm algorithm algorithm is a little more complex  Due to complex state transitions PHMM 41 Forward Algorithm  Notation  Indices i and j are columns in MSA  xi is ith observation symbol  qxi is distribution of xi in “random model”  Base case is  is score of x1,…,xi up to state j (note that in PHMM, i and j may not agree)  Some states undefined  Undefined states ignored in calculation PHMM 42 Forward Algorithm   Compute P(X|λ) recursively Note that and  depends on , And corresponding state transition probs PHMM 43 PHMM  We will see examples of PHMM later  In particular,  Malware detection based on opcodes  Masquerade detection based on UNIX commands PHMM 44 References  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Durbin, et al  Masquerade detection using profile hidden Markov models, L. Huang and M. Stamp, to appear in Computers and Security  Profile hidden Markov models for metamorphic virus detection, S. Attaluri, S. McGhee and M. Stamp, Journal in Computer Virology, Vol. 5, No. 2, May 2009, pp. 151-169 PHMM 45