Introduction to Profile Hidden Markov Models Mark Stamp PHMM 1 Hidden Markov Models Here, we assume you know about HMMs If not, see “A revealing introduction to hidden Markov models” Executive summary of HMMs HMM is a machine learning technique Also, a discrete hill climb technique Train model based on observation sequence Score given sequence to see how closely it matches the model Efficient algorithms, many useful applications PHMM 2 HMM Notation Recall, HMM model denoted λ = (A,B,π) Observation sequence is O Notation: PHMM 3 Hidden Markov Models Among the many uses for HMMs… Speech analysis Music search engine Malware detection Intrusion detection systems (IDS) Many more, and more all the time PHMM 4 Limitations of HMMs Positional information not considered HMM has no “memory” Higher order models have some memory But no explicit use of positional information Does not handle insertions or deletions These limitations are serious problems in some applications In bioinformatics string comparison, sequence alignment is critical Also, insertions and deletions occur PHMM 5 Profile HMM Profile HMM (PHMM) designed to overcome limitations on previous slide In some ways, PHMM easier than HMM In some ways, PHMM more complex The basic idea of PHMM Define multiple B matrices Almost like having an HMM for each position in sequence PHMM 6 PHMM In bioinformatics, begin by aligning multiple related sequences Multiple sequence alignment (MSA) This is like training phase for HMM Generate Easy, PHMM based on given MSA once MSA is known Hard part is generating MSA Then Use can score sequences using PHMM forward algorithm, like HMM PHMM 7 Generic View of PHMM Circles are Delete states Diamonds are Insert states Rectangles are Match states Match states correspond to HMM states Each transition has associated probability Arrows are possible transitions Transition probabilities are A matrix Emission probabilities are B matrices In PHMM, observations are emissions Match and insert states have emissions PHMM 8 Generic View of PHMM Circles are Delete states, diamonds are Insert states, rectangles are Match states Also, begin and end states PHMM 9 PHMM Notation Notation PHMM 10 PHMM Match state probabilities easily determined from MSA, that is transitions between match states eMi(k) emission probability at match state aMi,Mi+1 Note: For other transition probabilities example, aMi,Ii and aMi,Di+1 Emissions at all match & insert states Remember, emission == observation PHMM 11 MSA First we show MSA construction Then construct PHMM from MSA This is the difficult part Lots of ways to do this “Best” way depends on specific problem The easy part Standard algorithm for this How to score a sequence? Forward algorithm, similar to HMM PHMM 12 MSA How to construct MSA? Construct pairwise alignments Combine pairwise alignments to obtain MSA Allow gaps to be inserted Makes But better matches gaps tend to weaken scoring So there is a tradeoff PHMM 13 Global vs Local Alignment In these pairwise alignment examples “-” is gap “|” are aligned “*” omitted beginning and ending symbols PHMM 14 Global vs Local Alignment Global But alignment is lossless gaps tend to proliferate And gaps increase when we do MSA More gaps implies more sequences match So, result is less useful for scoring We usually only consider local alignment That For is, omit ends for better alignment simplicity, we assume global alignment here PHMM 15 Pairwise Alignment We allow gaps when aligning How to score an alignment? Based on n x n substitution matrix S Where n is number of symbols What algorithm(s) to align sequences? Usually, dynamic programming Sometimes, HMM is used Other? Local alignment --- more issues PHMM 16 Pairwise Alignment Example Note gaps vs misaligned elements Depends on S and gap penalty PHMM 17 Substitution Matrix Masquerade Detect Consider detection imposter using an account 4 different operations E == send email G == play games C == C programming J == Java programming How similar are these to each other? PHMM 18 Substitution Matrix Consider 4 different operations: E, G, C, J Possible substitution matrix: Diagonal --- matches Which others most similar? High positive scores J and C, so substituting C for J is a high score Game playing/programming, very different So substituting G for C is a negative score PHMM 19 Substitution Matrix Depending on problem, might be easy or very difficult to get useful S matrix Consider masquerade detection based on UNIX commands Sometimes difficult to say how “close” 2 commands are Suppose aligning DNA sequences Biological rationale for closeness of symbols PHMM 20 Gap Penalty Generally must allow gaps to be inserted But gaps make alignment more generic So, less useful for scoring Therefore, we penalize gaps How to penalize gaps? Linear gap penalty function f(g) = dg (i.e., constant penalty per gap) Affine gap penalty function f(g) = a + e(g – 1) Gap opening penalty a, then constant factor of e PHMM 21 Pairwise Alignment Algorithm We use dynamic programming Based on S matrix, gap penalty function Notation: PHMM 22 Pairwise Alignment DP Initialization: Recursion: PHMM 23 MSA from Pairwise Alignments Given pairwise alignments… …how to construct MSA? Generic approach is “progressive alignment” Select one pairwise alignment Select another and combine with first Continue to add more until all are combined Relatively easy (good) Gaps may proliferate, unstable (bad) PHMM 24 MSA from Pairwise Alignments Lots of ways to improve on generic progressive alignment Here, we mention one such approach Not necessarily “best” or most popular Feng-Dolittle progressive alignment Compute scores for all pairs of n sequences Select n-1 alignments that a) “connect” all sequences and b) maximize pairwise scores Then generate a minimum spanning tree For MSA, add sequences in the order that they appear in the spanning tree PHMM 25 MSA Construction Create pairwise alignments Generate substitution matrix Dynamic program for pairwise alignments Use pairwise alignments to make MSA Use pairwise alignments to construct spanning tree (e.g., Prim’s Algorithm) Add sequences to MSA in spanning tree order (from highest score, insert gaps as needed) Note: gap penalty is used PHMM 26 MSA Example Suppose 10 sequences, with the following pairwise alignment scores: PHMM 27 MSA Example: Spanning Tree Spanning tree based on scores So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9) PHMM 28 MSA Snapshot Intermediate step and final Use “+” for neutral symbol Then “-” for gaps in MSA Note increase in gaps PHMM 29 PHMM from MSA For PHMM, must determine match and insert states & probabilities from MSA “Conservative” columns are match states Half or less of symbols are gaps Other columns are insert states Majority Delete of symbols are gaps states are a separate issue PHMM 30 PHMM States from MSA Consider a simpler MSA… Columns 1,2,6 are match states 1,2,3, respectively Since less than half gaps Columns 3,4,5 are combined to form insert state 2 Since more than half gaps Match states between insert PHMM 31 PHMM Probabilities from MSA Emission probabilities Based on symbol distribution in match and insert states State transition probs Based on transitions in the MSA PHMM 32 PHMM Probabilities from MSA Emission probabilities: But 0 probabilities are bad Model “overfits” the data So, use “add one” rule Add one to each numerator, add total to denominators PHMM 33 PHMM Probabilities from MSA More emission probabilities: But 0 probabilities are bad Model “overfits” the data Again, use “add one” rule Add one to each numerator, add total to denominators PHMM 34 PHMM Probabilities from MSA Transition probabilities: We look at some examples Note that “-” is delete state First, consider begin state: Again, use add one rule PHMM 35 PHMM Probabilities from MSA Transition probabilities When no information in MSA, set probs to uniform For example I1 does not appear in MSA, so PHMM 36 PHMM Probabilities from MSA Transition probabilities, another example What about transitions from state D1? Can only go to M2, so Again, use add one rule: PHMM 37 PHMM Emission Probabilities Emission probabilities for the given MSA Using add-one rule PHMM 38 PHMM Transition Probabilities Transition probabilities for the given MSA Using add-one rule PHMM 39 PHMM Summary Construct Usually, Use pairwise alignments use dynamic programming these to construct MSA Lots Using of ways to do this MSA, determine probabilities Emission probabilities State transition probabilities In effect, we have trained a PHMM Now what??? PHMM 40 PHMM Scoring Want to score sequences to see how closely they match PHMM How did we score sequences with HMM? Forward How to score sequences with PHMM? Forward But, algorithm algorithm algorithm is a little more complex Due to complex state transitions PHMM 41 Forward Algorithm Notation Indices i and j are columns in MSA xi is ith observation symbol qxi is distribution of xi in “random model” Base case is is score of x1,…,xi up to state j (note that in PHMM, i and j may not agree) Some states undefined Undefined states ignored in calculation PHMM 42 Forward Algorithm Compute P(X|λ) recursively Note that and depends on , And corresponding state transition probs PHMM 43 PHMM We will see examples of PHMM later In particular, Malware detection based on opcodes Masquerade detection based on UNIX commands PHMM 44 References Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Durbin, et al Masquerade detection using profile hidden Markov models, L. Huang and M. Stamp, to appear in Computers and Security Profile hidden Markov models for metamorphic virus detection, S. Attaluri, S. McGhee and M. Stamp, Journal in Computer Virology, Vol. 5, No. 2, May 2009, pp. 151-169 PHMM 45