b2_PHMM

Profile Hidden Markov Models Mark Stamp PHMM 1 Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to hidden Markov models”  Executive summary of HMMs HMM is a machine learning technique Also, a discrete hill climb technique Train model based on observation sequence Score given sequence to see how closely it matches the model o Efficient algorithms, many useful applications o o o o PHMM 2 HMM Notation Recall, HMM model denoted λ = (A,B,π)  Observation sequence is O  Notation:  PHMM 3 Hidden Markov Models  Among the many uses for HMMs…  Speech analysis  Music search engine  Malware detection  Intrusion detection systems (IDS)  Many more, and more all the time PHMM 4 Limitations of HMMs  Positional information not considered o HMM has no “memory” o Higher order models have some memory o But no explicit use of positional information Does not handle insertions or deletions  These limitations are serious problems in some applications  o In bioinformatics string comparison, sequence alignment is critical o Also, insertions and deletions occur PHMM 5 Profile HMM  Profile HMM (PHMM) designed to overcome limitations on previous slide o In some ways, PHMM easier than HMM o In some ways, PHMM more complex  The basic idea of PHMM o Define multiple B matrices o Almost like having an HMM for each position in sequence PHMM 6 PHMM  In bioinformatics, begin by aligning multiple related sequences o Multiple sequence alignment (MSA) o This is like training phase for HMM  Generate PHMM based on given MSA o Easy, once MSA is known o Hard part is generating MSA  Then can score sequences using PHMM o Use forward algorithm, like HMM PHMM 7 Training: PHMM vs HMM  Training PHMM o Determine MSA  nontrivial o Determine PHMM matrices  trivial  Training HMM o Append training sequences  trivial o Determine HMM matrices  nontrivial  These are opposites… o In some sense PHMM 8 Generic View of PHMM  Have delete, insert, and match states o Match states correspond to HMM states  Arrows are possible transitions o Each transition has a probability  Transition probabilities are A matrix  Emission probabilities are B matrices o In PHMM, observations are emissions o Match and insert states have emissions PHMM 9 Generic View of PHMM  Circles are delete states, diamonds are insert states, squares are match states  Also, begin and end states PHMM 10 PHMM Notation  Notation PHMM 11 PHMM  Match state probabilities easily determined from MSA aMi,Mi+1 transitions between match states eMi(k) emission probability at match state  Many other transition probabilities o For example, aMi,Ii and aMi,Di+1  Emissions at all match & insert states o Remember, emission == observation PHMM 12 Multiple Sequence Alignment  First we show MSA construction o This is the difficult part o Lots of ways to do this o “Best” way depends on specific problem  Then construct PHMM from MSA o This is the easy part o Standard algorithm for this  How to score a sequence? o Forward algorithm, similar to HMM PHMM 13 MSA  How to construct MSA? o Construct pairwise alignments o Combine pairwise alignments for MSA  Allow gaps to be inserted o To make better matches  Gaps tend to weaken PHMM scoring o A tradeoff between gaps and scoring PHMM 14 Global vs Local Alignment  In these pairwise alignment examples o “-” is gap o “|” means elements aligned o “*” for omitted beginning/ending symbols PHMM 15 Global vs Local Alignment  Global o o o o alignment is lossless But gaps tend to proliferate And gaps increase when we do MSA More gaps, more random sequences match… …and result is less useful for scoring  We usually only consider local alignment o That is, omit ends for better alignment  For simplicity, assume global alignment in examples presented here PHMM 16 Pairwise Alignment Allow gaps when aligning  How to score an alignment?  o Based on n x n substitution matrix S o Where n is number of symbols  What algorithm(s) to align sequences? o Usually, dynamic programming o Sometimes, HMM is used o Other?  Local alignment  creates more issues PHMM 17 Pairwise Alignment  Example  Tradeoff gaps vs misaligned elements o Depends on matrix S and gap penalty PHMM 18 Substitution Matrix  Masquerade detection o Detect imposter using an account  Consider 4 different operations o E == send email o G == play games o C == C programming o J == Java programming  How PHMM similar are these to each other? 19 Substitution Matrix  Consider 4 different operations: o E, G, C, J   Possible substitution matrix: Diagonal  matches o High positive scores  Which others most similar? o J and C, so substituting C for J is a high score  Game playing/programming, very different o So substituting G for C is a negative score PHMM 20 Substitution Matrix  Depending on problem, might be easy or very difficult to find useful S matrix  Consider masquerade detection based on UNIX commands o Sometimes difficult to say how “close” 2 commands are  Suppose aligning DNA sequences o Biological rationale for closeness of symbols PHMM 21 Gap Penalty Generally must allow gaps to be inserted  But gaps make alignment more generic  o Less useful for scoring, so we penalize gaps How to penalize gaps?  Linear gap penalty function:  g(x) = ax (constant penalty for every gap)  Affine gap penalty function g(x) = a + b(x – 1) o Gap opening penalty a and constant penalty of b for each extension of existing gap PHMM 22 Pairwise Alignment Algorithm  We use dynamic programming o Based on S matrix, gap penalty function  Notation: PHMM 23 Pairwise Alignment DP  Initialization:  Recursion: where PHMM 24 MSA from Pairwise Alignments Given pairwise alignments…  How to construct MSA?  Generally use “progressive alignment”  o Select one pairwise alignment o Select another and combine with first o Continue to add more until all are combined Relatively easy (good)  Gaps proliferate, and it’s unstable (bad)  PHMM 25 MSA from Pairwise Alignments  Lots of ways to improve on generic progressive alignment o Here, we mention one such approach o Not necessarily “best” or most popular  Feng-Dolittle progressive alignment o Compute scores for all pairs of n sequences o Select n-1 alignments that a) “connect” all sequences and b) maximize pairwise scores o Then generate a minimum spanning tree o For MSA, add sequences in the order that they appear in the spanning tree PHMM 26 MSA Construction  Create pairwise alignments o Generate substitution matrix o Dynamic program for pairwise alignments  Use pairwise alignments to make MSA o Use pairwise alignments to construct spanning tree (e.g., Prim’s Algorithm) o Add sequences to MSA in spanning tree order (from highest score, insert gaps as needed) o Note: gap penalty is used PHMM 27 MSA Example  Suppose 10 sequences, with the following pairwise alignment scores PHMM 28 MSA Example: Spanning Tree  Spanning tree based on scores  So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9) PHMM 29 MSA Snapshot  Intermediate step and final o Use “+” for neutral symbol o Then “-” for gaps in MSA  Note increase in gaps PHMM 30 PHMM from MSA  In PHMM, determine match and insert states & probabilities from MSA  “Conservative” columns  match states o Half or less of symbols are gaps  Other columns are insert states o Majority of symbols are gaps  Delete PHMM states are a separate issue 31 PHMM States from MSA Consider a simpler MSA…  Columns 1,2,6 are match states 1,2,3, respectively  o Since less than half gaps  Columns 3,4,5 are combined to form insert state 2 o Since more than half gaps o Match states between insert PHMM 32 Probabilities from MSA  Emission probabilities o Based on symbol distribution in match and insert states  State transition probs o Based on transitions in the MSA PHMM 33 Probabilities from MSA  Emission probabilities:  But 0 probabilities are bad o Model “overfits” the data o So, use “add one” rule o Add one to each numerator, add total to denominators PHMM 34 Probabilities from MSA  More emission probabilities:  But 0 probabilities still bad o Model “overfits” the data o Again, use “add one” rule o Add one to each numerator, add total to denominators PHMM 35 Probabilities from MSA  Transition probabilities:  We look at some examples o Note that “-” is delete state  First, consider begin state:  Again, use add one rule PHMM 36 Probabilities from MSA Transition probabilities  When no information in MSA, set probs to uniform  For example I1 does not appear in MSA, so  PHMM 37 Probabilities from MSA Transition probabilities, another example  What about transitions from state D1?  Can only go to M2, so   Again, use add one rule: PHMM 38 PHMM Emission Probabilities  Emission probabilities for the given MSA o Using add-one rule PHMM 39 PHMM Transition Probabilities  Transition probabilities for the given MSA o Using add-one rule PHMM 40 PHMM Summary  Construct pairwise alignments o Usually, use dynamic programming  Use these to construct MSA o Lots of ways to do this  Using MSA, determine probabilities o Emission probabilities o State transition probabilities  Then we have trained a PHMM o Now what??? PHMM 41 PHMM Scoring  Want to score sequences to see how closely they match PHMM  How did we score using HMM? o Forward algorithm  How to score sequences with PHMM? o Forward algorithm (surprised?)  But, algorithm is a little more complex o Due to complex state transitions PHMM 42 Forward Algorithm  Notation o Indices i and j are columns in MSA o xi is ith observation symbol o qxi is distribution of xi in “random model” o Base case is is score of x1,…,xi up to state j (note that in PHMM, i and j may not agree) o Some states undefined o Undefined states ignored in calculation o PHMM 43 Forward Algorithm   Compute P(X|λ) recursively Note that and depends on , o And corresponding state transition probs PHMM 44 PHMM  We will see examples of PHMM later  In particular, o Malware detection based on opcodes o Masquerade detection based on UNIX commands PHMM 45 References    Durbin, et al, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids L. Huang and M. Stamp, Masquerade detection using profile hidden Markov models, Computers & Security, 30(8):732-747, 2011 S. Attaluri, S. McGhee, and M. Stamp, Profile hidden Markov models for metamorphic virus detection, Journal in Computer Virology, 5(2):151-169, 2009 PHMM 46

b2_PHMM

Related documents

Products

Support

b2_PHMM

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib