Chapter 4: Hidden Markov Models 4.4 HMM Applications Prof. Yechiam Yemini (YY) Computer Science Department Columbia University Overview Alignment with HMM Gene predictions Protein structure predictions 2 1 Sequence Analysis Using HMM Step 1: construct an HMM model Design an HMM generator for the observed sequences Assign hidden states to underlying sequence regions Model the questions to be answered in terms of hidden-pathway Step 2: train the HMM Supervised/unsupervised Step 3: analyze sequences Viterbi decoding: compute most likely hidden-pathway Forward/Backward: compute likelihood of sequence (& p/Z-score) 3 Alignment 4 2 Profile HMM D D D I I I I S M M M A C T G E .6 .1 .2 .1 5 Searching With Profile HMM Is the target sequence S generated by profile H? Viterbi: use H to compute the most probable alignment of S Forward: compute probability of S generated by H Example: searching for globins Create profile HMM from known globin sequences Use the forward algorithm to search SWISS-PROT for globin Compare against the null hypothesis (random sequence) Result: globins are highly distinct 6 3 Pairwise Alignment With HMM Key idea: recast DP as Viterbi decoding “Alignment” = sequence of pairs {<X,Y>,<X,_>, <_,Y>} Consider an HMM that emits pairs (pair HMM): ε X: qX δ δ Begin 1−2δ−τ τ M: PXY End 1−ε−τ 1−2δ−τ δ τ 1−ε−τ δ τ τ Y: qy ε 7 Viterbi Decoding Pairwise Alignment Viterbi decoding: optimum path of pairs Forward step: select among {XY,-Y,X-} X=TGCGCA Y=TGGCTA E Y X TT M B G-G GG 1 2 3 4 5 6 X-Y XY ε X: qX δ δ Begin 1−2δ−τ M: PXY τ δ End τ 1−ε−τ 1−2δ−τ δ τ 1−ε−τ Y: q y τ ε 8 4 Comparing With Standard DP Compute relationships scores probabilities Consider an HMM that emits pairs (pair HMM) <δ,ε,τ,P,q> <s(x,y),d,e> e.g., e=-log[ε/(1−τ) ] -d M s(Xi,Yj) -d X -e δ δ s(Xi+k,Yj) Begin s(Xi,Yj+k) Y 1−2δ−τ τ τ 1−ε−τ M: PXY δ End τ 1−ε−τ 1−2δ−τ δ -e ε X: qX Y: q y τ ε 9 Gene Predictions 10 5 Schematics Prokaryotic genes promoter gene start gene gene terminator stop Eukaryotic genes intron intron promoter exon start donor exon acceptor exon stop 11 Modeling Gene Structure With HMM Gene regions are modeled as HMM states Emission probabilities reflect nucleotide statistics Splice site 12 6 Simple Gene Models Consider a region of random length L Markov model L is geometrically distributed P[L=k]=pk(1-p) E[L]=p/(1-p) How do we model ORF? Codons can be modeled as higher-order states A=.29 C=.31 G=.04 T=.36 13 VEIL: Viterbi Exon-Intron Locator (Henderson, Salzberg, & Fasman 1997) Contains 9 hidden states Each state is a Markovian model of regions Exons, introns, intergenic regions, splice sites, etc. Exon HMM Model Upstream Start Codon 3’ Splice Site Exon Intron Stop Codon 5’ Splice Site Downstrea m 5’ Poly-A Site VEIL Architecture • Enter: start codon or intron (3’ Splice Site) • Exit: 5’ Splice site or three stop codons (taa, tag, tga) 14 7 Genie (Kulp 96) Uses a generalized HMM (GHMM) Edges in model are complete HMMs States are neural networks for signal finding • J5’ – 5’ UTR • EI – Initial Exon • E – Exon, Internal Exon Begin Start Sequence Translation • I – Intron Donor splice site Acceptor Stop End splice Translation Sequence site • EF – Final Exon • ES – Single Exon • J3’ – 3’UTR 15 GeneScan (Burge & Karlin 97) J. Mol. Bio. (1997) 268, 78-94 5’ UTR Intergenic Promoter Ex1 In1 3’ UTR Ex2 Ex2 In2 Ex3 In3 Ex4 Base models overall parse of the gene 5’UTR/3’UTR; Promoter; Poly A…. In4 Ex5 Ex5 E0 E1 E2 I0 I1 I2 Poly A Exon-Intron-Exon structure of gene Intron states model 3 scenarios: I0 : Intron is between two codons I1 : Intron is right after first codon base Einit I2 : Intron is right after second codon base Exons model respective scenarios 5’ UTR Eterm Single exon gene 3’ UTR Poly A signal Promoter Intergenic region 16 8 GeneScan Strategy 5’ UTR Intergenic Promoter Ex1 In1 3’ UTR Ex2 Ex2 In2 Ex3 In3 Ex4 HMM: sequence generator; state=region In4 Ex5 Ex5 E0 E1 E2 I0 I1 I2 Poly A Decoding: parse sequence into regions Viterbi: maximum likelihood decoder via DP These structures admit generalizations Einit Eterm Single exon gene 5’ UTR Forward Strand 3’ UTR Poly A signal Promoter Forward Strand Intergenic region 17 Extending The HMM Model ….ACAGTTATAGCGTAATTGCGAGCTATCATGAG…… Decode Intergenic Promoter Ex1 In1 Ex2 Ex2 In2 Ex3 In3 Ex4 In4 Ex5 Ex5 Poly A Problem: the HMM does not model nature Sequence lengths are not geometrically distributed Some regions have special features (e.g., splice sites use GT…) ……. Challenge: generalize HMM while retaining algorithms 18 9 GeneScan: Extended Sequence Generator Model Add to state k a length distribution fk(L) Semi-Markov model Extend the Viterbi DP recursion to reflect this Incorporate special regions structures Weight matrix model for splice site …. Preserve the recursion structure E0 E1 E2 I0 I1 I2 Einit Eterm Single exon gene 5’ UTR 3’ UTR Poly A signal Promoter Intergenic region 19 Measuring Classifier Performance The challenge: how do we evaluate performance of a classifier? E.g., consider a test to decide whether a person is sick (p) or healthy (n) Measured Performance TP=True Positive; TN=True Negative; FP=False Positive; FN = False Negative Sensitivity: SN=TP/(TP+FN) Sensitivity =percentage of correct predictions when the actual state is positive Specificity: SP =TN/(TN+FP) Specificity=percentage of correct predictions when the actual state is negative Receiver Operating Characteristics (ROC) curves Represent tradeoff between sensitivity & specificity TP FN FP TN ROC curves 0.5 N N’ Sensitivity Actual P prediction 1 P’ 1-specificity 0.5 1.0 20 10 Performance Comparisons 21 Identifying Transmembrane Proteins 22 11 Transmembrane Proteins Transmembrane proteins are the key tools for cells interactions Signaling, transport, adhesion… A common structural motif: core helices Example: receptors Function: receive signals, activate pathways Operations: 1. Signal recognition 2. Signal transduction 3. Pathway activation 4. Down regulation www.pdb.org 23 Example: The Insulin Pathway InsR phosphorylated complex Glut-4 is transported to membrane to generate glucose influx Insulin binds to InsR receptor Receptor activates pathway Metabolic network processing From Wikipedia 24 12 TMHMM [Sonhammer et al 98] A HMM to identify transmembrane proteins Architecture: The helix core is key Globular and loop domains Modeling challenge: variable length Training The Maximization step of Baum Welch uses simulated annealing 3-stage training A hidden Markov Model for Predicting Transmembrane Helices in Protein In J. Glasgow et al., eds., Proc. Sixth Int. Conf. on Intelligent Systems for Molecular Biology, 175-182. AAAI Press, 1998. 1 25 TMHMM Performance 26 13 Proteins Structure Prediction 27 Proteins Are Formed From Peptide Bonds Ramachandran Regions 28 14 Levels of Protein Structure 29 Structure Formation 30 15 Helix is Formed Through H Bonds 31 Tertiary Structures: b-Barrel Example 32 16 Why Is Structure Important? Protein functions are handled by structural elements Enzyme active site HIV protease drug binding site Antibody-protein interfaces 33 Structure Determination Map: sequence structure Structure determination is handled through crystalography X-Ray; NMR PDB: a database of structures 34 17 The Problem Of Structure Prediction Derive conformation geometry from sequence Ab-initio techniques Homology techniques Secondary structure is organized in folds α-helix, β-sheets…. 35 The Structure Prediction Problem Given: an amino acid sequence Output: structure annotation {H,B…} 36 18 HMM Model Hidden states: folds Observable: AA sequence Viterbi decoding: find most likely annotation Training: supervised learning use known structures Performance: HMM works well for limited protein types E.g., TMHMM… For general structure predictions Neural-Nets are superior (e.g., PHD, R. Burkhard 93) Why? o [HMM works as long as the underlying conformation is consistent with the Markovian model. Protein conformations may depend on complex long-range non-Markovian interactions. E.g., consider the impact of a single AA change on hemoglobin conformation in sickle-cell anemia. ] 37 Example (Chu et al, 2004) www.gatsby.ucl.ac.uk/~chuwei/paper/icml04_talk.pdf Step 1: align sequences to partition into regions Step 2: build extended HMM (semi-markov length) 38 19 Performance Challenges long-range interactions (e.g., β-sheets) 39 Conclusions HMM are very useful when a Markovian model is appropriate Solve three central problems: Decoding sequences to classify their components Computing likelihood of the classification Training May be extended to handle non-Markovian elements But can lose their predictive power when Markovian assumptions diverge from nature 40 20