Chapter 4: Hidden Markov Models

advertisement
Chapter 4: Hidden Markov Models
4.4 HMM Applications
Prof. Yechiam Yemini (YY)
Computer Science Department
Columbia University
Overview
 Alignment with HMM
 Gene predictions
 Protein structure predictions
2
1
Sequence Analysis Using HMM
Step 1: construct an HMM model
 Design an HMM generator for the observed sequences
 Assign hidden states to underlying sequence regions
 Model the questions to be answered in terms of hidden-pathway
Step 2: train the HMM
 Supervised/unsupervised
Step 3: analyze sequences
 Viterbi decoding: compute most likely hidden-pathway
 Forward/Backward: compute likelihood of sequence (& p/Z-score)
3
Alignment
4
2
Profile HMM
D
D
D
I
I
I
I
S
M
M
M
A
C
T
G
E
.6
.1
.2
.1
5
Searching With Profile HMM
 Is the target sequence S generated by profile H?
 Viterbi: use H to compute the most probable alignment of S
 Forward: compute probability of S generated by H
 Example: searching for globins
 Create profile HMM from known globin sequences
 Use the forward algorithm to search SWISS-PROT for globin
 Compare against the null hypothesis (random sequence)
 Result: globins are highly distinct
6
3
Pairwise Alignment With HMM
 Key idea: recast DP as Viterbi decoding
“Alignment” = sequence of pairs {<X,Y>,<X,_>, <_,Y>}
 Consider an HMM that emits pairs (pair HMM):
ε
X: qX
δ
δ
Begin
1−2δ−τ
τ
M: PXY
End
1−ε−τ
1−2δ−τ
δ
τ
1−ε−τ
δ
τ
τ
Y: qy
ε
7
Viterbi Decoding  Pairwise Alignment
 Viterbi decoding: optimum path of pairs
 Forward step: select among {XY,-Y,X-}
X=TGCGCA
Y=TGGCTA
E
Y
X TT
M
B
G-G
GG
1 2 3 4 5 6
X-Y
XY
ε
X: qX
δ
δ
Begin
1−2δ−τ
M: PXY
τ
δ
End
τ
1−ε−τ
1−2δ−τ
δ
τ
1−ε−τ
Y: q y
τ
ε
8
4
Comparing With Standard DP
 Compute relationships scores  probabilities
 Consider an HMM that emits pairs (pair HMM)
<δ,ε,τ,P,q>  <s(x,y),d,e>
e.g., e=-log[ε/(1−τ) ]
-d
M
s(Xi,Yj) -d
X
-e
δ
δ
s(Xi+k,Yj)
Begin
s(Xi,Yj+k)
Y
1−2δ−τ
τ
τ
1−ε−τ
M: PXY
δ
End
τ
1−ε−τ
1−2δ−τ
δ
-e
ε
X: qX
Y: q y
τ
ε
9
Gene Predictions
10
5
Schematics
 Prokaryotic genes
promoter
gene
start
gene
gene
terminator
stop
 Eukaryotic genes
intron
intron
promoter
exon
start
donor
exon
acceptor
exon
stop
11
Modeling Gene Structure With HMM
Gene regions are modeled as HMM states
Emission probabilities reflect nucleotide statistics
Splice site
12
6
Simple Gene Models
 Consider a region of random length L
 Markov model  L is geometrically distributed P[L=k]=pk(1-p)
 E[L]=p/(1-p)
 How do we model ORF?
 Codons can be modeled as higher-order states
A=.29
C=.31
G=.04
T=.36
13
VEIL: Viterbi Exon-Intron Locator
(Henderson, Salzberg, & Fasman 1997)
 Contains 9 hidden states
 Each state is a Markovian model of regions
 Exons, introns, intergenic regions, splice sites, etc.
Exon HMM Model
Upstream
Start Codon
3’ Splice Site
Exon
Intron
Stop Codon
5’ Splice Site
Downstrea
m
5’ Poly-A Site
VEIL Architecture
• Enter: start codon or intron (3’ Splice Site)
• Exit: 5’ Splice site or three stop codons
(taa, tag, tga)
14
7
Genie (Kulp 96)
 Uses a generalized HMM (GHMM)
 Edges in model are complete HMMs
 States are neural networks for signal finding
• J5’ – 5’ UTR
• EI – Initial Exon
• E – Exon, Internal Exon
Begin
Start
Sequence Translation
• I – Intron
Donor
splice
site
Acceptor
Stop
End
splice Translation Sequence
site
• EF – Final Exon
• ES – Single Exon
• J3’ – 3’UTR
15
GeneScan (Burge & Karlin 97)
J. Mol. Bio. (1997) 268, 78-94
5’ UTR
Intergenic
Promoter
Ex1
In1
3’ UTR
Ex2
Ex2
In2
Ex3
In3
Ex4
 Base models overall parse of the gene
 5’UTR/3’UTR; Promoter; Poly A….
In4
Ex5
Ex5
E0
E1
E2
I0
I1
I2
Poly A
 Exon-Intron-Exon structure of gene
 Intron states model 3 scenarios:
 I0 : Intron is between two codons
 I1 : Intron is right after first codon base
Einit
 I2 : Intron is right after second codon base
 Exons model respective scenarios
5’ UTR
Eterm
Single
exon
gene
3’ UTR
Poly A
signal
Promoter
Intergenic
region
16
8
GeneScan Strategy
5’ UTR
Intergenic
Promoter
Ex1
In1
3’ UTR
Ex2
Ex2
In2
Ex3
In3
Ex4
 HMM: sequence generator; state=region
In4
Ex5
Ex5
E0
E1
E2
I0
I1
I2
Poly A
 Decoding: parse sequence into regions
 Viterbi: maximum likelihood decoder via DP
 These structures admit generalizations
Einit
Eterm
Single
exon
gene
5’ UTR
Forward Strand
3’ UTR
Poly A
signal
Promoter
Forward Strand
Intergenic
region
17
Extending The HMM Model
….ACAGTTATAGCGTAATTGCGAGCTATCATGAG……
Decode
Intergenic
Promoter
Ex1
In1
Ex2
Ex2
In2
Ex3
In3
Ex4
In4
Ex5
Ex5
Poly A
 Problem: the HMM does not model nature
 Sequence lengths are not geometrically distributed
 Some regions have special features (e.g., splice sites use GT…)
 …….
 Challenge: generalize HMM while retaining algorithms
18
9
GeneScan: Extended Sequence Generator Model
 Add to state k a length distribution fk(L)
 Semi-Markov model
 Extend the Viterbi DP recursion to reflect this
 Incorporate special regions structures
 Weight matrix model for splice site
 ….
 Preserve the recursion structure
E0
E1
E2
I0
I1
I2
Einit
Eterm
Single
exon
gene
5’ UTR
3’ UTR
Poly A
signal
Promoter
Intergenic
region
19
Measuring Classifier Performance
 The challenge: how do we evaluate performance of a classifier?
 E.g., consider a test to decide whether a person is sick (p) or healthy (n)
 Measured Performance
 TP=True Positive; TN=True Negative; FP=False Positive; FN = False Negative
 Sensitivity: SN=TP/(TP+FN)
 Sensitivity =percentage of correct predictions when the actual state is positive
 Specificity: SP =TN/(TN+FP)
 Specificity=percentage of correct predictions when the actual state is negative
 Receiver Operating Characteristics (ROC) curves
 Represent tradeoff between sensitivity & specificity
TP
FN
FP
TN
ROC curves
0.5
N
N’
Sensitivity
Actual
P
prediction
1
P’
1-specificity
0.5
1.0
20
10
Performance Comparisons
21
Identifying
Transmembrane Proteins
22
11
Transmembrane Proteins
 Transmembrane proteins are the key tools for cells interactions
 Signaling, transport, adhesion…
 A common structural motif: core helices
 Example: receptors
 Function: receive signals, activate pathways
 Operations:
1. Signal recognition
2. Signal transduction
3. Pathway activation
4. Down regulation
www.pdb.org
23
Example: The Insulin Pathway
InsR phosphorylated complex
Glut-4 is transported to
membrane to generate
glucose influx
Insulin binds to
InsR receptor
Receptor activates
pathway
Metabolic network
processing
From Wikipedia
24
12
TMHMM [Sonhammer et al 98]
 A HMM to identify
transmembrane proteins
 Architecture:
 The helix core is key
 Globular and loop domains
 Modeling challenge: variable length
 Training
 The Maximization step of Baum
Welch uses simulated annealing
 3-stage training
A hidden Markov Model for Predicting Transmembrane Helices in Protein
In J. Glasgow et al., eds., Proc. Sixth Int. Conf. on Intelligent Systems for
Molecular Biology, 175-182. AAAI Press, 1998. 1
25
TMHMM Performance
26
13
Proteins
Structure Prediction
27
Proteins Are Formed From Peptide Bonds
Ramachandran Regions
28
14
Levels of Protein Structure
29
Structure Formation
30
15
Helix is Formed Through H Bonds
31
Tertiary Structures: b-Barrel Example
32
16
Why Is Structure Important?
Protein functions are handled by structural elements
Enzyme active site
HIV protease drug binding site
Antibody-protein interfaces
33
Structure Determination
Map: sequence  structure
 Structure determination is handled through crystalography
 X-Ray; NMR
PDB: a database of structures
34
17
The Problem Of Structure Prediction
 Derive conformation geometry from sequence
 Ab-initio techniques
 Homology techniques
 Secondary structure is organized in folds
 α-helix, β-sheets….
35
The Structure Prediction Problem
Given: an amino acid sequence
Output: structure annotation {H,B…}
36
18
HMM Model





Hidden states: folds
Observable: AA sequence
Viterbi decoding: find most likely annotation
Training: supervised learning use known structures
Performance: HMM works well for limited protein types
 E.g., TMHMM…
 For general structure predictions Neural-Nets are superior
(e.g., PHD, R. Burkhard 93)
 Why?
o [HMM works as long as the underlying conformation is consistent with the Markovian model.
Protein conformations may depend on complex long-range non-Markovian interactions. E.g.,
consider the impact of a single AA change on hemoglobin conformation in sickle-cell anemia. ]
37
Example (Chu et al, 2004)
www.gatsby.ucl.ac.uk/~chuwei/paper/icml04_talk.pdf
Step 1: align sequences to partition into regions
Step 2: build extended HMM (semi-markov length)
38
19
Performance
Challenges long-range interactions (e.g., β-sheets)
39
Conclusions
 HMM are very useful when a Markovian model is appropriate
 Solve three central problems:
 Decoding sequences to classify their components
 Computing likelihood of the classification
 Training
 May be extended to handle non-Markovian elements
 But can lose their predictive power when Markovian assumptions
diverge from nature
40
20
Download