Document

advertisement
Hidden
Markov
Models
A profile HMM
What
is
a
HMM?
• It’s a model. It condenses information.
• Models 1D discrete data.
• Directed graph.
• Nodes emit discrete data
• Edges transition between nodes.
• What’s hidden about it?
• Node identity.
• Who is Markov?
2
Markov processes
time
sequence
Markov process is any process where the next item in the list depends
on the current item. The dimension can be time, sequence position, etc
Modeling protein secondary structure
using Markov chains
“states” connected by “transitions”
H
E
H=helix
E=extended (strand)
L
L=loop
Setting the parameters of a Markov
model from data.
Secondary structure data
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLEEEEELLLLEEEEELLLLLLLLLLLEEEEEEEEELLLLLEEEEEEEEELL
LLLLLLEEEEEELLLLLEEEEEELLLLLLLLLLLLLLLEEEEEELLLLLEEEEELLLLLLLLLLEEEELLLLEEEELLLLEEE
EEEEELLLLLLEEEEEEEEELLLLLLEELLLLLLLLLLLLLLLLLLLLLLLLEEEEELLLEEEEEELLLLLLLLLLEEEEEEL
LLLLEEELLLLLLLLLLLLLEEEEEEEEELLLEEEEEELLLLLLLLLLLLLLLLLLLHHHHHHHHHHHHHLLLLLLLEEL
HHHHHHHHHHLLLLLLHHHHHHHHHHHLLLLLLLELHHHHHHHHHHHHLLLLLHHHHHHHHHHHHHLLLLLEEEL
HHHHHHHHHHLLLLLLHHHHHHHHHHEELLLLLLHHHHHHHHHHHLLLLLLLHHHHHHHHHHHHHHHHHHHHH
HHHHHHLLLLLLHHHHHHHHHHHHHHHHHHLLLHHHHHHHHHHHHHHLLLLEEEELLLLLLLLLLLLLLLLEEEEL
LLLHHHHHHHHHHHHHHHLLLLLLLLEELLLLLHHHHHHHHHHHHHHLLLLLLEEEEELLLLLLLLLLHHHHHHHH
HHHHHHHHHHHHHHHLLLLLHHHHHHHHHLLLLHHHHHHHLLHHHHHHHHHHHHHHHHHHHH
Count the pairs to get the transition probability.
E
P(L|E)
L
P(L|E) = P(EL)/P(E) = counts(EL)/counts(E)
counts(E) = counts(EE) + counts(EL) + counts(EH)
Therefore: P(E|E) + P(L|E) + P(H|E) = 1.
A transition matrix
P(H|H)
P(E|H)
P(qt|qt-1)
P(E|E)
H
E
P(H|E)
P(L|E)
H
E
L
H
.93
.01 .06
E
.01
.80
.19
L
.04
.06
.90
P(L|H)
P(H|L)
P(E|L)
L
P(L|L)
**This is a “first-order” MM. Transition probabilities depend
on only the current state.
P(S|λ), the probability of a
sequence, given the model.
P(“HHEELL”| λ)
λ
H
E
L
H
.93
.01 .06
E
.01
.80
.19
L
.04
.06
.90
=P(H)P(H|H)P(E|H)P(E|E)P(L|E)P(L|L)
=(.33)(.93)(.01)(.80)(.19)(.90)
=4.2E-4
P(“HHHHHH” | λ) =0.69
common protein secondary structure
P(“HEHEHE” | λ) =1E-6
not protein secondary structure
Probability discriminates between realistic and unrealistic sequences
What is the maximum likelihood model
given a dataset of sequences?
Dataset.
HHEELL
H
E
H
1
1
0
H
0.5
0.5 0
HHEELL
E
0
1
1
E
0
0.5
0.5
HHEELL
L
0
0
1
L
.0
0
1.0
HHEELL
L
HHEELL
HHEELL
Count the state pairs.
H
E
L
Maximum likelihood model
Normalize by row.
Is this model too simple?
H
E
L
Frequency
Synthetic helix length data from
this model
1 2 3 4 5 6 7 8 9 10
Real helix length data
*L.Pal et al, J. Mol. Biol. (2003)
326, 273–291
“A model should be as simple as possible but not simpler” --Einstein
A pseudo-higher-order HMM
A Markov chain for proteins where helices are always
exactly 4 residues long
H
H
L
H
H
E
A Markov chain for proteins where helices are always
at least 4 residues long
H
H
L
H
H
E
Can you draw a Markov chain where helices are
always a multiple of 4 long?
H
H
H
L
H1
H1
H2
H3
H4
E
E
L
0.1
0.4
1
H2
1
H3
1
H4
0.5
H
E
0.2
0.7
0.1
L
0.2
0.2
0.6
Calculate probability of
EHHHHHLLE.
Example application: A Markov chain for CpG
islands
A
T
a “saturated” 4-state MM
C
G
P(ATCGCGTA...) = πAaATaTCaCGaGCaCGaGTaTA …
1
CpG Islands
DNA ...
+
methylated
+
-
-
...
Not methylated
DNA is methylated on C to protect against endonucleases.
Using mass spectroscopy we can find regions of DNA that are
methylated and regions that are not. Regions that are protected
from methylation may be functionally important, i.e.
transcription factor binding sites.
NNNCGNNN
NNNTGNNN
During the course of evolution. Methylated CpG’s get mutated
to TpG’s
Using Markov chains for descrimination:
CpG Islands in human chromosome sequences
+
-
CpG rich
CpG poor
...
-
+
...
CpG poor= "-"
CpG rich= "+"
+
A
C
G
T
-
A
C
G
T
A
0.180
0.274
0.426
0.120
A
0.300
0.205
0.285
0.210
C
0.171
0.368
0.274
0.188
C
0.322
0.298
0.078
0.302
G
0.161
0.339
0.385
0.125
G
0.248
0.246
0.298
0.208
T
0.079
0.355
0.384
0.182
T
0.177
0.239
0.292
0.292
P(CGCG|+) = πC(0.274)(0.339)(0.274) = πC 0.0255
P(CGCG|-) = πC(0.078)(0.246)(0.078) = πC 0.0015
From Durbin,Eddy, Krogh and Mitcheson “Biological Sequence Analysis” (1998) p.50
2
Rabiner’s notation
P(y, x) F(y, x)
a yx = P(x | y) =
=
P(y)
F(y)
...the conditional probability of x given y, or the
transition probability of state y to state x.
πx = P(x) = F(x)/N
...the unconditional probability of being in state x . (used
to start a state pathway)
bx (y) = P(y|x)
...the conditional probability of emitting character
y given state x.
Comparing two MMs
The log likelihood ratio (LLR)
L
log ∏
i =1
+
x i −1 x i
−
x i −1 x i
a
a
β
Log-likelihood ratios
for transitions:
L
= ∑ log
i =1
A
C
G
a
a
+
xi −1 xi
−
xi −1 xi
L
= ∑ βx i−1x i
i =1
T
A
-0.740 0.419 0.580 -0.803
C
-0.913 0.302 1.812 -0.685
G
-0.624 0.461 0.331 -0.730
T
-1.169 0.573 0.393 -0.679
Sum the LLRs.
If the result is positive, its a CpG island, otherwise not.
LLR(CGCG)=1.812 + 0.461 + 1.812 = 4.085
yes
3
Combining two Markov chains to make a hidden Markov model
A hidden Markov model can have
multiple paths for a sequence
C
"+" model
"–"model
A
T
G
Transitions
between +/models
A
T
G
In Hidden Markov models (HMM), there is no one-to-one correspondence
between the state and the emitted symbol.
4
Probability of a sequence using a HMM
Different state sequences can produce the same emitted sequence
Nucleotide sequence (S): C
G
C
G
P(sequence,path)
State sequences (Q):
C+ G+ C+ G+
C- G- C- G-
C+ G+ C- G-
C+ G- C- G+
etc....
P(CGCG|λ) =
πC+ aC+G+aG+C+aC+G+
πC- aC-G- aG-C- aC-GπC+ aC+G+aG+C-aC-GπC+ aC+G- aG-C- aC-G+
etc....
Σ
P(Q)
All paths Q
Each state sequence has a probability. The sum of all state
sequences that emit CGCG is the P(CGCG).
5
Three HMM Algorithms
1. The Viterbi algorithm: get the optimal state pathway.
Maximum joint prob.
2. The Forward/Backward algorithm: get the probability of
each state at each position. Sum over all joint probs.
3. Expectation/Maximization: refine the parameters of the
model using the data
Back to secondary structure prediction....
Parallel HMM: emits sec struct and amino acid
bH(i)
The marble bag represents
a probability distribution
of amino acids, b. ( a
profile )
H
E
L
probability distribution == a set of
probabilities (0 ≤ p ≤ 1) that sum to 1.
0.
Each state emits one
amino acid from the
marblebag, for each visit.
stacked odds?
1.
states emit aa and ss.
Amino acid
Sequence
H
E
L
λ
Given an amino acid sequence, what is the
most probable state sequence?
State sequence
(secondary
structure)
HMM data structure for parallel HMM
in fortran...
type HMMNODE
integer :: id
type (HMMEDGE), pointer :: a(:)
real :: b(20), emit(:)
logical :: emitting
end type HMMNODE
type HMMEDGE
integer :: id
real :: p
type (HMMNODE), pointer :: q
end type HMMEDGE
type (HMMSTATE), pointer :: hmm_root
A linked list...
hmm_root should be the “begin” state, with hmm_root%emitting set to .false. If
(emitting==.true.) then the state emits amino acid profile b, and optionally something
else called emit(:).
23
In class exercise: what’s the LLR?
What is the LLR that this seq is a CpG Island?
ATGTCTTAGCGCGATCAGCGAAAGCCACG
β
A
C
G
T
A
-0.740
0.419
0.580
-0.803
C
-0.913
0.302
1.812
-0.685
G
-0.624
0.461
0.331
-0.730
T
-1.169
0.573
0.393
-0.679
L
LLR = ∑ β xi−1xi
= _______________
i=1
3
HMM: assigning the states given the
sequence is not as easy.
Typically, when using a HMM, the task is to determine the
optimal state pathway given the sequence. The state pathway
provides some predictive feature, such as secondary structure,
or splice site/not splice site, or CpG island/not CpG island, etc.
In Principle, we can do this task by trying all state
pathways Q, and choosing the optimal. In Practice, this is
usually impossible, because the number of pathways increases
as the number of states to the power of the length, i.e. O(nm).
How do we do it, then?
Maximize:
Joint probability of a sequence and pathway
Q = {q1,q2,q3,…qT} = sequence of Markov states, or pathway
S = {s1,s2,s3,…sT} = sequence of amino acids or nucleotides
T = length of S and Q.
Joint probability of a pathway and sequence, given a HMM λ.
S=
A
G
P
L
V
D
Q=
P=
H
H
H
H
H
H
E
E
E
E
E
E
L
L
L
L
L
L
πHbH(A)
× aHHbH(G) × aHEbE(P) × aEEbE(L) × aEEbE(V) × aELbL(D)
Joint probability : general expression
General expression for pathway Q through HMM λ :
P(S,Q | λ ) = π q 1 ∏ bq t (st )aq t q t +1
t =1,T
**
**when t=T, there is no qt+1. Use a = 1
A
G
P
L
V
D
H
H
H
H
H
H
E
E
E
E
E
E
L
L
L
L
L
L
The Viterbi algorithm:the maximum probability path
For all states k
vk(t) = MAX vl(t-1) alk bk(st)
l
Trck(t) = ARGMAX vl(t-1) alk bk(st)
l
Recursive. We save the value v
and also a traceback arrow Trc as
we go along.
Plot state versus position. Each v is a MAX over the whole previous column of v’s.
l
vl(i)
Markov states
...
1 2 3
k
T-1
sequence position t
T
When t = T the last position, the
traceback arrow from the MAX give the
optimal state sequence.
Exercise: Write the Viterbi algorithm
5
4
6
3
2
states 1..L
1
positions 1..T
vk(t) = MAX vl(t-1) alk bk(st)
Trck(t) = ARGMAX vl(t-1) alk bk(st)
Exercise: Write the Viterbi algorithm
5
4
6
3
1
2
vk(t) = MAX vl(t-1) alk bk(st)
Trck(t) = ARGMAX vl(t-1) alk bk(st)
initialize vk(1)=bk(s1)
for t=2,T {
for k=1,L {
}
states 1..L
positions 1..T
}
α
The Forward algorithm:all paths to a state
This is alpha, the forward probability
This is ‘a’, the ‘arrow’ between states.
αk(t) = Σ αl(t-1) alk bk(t)
l
“Forward” stands for
“forward recursion”
After the first row, each α depends on the whole previous row of α’s.
l
αlt
Markov states
...
1 2 3
k
t-1
...
Sum of P over all paths up to state k
at t
= αk(t)
t
sequence position i
At the end of the sequence, when t=T, the sum of αk(T) equals the total
probability of the seuqence given the model, P(S|λ).
β
The Backward algorithm:all paths from a state
“Backward” stands for
“backward recursion”. The
algorithm starts at t=T, the end
of the sequence. (The
transitions are still forward.)
βk(t) = Σ βl(t+1) akl bk(t)
l
Each β depends on the whole next row of β’s.
βlt
= βk(t)
...
Markov states
Sum over all paths to state k from t+1
...
k
l
t
t+1
T-2 T-1 T
sequence position i
At the beginning of the sequence, when t=1, the sum of βk(1) equals the total
probability of the sequence given the model, P(S|λ).
Exercise: Write the Forward algorithm
5
4
6
3
1
2
αk(t) = Σ αl(t-1) alk bk(t)
initialize αk(1)=πk(s1)
for t=2,T {
for k=1,L {
}
states 1..L
positions 1..T
}
γ
Forward/Backward algorithm:all paths through a state.
γk(t) = αk(t) *βk(t)
γk(t) is the total probability of state k at t,
given the sequence S and the model, λ.
βlt
Markov states
l
αlt
...
k
l
Markov states
...
1 2 3
t-1
t
t+1
sequence position t
The bottleneck through which all paths must travel.
T-2 T-1 T
Expectation/Maximization: refining the model
Example: refining bk(G) (i.e. the number of Gly’s in the kth marble bag)
all parameters can be refined simultaneously
First calculate P(k|t) for all states k and all positions t using
Forward/Backward.
Step 1) Sum the probability of state k given aa @ t is G, P(k|G).
Step 2) Normalize it by dividing by the sum of P(k) .
Step 3) Set bk(G) in that value. P(G|k)
Step 4) Repeat for all states k in λ and all 20 amino acids.
Recalculate P(k|t) using the new parameters.
Repeat steps 1-4.
Iterate to convergence.
Expectation/Maximization: refining the model
Example: refining bk(G)
Σ over all G in all sequences, S
To count the Glycines, we calculate the Forward/Backward value for state k at
every Glycine in the database. Then sum them.
P(k|t,S,λ) = Σ all paths though k at t = γk(t) = αk(t) *βk(t)
b’k(G) =
SDKPHS
G
SDKPQ
LKVSDE
G
LKVSDEFF
+
SDKPHSSIK
G
SDE
G
INC LKVHRSDE
SDKPHSEEE
G
SDE
+
D
G
+
KP
LKVSDEGWWNN
+
KS
+
+
+
G
LKVSDEGQGQ
+
SDKPHSGM
G
LKE
ASDKPH
G
LKVSDE
bk is then normalized to sum to 1 over all 20 AA’s.
Expectation/Maximization: refining the model
Example: refining ajk, the probability of a transition from state j to state k.
Step 1) Get the probability of ending in state j at t
--> αj(t)
Step 2) Get the probability of starting in state k at t+1
--> βk(t)
Step 3) Multiply these by the current ajk
Step 4) Do Steps 1-3 for all positions t and all sequences, S.
Sum--> a’. Then normalize. Reset ajkin the new model to a’.
Do 1-4 using the new model. Repeat until convergence.
Expectation/Maximization: refining the model
αj(t)
βk(t+1)
Σ Σ αj(t) ajk βk(t+1)
S t
+
+
+
+
+
+
ajk
+
...
= a’
After summing all a’, they are
normalized to sum to 1.
Σ over all t in all sequences, S
Example: refining ajk, the probability of a transition from state j to state k.
Draw this HMM
What is the probability of this sequence?
39
Unrolling the Viterbi algorithm: underflow problems solved
by going to log space
Algorithm:
vk(t) = MAX vl(t-1) alk bk(st)
l
Algorithm unrolled:
vk(2) = bq (s1) aq k bk(s2)
1
1
vk(3) = bq (s1) aq q2 bq2(s2) aq2q3 bq3(s3)
1
1
vk(4) = bq (s1) aq q2 bq2(s2) aq2q3 bq3(s3) aq3q4 bq4(s4)
1
1
...
Why will this calculation fail for vk(100) ?
Hint: Try multiplying 200 numbers (all between 0 and 1) together.
Viterbi algorithm underflow: log space solution
Algorithm:
vk(t) = MAX vl(t-1) alk bk(st)
l
Log space:
Log(vk(t)) = MAX[ Log(vl(t-1)) + Log(alk) + Log( bk(st)) ]
l
The Forward algorithm: underflow problem
Algorithm:
αk(t) = Σ αl(t-1) alk bk(t)
l
Algorithm unrolled:
αk(2) = bk(2) [α1(1) a1k + α2(1) a2k + α3(1) a3k + α4(1) a4k ]
αk(3) = bk(3) [b1(2) [α1(1) a11 + α2(1) a21 + α3(1) a31 + α4(1) a41 ] a1k +
b2(2) [α1(1) a12 + α2(1) a22 + α3(1) a32 + α4(1) a42 ] a2k +
b3(2) [α1(1) a13 + α2(1) a23 + α3(1) a33 + α4(1) a43 ] a3k +
b4(2) [α1(1) a14 + α2(1) a24 + α3(1) a34 + α4(1) a44 ] a4k ]
...
Can’t do this one in Log space,
because Log(a+b) can’t be simplified.
Solution: dynamic scaling to keep numbers within a
reasonable range
42
The Forward algorithm: underflow problem, scaling solution
αk(t) = Σ αl(t-1) alk bk(t)
l
Let, α’k(2) = c2
Σ αl(1) alk bk(2)
where c1 is any scale factor.
And, α’k(3) = c3
Σ α’l(2) alk bk(3)
where c2 is any scale factor.
And so on, for all sequence positions t.
Then, α’k(t) = ct
Σ α’l(t-1) alk bk(t-1)
If we choose, ct
= 1/Σ α’k(t) ,
then, ct α’k(t) has a mean value of 1.
43
The Forward / Backward algorithm: scaling solution does not
change gamma.
Re-writing, α’k(t) = ctΣl (Πi=1,t-1ci) αl(t-1) alk bk(t-1)
So, α’k(t) = (Πi=1,t ci) αk(t)
If we apply the same scale factors to the backward value β,
then, β’k(t) = (Πi=t,T ci) βk(t)
Then the calculation of the a posteriori value γ, is
γ’k(t) = α’k(t) β’k(t) = (Πi=1,t ci) αk(t)(Πi=t,T ci) βk(t)
= (Πi=1,t ci) (Πi=t,T ci) αk(t) βk(t)
= (Πi=1,T ci) ct αk(t) βk(t) = CT ct αk(t) βk(t) = CT ct γk(t)
where, CT = (Πi=1,T ci)
Since γ is normalized to sum to 1,
γ’k(t) = CT ct γk(t) / Σk CT ct γk(t) = γk(t)
Gamma from scaled summation is the same as gamma unscaled.
44
How do you define the
TOPOLOGY of the HMM?
how much wood would a wood chuck
chuck if a wood chuck would
chuck wood?
can you can a can as a canner
can can a can?
I wish to wish the wish you wish to
wish, but if you wish the wish the
witch wishes, I won’t wish the wish
you wish to wish.
1
61
121
181
241
301
361
421
tgattggtct
tttgggggat
tggtatttta
gatttcggga
tacttgattt
ggattttaag
ttttaggatt
ctgaatataa
ctctgccacc
tttaggatta
ggatttactt
tttcaggatt
tgggatttta
ttttcttgat
acgggatttt
atgctctgct
gggagatttc
taggattacg
gattttggga
ttaagttttc
ggattacggg
tttatgattt
agggtgctca
gctctcgctg
cttatttgga
ggattttagg
ttttaggatt
ttgattttat
attttagggt
taagatttta
ctatttatag
atgtcattgt
ggtgatggag
gttctaggat
gagggatttt
gattttaaga
ttcaggattt
ggatttactt
aactttcatg
tctcataata
gatttcagga
tttaggatta
agggtttcag
ttttaggatt
cgggatttca
gattttggga
gtttaacata
cgttcctttg
Transposable elements: junk dealers
Transposable
elements “jumping
genes” lead to rapid
germline variation.
Barbara McClintock
“Out standing in her field”
Transposase,
transposasome
Excision of transposon may leave a “scar”.
TR
TR=tandem repeat
IR=inverted repeat
IR
IR
TR
cruciform structure
repaired DNA with copied TR and
added IR
Millions of years of accumulated TE
“scars”
Some genomes contain a large accumulation of transposon scars.
Estimated Transposable element-associated
DNA content in selected genomes
35%
TEs
>50%
Everything else
15%
H.sapiens
Z. mays
2%
Drosophila Arabidopsis
1.8%
C. elegans
3.1%
S. cerevisiae
How do you recognize a repeat sequence?
•High scoring self-alignments
•High dot plot density
•Compositional bias
A repeat region
in a dot plot.
Types of repeat sequences
Satellites -- 1000+ bp in
heterochromatin: centromeres, telomeres
Simple Sequence Repeats (SSRs),
in euchromatin :
Minisatellites -- ~15bp (VNTR)
Microsatellites -- 2-6 bp
heterochromatin=compact, light bands
euchromatin=loose, dark bands.
microsatellite
541
601
661
721
781
841
gagccactag
aggtgtgtgt
taaagtaggc
gccactggtt
ccgatcaaat
tcctatttcc
tgcttcattc
gtgtgtgtgt
agtcagtcaa
ggagagctga
aagcataagg
aagctgtggg
tctcgctcct
gtgtgtgtgt
cagtaagaac
tccgcaagct
tcttccaacc
gtcttgagga
actagaatga
gtgtgtgtgt
ttggtgccgg
gcaagacctc
actagcattt
gatcatttca
acccaagatt
gtatagcaga
aggtttgggg
tctatgcttt
ctgtcataaa
ctggccggac
gcccaggccc
gatggtttcc
tcctggccct
ggttctctaa
atgagcactg
cccatttcac
a microsatellite in a dog (canis familiaris) gene.
Minisatellite
1
61
121
181
241
301
361
421
tgattggtct
tttgggggat
tggtatttta
gatttcggga
tacttgattt
ggattttaag
ttttaggatt
ctgaatataa
ctctgccacc
tttaggatta
ggatttactt
tttcaggatt
tgggatttta
ttttcttgat
acgggatttt
atgctctgct
gggagatttc
taggattacg
gattttggga
ttaagttttc
ggattacggg
tttatgattt
agggtgctca
gctctcgctg
cttatttgga
ggattttagg
ttttaggatt
ttgattttat
attttagggt
taagatttta
ctatttatag
atgtcattgt
ggtgatggag
gttctaggat
gagggatttt
gattttaaga
ttcaggattt
ggatttactt
aactttcatg
tctcataata
gatttcagga
tttaggatta
agggtttcag
ttttaggatt
cgggatttca
gattttggga
gtttaacata
cgttcctttg
This 8bp tandem repeat has a consensus sequence AGGATTTT,
but is almost never a perfect match to the consensus.
fun with bioinformatics jargon
ACRONYMS for satellites and transposons
SSR
Short Sequence Repeat
STR
Short Tandem Repeat
VNTR Variable Number Tandem Repeat
LTR
Long Terminal Repeat
LINE Long Interspersed Nuclear Element
SINE
MITE
Short Interspersed Nuclear Element
Miniature Inverted repeat Transposable Element (class III TE)
TE
IS
Transposable Element
Insertion Sequence
IR
Inverted Repeat
RT
Reverse Transcriptase
TPase
Transposase
Class I TE, uses RT.
Class II TE, uses TPase.
Class III TE, MITEs*
Alu 11% of primate genome (SINE)
LINE1 14.6% of human genome
Tn7,Tn3,Tn10,Mu,IS50 transposons or transposable bacteriophage
*Cl,ass III are now merged with Class II TEs.
retroposon=retrotransposon
How significant is that?
...a specific
inference. Thanks.
...as opposed to...
...how likely the data would not
have been the result of chance,...
Please give me a number for...
(How) do you align repeat sequences?
A: Don’t align. Mask them out instead.
B: Dynamic Programming with special null model.
(Use EVD to fit
random scores.)
Remember: Low complexity repeat sequences will have high-scoring
alignments randomly. For example, random A/T repeat...
ATTTATATAATTAATATATAAATATAATAAATAT
aligned to
TATTATATATATATATATATTATATATATATATA
Random score is has >50% identity!!
How do I align repeat sequences?
• Align using dynamic programming.
• Assess significance using extreme value
distribution fit to random data from HMM.
• Results are e-values. Use e-values to build
multiple sequence alignments, phylogenetic
trees, etc.
57
Generating random low complexity
sequences, to estimate significance.
Dinucleotide composition model.
Generate random sequences based on dinucleotide model. Align
them to generate random score distribution.
Null model =
P(random alignment)
C
T
G
freq
A
alignment score
Getting expectation values for low
complexity/repeat sequences.
Trinucleotide composition model.
after A
after C
A
C
T
G
A
C
T
G
A
C
T
G
A
C
T
G
after T
after G
Only the arrows into the 4 “after A” states are shown
Getting expectation values for low
complexity/repeat sequences.
Motif null model. (Grammatical model.)
Repeats are possibly misspelled words.
A G K V T T T H
N
8 character misspelled-word repeat model, with occasional extra character(s).
Try this: create a HMM for a microsatellite.
•Using Netscape: Go to the NCBI database and download
the nucleotide sequence with GenBank identifier (gi)
21912445
•Import it into UGENE.
•Find the microsatellite that starts at around 330.
Draw a motif HMM.
•Generate and align random microsatellite sequences.
What are the scores?
MARCOIL predicts coiled coils
MARCOIL consists of 9 groups of 7 states. Each of the 7 models a position in the helix, a-f.
There are 4 special pre-coil and 4 post-coil groups, one repeating coil state (5), and one generic
state (0).
connections between groups
Helix positions
a
e
d
g
b
f
c
groups
...
...
MARCOIL. Delorenzi & Speed, 2002
HMM for dicodon preferences: gene design
•
•
Codon preferences exist due to differences in [tRNA] in the cell.
•
Di-codon preferences are preserved* in the DNA sequences of
ORFs in the genome.
•
To find the optimal set of codons for a protein sequence, design a
HMM based on codons, emitting amino acids in parallel. A parallel
HMM. Maximum likelihood transitions between codons are the
dicodon frequencies.
•
Use Viterbi to assign codons to an amino acid sequence.
Di-codon preferences exist due to interactions between neighboring
tRNAs on the ribosome.
vk(t) = MAX vl(t-1) alk bk(st)
l
l= all codons
emissions = 1.00 or 0.00
dicodon frequencies
TMHMM -- transmembrane helices
TM helices
TMHMM (Krogh, 2001) models
transmembrane helices in eukaryotic and
inner-bacterial membranes. First attempt was
a composition-based model, 2 states.
Example of a prediction result using V, F/B
V, M in red
F/B
AA
I
F
L
V
M
W
A
Y
G
C
T
S
P
H
N
Q
D
E
K
R
Tot
TM helices
count
freq
1826
0.120
1370
0.090
2562
0.168
1751
0.115
616
0.040
414
0.027
1657
0.109
615
0.040
1243
0.082
289
0.019
755
0.050
806
0.053
423
0.028
121
0.008
250
0.016
141
0.009
104
0.007
110
0.007
78
0.005
83
0.005
15214
1.000
Other regions
count
freq
2187
0.046
1854
0.039
4156
0.087
2935
0.061
1201
0.025
819
0.017
3382
0.071
1616
0.034
3352
0.070
960
0.020
2852
0.060
3410
0.071
2640
0.055
1085
0.023
2279
0.048
2054
0.043
2551
0.053
2983
0.062
2651
0.055
2933
0.061
47900
1.000
Over-rep
2.61
2.31
1.93
1.89
1.60
1.59
1.54
1.18
1.17
0.95
0.83
0.75
0.51
0.35
0.33
0.21
0.13
0.11
0.09
0.08
TMHMM -- 15 state version
New version has specific pre-helix and
post-helix states and a more reasonable Mlength distribution, ranging from 14-28+,
not 1-28+
Better F/B prediction, but V is worse. Why?
Modularity...
HMMs within HMMs
Connect HMMs by their begin/end states
begin
H
H
L
E
end
begin
We can define a very simple HMM for
secondary structure. But here, each state
is a macro-state emitting a variable
length string of amino acids, from an
internal HMM.
end
The topology of the helix (H) unit used by Asai et al [1993] to predict
secondary structure. The periodicity of amphipathic helices is
approximately modeled by the cycle of states. States 1 and 5 represent
the start and end of the helix, respectively.
Other topologies were explored for E and L macro-states.
Applications of metagenomics
Next-gen sequencing of un-amplified, un-cultured DNA.
•
Soil health
•
Water pollution
•
Human health and nutrition
•
Paleobiology, Paleogenomics
•
Forensic science?
A community if microorganisms is required to recycle nutrients. Farmers
want to know what is there.
Microorganism populations respond to the content of the water.**
The microbial community of the gut, nose/throat, skin, vagina, are indicators
of infected state, risk, and may be biomarkers of cancer.
DNA from frozen mammoth, iceman reveal their diet,
phylogeny.
** A stream contaminated by acid coal mine runoff
and discarded metal accumulates iron hydroxide
and evolves its own bacterial community
Meta genome assembly
ACGGCTCGCTAAT
CGCTAATTGTGACTGC
GACTGCAAAGGCTAGA
AAAGGCTAGACCG
•
•
•
•
•
•
nodes = short sequence reads
edge = sequential order
path = genome
Edges with only one occurence can be pruned (errors tend to NOT occur in the same
place twice).
Bubbles and tips may be pruned by the “Tour bus” algorithm.
Ambiguous paths may represent multiple strains of a species, or very similar species,
and may be separated out using abundances.
Abundance data
close homologs
Species A
1000 copies
Species B
100 copies
This T goes with this G
1000x
100x
ACGGCTCGCTAAT
ACGGCTCGCTAAT
1000x CGCTAATTGCGACTGC
100x CGCTAATTGTGACTGC
1000x GACTGCAAAGGCTAGA
100x GACTGCAAAGGCTAGA
1000x AAAGGCTAGACCG
100x AAAGGCTAGACGG
Relative abundance of mutations should match relative abundance of
species, and can be used to resolve ambiguous assemblies.
Using relative abundance for path finding in De Bruijn graphs
Draw an edge only where overlap is exact.
Initialize all edge weights = number of occurences (arrow thickness)
Identify branched pathways, where there is > 1 way to connect any two verteces.
Classify branches by occurrence weight.
Find a path that stays within an occurrence class.
C
C
C
C
C
G
TAATTGCG
G
G
G
CTAATTGC
G
G
GCTAATTGC/T GACTGCAAAGGCTAGACG/CGTCA
CTAATTGT
TAATTGTG
T
T
T
T
T
C
C
C
C
C
UGENE exercise: meta genome
assembly
1.
2.
3.
Download reference genomes (two) and reads from course web site (metagenomics_ref,
metagenomics_reads)
in UGENE Tools/Align to reference/Align short reads...
1.
Under “reference seqeunce” enter the metagenomics_ref filename.
2.
Under “Short reads” Add the metagenomics_reads file.
3.
Mismatch allowed, set to 10%
4.
Check “align reverse complement”, “Use best-mode”
5.
Start
Determine the number of species present.
1.
2.
Random errors generally do not repeat. So, ignore mutations that don’t occur ≥ twice.
Find multiple strains by looking at the relative abundances of bases in aligned positions
1.
Assuming two strains, how well are the abundances estimated by the sample?
2.
What bases go together in change positions?
Profile HMMs: models for MSAs
State emissions:
I = insert state, one character from the background profile
D = delete state, non-emitting. A connector.
M = match state, one character from a specific profile.
Begin = non-emitting. Source state.
End = non-emitting. Sink state.
All π(q)=0, except π(Begin)=1
To get the scores of a sequence to a profile HMM, we use the F/B algorithm
to get P(End). This is the measure of how well the sequence fits the model.
Then we can test several models.
Make a HMM from Blast data
Score
E
Sequences producing significant alignments:
(bits) Value
gi|18977279|ref|NP_578636.1| (NC_003413) hypothetical protein [P...
gi|14521217|ref|NP_126692.1| (NC_000868) hypothetical protein [P...
gi|14591052|ref|NP_143127.1| (NC_000961) hypothetical protein [P...
gi|18313751|ref|NP_560418.1| (NC_003364) translation elongation ...
gi|729396|sp|P41203|EF1A_DESMO Elongation factor 1-alpha (EF-1-a...
gi|1361925|pir||S54734 translation elongation factor aEF-1 alpha...
gi|18312680|ref|NP_559347.1| (NC_003364) translation initiation ...
136
59
56
42
40
39
37
QUERY
18977279
14521217
14591052
18313751
729396
1361925
18312680
3
2
1
1
243
236
239
487
GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V
GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V
-MLGFFRRKKKEEEEKI---TGKPVGKVKVENILIVGFK-TVI-ICEVLEGMVKVGYK-V
-MFKFFKRKGEDEKD----VTGKPVGKVKVESILKVGFR-DVI-ICEVLEGIVKVGYK-V
--------------------------RMPIQDVFTITGAGTVV-VGRVETGVLKVGDR-V
--------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV
--------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV
-----------------------------------IVGV-KVL-AGTIKPGVT----L-V
QUERY
18977279
14521217
14591052
18313751
729396
1361925
18312680
60
59
54
53
275
269
272
505
--KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT
--KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT
--RKGKKVAGIVS-MEREHKKVEFAIPGDKIGIMLEKNI---G---AEKGDILEVF---KKGKKVAGIVS-MEREHKKIEFAIPGDRVGMMLEKNI----N--AEKDDILEVY-VIVPPAKVGDVRS-IETHHMKLEQAQPGDNIGVNVRG-I---AKEDVKRGDVL------FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV-------FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV-------KDGREVGRIMQ-IQKTGRAINEAAAGDEVAISIHGDVIVGRQ--IKEGDILYVY--
5e-32
8e-09
8e-08
9e-04
0.007
0.008
0.060
59
58
53
52
274
268
271
504
109
108
100
99
322
314
317
555
Make a HMM from Blast data
Delete states
begin
Insert states
GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V
GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V
-MLGFFRRKKKEEEEKI---TGKPVGKVKVENILIVGFK-TVI-ICEVLEGMVKVGYK-V
-MFKFFKRKGEDEKD----VTGKPVGKVKVESILKVGFR-DVI-ICEVLEGIVKVGYK-V
--------------------------RMPIQDVFTITGAGTVV-VGRVETGVLKVGDR-V
--------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV
--------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV
-----------------------------------IVGV-KVL-AGTIKPGVT----L-V
--KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT
--KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT
--RKGKKVAGIVS-MEREHKKVEFAIPGDKIGIMLEKNI---G---AEKGDILEVF---KKGKKVAGIVS-MEREHKKIEFAIPGDRVGMMLEKNI----N--AEKDDILEVY-VIVPPAKVGDVRS-IETHHMKLEQAQPGDNIGVNVRG-I---AKEDVKRGDVL------FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV-------FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV-------KDGREVGRIMQ-IQKTGRAINEAAAGDEVAISIHGDVIVGRQ--IKEGDILYVY--
Match states
end
Paths
through
Profile
HMM
is
like
a
path
through
an
alignment
matrix
A path through an edit graph and the corresponding
path through a profile HMM
Generating a profile HMM from a multiple
sequence alignment
VGA--H
V----N
VEA--D
VKG--VYS--T
FNA--N
IAGADN
base the
model on, say,
this one
Make four
match states
begin
M
M
M
M
end
Generating a profile HMM from a multiple
sequence alignment
base the
model on, say,
this one
VGA--H
V----N
VEA--D
VKG--VYS--T
FNA--N
IAGADN
Make four
match states
Generating a profile HMM from a multiple
sequence alignment
VGA--H
V----N
VEA--D
VKG--VYS--T
FNA--N
IAGADN
Add insertion
states where
there are
insertions.
(red)
I
begin
M
M
M
M
end
Generating a profile HMM from a multiple
sequence alignment
VGA--H
V----N
VEA--D
VKG--VYS--T
FNA--N
IAGADN
Add deletion
states where
there are
deletions. (red
dashes)
D
D
D
I
begin
M
M
M
M
...now optimize using expectation
maximization.
end
Getting profiles for every Match state
w1
w2
w3
w4
w5
w6
w7
VGA--H
V----N
VEA--D
VKG--VYS--T
FNA--N
IAGADN
Count the
frequency of
each amino
acid, scaled by
sequence
weights, w.
D
D
D
I
M
M
P(V ) =
∑w
∑w
begin
M
M
i
si = V
i
all i
bM1(V) = (w1+w2+w3+w4+w5)/
(w1+w2+w3 +w4+w5 +w6+w7)
end
Calculating the probability of a sequence
given the model: P(s|λ)
D
D
D
I
begin
M
M
M
M
end
Sum forward (forward algorithm) using the sequence s.
For each Match state, multiply by the transition (a) and the
profile value, bM(si), and increment i
For each Deletion state, multiply by a, do not increment i.
For each Insertion state, multiply by a, increment i.
Picking a parent sequence
• The parent defines the number of Match states
• A Match state should conserve the chemical nature of the
sidechain as much as possible.
• A Match state implies structural similarity.
Homolog detection using a library of profile HMMs
Get P(S|λ) for each λ
1
2
P(s|λ2)
MYSEQUENCE
3
4
P(s|λ3)
P(s|λ4)
Pick the model with the max P
P(s|λ1)
Added value
In DP, we assumed insertions and deletions were equally
probable, and that the probability was independent of position.
With Profile HMMs we allow insertions and deletions to have
different probabilities, and to be dependent on the position.
HMMs give better alignments than DP.
Pfam: Protein families
A searchable database of multiple sequence
alignments and profile-HMMs.
database
sequence alignments from
curated?
Pfam-A
UniProtKB
yes
Pfam-B
ADDA (Holm)
no
PFAM visualization of profile HMM
•Logos for Match states.
•
--AAAs stacked by probability
--Color is AA type
--Width is relative contribution.
--Height is shannon entropy
Pink bars for insert states
-- thin line means no I state
--dark pink M->I,
-- light pink I->I
Bayes Block Alignment (BBA)
Find all alignments that have at most K gaps.
Sankoff, 1972; Zhu, Liu & Lawrence, 1998
start
D
D
D
D
I
I
I
I
M
M
M
M
k=1
k=2
...
k=K
Maximum number of indels = K = 20 or L/10, whichever is less.
end
Algorithm for BBA forward probabilities
D[k,i,j] =
Σ
D
I
M[k,i-1,j]
I[k,i-1,j]
D[k,i-1,j]
I[k,i,j] =
Σ
M[k,i,j-1]
I[k,i,j-1]
M
M[k,i,j] = LR(i,j) x
Σ
M[k,i-1,j-1]
I[k-1,i-1,j-1]
D[k-1,i-1,j-1]
LR(i,j) = likelihood ratio = substitution probability
The forward product must be scaled
Underflow problems....
Solution: re-scaling at each step and saving the scale factors.
See slide 5
Algorithm for BBA forward sums: M
Illustration of the paths into M[k,i,j], alignment of i to j
after k gaps.
D D
I I
M M
M[k,i,j] = LR(i,j) x
indel
after k
blocks
Σ
M[k,i-1,j-1]
I[k-1,i-1,j-1]
D[k-1,i-1,j-1]
match after k gaps
Algorithm for BBA forward sums: I
D
I
M
I[k,i,j] =
Σ
M[k,i,j-1]
I[k,i,j-1]
after k blocks
Algorithm for BBA forward sums: D
D
I
M
D[k,i,j] =
Σ
M[k,i-1,j]
I[k,i-1,j]
D[k,i-1,j]
after k blocks
“sampleback” instead of traceback.
Start “sampleback” at lower right, M(K,I,J),D(K,I,J), or I(K,I,J) then
do the following:
i=I; j=J; k=K; histogram(..)=0;
While (i>0 and j>0) do
If current state is M,
add 1 to histogram(i,j)
y=D(k-1,i-1,j-1)+I(k-1,i-1,j-1)+M(k,i-1,j-1)
x = random number 0≤x≤1.
If (x < D(k-1,i-1,j-1)/y) then
next state is D(k-1,i-1,j-1)
else if (x<(D(k-1,i-1,j-1)+ I(k-1,i-1,j-1))/y) then
next state is I(k-1,i-1,j-1)
else
next state is M(k,i-1,j-1)
end if
else if current state is I, ... (try filling this in)
else if current state is D, ... (try filling this in)
end do
Illustration of one sample-back
do this 10,000 times
start
Alignment
starts in any
upper-left box.
small orange
boxes show a path
through the blocks
D
D
D
D
I
I
I
I
M
M
M
M
MMDDMMMIIMDDD
AGCGCGC~~TTCA
AG~~CGCCCT~~~
end
Alignment
ends in any
lower-right
box.
Result is a histogram , P(i,j)
j-->
We can plot the probability of a match for every ij. In the example
below the proteins are NOT homologous, but short stretches of
similarity are found nonetheless.
i-->
Testing DP versus BBA for hard cases
(Huang & Bystroff, 2006)
Above: Optimal DP alignments of 1R69 to
1NEQ (distant homolog proteins) using various
subsitution matrices. The last line “BAliBase” is
the true alignment.
Left: Bayesian Adaptive Alignment results for
the same pair. True alignment in black outline.
Color indicates (white: zero probability, blue:
low P, red: high P) probability of a match
between positions in the sequences. Sometimes
the true alignment is sub-optimal.
error rate
BBA is better than DP.
coverage
Review
•
What is a Markov Model?
•
What is a Hidden Markov Model?
•
What kind of problem can a HMM solve?
•
How do you find the best state pathway given a sequence?
•
What does “state pathway” mean?
•
How do you find the probability of a state q at sequence position k given a
sequence s and a HMM λ?
•
What does it mean for a state to “emit”?
•
What use are non-emitting states?
•
How is the joint probability of a sequence and a pathway calculated?
•
What algorithm can be used to determine the maximum likelihood value for a
parameter of a HMM?
•
What is a profile HMM?
•
How is the topology of a profile HMM initialized? How is it refined?
97
Download