Hidden Markov Models A profile HMM What is a HMM? • It’s a model. It condenses information. • Models 1D discrete data. • Directed graph. • Nodes emit discrete data • Edges transition between nodes. • What’s hidden about it? • Node identity. • Who is Markov? 2 Markov processes time sequence Markov process is any process where the next item in the list depends on the current item. The dimension can be time, sequence position, etc Modeling protein secondary structure using Markov chains “states” connected by “transitions” H E H=helix E=extended (strand) L L=loop Setting the parameters of a Markov model from data. Secondary structure data LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLEEEEELLLLEEEEELLLLLLLLLLLEEEEEEEEELLLLLEEEEEEEEELL LLLLLLEEEEEELLLLLEEEEEELLLLLLLLLLLLLLLEEEEEELLLLLEEEEELLLLLLLLLLEEEELLLLEEEELLLLEEE EEEEELLLLLLEEEEEEEEELLLLLLEELLLLLLLLLLLLLLLLLLLLLLLLEEEEELLLEEEEEELLLLLLLLLLEEEEEEL LLLLEEELLLLLLLLLLLLLEEEEEEEEELLLEEEEEELLLLLLLLLLLLLLLLLLLHHHHHHHHHHHHHLLLLLLLEEL HHHHHHHHHHLLLLLLHHHHHHHHHHHLLLLLLLELHHHHHHHHHHHHLLLLLHHHHHHHHHHHHHLLLLLEEEL HHHHHHHHHHLLLLLLHHHHHHHHHHEELLLLLLHHHHHHHHHHHLLLLLLLHHHHHHHHHHHHHHHHHHHHH HHHHHHLLLLLLHHHHHHHHHHHHHHHHHHLLLHHHHHHHHHHHHHHLLLLEEEELLLLLLLLLLLLLLLLEEEEL LLLHHHHHHHHHHHHHHHLLLLLLLLEELLLLLHHHHHHHHHHHHHHLLLLLLEEEEELLLLLLLLLLHHHHHHHH HHHHHHHHHHHHHHHLLLLLHHHHHHHHHLLLLHHHHHHHLLHHHHHHHHHHHHHHHHHHHH Count the pairs to get the transition probability. E P(L|E) L P(L|E) = P(EL)/P(E) = counts(EL)/counts(E) counts(E) = counts(EE) + counts(EL) + counts(EH) Therefore: P(E|E) + P(L|E) + P(H|E) = 1. A transition matrix P(H|H) P(E|H) P(qt|qt-1) P(E|E) H E P(H|E) P(L|E) H E L H .93 .01 .06 E .01 .80 .19 L .04 .06 .90 P(L|H) P(H|L) P(E|L) L P(L|L) **This is a “first-order” MM. Transition probabilities depend on only the current state. P(S|λ), the probability of a sequence, given the model. P(“HHEELL”| λ) λ H E L H .93 .01 .06 E .01 .80 .19 L .04 .06 .90 =P(H)P(H|H)P(E|H)P(E|E)P(L|E)P(L|L) =(.33)(.93)(.01)(.80)(.19)(.90) =4.2E-4 P(“HHHHHH” | λ) =0.69 common protein secondary structure P(“HEHEHE” | λ) =1E-6 not protein secondary structure Probability discriminates between realistic and unrealistic sequences What is the maximum likelihood model given a dataset of sequences? Dataset. HHEELL H E H 1 1 0 H 0.5 0.5 0 HHEELL E 0 1 1 E 0 0.5 0.5 HHEELL L 0 0 1 L .0 0 1.0 HHEELL L HHEELL HHEELL Count the state pairs. H E L Maximum likelihood model Normalize by row. Is this model too simple? H E L Frequency Synthetic helix length data from this model 1 2 3 4 5 6 7 8 9 10 Real helix length data *L.Pal et al, J. Mol. Biol. (2003) 326, 273–291 “A model should be as simple as possible but not simpler” --Einstein A pseudo-higher-order HMM A Markov chain for proteins where helices are always exactly 4 residues long H H L H H E A Markov chain for proteins where helices are always at least 4 residues long H H L H H E Can you draw a Markov chain where helices are always a multiple of 4 long? H H H L H1 H1 H2 H3 H4 E E L 0.1 0.4 1 H2 1 H3 1 H4 0.5 H E 0.2 0.7 0.1 L 0.2 0.2 0.6 Calculate probability of EHHHHHLLE. Example application: A Markov chain for CpG islands A T a “saturated” 4-state MM C G P(ATCGCGTA...) = πAaATaTCaCGaGCaCGaGTaTA … 1 CpG Islands DNA ... + methylated + - - ... Not methylated DNA is methylated on C to protect against endonucleases. Using mass spectroscopy we can find regions of DNA that are methylated and regions that are not. Regions that are protected from methylation may be functionally important, i.e. transcription factor binding sites. NNNCGNNN NNNTGNNN During the course of evolution. Methylated CpG’s get mutated to TpG’s Using Markov chains for descrimination: CpG Islands in human chromosome sequences + - CpG rich CpG poor ... - + ... CpG poor= "-" CpG rich= "+" + A C G T - A C G T A 0.180 0.274 0.426 0.120 A 0.300 0.205 0.285 0.210 C 0.171 0.368 0.274 0.188 C 0.322 0.298 0.078 0.302 G 0.161 0.339 0.385 0.125 G 0.248 0.246 0.298 0.208 T 0.079 0.355 0.384 0.182 T 0.177 0.239 0.292 0.292 P(CGCG|+) = πC(0.274)(0.339)(0.274) = πC 0.0255 P(CGCG|-) = πC(0.078)(0.246)(0.078) = πC 0.0015 From Durbin,Eddy, Krogh and Mitcheson “Biological Sequence Analysis” (1998) p.50 2 Rabiner’s notation P(y, x) F(y, x) a yx = P(x | y) = = P(y) F(y) ...the conditional probability of x given y, or the transition probability of state y to state x. πx = P(x) = F(x)/N ...the unconditional probability of being in state x . (used to start a state pathway) bx (y) = P(y|x) ...the conditional probability of emitting character y given state x. Comparing two MMs The log likelihood ratio (LLR) L log ∏ i =1 + x i −1 x i − x i −1 x i a a β Log-likelihood ratios for transitions: L = ∑ log i =1 A C G a a + xi −1 xi − xi −1 xi L = ∑ βx i−1x i i =1 T A -0.740 0.419 0.580 -0.803 C -0.913 0.302 1.812 -0.685 G -0.624 0.461 0.331 -0.730 T -1.169 0.573 0.393 -0.679 Sum the LLRs. If the result is positive, its a CpG island, otherwise not. LLR(CGCG)=1.812 + 0.461 + 1.812 = 4.085 yes 3 Combining two Markov chains to make a hidden Markov model A hidden Markov model can have multiple paths for a sequence C "+" model "–"model A T G Transitions between +/models A T G In Hidden Markov models (HMM), there is no one-to-one correspondence between the state and the emitted symbol. 4 Probability of a sequence using a HMM Different state sequences can produce the same emitted sequence Nucleotide sequence (S): C G C G P(sequence,path) State sequences (Q): C+ G+ C+ G+ C- G- C- G- C+ G+ C- G- C+ G- C- G+ etc.... P(CGCG|λ) = πC+ aC+G+aG+C+aC+G+ πC- aC-G- aG-C- aC-GπC+ aC+G+aG+C-aC-GπC+ aC+G- aG-C- aC-G+ etc.... Σ P(Q) All paths Q Each state sequence has a probability. The sum of all state sequences that emit CGCG is the P(CGCG). 5 Three HMM Algorithms 1. The Viterbi algorithm: get the optimal state pathway. Maximum joint prob. 2. The Forward/Backward algorithm: get the probability of each state at each position. Sum over all joint probs. 3. Expectation/Maximization: refine the parameters of the model using the data Back to secondary structure prediction.... Parallel HMM: emits sec struct and amino acid bH(i) The marble bag represents a probability distribution of amino acids, b. ( a profile ) H E L probability distribution == a set of probabilities (0 ≤ p ≤ 1) that sum to 1. 0. Each state emits one amino acid from the marblebag, for each visit. stacked odds? 1. states emit aa and ss. Amino acid Sequence H E L λ Given an amino acid sequence, what is the most probable state sequence? State sequence (secondary structure) HMM data structure for parallel HMM in fortran... type HMMNODE integer :: id type (HMMEDGE), pointer :: a(:) real :: b(20), emit(:) logical :: emitting end type HMMNODE type HMMEDGE integer :: id real :: p type (HMMNODE), pointer :: q end type HMMEDGE type (HMMSTATE), pointer :: hmm_root A linked list... hmm_root should be the “begin” state, with hmm_root%emitting set to .false. If (emitting==.true.) then the state emits amino acid profile b, and optionally something else called emit(:). 23 In class exercise: what’s the LLR? What is the LLR that this seq is a CpG Island? ATGTCTTAGCGCGATCAGCGAAAGCCACG β A C G T A -0.740 0.419 0.580 -0.803 C -0.913 0.302 1.812 -0.685 G -0.624 0.461 0.331 -0.730 T -1.169 0.573 0.393 -0.679 L LLR = ∑ β xi−1xi = _______________ i=1 3 HMM: assigning the states given the sequence is not as easy. Typically, when using a HMM, the task is to determine the optimal state pathway given the sequence. The state pathway provides some predictive feature, such as secondary structure, or splice site/not splice site, or CpG island/not CpG island, etc. In Principle, we can do this task by trying all state pathways Q, and choosing the optimal. In Practice, this is usually impossible, because the number of pathways increases as the number of states to the power of the length, i.e. O(nm). How do we do it, then? Maximize: Joint probability of a sequence and pathway Q = {q1,q2,q3,…qT} = sequence of Markov states, or pathway S = {s1,s2,s3,…sT} = sequence of amino acids or nucleotides T = length of S and Q. Joint probability of a pathway and sequence, given a HMM λ. S= A G P L V D Q= P= H H H H H H E E E E E E L L L L L L πHbH(A) × aHHbH(G) × aHEbE(P) × aEEbE(L) × aEEbE(V) × aELbL(D) Joint probability : general expression General expression for pathway Q through HMM λ : P(S,Q | λ ) = π q 1 ∏ bq t (st )aq t q t +1 t =1,T ** **when t=T, there is no qt+1. Use a = 1 A G P L V D H H H H H H E E E E E E L L L L L L The Viterbi algorithm:the maximum probability path For all states k vk(t) = MAX vl(t-1) alk bk(st) l Trck(t) = ARGMAX vl(t-1) alk bk(st) l Recursive. We save the value v and also a traceback arrow Trc as we go along. Plot state versus position. Each v is a MAX over the whole previous column of v’s. l vl(i) Markov states ... 1 2 3 k T-1 sequence position t T When t = T the last position, the traceback arrow from the MAX give the optimal state sequence. Exercise: Write the Viterbi algorithm 5 4 6 3 2 states 1..L 1 positions 1..T vk(t) = MAX vl(t-1) alk bk(st) Trck(t) = ARGMAX vl(t-1) alk bk(st) Exercise: Write the Viterbi algorithm 5 4 6 3 1 2 vk(t) = MAX vl(t-1) alk bk(st) Trck(t) = ARGMAX vl(t-1) alk bk(st) initialize vk(1)=bk(s1) for t=2,T { for k=1,L { } states 1..L positions 1..T } α The Forward algorithm:all paths to a state This is alpha, the forward probability This is ‘a’, the ‘arrow’ between states. αk(t) = Σ αl(t-1) alk bk(t) l “Forward” stands for “forward recursion” After the first row, each α depends on the whole previous row of α’s. l αlt Markov states ... 1 2 3 k t-1 ... Sum of P over all paths up to state k at t = αk(t) t sequence position i At the end of the sequence, when t=T, the sum of αk(T) equals the total probability of the seuqence given the model, P(S|λ). β The Backward algorithm:all paths from a state “Backward” stands for “backward recursion”. The algorithm starts at t=T, the end of the sequence. (The transitions are still forward.) βk(t) = Σ βl(t+1) akl bk(t) l Each β depends on the whole next row of β’s. βlt = βk(t) ... Markov states Sum over all paths to state k from t+1 ... k l t t+1 T-2 T-1 T sequence position i At the beginning of the sequence, when t=1, the sum of βk(1) equals the total probability of the sequence given the model, P(S|λ). Exercise: Write the Forward algorithm 5 4 6 3 1 2 αk(t) = Σ αl(t-1) alk bk(t) initialize αk(1)=πk(s1) for t=2,T { for k=1,L { } states 1..L positions 1..T } γ Forward/Backward algorithm:all paths through a state. γk(t) = αk(t) *βk(t) γk(t) is the total probability of state k at t, given the sequence S and the model, λ. βlt Markov states l αlt ... k l Markov states ... 1 2 3 t-1 t t+1 sequence position t The bottleneck through which all paths must travel. T-2 T-1 T Expectation/Maximization: refining the model Example: refining bk(G) (i.e. the number of Gly’s in the kth marble bag) all parameters can be refined simultaneously First calculate P(k|t) for all states k and all positions t using Forward/Backward. Step 1) Sum the probability of state k given aa @ t is G, P(k|G). Step 2) Normalize it by dividing by the sum of P(k) . Step 3) Set bk(G) in that value. P(G|k) Step 4) Repeat for all states k in λ and all 20 amino acids. Recalculate P(k|t) using the new parameters. Repeat steps 1-4. Iterate to convergence. Expectation/Maximization: refining the model Example: refining bk(G) Σ over all G in all sequences, S To count the Glycines, we calculate the Forward/Backward value for state k at every Glycine in the database. Then sum them. P(k|t,S,λ) = Σ all paths though k at t = γk(t) = αk(t) *βk(t) b’k(G) = SDKPHS G SDKPQ LKVSDE G LKVSDEFF + SDKPHSSIK G SDE G INC LKVHRSDE SDKPHSEEE G SDE + D G + KP LKVSDEGWWNN + KS + + + G LKVSDEGQGQ + SDKPHSGM G LKE ASDKPH G LKVSDE bk is then normalized to sum to 1 over all 20 AA’s. Expectation/Maximization: refining the model Example: refining ajk, the probability of a transition from state j to state k. Step 1) Get the probability of ending in state j at t --> αj(t) Step 2) Get the probability of starting in state k at t+1 --> βk(t) Step 3) Multiply these by the current ajk Step 4) Do Steps 1-3 for all positions t and all sequences, S. Sum--> a’. Then normalize. Reset ajkin the new model to a’. Do 1-4 using the new model. Repeat until convergence. Expectation/Maximization: refining the model αj(t) βk(t+1) Σ Σ αj(t) ajk βk(t+1) S t + + + + + + ajk + ... = a’ After summing all a’, they are normalized to sum to 1. Σ over all t in all sequences, S Example: refining ajk, the probability of a transition from state j to state k. Draw this HMM What is the probability of this sequence? 39 Unrolling the Viterbi algorithm: underflow problems solved by going to log space Algorithm: vk(t) = MAX vl(t-1) alk bk(st) l Algorithm unrolled: vk(2) = bq (s1) aq k bk(s2) 1 1 vk(3) = bq (s1) aq q2 bq2(s2) aq2q3 bq3(s3) 1 1 vk(4) = bq (s1) aq q2 bq2(s2) aq2q3 bq3(s3) aq3q4 bq4(s4) 1 1 ... Why will this calculation fail for vk(100) ? Hint: Try multiplying 200 numbers (all between 0 and 1) together. Viterbi algorithm underflow: log space solution Algorithm: vk(t) = MAX vl(t-1) alk bk(st) l Log space: Log(vk(t)) = MAX[ Log(vl(t-1)) + Log(alk) + Log( bk(st)) ] l The Forward algorithm: underflow problem Algorithm: αk(t) = Σ αl(t-1) alk bk(t) l Algorithm unrolled: αk(2) = bk(2) [α1(1) a1k + α2(1) a2k + α3(1) a3k + α4(1) a4k ] αk(3) = bk(3) [b1(2) [α1(1) a11 + α2(1) a21 + α3(1) a31 + α4(1) a41 ] a1k + b2(2) [α1(1) a12 + α2(1) a22 + α3(1) a32 + α4(1) a42 ] a2k + b3(2) [α1(1) a13 + α2(1) a23 + α3(1) a33 + α4(1) a43 ] a3k + b4(2) [α1(1) a14 + α2(1) a24 + α3(1) a34 + α4(1) a44 ] a4k ] ... Can’t do this one in Log space, because Log(a+b) can’t be simplified. Solution: dynamic scaling to keep numbers within a reasonable range 42 The Forward algorithm: underflow problem, scaling solution αk(t) = Σ αl(t-1) alk bk(t) l Let, α’k(2) = c2 Σ αl(1) alk bk(2) where c1 is any scale factor. And, α’k(3) = c3 Σ α’l(2) alk bk(3) where c2 is any scale factor. And so on, for all sequence positions t. Then, α’k(t) = ct Σ α’l(t-1) alk bk(t-1) If we choose, ct = 1/Σ α’k(t) , then, ct α’k(t) has a mean value of 1. 43 The Forward / Backward algorithm: scaling solution does not change gamma. Re-writing, α’k(t) = ctΣl (Πi=1,t-1ci) αl(t-1) alk bk(t-1) So, α’k(t) = (Πi=1,t ci) αk(t) If we apply the same scale factors to the backward value β, then, β’k(t) = (Πi=t,T ci) βk(t) Then the calculation of the a posteriori value γ, is γ’k(t) = α’k(t) β’k(t) = (Πi=1,t ci) αk(t)(Πi=t,T ci) βk(t) = (Πi=1,t ci) (Πi=t,T ci) αk(t) βk(t) = (Πi=1,T ci) ct αk(t) βk(t) = CT ct αk(t) βk(t) = CT ct γk(t) where, CT = (Πi=1,T ci) Since γ is normalized to sum to 1, γ’k(t) = CT ct γk(t) / Σk CT ct γk(t) = γk(t) Gamma from scaled summation is the same as gamma unscaled. 44 How do you define the TOPOLOGY of the HMM? how much wood would a wood chuck chuck if a wood chuck would chuck wood? can you can a can as a canner can can a can? I wish to wish the wish you wish to wish, but if you wish the wish the witch wishes, I won’t wish the wish you wish to wish. 1 61 121 181 241 301 361 421 tgattggtct tttgggggat tggtatttta gatttcggga tacttgattt ggattttaag ttttaggatt ctgaatataa ctctgccacc tttaggatta ggatttactt tttcaggatt tgggatttta ttttcttgat acgggatttt atgctctgct gggagatttc taggattacg gattttggga ttaagttttc ggattacggg tttatgattt agggtgctca gctctcgctg cttatttgga ggattttagg ttttaggatt ttgattttat attttagggt taagatttta ctatttatag atgtcattgt ggtgatggag gttctaggat gagggatttt gattttaaga ttcaggattt ggatttactt aactttcatg tctcataata gatttcagga tttaggatta agggtttcag ttttaggatt cgggatttca gattttggga gtttaacata cgttcctttg Transposable elements: junk dealers Transposable elements “jumping genes” lead to rapid germline variation. Barbara McClintock “Out standing in her field” Transposase, transposasome Excision of transposon may leave a “scar”. TR TR=tandem repeat IR=inverted repeat IR IR TR cruciform structure repaired DNA with copied TR and added IR Millions of years of accumulated TE “scars” Some genomes contain a large accumulation of transposon scars. Estimated Transposable element-associated DNA content in selected genomes 35% TEs >50% Everything else 15% H.sapiens Z. mays 2% Drosophila Arabidopsis 1.8% C. elegans 3.1% S. cerevisiae How do you recognize a repeat sequence? •High scoring self-alignments •High dot plot density •Compositional bias A repeat region in a dot plot. Types of repeat sequences Satellites -- 1000+ bp in heterochromatin: centromeres, telomeres Simple Sequence Repeats (SSRs), in euchromatin : Minisatellites -- ~15bp (VNTR) Microsatellites -- 2-6 bp heterochromatin=compact, light bands euchromatin=loose, dark bands. microsatellite 541 601 661 721 781 841 gagccactag aggtgtgtgt taaagtaggc gccactggtt ccgatcaaat tcctatttcc tgcttcattc gtgtgtgtgt agtcagtcaa ggagagctga aagcataagg aagctgtggg tctcgctcct gtgtgtgtgt cagtaagaac tccgcaagct tcttccaacc gtcttgagga actagaatga gtgtgtgtgt ttggtgccgg gcaagacctc actagcattt gatcatttca acccaagatt gtatagcaga aggtttgggg tctatgcttt ctgtcataaa ctggccggac gcccaggccc gatggtttcc tcctggccct ggttctctaa atgagcactg cccatttcac a microsatellite in a dog (canis familiaris) gene. Minisatellite 1 61 121 181 241 301 361 421 tgattggtct tttgggggat tggtatttta gatttcggga tacttgattt ggattttaag ttttaggatt ctgaatataa ctctgccacc tttaggatta ggatttactt tttcaggatt tgggatttta ttttcttgat acgggatttt atgctctgct gggagatttc taggattacg gattttggga ttaagttttc ggattacggg tttatgattt agggtgctca gctctcgctg cttatttgga ggattttagg ttttaggatt ttgattttat attttagggt taagatttta ctatttatag atgtcattgt ggtgatggag gttctaggat gagggatttt gattttaaga ttcaggattt ggatttactt aactttcatg tctcataata gatttcagga tttaggatta agggtttcag ttttaggatt cgggatttca gattttggga gtttaacata cgttcctttg This 8bp tandem repeat has a consensus sequence AGGATTTT, but is almost never a perfect match to the consensus. fun with bioinformatics jargon ACRONYMS for satellites and transposons SSR Short Sequence Repeat STR Short Tandem Repeat VNTR Variable Number Tandem Repeat LTR Long Terminal Repeat LINE Long Interspersed Nuclear Element SINE MITE Short Interspersed Nuclear Element Miniature Inverted repeat Transposable Element (class III TE) TE IS Transposable Element Insertion Sequence IR Inverted Repeat RT Reverse Transcriptase TPase Transposase Class I TE, uses RT. Class II TE, uses TPase. Class III TE, MITEs* Alu 11% of primate genome (SINE) LINE1 14.6% of human genome Tn7,Tn3,Tn10,Mu,IS50 transposons or transposable bacteriophage *Cl,ass III are now merged with Class II TEs. retroposon=retrotransposon How significant is that? ...a specific inference. Thanks. ...as opposed to... ...how likely the data would not have been the result of chance,... Please give me a number for... (How) do you align repeat sequences? A: Don’t align. Mask them out instead. B: Dynamic Programming with special null model. (Use EVD to fit random scores.) Remember: Low complexity repeat sequences will have high-scoring alignments randomly. For example, random A/T repeat... ATTTATATAATTAATATATAAATATAATAAATAT aligned to TATTATATATATATATATATTATATATATATATA Random score is has >50% identity!! How do I align repeat sequences? • Align using dynamic programming. • Assess significance using extreme value distribution fit to random data from HMM. • Results are e-values. Use e-values to build multiple sequence alignments, phylogenetic trees, etc. 57 Generating random low complexity sequences, to estimate significance. Dinucleotide composition model. Generate random sequences based on dinucleotide model. Align them to generate random score distribution. Null model = P(random alignment) C T G freq A alignment score Getting expectation values for low complexity/repeat sequences. Trinucleotide composition model. after A after C A C T G A C T G A C T G A C T G after T after G Only the arrows into the 4 “after A” states are shown Getting expectation values for low complexity/repeat sequences. Motif null model. (Grammatical model.) Repeats are possibly misspelled words. A G K V T T T H N 8 character misspelled-word repeat model, with occasional extra character(s). Try this: create a HMM for a microsatellite. •Using Netscape: Go to the NCBI database and download the nucleotide sequence with GenBank identifier (gi) 21912445 •Import it into UGENE. •Find the microsatellite that starts at around 330. Draw a motif HMM. •Generate and align random microsatellite sequences. What are the scores? MARCOIL predicts coiled coils MARCOIL consists of 9 groups of 7 states. Each of the 7 models a position in the helix, a-f. There are 4 special pre-coil and 4 post-coil groups, one repeating coil state (5), and one generic state (0). connections between groups Helix positions a e d g b f c groups ... ... MARCOIL. Delorenzi & Speed, 2002 HMM for dicodon preferences: gene design • • Codon preferences exist due to differences in [tRNA] in the cell. • Di-codon preferences are preserved* in the DNA sequences of ORFs in the genome. • To find the optimal set of codons for a protein sequence, design a HMM based on codons, emitting amino acids in parallel. A parallel HMM. Maximum likelihood transitions between codons are the dicodon frequencies. • Use Viterbi to assign codons to an amino acid sequence. Di-codon preferences exist due to interactions between neighboring tRNAs on the ribosome. vk(t) = MAX vl(t-1) alk bk(st) l l= all codons emissions = 1.00 or 0.00 dicodon frequencies TMHMM -- transmembrane helices TM helices TMHMM (Krogh, 2001) models transmembrane helices in eukaryotic and inner-bacterial membranes. First attempt was a composition-based model, 2 states. Example of a prediction result using V, F/B V, M in red F/B AA I F L V M W A Y G C T S P H N Q D E K R Tot TM helices count freq 1826 0.120 1370 0.090 2562 0.168 1751 0.115 616 0.040 414 0.027 1657 0.109 615 0.040 1243 0.082 289 0.019 755 0.050 806 0.053 423 0.028 121 0.008 250 0.016 141 0.009 104 0.007 110 0.007 78 0.005 83 0.005 15214 1.000 Other regions count freq 2187 0.046 1854 0.039 4156 0.087 2935 0.061 1201 0.025 819 0.017 3382 0.071 1616 0.034 3352 0.070 960 0.020 2852 0.060 3410 0.071 2640 0.055 1085 0.023 2279 0.048 2054 0.043 2551 0.053 2983 0.062 2651 0.055 2933 0.061 47900 1.000 Over-rep 2.61 2.31 1.93 1.89 1.60 1.59 1.54 1.18 1.17 0.95 0.83 0.75 0.51 0.35 0.33 0.21 0.13 0.11 0.09 0.08 TMHMM -- 15 state version New version has specific pre-helix and post-helix states and a more reasonable Mlength distribution, ranging from 14-28+, not 1-28+ Better F/B prediction, but V is worse. Why? Modularity... HMMs within HMMs Connect HMMs by their begin/end states begin H H L E end begin We can define a very simple HMM for secondary structure. But here, each state is a macro-state emitting a variable length string of amino acids, from an internal HMM. end The topology of the helix (H) unit used by Asai et al [1993] to predict secondary structure. The periodicity of amphipathic helices is approximately modeled by the cycle of states. States 1 and 5 represent the start and end of the helix, respectively. Other topologies were explored for E and L macro-states. Applications of metagenomics Next-gen sequencing of un-amplified, un-cultured DNA. • Soil health • Water pollution • Human health and nutrition • Paleobiology, Paleogenomics • Forensic science? A community if microorganisms is required to recycle nutrients. Farmers want to know what is there. Microorganism populations respond to the content of the water.** The microbial community of the gut, nose/throat, skin, vagina, are indicators of infected state, risk, and may be biomarkers of cancer. DNA from frozen mammoth, iceman reveal their diet, phylogeny. ** A stream contaminated by acid coal mine runoff and discarded metal accumulates iron hydroxide and evolves its own bacterial community Meta genome assembly ACGGCTCGCTAAT CGCTAATTGTGACTGC GACTGCAAAGGCTAGA AAAGGCTAGACCG • • • • • • nodes = short sequence reads edge = sequential order path = genome Edges with only one occurence can be pruned (errors tend to NOT occur in the same place twice). Bubbles and tips may be pruned by the “Tour bus” algorithm. Ambiguous paths may represent multiple strains of a species, or very similar species, and may be separated out using abundances. Abundance data close homologs Species A 1000 copies Species B 100 copies This T goes with this G 1000x 100x ACGGCTCGCTAAT ACGGCTCGCTAAT 1000x CGCTAATTGCGACTGC 100x CGCTAATTGTGACTGC 1000x GACTGCAAAGGCTAGA 100x GACTGCAAAGGCTAGA 1000x AAAGGCTAGACCG 100x AAAGGCTAGACGG Relative abundance of mutations should match relative abundance of species, and can be used to resolve ambiguous assemblies. Using relative abundance for path finding in De Bruijn graphs Draw an edge only where overlap is exact. Initialize all edge weights = number of occurences (arrow thickness) Identify branched pathways, where there is > 1 way to connect any two verteces. Classify branches by occurrence weight. Find a path that stays within an occurrence class. C C C C C G TAATTGCG G G G CTAATTGC G G GCTAATTGC/T GACTGCAAAGGCTAGACG/CGTCA CTAATTGT TAATTGTG T T T T T C C C C C UGENE exercise: meta genome assembly 1. 2. 3. Download reference genomes (two) and reads from course web site (metagenomics_ref, metagenomics_reads) in UGENE Tools/Align to reference/Align short reads... 1. Under “reference seqeunce” enter the metagenomics_ref filename. 2. Under “Short reads” Add the metagenomics_reads file. 3. Mismatch allowed, set to 10% 4. Check “align reverse complement”, “Use best-mode” 5. Start Determine the number of species present. 1. 2. Random errors generally do not repeat. So, ignore mutations that don’t occur ≥ twice. Find multiple strains by looking at the relative abundances of bases in aligned positions 1. Assuming two strains, how well are the abundances estimated by the sample? 2. What bases go together in change positions? Profile HMMs: models for MSAs State emissions: I = insert state, one character from the background profile D = delete state, non-emitting. A connector. M = match state, one character from a specific profile. Begin = non-emitting. Source state. End = non-emitting. Sink state. All π(q)=0, except π(Begin)=1 To get the scores of a sequence to a profile HMM, we use the F/B algorithm to get P(End). This is the measure of how well the sequence fits the model. Then we can test several models. Make a HMM from Blast data Score E Sequences producing significant alignments: (bits) Value gi|18977279|ref|NP_578636.1| (NC_003413) hypothetical protein [P... gi|14521217|ref|NP_126692.1| (NC_000868) hypothetical protein [P... gi|14591052|ref|NP_143127.1| (NC_000961) hypothetical protein [P... gi|18313751|ref|NP_560418.1| (NC_003364) translation elongation ... gi|729396|sp|P41203|EF1A_DESMO Elongation factor 1-alpha (EF-1-a... gi|1361925|pir||S54734 translation elongation factor aEF-1 alpha... gi|18312680|ref|NP_559347.1| (NC_003364) translation initiation ... 136 59 56 42 40 39 37 QUERY 18977279 14521217 14591052 18313751 729396 1361925 18312680 3 2 1 1 243 236 239 487 GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V -MLGFFRRKKKEEEEKI---TGKPVGKVKVENILIVGFK-TVI-ICEVLEGMVKVGYK-V -MFKFFKRKGEDEKD----VTGKPVGKVKVESILKVGFR-DVI-ICEVLEGIVKVGYK-V --------------------------RMPIQDVFTITGAGTVV-VGRVETGVLKVGDR-V --------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV --------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV -----------------------------------IVGV-KVL-AGTIKPGVT----L-V QUERY 18977279 14521217 14591052 18313751 729396 1361925 18312680 60 59 54 53 275 269 272 505 --KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT --KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT --RKGKKVAGIVS-MEREHKKVEFAIPGDKIGIMLEKNI---G---AEKGDILEVF---KKGKKVAGIVS-MEREHKKIEFAIPGDRVGMMLEKNI----N--AEKDDILEVY-VIVPPAKVGDVRS-IETHHMKLEQAQPGDNIGVNVRG-I---AKEDVKRGDVL------FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV-------FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV-------KDGREVGRIMQ-IQKTGRAINEAAAGDEVAISIHGDVIVGRQ--IKEGDILYVY-- 5e-32 8e-09 8e-08 9e-04 0.007 0.008 0.060 59 58 53 52 274 268 271 504 109 108 100 99 322 314 317 555 Make a HMM from Blast data Delete states begin Insert states GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V -MLGFFRRKKKEEEEKI---TGKPVGKVKVENILIVGFK-TVI-ICEVLEGMVKVGYK-V -MFKFFKRKGEDEKD----VTGKPVGKVKVESILKVGFR-DVI-ICEVLEGIVKVGYK-V --------------------------RMPIQDVFTITGAGTVV-VGRVETGVLKVGDR-V --------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV --------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV -----------------------------------IVGV-KVL-AGTIKPGVT----L-V --KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT --KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT --RKGKKVAGIVS-MEREHKKVEFAIPGDKIGIMLEKNI---G---AEKGDILEVF---KKGKKVAGIVS-MEREHKKIEFAIPGDRVGMMLEKNI----N--AEKDDILEVY-VIVPPAKVGDVRS-IETHHMKLEQAQPGDNIGVNVRG-I---AKEDVKRGDVL------FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV-------FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV-------KDGREVGRIMQ-IQKTGRAINEAAAGDEVAISIHGDVIVGRQ--IKEGDILYVY-- Match states end Paths through Profile HMM is like a path through an alignment matrix A path through an edit graph and the corresponding path through a profile HMM Generating a profile HMM from a multiple sequence alignment VGA--H V----N VEA--D VKG--VYS--T FNA--N IAGADN base the model on, say, this one Make four match states begin M M M M end Generating a profile HMM from a multiple sequence alignment base the model on, say, this one VGA--H V----N VEA--D VKG--VYS--T FNA--N IAGADN Make four match states Generating a profile HMM from a multiple sequence alignment VGA--H V----N VEA--D VKG--VYS--T FNA--N IAGADN Add insertion states where there are insertions. (red) I begin M M M M end Generating a profile HMM from a multiple sequence alignment VGA--H V----N VEA--D VKG--VYS--T FNA--N IAGADN Add deletion states where there are deletions. (red dashes) D D D I begin M M M M ...now optimize using expectation maximization. end Getting profiles for every Match state w1 w2 w3 w4 w5 w6 w7 VGA--H V----N VEA--D VKG--VYS--T FNA--N IAGADN Count the frequency of each amino acid, scaled by sequence weights, w. D D D I M M P(V ) = ∑w ∑w begin M M i si = V i all i bM1(V) = (w1+w2+w3+w4+w5)/ (w1+w2+w3 +w4+w5 +w6+w7) end Calculating the probability of a sequence given the model: P(s|λ) D D D I begin M M M M end Sum forward (forward algorithm) using the sequence s. For each Match state, multiply by the transition (a) and the profile value, bM(si), and increment i For each Deletion state, multiply by a, do not increment i. For each Insertion state, multiply by a, increment i. Picking a parent sequence • The parent defines the number of Match states • A Match state should conserve the chemical nature of the sidechain as much as possible. • A Match state implies structural similarity. Homolog detection using a library of profile HMMs Get P(S|λ) for each λ 1 2 P(s|λ2) MYSEQUENCE 3 4 P(s|λ3) P(s|λ4) Pick the model with the max P P(s|λ1) Added value In DP, we assumed insertions and deletions were equally probable, and that the probability was independent of position. With Profile HMMs we allow insertions and deletions to have different probabilities, and to be dependent on the position. HMMs give better alignments than DP. Pfam: Protein families A searchable database of multiple sequence alignments and profile-HMMs. database sequence alignments from curated? Pfam-A UniProtKB yes Pfam-B ADDA (Holm) no PFAM visualization of profile HMM •Logos for Match states. • --AAAs stacked by probability --Color is AA type --Width is relative contribution. --Height is shannon entropy Pink bars for insert states -- thin line means no I state --dark pink M->I, -- light pink I->I Bayes Block Alignment (BBA) Find all alignments that have at most K gaps. Sankoff, 1972; Zhu, Liu & Lawrence, 1998 start D D D D I I I I M M M M k=1 k=2 ... k=K Maximum number of indels = K = 20 or L/10, whichever is less. end Algorithm for BBA forward probabilities D[k,i,j] = Σ D I M[k,i-1,j] I[k,i-1,j] D[k,i-1,j] I[k,i,j] = Σ M[k,i,j-1] I[k,i,j-1] M M[k,i,j] = LR(i,j) x Σ M[k,i-1,j-1] I[k-1,i-1,j-1] D[k-1,i-1,j-1] LR(i,j) = likelihood ratio = substitution probability The forward product must be scaled Underflow problems.... Solution: re-scaling at each step and saving the scale factors. See slide 5 Algorithm for BBA forward sums: M Illustration of the paths into M[k,i,j], alignment of i to j after k gaps. D D I I M M M[k,i,j] = LR(i,j) x indel after k blocks Σ M[k,i-1,j-1] I[k-1,i-1,j-1] D[k-1,i-1,j-1] match after k gaps Algorithm for BBA forward sums: I D I M I[k,i,j] = Σ M[k,i,j-1] I[k,i,j-1] after k blocks Algorithm for BBA forward sums: D D I M D[k,i,j] = Σ M[k,i-1,j] I[k,i-1,j] D[k,i-1,j] after k blocks “sampleback” instead of traceback. Start “sampleback” at lower right, M(K,I,J),D(K,I,J), or I(K,I,J) then do the following: i=I; j=J; k=K; histogram(..)=0; While (i>0 and j>0) do If current state is M, add 1 to histogram(i,j) y=D(k-1,i-1,j-1)+I(k-1,i-1,j-1)+M(k,i-1,j-1) x = random number 0≤x≤1. If (x < D(k-1,i-1,j-1)/y) then next state is D(k-1,i-1,j-1) else if (x<(D(k-1,i-1,j-1)+ I(k-1,i-1,j-1))/y) then next state is I(k-1,i-1,j-1) else next state is M(k,i-1,j-1) end if else if current state is I, ... (try filling this in) else if current state is D, ... (try filling this in) end do Illustration of one sample-back do this 10,000 times start Alignment starts in any upper-left box. small orange boxes show a path through the blocks D D D D I I I I M M M M MMDDMMMIIMDDD AGCGCGC~~TTCA AG~~CGCCCT~~~ end Alignment ends in any lower-right box. Result is a histogram , P(i,j) j--> We can plot the probability of a match for every ij. In the example below the proteins are NOT homologous, but short stretches of similarity are found nonetheless. i--> Testing DP versus BBA for hard cases (Huang & Bystroff, 2006) Above: Optimal DP alignments of 1R69 to 1NEQ (distant homolog proteins) using various subsitution matrices. The last line “BAliBase” is the true alignment. Left: Bayesian Adaptive Alignment results for the same pair. True alignment in black outline. Color indicates (white: zero probability, blue: low P, red: high P) probability of a match between positions in the sequences. Sometimes the true alignment is sub-optimal. error rate BBA is better than DP. coverage Review • What is a Markov Model? • What is a Hidden Markov Model? • What kind of problem can a HMM solve? • How do you find the best state pathway given a sequence? • What does “state pathway” mean? • How do you find the probability of a state q at sequence position k given a sequence s and a HMM λ? • What does it mean for a state to “emit”? • What use are non-emitting states? • How is the joint probability of a sequence and a pathway calculated? • What algorithm can be used to determine the maximum likelihood value for a parameter of a HMM? • What is a profile HMM? • How is the topology of a profile HMM initialized? How is it refined? 97