Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe Example alignment HBA_HUMAN HBB_HUMAN MYG_PHYCA GLB3_CHITP GLB5_PETMA LGB2_LUPLU GLB1_GLYDI –HGSAQVKGHGKKVADALTNAVAHVVMGNPKVKAHGKKVLGAFSDGLAHLMKASEDLKKHGVTVLTALGAILKK-IKGTAPFETHANRIVGFFSKIIGELLKKSADVRWHAERIINAVNDAVASMPQNNPELQAHAGKVFKLVYEAAIQLQ ---DPGVAALGAKVLAQIGVAVSHL- Linda Muselaars and Miranda Stobbe 2 Overview chapter 5 Ungapped score matrices. Adding insert and delete states to obtain profile HMMs. Deriving profile HMMs from multiple alignments Searching with profile HMMs. Profile HMM variants for non-global alignments. More on estimation of probabilities. Optimal model construction. Weighting training sequences. Linda Muselaars and Miranda Stobbe 3 Overview chapter 5 Ungapped score matrices. Adding insert and delete states to obtain profile HMMs. Deriving profile HMMs from multiple alignments Searching with profile HMMs. Profile HMM variants for non-global alignments. More on estimation of probabilities. Optimal model construction. Weighting training sequences. Linda Muselaars and Miranda Stobbe 4 Key-issues Identifying the relationship of an individual sequence to a sequence family. How to build a profile HMM. Use profile HMMs to detect potential membership in a family. Use profile HMMs to give an alignment of a sequence to the family. Linda Muselaars and Miranda Stobbe 5 Key-issues (2) Lollypops for a valuable (up to the speakers to decide) contribution to this lecture. Linda Muselaars and Miranda Stobbe 6 Needed theory Emission probabilities. Silent states. Pair HMMs. The Viterbi algorithm. The Forward algorithm. Linda Muselaars and Miranda Stobbe 7 Contents Ungapped score matrices. Adding insert and delete states to obtain profile HMMs. Deriving profile HMMs from multiple alignments. – Non-probabilistic profiles – Basic profile HMM parameterisation Searching with profile HMMs. Profile HMM variants for non-global alignments. Linda Muselaars and Miranda Stobbe 8 Example alignment HBA_HUMAN HBB_HUMAN MYG_PHYCA GLB3_CHITP GLB5_PETMA LGB2_LUPLU GLB1_GLYDI –HGSAQVKGHGKKVADALTNAVAHVVMGNPKVKAHGKKVLGAFSDGLAHLMKASEDLKKHGVTVLTALGAILKK-IKGTAPFETHANRIVGFFSKIIGELLKKSADVRWHAERIINAVNDAVASMPQNNPELQAHAGKVFKLVYEAAIQLQ ---DPGVAALGAKVLAQIGVAVSHL********************* Linda Muselaars and Miranda Stobbe 9 Ungapped regions Gaps tend to line up. We can consider models for ungapped regions. Specify indepependent probabilities ei(a). L P( x | M ) ei ( xi ) i 1 But of course: log-odds ratio! Position specific score matrix. Linda Muselaars and Miranda Stobbe 10 Drawbacks Multiple alignments do have gaps. Need to be accounted for. For example: BLOCKS database, with combined scores of ungapped regions. We will develop a single probabilistic model for the whole extent of the alignment. Linda Muselaars and Miranda Stobbe 11 Contents Ungapped score matrices. Adding insert and delete states to obtain profile HMMs. Deriving profile HMMs from multiple alignments. – Non-probabilistic profiles – Basic profile HMM parameterisation Searching with profile HMMs. Profile HMM variants for non-global alignments. Linda Muselaars and Miranda Stobbe 12 Short review Emission probabilities: the probability that a certain symbol is seen when in certain state k. Silent states: states that do not emit symbols in an HMM. Linda Muselaars and Miranda Stobbe 13 Building the model (1) We need position sensitive gap scores. HMM with repetitive structure of (match) states. Transitions of probability 1. Emmision probabilities: eMi(a). Begin .... Mj .... Linda Muselaars and Miranda Stobbe End 14 Building the model (2) Deal with insertions: set of new states Ii. Ii have emission distribution eIi(a). Set to the background distribution qa. Ij Begin Mj Linda Muselaars and Miranda Stobbe End 15 Building the model (3) Deal with deletions. Possibly forward jumps. For arbitrarily long gaps: silent states Dj . Dj Begin Mj Linda Muselaars and Miranda Stobbe End 16 Costs for additional states States for insertions: the sum of the costs of the transitions and emissions (M→ I, number of I→ I, I→ M). States for deletions: the sum of the costs of an M→ D transition and a number of D→ D transitions and an D→ M transition. Linda Muselaars and Miranda Stobbe 17 Full model Dj Ij Begin Mj Linda Muselaars and Miranda Stobbe End 18 Comparison with pair HMM X qxi Begin M pxiyj End Y qyj Linda Muselaars and Miranda Stobbe 19 Contents Ungapped score matrices. Adding insert and delete states to obtain profile HMMs. Deriving profile HMMs from multiple alignments. – Non-probabilistic profiles – Basic profile HMM parameterisation Searching with profile HMMs. Profile HMM variants for non-global alignments. Linda Muselaars and Miranda Stobbe 20 Non-probabilistic profiles Profile HMM without underlying probabilistic model. Set scores to averages of standard substitution scores. Anomalies: – Conservation of columns is not taken into account. – Scores for gaps do not behave properly. Linda Muselaars and Miranda Stobbe 21 Example HBA_HUMAN HBB_HUMAN MYG_PHYCA GLB3_CHITP GLB5_PETMA LGB2_LUPLU GLB1_GLYDI ...VGA--HAGEY... ...V----NVDEV... ...VEA--DVAGH... ...VKG------D... ...VYS--TYETS... ...FNA--NIPKH... ...IAGADNGAGV... *** ***** The score for residue a in column 1 would be set to: 5 1 1 s(V, a) s(F, a) s(I, a) 7 7 7 Linda Muselaars and Miranda Stobbe 22 Basic profile HMM parameterisation Objective: make the probability distribution peak around members of the family. Available parameters: – Length of the model. – Transition and emission probabilities. Linda Muselaars and Miranda Stobbe 23 Length of the model Which multiple alignment columns do we assign to match states? And which to insert states? Heuristic rule: Columns that consist for more than 50% of gap characters should be modeled by insert states. Linda Muselaars and Miranda Stobbe 24 Probability parameters # of transitions from state k to state l Transition probability: Akl akl l ' Akl ' # of transitions from state k to any other state Emission probability: Ek ( a ) ek (a) a ' E k ( a ' ) In the limit this is an accurate and consistent estimation. Pseudocount method: LaPlace’s rule. Linda Muselaars and Miranda Stobbe 25 Example Bat Rat Cat Gnat Goat A A A A * G G G * A A - G A A * A A - Linda Muselaars and Miranda Stobbe C C C C * 26 Example continued D1 D2 D3 D4 I0 I1 I2 I3 I4 Begin A 5/8 C 1/8 G 1/8 T 1/8 A 1/7 C 1/7 G 4/7 T 1/7 A 3/7 C 1/7 G 2/7 T 1/7 A 1/8 C 5/8 G 1/8 T 1/8 M1 M2 M3 M4 aM1M2 = 4/7 aM1D2 = 2/7 End aM1I1 = 1/7 Linda Muselaars and Miranda Stobbe 27 Contents Ungapped score matrices. Adding insert and delete states to obtain profile HMMs. Deriving profile HMMs from multiple alignments. – Non-probabilistic profiles – Basic profile HMM parameterisation Searching with profile HMMs. Profile HMM variants for non-global alignments. Linda Muselaars and Miranda Stobbe 28 Searching with profile HMMs Obtaining significant matches of a sequence to the profile HMM: – Viterbi algorithm: P(x, π*| M). – Forward algorithm: P(x | M). Give an alignment of a sequence to the family. – Highest scoring, or Viterbi, alignment. Linda Muselaars and Miranda Stobbe 29 Viterbi equations Log-odds score of best path matching subsequence x1…i to the submodel up to state j, ending with xi being emitted by state Mj: V jM (i) Log-odds score of the best path ending in xi being emitted by Ij: V jI (i) The best path ending in state Dj: V jD (i) (1 2 )v M (i 1, j 1), M X Pair HMM: v (i, j ) p ( xi, yj ) max (1 )v (i 1, j 1), (1 )v Y (i 1, j 1); Linda Muselaars and Miranda Stobbe 30 Viterbi equations V jM1 (i 1) log aM M , j 1 j e ( x ) M i V jM (i ) log j max V jI1 (i 1) log aI j1M j , q xi V D (i 1) log a D j 1M j ; j 1 V jM (i 1) log aM I , j j e ( x ) I i j V jI (i ) log max V jI (i 1) log aI j I j , q xi V D (i 1) log a ; D jI j j V jM1 (i ) log aM D , j 1 j I D V j (i ) max V j 1 (i ) log aI j1D j , V D (i ) log a D j 1D j ; j 1 Linda Muselaars and Miranda Stobbe 31 Forward algorithm F (i ) log M j eM j ( xi ) q xi log[ aM j1M j exp( F jM1 (i 1)) aI j1M j exp( F jI1 (i 1)) aD j1M j exp( F jD1 (i 1))]; F (i ) log I j eI j ( xi ) q xi log[ aM j I j exp( F jM (i 1)) log aI j I j exp( F jI (i 1)) aD j I j exp( F jD (i 1))]; F jD (i ) log[ aM j1D j exp( F jM1 (i )) log aI j1D j exp( F jI1 (i )) aD j1D j exp( F jD1 (i ))]; Linda Muselaars and Miranda Stobbe 32 Initialisation and termination Viterbi algorithm: M V – Initialisation: V (0) 0 L ( n 1) log aM L M L1 , I – Termination: V (n) max VL (n 1) log aI j1M L1 , V D (n 1) log a D L M L1 ; L M 0 M L 1 Forward algorithm: – Initialisation: F0M (0) 0 – Termination: FLM1 (n) log[ aM M exp( F L ( n 1)) L M L 1 aI L M L1 exp( FLI (n 1)) aD L M L1 exp( FLD (n 1))] Linda Muselaars and Miranda Stobbe 33 Alternative to log-odds scoring Log Likelihood score (LL score) Strongly length dependent. Solutions: – Divide by sequence length – Z-score Which method is preferred? Linda Muselaars and Miranda Stobbe 34 Linda Muselaars and Miranda Stobbe 35 Demo Linda Muselaars and Miranda Stobbe 36 Part of the profile HMM Linda Muselaars and Miranda Stobbe 37 Scoring Linda Muselaars and Miranda Stobbe 38 Part of the multiple alignment Linda Muselaars and Miranda Stobbe 39 Relative frequencies Linda Muselaars and Miranda Stobbe 40 Contents Ungapped score matrices. Adding insert and delete states to obtain profile HMMs. Deriving profile HMMs from multiple alignments. – Non-probabilistic profiles – Basic profile HMM parameterisation Searching with profile HMMs. Profile HMM variants for non-global alignments. Linda Muselaars and Miranda Stobbe 41 Flanking model states Used to model the flanking sequences to the actual profile match itself. Extra probabilities needed: – Emission probability: qa. – ‘Looping’ transition probability: (1 - η). – Transition probability from left flanking state: depends on application. Linda Muselaars and Miranda Stobbe 42 Model for local alignment Smith-Waterman style Dj Ij Mj Begin End Begin End Q Q Linda Muselaars and Miranda Stobbe 43 Model for overlap matches Dj Q Ij Begin Mj Linda Muselaars and Miranda Stobbe Q End 44 Model for repeat matches Dj Ij Begin Mj Begin Q Linda Muselaars and Miranda Stobbe End End 45 Summary Construction of a profile HMM for different kinds of alignments. Use profile HMMs to detect potential membership in a family. Use profile HMMs to give an alignment of a sequence to the family. Linda Muselaars and Miranda Stobbe 46 Discussion subject BLAST versus profile HMM Linda Muselaars and Miranda Stobbe 47