Molecular Bioinformatics, X3 Lab 4, 030129 Hidden Markov Models Part 1 Consider the short nucleotide motif below. Given the marked columns and a maximum likelihood estimate for the transition probabilities, a hidden Markov model M can be constructed. C G C C C * T G T C A G T G G * A A A A * D3 I1 2 B 5 M1 5 5 M2 3 M3 E 4 5 4 Columns 0,4,5 corresponds to the matchstates M1 M2 & M3 and columns 1,2,3 to insertstate I1. Some of the transition parameters have already been estimated based on the alignment above. Calculate the missing probabilities. Estimate the emission parameters using the Add-one Pseudocount method given the background vector q 14 , 14 , 14 , 14 . Make a table for the emissions of the nucleotides A C G T for each of the emitting states, M1 I1 M2 M3. Q1. The sequence s1=‘GTA’ can be generated by two paths, 1 & 2, through the model. Write their corresponding alignments. Q2. Calculate the probability that the sequence has been generated by respective path, i.e. calculate P(s1 | M,1) & P(s1 | M,2) Q3. Evaluate the corresponding log-odds scores, score(s1 | M,1) & score(s1 | M,2), given q 14 , 14 , 14 , 14 and using the natural logarithm. The optimal path for another, longer, sequence s2=‘CTAGA’ yields a score of +0.05212 nits. The probability however is as low as 0.00103. Q4. Compare the score & probability for this known motif-member to the optimal ones for sequence s1. Why is the log-odds score more convenient to use for discrimination purposes compared to the probability P ? Part 2 Building your first model First make sure your screen resolution is at least 1152x864 pixles by hitting the right mousebutton anywhere on the desktop and then selecting Properties. Choose the Settings-tab and change the ‘Screen Area’ to 1152x864. Press ok and then ok again. If the FarOut program is present at H:\FarOut, click on its icon to start the program. Otherwise, download the file http://xray.bmc.uu.se/~patrik/FarOut/farout.jar and all of the files in http://xray.bmc.uu.se/~patrik/FarOut/examples/. When the program’s main window appears, click on the button saying Start JAMB! to open ’the Java Markov Builder’. Go to the File menu and select Load Alignment, change the filetypes to All files and open the file example1.fasta. The short alignment contained in the file is displayed together with sequence names, weights and a consensus sequence. Click on the Settings button to open the Settings window. Hit the Background Pseudocount field to enable background pseudocount parameter estimation. You’ll end up with a window looking something like this ; In the Settings window you can adjust the amount of a priori information incorporated into your models for example by tuning the emission pseudocount constant A and the transition pseudoconstant B. Try building your first model by clicking on the Scan button ; now the program creates the HMM-graph and estimates all of its parameters. Look at the estimated emission probabilities. Q5. What is the estimated probability of emitting a Tyrosine, Y, in column nr 1 ? Q6. How frequent is Glutamic Acid, E, in the sequence universe ? Which is the most common amino acid in the protein space ? Start by evaluating one or a few of the sequences from the alignment (e.g. QEMNGYHI or MDLNGHF..) by typing it into the Sequence Evaluation field and hitting enter. Look at the score at the bottom left field of the window and examine the alignment corresponding to the most probable path through the model by the clicking on the Show Alignment checkbox. As one might suspect the model is pretty good at generating its own sequences, the scores of model-sequences in some sense sets a rough upper limit of the maximum archievable score. Since the model is so short, the difference between the model and the background distribution isn’t really so significant but enough to serve as an illustration for some basic properties of hidden Markov models. The Workset environment Evaluate the slightly different sequence ‘EMKGDH’ and look at its alignment. Try changing the emission pseudocount constant A to a large number, like e.g. 9999. Hit Scan and evaluate the sequence again. Examine the alignment. Q7. What happens, why ? Reset A=5 and add a ‘P’ at the end of the sequence. Compare the score of this new sequence to the first one (i.e. evaluate the sequence ‘EMKGDHP’). Q8. How come the score for the first sequence is so much lower even though Proline clearly isn’t such a favourable residue in the last position ? (Hint ; transition probabilities..) Click Add Model and enter a name for for your new-built model, it now ends up in the workset database of models. The Workset can contain all the models and sequences you are working with during a session. Reopen your model by hitting Edit and evaluate the sequence ‘MEVEGHV‘. Look at the score. Now select Total Probability in the Workset window to sum the score over all possible alignments instead of just the optimal one. Evaluate the sequence again. (note ; there’s no need to rebuild your model since we’re just using a different algorithm to calculate the score) Q9. Compare the scores – what does this tell you about all other valid paths through the model compared to the optimal one ? Subsequence Search & Classification Reselect the box Maximum Probability and close your old model window. In the Worksetwindow press Build. Now it’s time to look at a real example ; read the alignment file gh10_seed.fasta and build the model using Background Pseudocount estimation. Add the model to the Workset and name it e.g. ‘gh10_backgr’ or something. Repeat the same procedure first using Substitution Mixtures and then using Dirichlet Mixtures. Now you will have three models built from the same data but by different methods. In the Workset window, Workset Sequences-field, click Add to load a sequence file. Open the file gh10_family.fasta and the file gh5_family.fasta. First mark one of the models and mark some of the first sequences ; hit Eval. Look at the scores. Now unselect the Subsequence Search checkbox in the Local Evaluation field. Eval the same sequences again. Q10. What happens, what is special with these sequences ? Look at the alignment made of one of them by selecting it and copying (mouse-right-button). Hit Edit on the current model and evaluate the sequence. The Submodel Search option in turn, deals with the concept of modelling a local motif ; i.e. when the model is much shorter than the evaluated sequences. These both options infer changes in the basic hidden Markov architecture. The Subsequence Search option enables direct transitions between the begin state and all matchstates – and between all matchstates and the end state. Q11. How do you think the architecture and / or the parameters can be changed to deal with the second case, when modelling a shorter domain and just a small segment of the search sequences should be considered while the rest should be ignored ; Submodel Search ? The sequences contained in the first file have already been classified as family 10 members by hand. This means that we can assess the quality of our three models built from a small subsection of the family by investigating how many putative new members they find. Re-enable the Subsequence Search and select one or several of your gh10-models. Select the sequences you want to evaluate and hit Eval. If you select several models, all will be evaluated and the highest scoring one will be displayed in the ‘Max Prob Model’ column. Since the program isn’t optimised for fast sequence-evaluation this might be pretty slow on the lab computers. Play around a bit to find the best of the three models and then maybe evaluate all of the sequences with that model. Also try out some of the family 5 sequences. Switch between enabling and disabling the Subsequence Search. This will also change how high or how low the nonfamily sequences will score. Since using the Subsequence Search will make the model much less length-dependent, not just short family sequences will score higher – all of the sequence universe will score higher. The model becomes more general i.e. more sensitive but less selective. Sequence Weighting As you might have noticed while looking at emission- and transition-counts earlier, all modelsequences aren’t treated equal by the program. Open one of your models and look at the weight column at the far left of the window. Click on the sequence names / and on the weights to get a bar-representation instead of numbers. These weights are calculated to maximise the entropy, i.e. the information, of the modelsequences. Description of the weighting-algorithm can be found in section 3.4.3 in the Appendix. Now try changing the weights by using your mouse to increase some of the bars. This will bias your model towards those sequences. Doing this might actually be useful in some cases, for example when one is doing homology modeling using one sequence/structure as a template but still would like some amount of information from the other family members. By assigning the template sequence a large weight and letting the other split the remnant, a better templatesequence mapping can be made. Another important parameter when building a model is the total weight. Normally one implicitly uses a total weight equal to the number of model-sequences. Q12. What would happen to selectivity and sensitivity if we were to increase or decrease the total weight ? (Hint, what would happen to the amount of a priori information incorporated?) Building a StructureHMM Delete your existing Workset-sequences and models. Now go back to the main FarOut-window and click on the Structural Alignment tab. Hit Open Structure File and read in the .pdb file 3MAN, a mannanase (family 5). Also open the structure file 1QNO. Deselect ‘View Results in FarOut’ and instead choose to View Results in Jamb. Click Align Structures. A model-window and a molecule-viewer of the structural alignment appears after a few seconds. First look at the interpolated 3D-structures to get a rough idea of how similar they are. Note the pretty well conserved helices in contrast to the long nonconserved loops. Go back to the model window. Here you’ll see the sequences and their structurally conserved parts. The pairwise sequence identity is 13%. Locate the catalytic triad, the conserved residues ‘NEP’, and the two active site stabilising-residues ‘H-Y’. Open the Settings window. In the ‘StructureHMM settings’ field use the second slider to decrease the total weight of nonstructurally conserved regions to 1 (look at the weight column to see the total weight of conserved and nonconserved segments, e.g. Tot : 2 / 1). Build and add the model to the Workset. Read in the sequences 1EGZ and 1GO1 and evaluate them against your model. Look at the alignment of the highest scoring sequence. Now incorporate it into the model by clicking Add Alignment to Model. Evaluate the two sequences again and examine the alignments. Q13. How come a shift of the total weight between structurally conserved regions and nonconserved may improve the model’s ability to recognise distant homologs ? Can you think of any other architecture / and or parameter changes that would help ? If you have some energy left you can try to structurally align 3MAN to 1QNO as before and then also 1EGZ, and compare this true alignment to the one obtained by your model. Since doing structural alignment isn’t trivial, try using the ‘fast’ Levitt-Gerstein 2D method and lowering the Minimum Segment length to like 5 or so. Compare with the slower Brute Force 3D method. View your results in FarOut. The sequence identity between the three first structures is between 12-17%, 1GO1 is a bit more similar to 1EGZ. Part 3 Appendix Excerpt from ‘Classification of glycoside hydrolase sequences using hidden Markov models’ P.Johansson (2001) UPTEC F01 002