Lab Hidden Markov Models - Structural Biology Labs

advertisement
Molecular Bioinformatics, X3
Lab 4, 030129
Hidden Markov Models
Part 1
Consider the short nucleotide motif below. Given the marked columns and a maximum likelihood
estimate for the transition probabilities, a hidden Markov model M can be constructed.
C
G
C
C
C
*
T
G
T
C
A
G
T
G
G
*
A
A
A
A
*
D3
I1
2
B
5
M1
5
5
M2
3
M3
E
4
5
4
Columns 0,4,5 corresponds to the matchstates M1 M2 & M3 and columns 1,2,3 to insertstate I1.
Some of the transition parameters have already been estimated based on the alignment above.
Calculate the missing probabilities.
Estimate the emission parameters using the Add-one Pseudocount method given the background
vector q  14 , 14 , 14 , 14  . Make a table for the emissions of the nucleotides A C G T for each of the
emitting states, M1 I1 M2 M3.
Q1. The sequence s1=‘GTA’ can be generated by two paths, 1 & 2, through the model.
Write their corresponding alignments.
Q2. Calculate the probability that the sequence has been generated by respective path, i.e.
calculate P(s1 | M,1) & P(s1 | M,2)
Q3. Evaluate the corresponding log-odds scores, score(s1 | M,1) & score(s1 | M,2), given
q  14 , 14 , 14 , 14  and using the natural logarithm.
The optimal path for another, longer, sequence s2=‘CTAGA’ yields a score of +0.05212 nits. The
probability however is as low as 0.00103.
Q4. Compare the score & probability for this known motif-member to the optimal ones for
sequence s1. Why is the log-odds score more convenient to use for discrimination purposes
compared to the probability P ?
Part 2
Building your first model
First make sure your screen resolution is at least 1152x864 pixles by hitting the right mousebutton
anywhere on the desktop and then selecting Properties. Choose the Settings-tab and change the
‘Screen Area’ to 1152x864. Press ok and then ok again.
If the FarOut program is present at H:\FarOut, click on its icon to start the program. Otherwise,
download the file http://xray.bmc.uu.se/~patrik/FarOut/farout.jar and all of the files in
http://xray.bmc.uu.se/~patrik/FarOut/examples/.
When the program’s main window appears, click on the button saying Start JAMB! to open ’the
Java Markov Builder’. Go to the File menu and select Load Alignment, change the filetypes to
All files and open the file example1.fasta.
The short alignment contained in the file is displayed together with sequence names, weights and
a consensus sequence. Click on the Settings button to open the Settings window. Hit the
Background Pseudocount field to enable background pseudocount parameter estimation. You’ll
end up with a window looking something like this ;
In the Settings window you can adjust the amount of a priori information incorporated into your
models for example by tuning the emission pseudocount constant A and the transition
pseudoconstant B.
Try building your first model by clicking on the Scan button ; now the program creates the
HMM-graph and estimates all of its parameters. Look at the estimated emission probabilities.
Q5. What is the estimated probability of emitting a Tyrosine, Y, in column nr 1 ?
Q6. How frequent is Glutamic Acid, E, in the sequence universe ? Which is the most
common amino acid in the protein space ?
Start by evaluating one or a few of the sequences from the alignment (e.g. QEMNGYHI or
MDLNGHF..) by typing it into the Sequence Evaluation field and hitting enter. Look at the
score at the bottom left field of the window and examine the alignment corresponding to the most
probable path through the model by the clicking on the Show Alignment checkbox.
As one might suspect the model is pretty good at generating its own sequences, the scores of
model-sequences in some sense sets a rough upper limit of the maximum archievable score. Since
the model is so short, the difference between the model and the background distribution isn’t
really so significant but enough to serve as an illustration for some basic properties of hidden
Markov models.
The Workset environment
Evaluate the slightly different sequence ‘EMKGDH’ and look at its alignment. Try changing the
emission pseudocount constant A to a large number, like e.g. 9999. Hit Scan and evaluate the
sequence again. Examine the alignment.
Q7. What happens, why ?
Reset A=5 and add a ‘P’ at the end of the sequence. Compare the score of this new sequence to
the first one (i.e. evaluate the sequence ‘EMKGDHP’).
Q8. How come the score for the first sequence is so much lower even though Proline clearly
isn’t such a favourable residue in the last position ? (Hint ; transition probabilities..)
Click Add Model and enter a name for for your new-built model, it now ends up in the workset
database of models. The Workset can contain all the models and sequences you are working with
during a session. Reopen your model by hitting Edit and evaluate the sequence ‘MEVEGHV‘.
Look at the score. Now select Total Probability in the Workset window to sum the score over all
possible alignments instead of just the optimal one. Evaluate the sequence again. (note ; there’s no
need to rebuild your model since we’re just using a different algorithm to calculate the score)
Q9. Compare the scores – what does this tell you about all other valid paths through the
model compared to the optimal one ?
Subsequence Search & Classification
Reselect the box Maximum Probability and close your old model window. In the Worksetwindow press Build. Now it’s time to look at a real example ; read the alignment file
gh10_seed.fasta and build the model using Background Pseudocount estimation. Add the model
to the Workset and name it e.g. ‘gh10_backgr’ or something. Repeat the same procedure first
using Substitution Mixtures and then using Dirichlet Mixtures. Now you will have three models
built from the same data but by different methods.
In the Workset window, Workset Sequences-field, click Add to load a sequence file. Open the
file gh10_family.fasta and the file gh5_family.fasta.
First mark one of the models and mark some of the first sequences ; hit Eval. Look at the scores.
Now unselect the Subsequence Search checkbox in the Local Evaluation field. Eval the same
sequences again.
Q10. What happens, what is special with these sequences ? Look at the alignment made of
one of them by selecting it and copying (mouse-right-button). Hit Edit on the current model
and evaluate the sequence.
The Submodel Search option in turn, deals with the concept of modelling a local motif ; i.e.
when the model is much shorter than the evaluated sequences. These both options infer changes
in the basic hidden Markov architecture. The Subsequence Search option enables direct
transitions between the begin state and all matchstates – and between all matchstates and the end
state.
Q11. How do you think the architecture and / or the parameters can be changed to deal with
the second case, when modelling a shorter domain and just a small segment of the search
sequences should be considered while the rest should be ignored ; Submodel Search ?
The sequences contained in the first file have already been classified as family 10 members by
hand. This means that we can assess the quality of our three models built from a small subsection
of the family by investigating how many putative new members they find.
Re-enable the Subsequence Search and select one or several of your gh10-models. Select the
sequences you want to evaluate and hit Eval. If you select several models, all will be evaluated
and the highest scoring one will be displayed in the ‘Max Prob Model’ column. Since the program
isn’t optimised for fast sequence-evaluation this might be pretty slow on the lab computers. Play
around a bit to find the best of the three models and then maybe evaluate all of the sequences with
that model.
Also try out some of the family 5 sequences. Switch between enabling and disabling the
Subsequence Search. This will also change how high or how low the nonfamily sequences will
score. Since using the Subsequence Search will make the model much less length-dependent, not
just short family sequences will score higher – all of the sequence universe will score higher. The
model becomes more general i.e. more sensitive but less selective.
Sequence Weighting
As you might have noticed while looking at emission- and transition-counts earlier, all modelsequences aren’t treated equal by the program. Open one of your models and look at the weight
column at the far left of the window. Click on the sequence names / and on the weights to get a
bar-representation instead of numbers.
These weights are calculated to maximise the entropy, i.e. the information, of the modelsequences. Description of the weighting-algorithm can be found in section 3.4.3 in the Appendix.
Now try changing the weights by using your mouse to increase some of the bars. This will bias
your model towards those sequences. Doing this might actually be useful in some cases, for
example when one is doing homology modeling using one sequence/structure as a template but
still would like some amount of information from the other family members. By assigning the
template sequence a large weight and letting the other split the remnant, a better templatesequence mapping can be made.
Another important parameter when building a model is the total weight. Normally one implicitly
uses a total weight equal to the number of model-sequences.
Q12. What would happen to selectivity and sensitivity if we were to increase or decrease the
total weight ? (Hint, what would happen to the amount of a priori information incorporated?)
Building a StructureHMM
Delete your existing Workset-sequences and models. Now go back to the main FarOut-window
and click on the Structural Alignment tab. Hit Open Structure File and read in the .pdb file
3MAN, a mannanase (family 5). Also open the structure file 1QNO. Deselect ‘View Results in
FarOut’ and instead choose to View Results in Jamb. Click Align Structures.
A model-window and a molecule-viewer of the structural alignment appears after a few seconds.
First look at the interpolated 3D-structures to get a rough idea of how similar they are. Note the
pretty well conserved helices in contrast to the long nonconserved loops.
Go back to the model window. Here you’ll see the sequences and their structurally conserved
parts. The pairwise sequence identity is 13%. Locate the catalytic triad, the conserved residues
‘NEP’, and the two active site stabilising-residues ‘H-Y’. Open the Settings window. In the
‘StructureHMM settings’ field use the second slider to decrease the total weight of nonstructurally conserved regions to 1 (look at the weight column to see the total weight of conserved
and nonconserved segments, e.g. Tot : 2 / 1). Build and add the model to the Workset.
Read in the sequences 1EGZ and 1GO1 and evaluate them against your model. Look at the
alignment of the highest scoring sequence. Now incorporate it into the model by clicking Add
Alignment to Model. Evaluate the two sequences again and examine the alignments.
Q13. How come a shift of the total weight between structurally conserved regions and
nonconserved may improve the model’s ability to recognise distant homologs ? Can you
think of any other architecture / and or parameter changes that would help ?
If you have some energy left you can try to structurally align 3MAN to 1QNO as before and then
also 1EGZ, and compare this true alignment to the one obtained by your model. Since doing
structural alignment isn’t trivial, try using the ‘fast’ Levitt-Gerstein 2D method and lowering the
Minimum Segment length to like 5 or so. Compare with the slower Brute Force 3D method.
View your results in FarOut. The sequence identity between the three first structures is between
12-17%, 1GO1 is a bit more similar to 1EGZ.
Part 3
Appendix
Excerpt from
‘Classification of glycoside hydrolase sequences using hidden Markov models’
P.Johansson (2001) UPTEC F01 002
Download