BS123A/MB23 Introduction to Computational Biology

Copyright © 2003 Dr. Ray Luo, All rights reserved.
Unless otherwise indicated, all materials on these pages are copyrighted by Dr. Ray Luo. All rights
reserved. No part of these pages, either text or image may be used for any purpose other than personal use.
Therefore, reproduction, modification, storage in a retrieval system or retransmission, in any form or by
any means, electronic, mechanical or otherwise, for reasons other than personal use, is strictly prohibited
without prior written permission.
BS123A/MB223 Introduction to Computational Biology
Final Exam
Due 4:00pm Dec 5th, 2003
Note: This is an open-book/open-note final exam. You can use whatever resources in the
library, in the computer lab, and on the web. However, you have to work out your
solutions independently. Type your solutions in an MSWORD file and email the file to
rluo@uci.edu, just as you did for your projects. Please use your name as the file name
and write your name at the beginning of the file.
Absolutely no late return! If you can’t email the file, please drop it to NSI3206 by
the dual time.
I. Molecular Mechanics and Its Applications in Biochemistry (20pts)
(a) Suppose you are interested in the mechanisms of a noncovalent interaction of a
natural ligand and its receptor. Given that you have identified the complex structure, you
would like to know what drives the complex formation, i.e. you would like to know
which energy component, polar or nonpolar interactions, is important in the complex
formation. To achieve this goal, you need to perform several computational mutagenesis
studies to mutate key residues in the receptor active site. The first residue to be analyzed
is a charged residue -- aspartic acid because its salt-bridge interaction with a nearby
residue Lysine is perturbed by the ligand binding. Suppose their interactions can be
modeled by two dummy atoms of charges of +e and –e, respectively, and with a distance
of 3 Å; and suppose their van der Waals parameters are A (= 4 kcal/mol) and B (= 3
kcal/mol). What is the scaling of the charge-charge interaction with respect to their
distance? What is the scaling of the van der Waals interaction with respect to their
distance? Please write down the formula for the Coulombic and van der Waals energies
of the two dummy atoms to show these. Suppose the perturbation of ligand binding move
the two dummy atoms further away, i.e. they are separated by a longer distance of 4 Å.
What is the Coulombic energy change upon this perturbation? Note that 1.0 e2/Å is equal
to 332 kcal/mol. What is the van der Waal energy change upon this perturbation?
(b) Given that you have studied the biochemical reaction of the ligand within the
receptor, and found out that it is a two-state reaction, i.e. there are: the state of the
reactant (REAC), the state of the product (PROC) , and a transition state (TS) somewhere
1
along the reaction coordinate. The nice thing about the receptor acting as an enzyme is
that it dramatically lowers the transition state energy barrier, ΔG = GTS - GREAC, from 20
kcal/mol to 5 kcal/mol. Suppose that the reaction rate is directly related to the energy
barrier as exp(-ΔG/KBT) (KBT = 0.6 kcal/mol). Compute the acceleration of the reaction
by the enzyme.
(c) Suppose you did a short Metropolis Monte Carlo simulation at 300K (KBT = 0.6
kcal/mol) with the following data for the energies (kcal/mol): -3.0, -3.0, -2.5, and -4.0.
Compute the average energy.
(d) Suppose you did a short Molecular Dynamics simulation at 300K (KBT = 0.6
kcal/mol), you obtain the following energy data (kcal/mol) during the equilibrium phase:
-1.0, -3.0, -4.0, and -2.0. Compute the average energy. What is the assumption used in the
computation of the average energy with Molecular Dynamics?
(e) There is a system of three states with energies (kcal/mol) of -2.1, -3.2, and -4.0.
Compute the average energy at 300K (KBT = 0.6 kcal/mol).
II. Hidden Markov Model (20pts)
Consider the following HMM for secondary structure prediction. The three states are
helical (1), beta (2), and coil (3). The letters are from a subset of amino acids, 3 amino
acids only, labeled as 1, 2, and 3 below. The transition probabilities are: a11 = 0.4, a12 =
0.2, a13 = 0.4; a21=0.2, a22=0.3, a23=0.5; a31=0.3, a32=0.3, a33=0.4. The emission
probabilities are: e1(1)=0.3, e1(2)=0.3, e1(3)=0.4 for state 1; e2(1)=0.4, e2(2)=0.4,
e2(3)=0.2 for state 2; e3(1)=0.3, e3(2)=0.5, e3(3)=0.2 for state 3. Let’s assume all states
have the same chance to start, i.e. a01=a02=a03=1/3.
(a) Given a sequence of 123123123123: if we parse the sequence as all helical, what is
the probability? How about all beta? How about all coil?
(b) Compute the probability of generating the sequence of 123 by enumerating all
possibilities.
(c) Use the forward algorithm to compute the above probability.
III. Simulation of Crystal Structures by Molecular Dynamics (30pts)
Retrieve the pdb files 1pga.pdb and 1pgb.pdb for a small alpha/beta protein,
Protein G. Remove all water or other heteroatom entries from the files. Also remove all
non ATOM entries from the files. There is a 100 ps Molecular Dynamics simulation for
this protein. Its output file, test.out, trajectory file, test.crd, and the parameter
file, test.parm, are stored under the /tmp/final/ directory on the server.
(a) Make sure there is no water and heteroatom in your pdb files. Now use tleap to
generate a parameter file and restart file for each pdb entry. Do not use any water
2
molecules. Use ptraj to compute the backbone atom RMSD of the 100 ps trajectory
with each of the two restart files as reference. Please use all residues in the calculation.
Copy your ptraj input file in the answer. Plot the simulation time v.s. backbone atom
RMSD for both calculations in one figure. Copy the figure into your solution. Is there any
significant difference between the two RMSD fluctuations?
(b) You can also compute B factor of the simulated protein with ptraj. Please go to the
online manual to figure out how to do this. Please compute the B factor using the
backbone CA, C, N atoms and use the “byres” option. Plot the residue number v.s. B
factor.
(c) B factor is also provided in both pdb entries. Please use the command getbfact
under /tmp/final to obtain a data file for residue number v.s. B factor for each pdb
entry. Make sure your pdb file is without water and heteroatoms before you do this.
/tmp/final/getbfact [your_pdb_file_here] > [your_bfactor_data_file_here]
Plot the data files for both pdb entries. Compare the two crystal B factor plots and
computation B factor plot. Note that the absolute B factor values are meaningless, so we
only need to compare the shapes of the three plots. Which crystal B factor plot agrees
better with the computational B factor plot?
(d) Read the paper provided, Gallagher et al, Two crystal structures of the B1
Immunoglobulin-binding domain of Streptococcal protein G and comparison with NMR,
Biochemistry, vol. 33, 4721-4729, 1994, to find out why your observation in (c) is the
case. Focus on the result and discuss sessions. You don’t have to write too much, a few
sentences will do.
IV. Detecting Subtle Sequence Signals by Multiple Alignment (30pts)
Read the paper, Lawrence et al, Detecting subtle sequence signals: A Gibbs sampling
strategy for multiple alignment, Science, vol. 262, 208-214, 1993. Answer the questions
below. Note that you don’t have to understand a paper 100% to get the information you
want.
(a) Multiple sequence alignments can be divided into which two classes? Which class this
paper is focusing on?
(b) The proposed method has three fundamental characteristics. What are these?
(c) What is the biological significance of gaps? Why the current method avoids the
excess use of gaps?
(d) What influences the occurrence probability of a particular residue at a particular
position of a sequence? As pointed out in the paper, patterns in closely related sequences
are relatively easy to locate. The current method is more interested to locate distantly
3
related sequence patterns. Why the focus on the frequencies of each residue within the
pattern is a natural choice to achieve this goal?
Now some background on the optimization procedure used in this paper. Note that we
have used the Metropolis Monte Carlo to obtain an equilibrium distribution of a system.
Note also the fact that when high temperatures (i.e. KBT is large) are used, the Metropolis
Monte Carlo simulation can easily cross energy barriers. This observation is frequently
exploited in designing optimization methods. The Gibbs sampler proposed in this paper is
very similar to the Metropolis Monte Carlo method. You don’t have to know the
“predictive update” feature in this, which is used to enhance further the sampling
efficiency. Let’s first understand the basic algorithm: identification of a single sequence
pattern of fixed width.
(e) The first data structure is pattern description, a list of residue frequencies, qi,1, … qi,20,
from positions 1 to W, the length of pattern. Why we need 20 of them at each position?
Why do we need a list of “background frequencies”, p1, … p20? The second data structure
is used to store the positions of residues in each sequence. What is the objective in the
algorithm?
(f) Given qi,1, … qi,20 in step 1, how do we calculate Qx of segment x? Given p1, … p20,
how to compute Px?
(g) Given above two probabilities, we can compute the weight for the segment of x, Ax.
This weight is similar to the probability distribution of a state in the original Metropolis
Monte Carlo. Now review the lecture notes on the Metropolis Monte Carlo, why the
Metropolis Monte Carlo method has the tendency to visit the state of high probability
distribution, i.e. lower energy state? How does the Gibbs sampler sample segments of
higher weight during a search?
(h) Suppose that pseudocounts are given, do we need to use a training set of sequences to
obtain any parameters needed in this method? Why?
(i) What is the “phase” problem? What is the analogy of this problem in a Metropolis
Monte Carlo sampling of a molecular system? The paper proposed a way to work around
it. Do you have a better way to work around it, given the background information on the
optimization method above?
(j) How to deal with different widths of the patterns in the most straightforward way?
What is the difficulty of doing so? What is the reason for this difficulty?
4