Copyright © 2003 Dr. Ray Luo, All rights reserved. Unless otherwise indicated, all materials on these pages are copyrighted by Dr. Ray Luo. All rights reserved. No part of these pages, either text or image may be used for any purpose other than personal use. Therefore, reproduction, modification, storage in a retrieval system or retransmission, in any form or by any means, electronic, mechanical or otherwise, for reasons other than personal use, is strictly prohibited without prior written permission. BS123A/MB223 Introduction to Computational Biology Final Exam Due 4:00pm Dec 5th, 2003 Note: This is an open-book/open-note final exam. You can use whatever resources in the library, in the computer lab, and on the web. However, you have to work out your solutions independently. Type your solutions in an MSWORD file and email the file to rluo@uci.edu, just as you did for your projects. Please use your name as the file name and write your name at the beginning of the file. Absolutely no late return! If you can’t email the file, please drop it to NSI3206 by the dual time. I. Molecular Mechanics and Its Applications in Biochemistry (20pts) (a) Suppose you are interested in the mechanisms of a noncovalent interaction of a natural ligand and its receptor. Given that you have identified the complex structure, you would like to know what drives the complex formation, i.e. you would like to know which energy component, polar or nonpolar interactions, is important in the complex formation. To achieve this goal, you need to perform several computational mutagenesis studies to mutate key residues in the receptor active site. The first residue to be analyzed is a charged residue -- aspartic acid because its salt-bridge interaction with a nearby residue Lysine is perturbed by the ligand binding. Suppose their interactions can be modeled by two dummy atoms of charges of +e and –e, respectively, and with a distance of 3 Å; and suppose their van der Waals parameters are A (= 4 kcal/mol) and B (= 3 kcal/mol). What is the scaling of the charge-charge interaction with respect to their distance? What is the scaling of the van der Waals interaction with respect to their distance? Please write down the formula for the Coulombic and van der Waals energies of the two dummy atoms to show these. Suppose the perturbation of ligand binding move the two dummy atoms further away, i.e. they are separated by a longer distance of 4 Å. What is the Coulombic energy change upon this perturbation? Note that 1.0 e2/Å is equal to 332 kcal/mol. What is the van der Waal energy change upon this perturbation? (b) Given that you have studied the biochemical reaction of the ligand within the receptor, and found out that it is a two-state reaction, i.e. there are: the state of the reactant (REAC), the state of the product (PROC) , and a transition state (TS) somewhere 1 along the reaction coordinate. The nice thing about the receptor acting as an enzyme is that it dramatically lowers the transition state energy barrier, ΔG = GTS - GREAC, from 20 kcal/mol to 5 kcal/mol. Suppose that the reaction rate is directly related to the energy barrier as exp(-ΔG/KBT) (KBT = 0.6 kcal/mol). Compute the acceleration of the reaction by the enzyme. (c) Suppose you did a short Metropolis Monte Carlo simulation at 300K (KBT = 0.6 kcal/mol) with the following data for the energies (kcal/mol): -3.0, -3.0, -2.5, and -4.0. Compute the average energy. (d) Suppose you did a short Molecular Dynamics simulation at 300K (KBT = 0.6 kcal/mol), you obtain the following energy data (kcal/mol) during the equilibrium phase: -1.0, -3.0, -4.0, and -2.0. Compute the average energy. What is the assumption used in the computation of the average energy with Molecular Dynamics? (e) There is a system of three states with energies (kcal/mol) of -2.1, -3.2, and -4.0. Compute the average energy at 300K (KBT = 0.6 kcal/mol). II. Hidden Markov Model (20pts) Consider the following HMM for secondary structure prediction. The three states are helical (1), beta (2), and coil (3). The letters are from a subset of amino acids, 3 amino acids only, labeled as 1, 2, and 3 below. The transition probabilities are: a11 = 0.4, a12 = 0.2, a13 = 0.4; a21=0.2, a22=0.3, a23=0.5; a31=0.3, a32=0.3, a33=0.4. The emission probabilities are: e1(1)=0.3, e1(2)=0.3, e1(3)=0.4 for state 1; e2(1)=0.4, e2(2)=0.4, e2(3)=0.2 for state 2; e3(1)=0.3, e3(2)=0.5, e3(3)=0.2 for state 3. Let’s assume all states have the same chance to start, i.e. a01=a02=a03=1/3. (a) Given a sequence of 123123123123: if we parse the sequence as all helical, what is the probability? How about all beta? How about all coil? (b) Compute the probability of generating the sequence of 123 by enumerating all possibilities. (c) Use the forward algorithm to compute the above probability. III. Simulation of Crystal Structures by Molecular Dynamics (30pts) Retrieve the pdb files 1pga.pdb and 1pgb.pdb for a small alpha/beta protein, Protein G. Remove all water or other heteroatom entries from the files. Also remove all non ATOM entries from the files. There is a 100 ps Molecular Dynamics simulation for this protein. Its output file, test.out, trajectory file, test.crd, and the parameter file, test.parm, are stored under the /tmp/final/ directory on the server. (a) Make sure there is no water and heteroatom in your pdb files. Now use tleap to generate a parameter file and restart file for each pdb entry. Do not use any water 2 molecules. Use ptraj to compute the backbone atom RMSD of the 100 ps trajectory with each of the two restart files as reference. Please use all residues in the calculation. Copy your ptraj input file in the answer. Plot the simulation time v.s. backbone atom RMSD for both calculations in one figure. Copy the figure into your solution. Is there any significant difference between the two RMSD fluctuations? (b) You can also compute B factor of the simulated protein with ptraj. Please go to the online manual to figure out how to do this. Please compute the B factor using the backbone CA, C, N atoms and use the “byres” option. Plot the residue number v.s. B factor. (c) B factor is also provided in both pdb entries. Please use the command getbfact under /tmp/final to obtain a data file for residue number v.s. B factor for each pdb entry. Make sure your pdb file is without water and heteroatoms before you do this. /tmp/final/getbfact [your_pdb_file_here] > [your_bfactor_data_file_here] Plot the data files for both pdb entries. Compare the two crystal B factor plots and computation B factor plot. Note that the absolute B factor values are meaningless, so we only need to compare the shapes of the three plots. Which crystal B factor plot agrees better with the computational B factor plot? (d) Read the paper provided, Gallagher et al, Two crystal structures of the B1 Immunoglobulin-binding domain of Streptococcal protein G and comparison with NMR, Biochemistry, vol. 33, 4721-4729, 1994, to find out why your observation in (c) is the case. Focus on the result and discuss sessions. You don’t have to write too much, a few sentences will do. IV. Detecting Subtle Sequence Signals by Multiple Alignment (30pts) Read the paper, Lawrence et al, Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment, Science, vol. 262, 208-214, 1993. Answer the questions below. Note that you don’t have to understand a paper 100% to get the information you want. (a) Multiple sequence alignments can be divided into which two classes? Which class this paper is focusing on? (b) The proposed method has three fundamental characteristics. What are these? (c) What is the biological significance of gaps? Why the current method avoids the excess use of gaps? (d) What influences the occurrence probability of a particular residue at a particular position of a sequence? As pointed out in the paper, patterns in closely related sequences are relatively easy to locate. The current method is more interested to locate distantly 3 related sequence patterns. Why the focus on the frequencies of each residue within the pattern is a natural choice to achieve this goal? Now some background on the optimization procedure used in this paper. Note that we have used the Metropolis Monte Carlo to obtain an equilibrium distribution of a system. Note also the fact that when high temperatures (i.e. KBT is large) are used, the Metropolis Monte Carlo simulation can easily cross energy barriers. This observation is frequently exploited in designing optimization methods. The Gibbs sampler proposed in this paper is very similar to the Metropolis Monte Carlo method. You don’t have to know the “predictive update” feature in this, which is used to enhance further the sampling efficiency. Let’s first understand the basic algorithm: identification of a single sequence pattern of fixed width. (e) The first data structure is pattern description, a list of residue frequencies, qi,1, … qi,20, from positions 1 to W, the length of pattern. Why we need 20 of them at each position? Why do we need a list of “background frequencies”, p1, … p20? The second data structure is used to store the positions of residues in each sequence. What is the objective in the algorithm? (f) Given qi,1, … qi,20 in step 1, how do we calculate Qx of segment x? Given p1, … p20, how to compute Px? (g) Given above two probabilities, we can compute the weight for the segment of x, Ax. This weight is similar to the probability distribution of a state in the original Metropolis Monte Carlo. Now review the lecture notes on the Metropolis Monte Carlo, why the Metropolis Monte Carlo method has the tendency to visit the state of high probability distribution, i.e. lower energy state? How does the Gibbs sampler sample segments of higher weight during a search? (h) Suppose that pseudocounts are given, do we need to use a training set of sequences to obtain any parameters needed in this method? Why? (i) What is the “phase” problem? What is the analogy of this problem in a Metropolis Monte Carlo sampling of a molecular system? The paper proposed a way to work around it. Do you have a better way to work around it, given the background information on the optimization method above? (j) How to deal with different widths of the patterns in the most straightforward way? What is the difficulty of doing so? What is the reason for this difficulty? 4