BCB 444/544 Fall 07 Oct 26 Exam 2 BCB 444/544 - F07 Exam 2 (100 pts) p 1 of 8 Name KEY–Please note Revised Answers for Questions C2 & F A. Cells, Nucleus, Genetic Code, Transcription Factors, etc. (15 pts TOTAL) Below the DNA sequence shown, write the RNA sequence that would be transcribed from the top strand of DNA, assuming it is copied completely from the 3' to the 5' end. On the RNA sequence, circle the START and STOP codons for translation. Translate the RNA sequence into amino acids and write the deduced protein sequence below, too. DNA 3'-TATATCGTACAGACGATTTTGATCTATTGACTGG-5' 5'-ATATAGCATGTCTGCTAAAACTAGATAACTGACC-3' RNA 5'-AUAUAGCAUGUCUGCUAAAACUAGAUAACUGACC-3 Protein: NH2 – M e t - S e r - A l a - L y s - T h r - A r g - S t o p Suppose the DNA sequence in another individual is different, due to a single base-pair deletion, resulting in the sequence shown below. What effects (if any) would you expect this SNP have on the sequence and function of the expected protein product? 3'-TATATCGTACGACGATTTTGATCTATTGACTGG-5' 5'-ATATAGCATGCTGCTAAAACTAGATAACTGACC-3' RNA 5'-AUAUAGCAUGCUGCUAAAACUAGAUAACUGACC-3 Protein: NH2 – M e t - L e u - L e u - L y s - L e u - A s p - A s n - S t o p The single base deletion caused a frameshift, resulting in a completely different protein sequence. I would expect this mutation to make a protein that did not have the same function as the one above, and it may not function at all. BCB 444/544 Fall 07 Oct 26 Exam 2 p 2 of 8 B. HMM (20 pts TOTAL) Consider the occasionally dishonest casino example discussed in class. The system has 3 states: B denotes the start state F denotes the state when a fair die is used L denotes the state when a loaded die is used The transition probabilities between these states are shown in the diagram. The emission probabilities are: for state F, eF(1) = eF(2) = … = eF(6) = 1/6 for state L, eL(1) = eL(2) = … = eL(5) = 0.1, eL(6) = 0.5 1. What is the most probable sequence of states, starting from state B, to produce the sequence of die tosses 1,6? For full credit, you must show your work and fill in the table below. B 1 1 0 F 0 ½ * 1/6 = 1/12 L 0 ½ * 1/10 = 1/20 6 0 max {1/12 * 0.99 * 1/6 = 0.0138, 1/20 * 0.2 * 1/6 = 0.001667} = 0.0138 max {1/12 * 0.01 * 1/2 = 0.00042, 1/20 * 0.8 * 1/2 = 0.02} = 0.02 The most probable sequence of states is: B – L – L. 2. What is the total probability of the sequence 1,6? Show your work and fill in the table below. B 1 1 0 F 0 ½ * 1/6 = 1/12 L 0 ½ * 1/10 = 1/20 6 0 sum {1/12 * 0.99 * 1/6 = 0.0138, 1/20 * 0.2 * 1/6 = 0.001667} = 0.015467 sum {1/12 * 0.01 * 1/2 = 0.00042, 1/20 * 0.8 * 1/2 = 0.02} = 0.02042 The total probability of the sequence 1,6 is: 0.015467 + 0.02042 = 0.035887 BCB 444/544 Fall 07 Oct 26 Exam 2 p 3 of 8 C. Motifs, Domains, Structure & Structure Prediction (10 pts TOTAL) C1. (2 pts) Why is a profile more sensitive and flexible than a PSSM for detecting sequence motifs in proteins? A profile allows for gaps in the sequence while a PSSM does not. C2. (4 pts) Briefly explain the roles of base-pairing vs base-stacking interactions in RNA structure prediction Both base-pairing and base-stacking interactions stabilize RNA structures, but the major energetic contribution is from base-stacking. Most prediction methods attempt to identify potential structures by optimizing base-pairing, but the energies and ranking of potential structures are calculated based on nearest-neighbor base-stacking interactions. C3. (2 pts) Suggest one physical explanation for the fact that secondary structure prediction algorithms can more accurately identify alpha-helical segments than beta-strands/sheets. Secondary structure prediction algorithms more accurately identify alpha-helical segments because they are local structures, that is they are stabilized by hydrogen bonds between near neighbors in an amino acid sequence. Beta-strands/sheets are harder to predict because they involve long-range interactions stabilized by hydrogen bonds between portions of the sequence that can be far apart. C4. (2 pts) According to the paper by Ginalski et al., why are meta predictors better than individual methods for predicting the tertiary structure of proteins? Meta predictors benefit from being able to choose between multiple predicted conformations. They are able to identify the structure (or portions of a structure) that occur more often than expected in the set of structures returned. This procedure gives meta predictors a significant advantage over individual methods and results in better models. BCB 444/544 Fall 07 Oct 26 Exam 2 p 4 of 8 D. Longer answers/problems (20 pts TOTAL) D1. (10 pts) a) Briefly outline the steps used to predict a protein structure by threading. 1. Align the target sequence with all template structures 2. Calculate the energy score to evaluate how well the sequence fits on the structure 3. Rank models based on energy scores. b) What were the key "simplifications" exploited in the Ho method to make it fast enough to be used for genome-wide threading? 1. Simplify the full 3D eigenvector structure. 2. Simplify 3. Simplify residues. the template structure representation by using the contact matrix (2D) instead of structure. This representation is further reduced by considering only the dominant of the contact matrix, resulting in a one dimensional representation of the template the target sequence representation by using the Li-Tang-Wingreen representation. the energy function by only counting number of contacts between hydrophobic BCB 444/544 Fall 07 Oct 26 Exam 2 p 5 of 8 D2. (10 pts) Given a single RNA sequence, GGCGCGGCACCGUCCGCGGAACAAACGG, we predict the structure shown below. We then perform a database search and discover 5 homologous sequences and align them. The MSA is shown below with nucleotides that are base-paired in the structure highlighted and numbered so that base-paired positions have the same number above them. a) Based on the information in this MSA, is our predicted structure likely to be correct? Explain. Maybe. Most likely, the base pair at position 3 below does not form in all of the other sequences. Also, all of the other sequences can form base pairs with the bases immediately before and after the numbered region – see the UC and the GA bases that are present in all other sequences besides our first sequence. b) Are there any base-pairs in the predicted structure that are unlikely to form in structures corresponding to the additional sequences in the MSA? Explain. Yes, the base pair at position 3 is not likely to form because the bases in all of the other sequences vary but do not conserve the ability to base pair. 12345 54321 GGCGCGGCACCGUCCGCGGAACAAACGG UCCGGGUCACCGUACGCGGAACAAACGG UCCGUGCCACCGUGCGCGGAACAAACGG UCCGUGACACCGUUCCCGGAACAAACGG UCCGAGUCACCGUACGCGGAACAAACGG UCCGCGUCACCGUACGCGGAACAAACGG BCB 444/544 Fall 07 Oct 26 Exam 2 p 6 of 8 E. Molecular Biology & Bioinformatics Terms (20 pts TOTAL) (1pt each) Fill in the box beside each definition with one term or acronym that corresponds to the definition provided. (Some have more than one correct answer). Term Definition Algorithm used to determine the most probable path for a sequence of observed variables from a HMM E1. Viterbi algorithm E2. GOR V, PHD, PSIPRED, etc. E3. CpG Island E4. CATH, SCOP A protein structure classification database E5. PyMol, MolMol, Cn3D, etc. A program for visualizing protein structures E6. CASP E7. Cytoskeleton Internal structural scaffold that organizes the cytoplasm in eukaryotic cells E8. Peptide bond Type of covalent bond that links amino acids in a polypeptide chain E9. Protein domain E10 NMR Spectroscopy, X-ray crystallography A program for predicting protein secondary structure Genomic region with a high frequency of CG dinucleotides, likely to be near the transcription start site for genes A protein structure prediction "contest" Independent structural or functional unit of a protein Experimental method for determining the 3-D structure of a macromolecule (2pts each) Short answer: Answer each of the following questions (one phrase or sentence should be sufficient in most cases). E11. What is the difference between a protein motif and a protein domain? A protein motif is a short, conserved sequence pattern. A protein domain is a larger unit, corresponding to a longer sequence that usually represents a functionally or structurally independent unit (and may contain one or more motifs). E12. What is "hidden" in a hiddle Markov model? The actual state is hidden. We can’t "see" the underlying state that emits the observed variables. E13. What is meant by co-variation in the context of RNA structure prediction? Co-variation is used to determine which residues are more likely to be base-paired. In a MSA, if every time one position changes from an A to a C, another position changes from a BCB 444/544 Fall 07 Oct 26 Exam 2 p 7 of 8 U to a G, those two positions are likely to be base-paired because the mutation/variations change the sequence of the RNA, but conserve the ability to base-pair. E14. Name the 3 basic computational appraoches for protein structure prediction. Homology modeling, threading, and ab initio. E15. Which experimental protein structure determination method can to provide information about protein dynamics? NMR spectroscopy F. Something a bit more thought provoking!! (10 pts TOTAL) You are given a "mystery" gene sequence (M gene) from a newly discovered bacterium. The M gene has only one large open reading frame (ORF), which would encode a protein ~200 amino acids in length. Outline and briefly describe the types of computational analyses you would perform to try to annotate this gene and its potential product(s). Be sure to provide relevant details and explain how you would proceed if a proposed approach does not provide useful information. You should begin like this -- but please replace underlined bits and "x"s with your own words: First, I would perform a "xBLASTx" search using the M gene sequence to query the "x" database. If no hits with e-values better than "x" are returned using default parameters, I would…. 1) BLAST, if no significant hits, adjust default parameters, including substitution matrices 2) If still no hits, try PSI-BLAST to identify potential remote homologs 3) If still no hits, search for protein sequence or portions of it in Pfam & other protein family databases 4) Try to predict protein structure – protein may have function related to template protein chosen by homology modeling or threading program 5) Search for: a. functional motifs or domains: e.g., kinase or DNA binding motifs b. potentially co-regulated groups of proteins, based n microarray or proteomics data c. references to the protein in the literature Other ideas? BCB 444/544 Fall 07 Oct 26 Exam 2 p 8 of 8 G. The Question I Didn't Ask (5 pts TOTAL) Describe something you have learned about an ISU scientist or research project, somehow related to bioinformatics or computational biology (from lectures, reading, labs, seminars), which you think is worth 5 pts! We have mentioned quite a few of these in class. Some examples are the Ho group’s threading method for protein structure prediction and the Jernigan group’s GOR V, CDM, and FDM methods for protein secondary structure prediction, and the Brendel group’s methods for gene prediction. There are lots of other possibilities as well. Genetic Code Table