BCB 444/544 Fall 07 Study Guide #2 - KEY - Oct 24 p 1 of 9 BCB 444/544- F07 Study Guide #2 – KEY (Partial Answers) For Exam 2 (Oct 26) Complete answers will be discussed in Review Session on Thurs Oct 25 General comments • • • • • • Exam 2 will cover all topics covered in class, lab and assigned readings: • Lectures 13 - 26 (Wed Sept 19 thru Mon Oct 22 ) • Labs 5 - 8 • HW# 3 & 4 (not 5) • All assigned reading & URLs indicated in PPTs, including: Xiong: Chps 6 (beginning with HMMs), 7, 8, 12 - 15 (not 10 & 11) Eddy: What is a hidden Markov Model? Ginalski: Practical Lessons from Protein Structure Prediction This study guide covers ~90% of material important for Exam 2 - no guarantees about other 10%! Exam 2 will be a closed-book, closed-notes, 50-minute exam. Some questions will involve computation; bring your calculators if you like. All required formulae or tables will be provided. Some questions will require short essay-like answers that demonstrate your understanding of key concepts covered in the course. Topics & Study Questions: • Review: Nucleus, Chromosomes, Genes o Name two primary differences in the organization of eukaryotic versus prokaryotic cells • Review: RNA, Proteins, Promoters, Transcription factors o What is the relationship between protein sequence – structure – function? o Eukaryotic gene structure: Introns vs Exons o Regulation of gene expression • What is a promoter? An enhancer? • What is a transcription factor? • PSSMs, Profiles & HMMs o What are sequence logos? o What is the main difference between a PSSM and a profile? o How do you calculate the probability of a given path through an HMM? o How do you calculate the most probable path through a HMM? o How do you calculate the total probability of an observed sequence from a HMM? BCB 444/544 Fall 07 Study Guide #2 - KEY - Oct 24 p 2 of 9 • Protein Structure: Basics, Visualization, Classification o Amino acid properties: Why is Glycine "special?" Proline? Cysteine? o What are the 4 main levels of protein structure? o What are 3 main types of secondary structural elements? o Name 2 databases in which proteins are categorized on the basis of their structural class. o What is the major database for protein structure information? o Name 2 protein structure visualization tools. o What are the 2 main methods use to obtain high-resolution experimental structures of proteins? • Protein Secondary Structure Prediction o Why are different programs needed to predict the secondary structures of globular vs. membrane proteins? o What features of membrane proteins make them "easier" targets for protein secondary structure prediction? • Protein Tertiary Structure & Prediction o What is meant by the "protein folding" problem? o What is meant by the "inverse folding" problem? o What is the primary goal of the Structural Genomics Project? o What are the 3 major methods for protein tertiary structure prediction? o What is the CASP contest? o Name 1 program for protein structure comparison. • RNA Structure & Prediction o List 4 different types of functional RNAs o Why are there so many different types of base-pairs in RNA structures? o What types of bonds/molecular interactions are primarily responsible for stabilizing RNA structures? o What are 3 main methods for RNA structure prediction? o How is covariation used in RNA structure prediction? o What type of protein structure prediction method was developed by KaiMing Ho's group at ISU? • Gene Structure and (just a little bit on) Prediction o Name 3 differences in the structural features of genes in eukaryotes vs prokaryotes o Name 4 DNA sequence "signals" in genes that are often used in computational gene prediction approaches. BCB 444/544 Fall 07 Study Guide #2 - KEY - Oct 24 p 3 of 9 Sample Questions/Problems 1. Label each of the following statements eitherTrue (T) or False (F). a. F b. T c. T d. e. f. g. h. i. F F F F T T j. T The cytoskeleton extracellular matrix in eukaryotic cells organizes the extracellular space. The process by which information in RNA is used to make proteins is called translation. The process by which information in DNA is copied into RNA is called transcription. Peptide bonds are both planar and flexible. rigid. Enzymes that catalyze reactions in the cell are not always proteins. Protein interactions are not required for the functions of most proteins. An exon is a segment of a eukaryotic gene that does not encodes protein. In eukaryotes, one gene can sometimes encode several proteins. Transcription factors are proteins that often bind specific DNA sequences and promote the initiation of transcription. Non-coding RNAs can play important roles in cells, even though they do not produce proteins. 2. Short answer questions a. Briefly describe how PSSMs and profiles are generated, and how they differ. Name two applications for which either a PSSM or a profile could be used. Why are PSSMs or profiles used (i.e., what advantage do they confer over a single consensus sequence)? PSSMs and profiles are generated from multiple sequence alignments of similar sequences. Both capture the frequency of amino acid substitutions observed at each position of the protein sequence in the MSA. The main difference between a PSSM and a profile is that a profile allows gaps and a PSSM does not. PSSMs or profiles can be used to identify and characterize protein families, discover conserved regions of a sequence, generate consensus sequences, and find conserved sequences in a database, etc. The advantage of PSSMs and profiles over a consensus sequence is they store information about the relative importance of each position of the sequence, which cannot be captured by a single sequence. BCB 444/544 Fall 07 Study Guide #2 - KEY - Oct 24 p 4 of 9 b. What are the 4 main hierarchical levels of protein structure? What types of bonds (covalent or non-covalent) are most important for stabilizing structure at each level. Primary structure – linear sequence of amino acids; stabilized by covalent bonds (peptide bonds between amino acids) Secondary structure – alpha-helices, beta-strands & loops formed via shortrange interactions between amino acids; stabilized by non-covalent bonds (mostly hydrogen bonds) Tertiary structure – overall 3-D fold of a single polypeptide chain; stabilized mainly by non-covalent bonds (disulfide bonds are the exception here: they are covalent bonds that stabilize tertiary structure in some proteins) Quaternary structure – interactions between different polypeptide chains or subunits to form a functional multisubunit protein; stabilized by noncovalent bonds c. What is the difference between a protein motif and a protein domain? A protein motif is a short, conserved sequence pattern. A protein domain is a larger unit, corresponding to a longer sequence, that usually represents a functionally or structurally independent unit (and may contain one or more motifs). d. What is a HMM? Why is it more "powerful" and "flexible" than either a PSSM or a profile? An HMM is a hidden Markov model. HMMs are mathematical models with special properties, including “hidden” states. In our toy example of the fair and loaded dice, the hidden state was whether the fair or loaded die was being tossed. We cannot see the hidden state, only the observed variables; in the die example, we only see what number is rolled. One reason an HMMs is more powerful and flexible than either a PSSM or profile is that HMMs are mathematical models that explicitly include and distinguish between insertions and deletions, whereas PSSMs do not allow gaps, and profiles allow gaps, but don't distinguish insertions vs deletions. BCB 444/544 Fall 07 Study Guide #2 - KEY - Oct 24 p 5 of 9 3. "Problem Solving" & Short Essay A. Protein Structure Basics & Prediction According to Ginalski et al, 2005, what are the most important problems that must be addressed to improve protein tertiary structure prediction? Some important areas for improvement are: 1. Increasing the number of known protein structures through structural genomics projects. This will be the most helpful advance for protein structure prediction because homology modeling and threading rely on known structures. 2. Improving accuracy of energy functions. Ab initio methods, especially, are hampered by inaccurrate energy functions and the huge search space of possible conformations. Improvements in both ab initio and comparative methods will require better energy functions, increased computational power, and more efficient algorithms for exploring the conformational search space. 3. Improving structure comparison methods. The difficulty in comparing predicted structures with actual structures is another major problem at present. It is hard to improve your prediction method if you cannot accurately determine where you were wrong in the first place. B. RNA Structure Basics & Prediction What are the three main approaches to RNA secondary structure prediction? Explain the advantages and disadvantages (if any) of each. 1. Ab initio – • Advantage – requires only a single RNA sequence. • Disadvantage – relies on finding the minimum free energy structure based on energy models that may not be accurate 2. Comparative – • Advantage – uses multiple sequences to find covariation information or consensus structures between sequences • Disadvantage – still relies on energy models and multiple sequences provided as input must be biologically related; in other words, if you give these programs irrelevant information as input, they cannot produce good output. 3. Combined Computational and Experimental – • Advantage – can be much more accurate due to experimental evidence. • Disadvantage – requires wet-lab experiments that can be expensive and time-consuming, or digging through the literature to find experimental evidence that cost somebody else a lot of time and money. BCB 444/544 Fall 07 Study Guide #2 - KEY - Oct 24 p 6 of 9 C. Gene Structure Basics & Prediction Using this hidden Markov model, calculate the most probable path for sequence ACTG. Your probability table will begin like this: Start Begin 1 A 0 C 0 T 0 0.056 * (0.9 * 0.25) = 0.012 E 0 1 * 0.25 = 0.25 0.25 * (0.9 * 0.25) = 0.056 5 0 0 0.25 * 0.1 *0=0 0.056 * 0.1 * 0 = 0 I End 0 0 0 0 0 0 0 0 G 0 0.012 * (0.9 * 0.25) = 0.0028 0.0126 * 0.1 * 0.95 = 0.0012 0 0 End 0 0 0 0 0 Sorry about this problem! Because of the structure of this HMM, we cannot make it all the way from Start to End with this sequence. We can’t even make it to state I; we can get to either E or 5. The most probable path for this sequence through this model is: Start - E - E - E - E, and then we are stuck because we can’t get all the way to End! BCB 444/544 Fall 07 Study Guide #2 - KEY - Oct 24 p 7 of 9 D. Transcription/translation: Below the DNA sequence shown, write the RNA sequence that would be transcribed from the top strand of DNA, assuming it is copied completely from the 3' to the 5' end. On the RNA sequence, circle the START and STOP codons for translation. Translate the RNA sequence into amino acids and write the deduced protein sequence below, too. DNA 3'-TATATCGCGTTACGATCTGCACAAGATCATC-5' 5'-ATATAGCGCAATGCTAGACGACTTCTAGTAG-3' RNA 5'- A U A U A G C G C A A U G C U A G A C G A C U U C U A G U A G - 3 ' ___ ___ ___ ___ ___ ___ Protein: NH2 - M e t - L e u - A s p - A s p - P h e - S T OP - S to p ! ! START STOP Suppose the DNA sequence in another individual is different, due to a single basepair substitution, as shown below. What effects (if any) would you expect this SNP have on the sequence and function of the expected protein product? 3 ' - T A T A T C G C G T T A C G A T C T G C A C A A G GT C A T C - 5 ' 5'-ATATAGCGCAATGCTAGACGACTTCCAGTAG-3' 5'- A U A U A G C G C A A U G C U A G A C G A C U U C C A G U A G NH2 - M e t - L e u - A s p - A s p - P h e - G l n - S to p START STOP The peptide encoded by the 2nd gene would have one additional amino acid (Gln), relative to the 1st. Withonly the information provided, it is not possible to determine whether this change would alter its function. BCB 444/544 Fall 07 Study Guide #2 - KEY - Oct 24 p 8 of 9 Important "Bioinformatics" vocabulary: New Molecular Biology Jargon: CpG island RNA polymerase Promoter Enhancer Post-transcriptional control Primary, secondary, tertiary, quaternary structure Hydrophobic/ hydrophilic Motif Domain Covalent bond Non-covalent bond Peptide bond Hydrogen bond (H-bond) Base-stacking interactions Rotamer Ramachandran plot Energy minimization X-ray crystallography NMR spectroscopy AFM Co-variation Experimental constaints (for RNA structure prediction) cDNA EST Bioinformatics Jargon: Log odds score PSSM Profile HMM Markov model 1st, 2nd, 3rd order MMs Obseved vs hidden variables Emission vs transition probabilities Viterbi algorithm Profile HMM Regularization Pseudocounts Dirichlet mixture BCB 444/544 Fall 07 Study Guide #2 - KEY - Oct 24 p 9 of 9 Regular expression PSB, MMDB, MSD PYMOL, Cn3D CATH, SCOP, DALI GOR V, CMD Q3 SCORE Phobius Protein Structure Prediction 1) Ab initio 2) Threading or fold recognition 3) Homology modeling Miyazawa-Jernigan (MJ) model/potentials SWISS-MODEL 3-D JURY RNA structure prediction 1) Ab initio (thermodynamic) 2) Comparative 3) Combined computational & experimental Gene structure prediction 1) Ab initio 2) Similarity based 3) Combined