Problem Set 1 Please make sure to show your work and calculations and state any assumptions you make in answering the following questions. Include the names of the people you worked with at the top of your problem set. I. Biology (35 Points total) 1 DNA and RNA structure: Nucleic acid polymers are the basis for genetic information storage and transfer in the cell (5 points) 1.1 What is the monomeric unit of DNA called? Deoxyribonucleotide. (1 point) 1.2 What is the monomeric unit of RNA called? Ribonucleotide (1 point) 1.3 What is the generic term for both of these units? Nucleotide (1 point) 1.4 DNA is usually present in the cell in the form of a double-helix. Explain what this structure is and why it is important to the function of DNA. (Keywords: Replication, redundancy, anti-parallel, complementary base pairs.) (2 points) A DNA double-helix contains two anti-parallel strands of DNA (two chains running in opposite directions) bound together through complementary base pairs, forming a helix spiral. (1 point) Such a double-helix structure is important to DNA replication, because the complimentary nature of the structure contains redundant information, allowing the synthesis of one strand from the other. (1 point) Give credit as long as they demonstrate understanding of the structure. 2 Proteins are polymers that perform the intended function of most genes (5 points) 2.1 What is the monomeric unit of a protein? Amino Acid or Amino Acid Residue (1 point) 2.2 How many different types of monomers typically exist? 20. (1 point) Give credit if the answer is larger than 20 AND if non-standard AA is mentioned. 2.3 Due to the nature of the direction of protein synthesis and the structure of proteins a protein is usually referred to as having an N and C-terminal end. Why are the symbols N and C used? N represents the amino terminal side of the polypeptide which contains nitrogen as on of the first atoms in the chain. C represents the carboxyl terminal of the polypeptide chain. (1 point) Give credit as long as they know N stands for amino-terminal and C stands for carboxyl-terminal. 2.4 Proteins are described as having four levels of structure (primary-, secondary-, tertiary-, and quaternary-structure). For each of the following list the category (or categories) they belong to. Note that some may belong to more than one category. (2 points) 2.4.1 Amino Acid Sequence? Primary (0.5 point) 2.4.2 Alpha Helices? Secondary (0.5 point) 2.4.3 HIV protease dimer? Quaternary (0.5 point) 2.4.4 Disulfide bonds? Secondary, tertiary or quaternary (0.5 point) Give credit if they only mention tertiary and quaternary, or if they include primary structure as well. 3 An understanding of the central dogma and the structure of DNA, RNA and proteins will help you answer these questions (10 points) 3.1 In the eukaryotic cell the size of the mature mRNA that is translated is smaller than the gene sequence. Why is this the case? (Keywords: transcription, Promoter, Exon, Intron, Splicing) The transcription start site is usually downstream of the promoter, so that not the entire gene sequence gets transcribed into RNA. (2 points) Further more RNA splicing, in which introns in the primary transcript are removed and exons are combined together, further reduces the size of the mature mRNA. (3 points) Give credit as long as they show understanding of the subject. 3.2 For a given expressed gene, the length (in monomers) of the resultant protein (assuming no post processing), is less than 1/3 the size of the mature mRNA. Why? (Keywords: Tri-Nucleotide Codon, Start Codon, Stop Codon, t-RNA, 3’ and 5’ UTR, ORF.) First of all since one amino acid is coded for by three nucleotides (trinucleotide) translation automatically reduces the size of the polymer by a factor of 3. (1 point) Further since the mRNA contains 3’UTR (untranslated region) and 5’UTR, the coding sequence that actually gets converted into the peptide is shorter than the mRNA sequence. (4 points, 2 points each for 5’UTR and 3’UTR) Give credit as long as they show understanding of the subject. 4 You will need to understand the genetic code to answer these questions. (5 Points) 4.1 What six codons encode for Serine? TCT, TCC, TCA, TCG, AGT and AGC (or UCU, UCC, UCA, UCG, AGU and AGC). (1 point) 4.2 List all of the codons that do not code for an amino acid. What is their purpose? TAA, TAG, TGA (or UAA, UAG, UGA) (1 point) Their purpose is to signal translation termination. (1 point) 4.3 What amino acid does the codon ATG encode for? Other than coding for an amino acid, does this codon perform any special function in protein biosyntheses? Met (1 point) It signals translation initiation (1 point) 5 Eukaryotic and prokaryotic organisms differ in many aspects. For each of the cellular structures or characteristics listed below, please identity whether it belongs to eukaryotic or prokaryotic organisms or both. (5 points) 5.1 5.2 5.3 5.4 5.5 Membrane-bound organelles: eukaryotic (1 point) Nucleus eukaryotic (1 point) 70S ribosome prokaryotic (1 point) RNA splicing eukaryotic (1 point) microRNAs eukaryotic (1 point) For 5.4, give credit if they mention both prokaryotic and eukaryotic (there are very rare cases of splicing in prokaryotes). No credit will be given if only prokaryotes are mentioned. For 5.5, give credit if they mention both prokaryotic and eukaryotic (this depends on different definitions of microRNAs). 6 Speculate on a biological problem that might be interesting to investigate with computational methods. Think of this as a possible subject for your final project (5 Points) Give full credit (5 points) if they come up with anything that seems like a reasonable start. Detailed comments to be given by section TFs. II. Perl Program (35 points total + 5 bonus points) Answers in blue text. Companion files: PS1_answer_perl_2003.pl PS1_answer_perl_bonus_2003.pl PS1_answer_perl_2003.out PS1_answer_perl_bonus_2003.out 1. Calculate GC content for a test 70-mer (10 points for this section): a. In order to do this, you will need the following line of code: $c = $oligo =~ s/c//gi; Explain what this line does as a comment in your Perl script (hint in skeleton). (5 points). Official answer, as provided as commented lines in perl code: # The expression s/c//gi means to do a case insensitive global substitution # of all instances of the character 'c' with no character, thus removing # all 'c' characters from the $oligo variable. The substitution expression # itself returns an integer corresponding to the number of times the # substitution takes place. This can now be used to determine how many C's, # and for that matter, G's, are present in the oligonucleotide sequence. The answer has two main parts to it – what s/c//gi does and what $c = $oligo does. 3 points for getting the s/c//gi part 2 points for getting the $c = $oligo part (this is probably a more subtle/difficult portion of the answer and is awarded a lower point value to provide some leeway for people not familiar with Perl) b. Output the full oligo sequence and its GC content to the screen (5 points). This ensures that students can use the print statement and, with the assistance from part a, be able to perform a couple of operations (i.e. calculate GC content) and output that as well. 1 point for the print statement (which is essentially given in the skeleton code) 4 points for the GC content; half-credit if the code makes the substitution but does not correctly count the number of substitutions made, and 1 point off if the code is largely correct but reports an incorrect value. If the GC content is rounded to a minimum of 2 significant figures, that is permissible. Output: CATTACGATGCATTGATTTTTCAAAGGAATGTACTATCGAAATCACAAGTCGTGGACTACGGTTTGCAGT GC Content: 38.5714285714286% 2. Calculate Tm (described in the skeleton code) for a test 70-mer. (5 points) Output: Tm: 73.5714285714286 degrees C 4 points for correctly implementing the formula; half credit if the formula contains errors, and 1 point off if the code is largely correct but reports an incorrect value. If the Tm is rounded to a minimum of 2 significant figures, that is permissible. 1 point for the print statement (again, provided in the skeleton code) 3. Read in the SARS genome and parse through all possible 70-mers, calculate GC content and Tm for each 70-mer in the SARS genome, filter out 70-mers that don’t satisfy the Tm requirement (between 67-69 °C), and store the filtered results in an array variable (15 points). 1 point – read in SARS genome and store it in a variable (this is essentially provided already as a commented line in the skeleton code) 5 points – correctly implemented use of the substring function to extract each and every 70-mer in the SARS genome 4 points – code to calculate GC content and Tm. Parts 1 and 2 should have served to make this an easy incorporation into the loop. 1 point – correctly using the if loop to filter the results; the skeleton code had already provided the if statement 4 points – a reasonable array variable to store the attributes of each qualifying 70-mer Award up to half-credit for pseudocode or code that has the general structure but significant flaws in the implementation. Award full credit minus single points (per bug) for code that is mostly correct but has small bugs in the code. 4. The oligos you obtained from above may be overlapping with each other. Here you’re asked to filter out the overlapping ones and output a list of nonoverlapping oligos with the starting position, oligonucleotide sequence, GC content, and Tm. You should have a total of four tab-separated columns. (5 points) Hint: to output the STDOUT (i.e. the screen) into a file, reroute the output into the file with the following syntax: program.pl [switches] > output.txt This may be the most challenging part of the problem for some students. The skeleton code, however, does try to provide a hint to the students in constructing the loop. Note that the lines of code in the skeleton are slightly different than what is provided in the answer key code – the answer key code uses a single special-case iteration of the for loop as separate lines of code to take care of the “end-of-fence” problem that often arises in these kinds of loops. Since some students may treat given lines of code in the skeleton code as being unalterable, credit should not be deducted if the “end-of-fence” oligo cases are missing (e.g. first or last oligo missing in output). 2 points for a correct print statement that produces a reasonable tab-delimited output with 4 columns 3 points for correct implementation of non-overlapping oligos (no credit if the output file is big, say, over 50,000 bytes; the official output file answer key is less than 2,000 bytes). Output: See PS1_answer_perl_2003.out Bonus: an important factor in oligo design is to mask repetitive sequences to minimize non-specific hybridization. Most oligo design programs have a fairly comprehensive set of repetitive sequences that are masked from the oligo design space. Here, filter your list of results by removing oligonucleotides that have a homo-polynucleotide tract 5 or more bases in length. Note that you should perform your filtering on the list of all possible qualifying oligos, not the list of non-overlapping oligos (5 points). 3 points for the regular expression filter 2 points for array splicing 1 point off if the filtering is performed on the list of non-overlapping oligos generated in part 4. If the code employs some other implementation of parsing through the array and produces the correct result, full credit may be awarded; if the implementation is flawed, points awarding is to the discretion of the grader. Output: See PS1_answer_perl_bonus_2003.out III. Excel tutorial (30 points total + 5 bonus points) Please see PS1_answer_Excel_2003.xls The entire Excel file can be found at PS1_answer_Excel_complete_2003.xls (note it’s a huge file). Bonus A: Please see PS1_answer_Excel_bonusA_2003.xls Bonus B: Please see PS1_answer_Excel_bonusB_2003.xls