Problem Set 1

advertisement
Problem Set 1
Please make sure to show your work and calculations and state any assumptions you
make in answering the following questions. Include the names of the people you worked
with at the top of your problem set.
I. Biology (35 Points total)
1 DNA and RNA structure: Nucleic acid polymers are the basis for genetic information
storage and transfer in the cell (5 points)
1.1 What is the monomeric unit of DNA called?
Deoxyribonucleotide. (1 point)
1.2 What is the monomeric unit of RNA called?
Ribonucleotide (1 point)
1.3 What is the generic term for both of these units?
Nucleotide (1 point)
1.4 DNA is usually present in the cell in the form of a double-helix. Explain what
this structure is and why it is important to the function of DNA. (Keywords:
Replication, redundancy, anti-parallel, complementary base pairs.) (2 points)
A DNA double-helix contains two anti-parallel strands of DNA (two chains
running in opposite directions) bound together through complementary base
pairs, forming a helix spiral. (1 point)
Such a double-helix structure is important to DNA replication, because the
complimentary nature of the structure contains redundant information,
allowing the synthesis of one strand from the other. (1 point)
Give credit as long as they demonstrate understanding of the structure.
2 Proteins are polymers that perform the intended function of most genes (5 points)
2.1 What is the monomeric unit of a protein?
Amino Acid or Amino Acid Residue (1 point)
2.2 How many different types of monomers typically exist?
20. (1 point)
Give credit if the answer is larger than 20 AND if non-standard AA is mentioned.
2.3 Due to the nature of the direction of protein synthesis and the structure of
proteins a protein is usually referred to as having an N and C-terminal end. Why
are the symbols N and C used?
N represents the amino terminal side of the polypeptide which contains
nitrogen as on of the first atoms in the chain. C represents the carboxyl
terminal of the polypeptide chain. (1 point)
Give credit as long as they know N stands for amino-terminal and C stands for
carboxyl-terminal.
2.4 Proteins are described as having four levels of structure (primary-, secondary-,
tertiary-, and quaternary-structure). For each of the following list the category (or
categories) they belong to. Note that some may belong to more than one
category. (2 points)
2.4.1
Amino Acid Sequence?
Primary (0.5 point)
2.4.2
Alpha Helices?
Secondary (0.5 point)
2.4.3
HIV protease dimer?
Quaternary (0.5 point)
2.4.4
Disulfide bonds?
Secondary, tertiary or quaternary (0.5 point)
Give credit if they only mention tertiary and quaternary, or if they include
primary structure as well.
3 An understanding of the central dogma and the structure of DNA, RNA and proteins
will help you answer these questions (10 points)
3.1 In the eukaryotic cell the size of the mature mRNA that is translated is smaller
than the gene sequence. Why is this the case? (Keywords: transcription,
Promoter, Exon, Intron, Splicing)
The transcription start site is usually downstream of the promoter, so that
not the entire gene sequence gets transcribed into RNA. (2 points)
Further more RNA splicing, in which introns in the primary transcript are
removed and exons are combined together, further reduces the size of the
mature mRNA. (3 points)
Give credit as long as they show understanding of the subject.
3.2 For a given expressed gene, the length (in monomers) of the resultant protein
(assuming no post processing), is less than 1/3 the size of the mature mRNA.
Why? (Keywords: Tri-Nucleotide Codon, Start Codon, Stop Codon, t-RNA, 3’
and 5’ UTR, ORF.)
First of all since one amino acid is coded for by three nucleotides (trinucleotide) translation automatically reduces the size of the polymer by a
factor of 3. (1 point)
Further since the mRNA contains 3’UTR (untranslated region) and 5’UTR,
the coding sequence that actually gets converted into the peptide is shorter
than the mRNA sequence. (4 points, 2 points each for 5’UTR and 3’UTR)
Give credit as long as they show understanding of the subject.
4 You will need to understand the genetic code to answer these questions. (5 Points)
4.1 What six codons encode for Serine?
TCT, TCC, TCA, TCG, AGT and AGC (or UCU, UCC, UCA, UCG, AGU
and AGC). (1 point)
4.2 List all of the codons that do not code for an amino acid. What is their purpose?
TAA, TAG, TGA (or UAA, UAG, UGA) (1 point)
Their purpose is to signal translation termination. (1 point)
4.3 What amino acid does the codon ATG encode for? Other than coding for an
amino acid, does this codon perform any special function in protein
biosyntheses?
Met (1 point)
It signals translation initiation (1 point)
5 Eukaryotic and prokaryotic organisms differ in many aspects. For each of the cellular
structures or characteristics listed below, please identity whether it belongs to
eukaryotic or prokaryotic organisms or both. (5 points)
5.1
5.2
5.3
5.4
5.5
Membrane-bound organelles:
eukaryotic (1 point)
Nucleus
eukaryotic (1 point)
70S ribosome
prokaryotic (1 point)
RNA splicing
eukaryotic (1 point)
microRNAs
eukaryotic (1 point)
For 5.4, give credit if they mention both prokaryotic and eukaryotic (there are very
rare cases of splicing in prokaryotes). No credit will be given if only prokaryotes are
mentioned.
For 5.5, give credit if they mention both prokaryotic and eukaryotic (this depends on
different definitions of microRNAs).
6 Speculate on a biological problem that might be interesting to investigate with
computational methods. Think of this as a possible subject for your final project (5
Points)
Give full credit (5 points) if they come up with anything that seems like a reasonable
start. Detailed comments to be given by section TFs.
II. Perl Program (35 points total + 5 bonus points)
Answers in blue text. Companion files:
PS1_answer_perl_2003.pl
PS1_answer_perl_bonus_2003.pl
PS1_answer_perl_2003.out
PS1_answer_perl_bonus_2003.out
1. Calculate GC content for a test 70-mer (10 points for this section):
a. In order to do this, you will need the following line of code:
$c = $oligo =~ s/c//gi;
Explain what this line does as a comment in your Perl script (hint in
skeleton). (5 points).
Official answer, as provided as commented lines in perl code:
# The expression s/c//gi means to do a case insensitive global
substitution
# of all instances of the character 'c' with no character, thus
removing
# all 'c' characters from the $oligo variable. The substitution
expression
# itself returns an integer corresponding to the number of times the
# substitution takes place. This can now be used to determine how many
C's,
# and for that matter, G's, are present in the oligonucleotide
sequence.
The answer has two main parts to it – what s/c//gi does and what $c = $oligo does.
 3 points for getting the s/c//gi part
 2 points for getting the $c = $oligo part
(this is probably a more subtle/difficult portion of the answer and is awarded a
lower point value to provide some leeway for people not familiar with Perl)
b. Output the full oligo sequence and its GC content to the screen (5 points).
This ensures that students can use the print statement and, with the assistance from part
a, be able to perform a couple of operations (i.e. calculate GC content) and output that
as well.
 1 point for the print statement (which is essentially given in the skeleton code)
 4 points for the GC content; half-credit if the code makes the substitution but does
not correctly count the number of substitutions made, and 1 point off if the code is
largely correct but reports an incorrect value. If the GC content is rounded to a
minimum of 2 significant figures, that is permissible.
Output:
CATTACGATGCATTGATTTTTCAAAGGAATGTACTATCGAAATCACAAGTCGTGGACTACGGTTTGCAGT
GC Content:
38.5714285714286%
2. Calculate Tm (described in the skeleton code) for a test 70-mer. (5 points)
Output:
Tm: 73.5714285714286 degrees C


4 points for correctly implementing the formula; half credit if the formula
contains errors, and 1 point off if the code is largely correct but reports an
incorrect value. If the Tm is rounded to a minimum of 2 significant figures, that is
permissible.
1 point for the print statement (again, provided in the skeleton code)
3. Read in the SARS genome and parse through all possible 70-mers, calculate GC
content and Tm for each 70-mer in the SARS genome, filter out 70-mers that don’t
satisfy the Tm requirement (between 67-69 °C), and store the filtered results in an
array variable (15 points).





1 point – read in SARS genome and store it in a variable (this is essentially
provided already as a commented line in the skeleton code)
5 points – correctly implemented use of the substring function to extract each and
every 70-mer in the SARS genome
4 points – code to calculate GC content and Tm. Parts 1 and 2 should have served
to make this an easy incorporation into the loop.
1 point – correctly using the if loop to filter the results; the skeleton code had
already provided the if statement
4 points – a reasonable array variable to store the attributes of each qualifying
70-mer
Award up to half-credit for pseudocode or code that has the general structure but
significant flaws in the implementation. Award full credit minus single points (per bug)
for code that is mostly correct but has small bugs in the code.
4. The oligos you obtained from above may be overlapping with each other. Here
you’re asked to filter out the overlapping ones and output a list of nonoverlapping oligos with the starting position, oligonucleotide sequence, GC
content, and Tm. You should have a total of four tab-separated columns. (5 points)
Hint: to output the STDOUT (i.e. the screen) into a file, reroute the output into
the file with the following syntax:
program.pl [switches] > output.txt
This may be the most challenging part of the problem for some students. The skeleton
code, however, does try to provide a hint to the students in constructing the loop.
Note that the lines of code in the skeleton are slightly different than what is provided in
the answer key code – the answer key code uses a single special-case iteration of the for
loop as separate lines of code to take care of the “end-of-fence” problem that often
arises in these kinds of loops. Since some students may treat given lines of code in the
skeleton code as being unalterable, credit should not be deducted if the “end-of-fence”
oligo cases are missing (e.g. first or last oligo missing in output).


2 points for a correct print statement that produces a reasonable tab-delimited
output with 4 columns
3 points for correct implementation of non-overlapping oligos (no credit if the
output file is big, say, over 50,000 bytes; the official output file answer key is less
than 2,000 bytes).
Output:
See PS1_answer_perl_2003.out
Bonus: an important factor in oligo design is to mask repetitive sequences to minimize
non-specific hybridization. Most oligo design programs have a fairly comprehensive set
of repetitive sequences that are masked from the oligo design space. Here, filter your list
of results by removing oligonucleotides that have a homo-polynucleotide tract 5 or more
bases in length. Note that you should perform your filtering on the list of all possible
qualifying oligos, not the list of non-overlapping oligos (5 points).




3 points for the regular expression filter
2 points for array splicing
1 point off if the filtering is performed on the list of non-overlapping oligos
generated in part 4.
If the code employs some other implementation of parsing through the array and
produces the correct result, full credit may be awarded; if the implementation is
flawed, points awarding is to the discretion of the grader.
Output:
See PS1_answer_perl_bonus_2003.out
III. Excel tutorial (30 points total + 5 bonus points)
Please see PS1_answer_Excel_2003.xls
The entire Excel file can be found at PS1_answer_Excel_complete_2003.xls
(note it’s a huge file).
Bonus A:
Please see PS1_answer_Excel_bonusA_2003.xls
Bonus B:
Please see PS1_answer_Excel_bonusB_2003.xls
Download