Abstract

advertisement
Biotechnology
Homework 4 Answers
Fall 2009
1. (i) Starting at the first nucleotide I looked for triplets that were stop codons and marked them with
an “X” below. Then I shifted by one nucleotide for reading –frame 2, repeated the search, and then
shifted to reading frame 3. I then pick the most likely ORF simply as being the longest (a program
could also look for codon usage). For the first segment a stretch of reading frame 2, starting TCC and
ending TAA is the longest, spanning almost the entire segment. For the second segment readingframe 3 has the longest ORF from CGA to (beyond) the end of the sequence shown.
……….GTAATCCATCCCACCTCAGGCGTACCACATGTCCTACATGGCTAAAACGCTGAA
---------------------------------------------------------------------------------------X ------------------X --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------X -GGAGCAGAGTAATCCCGTGCAGTTAGTGCGACAGTACTTTAAGAAACTCGAGAGGGTA
--------------------X --------------------------------------------------------X ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------X ---------------------------------------------------------------X
AGTCCAAAACCAAAATTATAAATAATA….…AAGCAAACCCAAAATACTAACGATAT
-------------------------------------------------------------------------------------------------------------------------------------------------X ----------------------------------------------------------------------------------------------------------X ------------------------------------X
---------TGTTATTCATATCTTTGGCAGGAACACGTCACCCTGGATGCCAGGAGACAGGTAATCAG
-----------------------------------------------------------------------------------------------------------X -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------CCCCGATGACTGCGAAGATCTTAGCCCGATGACACAAACAGCTGCTCCACCGGTTCGTA
------------------------------------------------------------X ------------------------------------------------------------------X --------------------------X --------------------------------------------------------------------X
-----------------------------------------------------------------------------------------------------------------------GGCTAGGCGGAATC…………
------X
------------------------------------------------------------------------For (i) and (ii) there was no specific request to “Explain” but it is always a good idea to explain your
logic. That allows mis-understandings to be exposed and for you to gain partial credit. It also, of
course, shows that you derived the answer.
(ii) In searching the first segment for a splice donor I look first for the invariant GT consensus & then
examine each of these examples for a match to the more extensive sequence. Only one position has
a good match- AGGGTAAGT. If that site is used the GT is the start of the intron (which I have
made bold above). If the first segment is translated in reading frame 2 the reading frame
1
immediately before the splice donor will be CTC GAG GAG (gt…), so this will continue after a
splice acceptor with the first nucleotide of the next exon being the first nucleotide of a codon.
When looking for splice acceptors (AG, then match to CAGG with preceding pyrimidines) there are
two strong candidates in segment two (underlined above; a third sequence is not quite as good).
These two splice acceptors appear (to us) to be roughly equally good. However, their use will
connect the previous exon into different reading frames. The first splice will connect to reading
frame 3, which is the long ORF but the second splice connects to reading frame 2, which ends very
shortly thereafter. Thus, only the first splice acceptor joins the two long ORFs (and intron preceding
the second exon is in bold above).
Some answers looked for splicing events within each segment. While such events may take place,
such splices do not help in connecting ORFs that are separated by stop codons. Also, introns are
always larger than about 30nt, eliminating some of the possibilities.
Many answers did not see or acknowledge that the second segment had two perfectly good splice
acceptors and that the choice depends on maintaining the ORF.
(iii) The initiator should be the first ATG in the correct reading frame of the most upstream long
ORF (reading frame 2 of the first segment). That ATG is underlined. The Kozak sequence can
influence ATG choice but being the first ATG is usually the more significant criterion.
(iv) If the initiator codon is not in the sequences shown it is likely to be further upstream. In this
particular example there is a stop codon further upstream in reading frame 2 so there must be at least
one additional exon that splices to the region shown. The general answer is that we need to know the
complete mRNA sequence, especially the most 5’ regions. That is most readily accomplished by
isolating cloned cDNAs and sequencing. If none exist a mini-cDNA library specific to this gene can
be made using PCR primers for the sequences shown in a 5’ RACE procedure, followed by
sequencing cloned cDNA ends. The cDNA will clearly reveal which ATG is first in a long ORF of a
fully spliced mRNA. In this particular example you can see a good splice acceptor sequence early in
segment 1, so it is quite likely that there are upstream sequences that splice to the sequences shown.
Many answers simply wanted to examine genomic sequence, especially limited to the 2kb fragment.
The best this can dop is to put you in a position to predict upstream exons with coding potential and
guess whether they are spliced to the rest of the mRNA. This is usually the weakest aspect of
prediction programs because the first coding exon may be small & distant.
Performing primer extension (or S1 nuclease or RNase protection) is good because it examines RNA
but it can only give you lengthy information. Primer extension will not tell you if an upstream splice
was made or where the first exon is in genomic DNA (or, of course, its sequence). S1 nuclease
protection tells you the genomic DNA from the probe that is present in mRNA but cannot tell you if
there is another exon upstream. These methods are also less sensitive than an RT-PCR method (like
5’-RACE), so they are rarely used for this type of purpose.
(v) In practice there might already be a cloned cDNA of known sequence that confirms the predicted
splice but the purpose of the question is to do an experiment in the presumed absence of such
knowledge. You might suggest making a cDNA library, finding the correct cDNA and sequencing.
However, that includes a tremendous amount of unnecessary work. It is much better to use PCR to
amplify the nucleic acid of interest and leave all the other mRNAs in the cell undisturbed. You can
make an anti-sense (compared to mRNA) primer to a stretch of, say, 25nt near the end of the second
2
sequence segment and use this primer to hybridize to the RNA sample, followed by addition of
reverse transcriptase, suitable buffer and nucleotides. You can then convert the single-stranded DNA
into amplified double-stranded DNA by adding a sense direction primer based on predicted
transcribed sequences from the first segment and performing PCR (both primers present). This
should produce one or more bands (separate on a gel if there is more than one). remove primers (or
purify single bands if necessary) and use one of the primers to sequence. You should recover cDNA
sequence that spans the splice junction. In practice you would probably want to make a slightly
longer PCR product (for easy visualization); here we are constrained by the small amount of
sequence written down. The advantage of using primers spaced further apart is that you also have
more chance of capturing more than one pattern of splicing in the RT-PCR products.
Some answers suggest first making single or double-stranded cDNA generically (oligodT primer &
then self-priming) before PCR amplification of required cDNA. It is simpler and better to use genespecific primers only for RT and PCR steps.
2. (i) The longest “ORFs” found by this program (quotation marks because I firmly believe that
requiring a Met to initiate an ORF is not helpful and ought not to be part of the definition of an ORF)
are 87-1643 in one reading frame, 2018-2320 in another, and 2392-2703 in a third reading frame.
(here the stop codon is also included in the numbering; an odd choice)
The ORFs in the other orientation would normallyt be worth examining, but here in the question I
direct your attention only to the sense strand being written left to right.
(ii) Using this program, which allows a more useful ORF definition of not necessarily beginning with
Met, the largest ORFs are
2299-2703 (frame 1)
401-832 (2), 1679-2320 (2), 2477-2815 (2)
3-1643 (3)
Several answers omitted some of these ORFs, probably by inappropriate setting of length cut-off,
which made it harder to produce a full answer for (iii). Obviously, it is important to conduct these
searches accurately to be able to use the products.
(iii) The second program finds additional long ORFs because some of these lack a Met codon in
frame entirely or near the beginning of the ORF. The ORFs in the second program also start
earlier because they start immediately after the stop codon, not at the first ATG. The end-points
are the same.
The first long ORF ends by 1643, so we would expect that to connect to downstream ORFs by
splicing from a position prior to 1643 (so look for a splice donor upstream of 1643- we cannot
predict the distance but it is likely to be quite short). In looking for a connecting downstream exon
we should be searching anywhere downstream of about 1500 for an ORF of a reasonable size
(with no requirement for a Met codon in that ORF). The second program indicates that we should
be looking for a splice acceptor (probably early) in the segment from 1679-2320 (program 1 would
lead us only to look at 2018-2320 if we did not realize its great shortcoming in defining an ORF). If
an exon within 1679-2320 is used we can look for the ORF to be continued by a second splice.
The splice donor will likely be a little upstream of 2320. Program 2 tells us to look for an acceptor
in the 2299-2703 interval or perhaps the 2477-2815 interval. If the former region has an exon then
we might look for yet another splice connecting to the 2477-2815 region, although that may not
have the potential to extend an ORF very much. We don’t know a priori that the longest possible
3
ORF will be made. We just know that very long ORFs are not created by random sequences so
whenever we see a long ORF it is usually because there were selective pressures to maintain that
ORF because the ORF is actually used. That argument obviously applies to an ORF of 400
codons but confident predictions are much more difficult for ORFs of 50 codons (which are
common).
The most common problems with answers were
(a) not appreciating that exons downstream of the first coding exon do not need an ATG to start
their ORFs. Translation initiation will have taken place in an upstream exon.
(b) Splicing can change the reading-frame. In other words, splicing can connect consecutive ORFs
even if they are in different reading frames (relative to an arbitrary, fixed starting position in
genomic DNA).
(c) not responding to the request to estimate where splices would be made & not appreciating that
the order of ORFs is crucial to those estimates.
(iv) This program indicates an intron between 1639 and 1699, and between 2273 and 2342. This
conforms to the possibilities or expectations discussed in (iii), and indeed the splice junctions
conform reasonably well to consensus sequences (though some would be hard to pick out by eye).
The initiator codon is predicted to be at position 87. The program must assume you are giving it
sufficient sequence and hence that there would be no upstream sequences. However, in practice
there might easily be an upstream exon.
Several answers made some incorrect extrapolations. Just because there is a Met codon at 87
does not mean that the exon starts there. Hence, it is incorrect to draw the exon starting there or
call that a splice junction (same goes for position 1). The program happens to predict a polyA
addition signal at 2709-2714. that does not mean there is a splice at 2709 (must read the program
output carefully). The deuction of an intron between 2704-2708 should in any case strike you as
odd (no introns are that small- no space for the chemistry of splicing).
(v) Now we find that the first exon is split into two predicted exons (1 (with Met codon at 87)-197
and 270-1639). If you look at sequences around 197 and around 270 you will see excellent splice
consensus matches. The C to T change was not in these regions. So, from the cell’s point of view
the C to T change likely made no difference at all to the way the RNA was recognized by splicing
factors. In other words the splicing pattern in the cell would be the same for (iv) and (v). We
cannot be sure what it is, however, just by prediction because not every really good-looking
potential splice site is used in vivo (my personal guess in this case would be that the 197-270 splice
does take place). Why does Genscan make different predictions? Cleartly, Genscan is weighing
up different factors. On the one hand, if there is a continuous long ORF there is no reason to
predict interrupting it to make a splice site. On the other hand there is an excellent pair of
consensus splice sites that restore the long ORF. Once the disputed intronic/exonic sequence
(197-270) has a stop codon to interrupt the long ORF, however, the choice becomes easy; only
making the splice will preserve the long ORF, and creating a long ORF is a heavily-weighted
parameter in the prediction program.
Most answers did not discuss explicitly why the prediction program was so heavily influenced by
insertion of a stop codon.
(vi) This comparison clearly shows introns from 197-270, 1639-1699 and 2273-2342.
(vii) For the normal sequence (without the C to T change) Genscan appears to have made an
incorrect prediction about the 197-270 splice, but was correct elsewhere. Genscan was incorrect in
failing to point out the possibility of the 197-270 splice. However, it might be making a valuable
4
prediction. Perhaps two different splicing patterns are possible. One could try and find additional
cDNAs derived perhaps from a variety of tissues to see if any matched the Genscan prediction.
Alternatively, you could use RT-PCR primers from before & after the contested splice to amplify
cDNAs for that region from various RNA samples. Do you see one band size or two? (Here you
would have to be careful with controls to make sure you are not amplifying genomic DNA or
incompletely processed RNAs (make one primer downstream of a second splice so that you only
count molecules spliced at that second site). It is important to emphasize the use of a variety of
RNA samples whenever searching for variety in mRNA forms.
This question illustrates a couple of generalities. gene prediction programs are not all that good
(here a problem within coding regions in a compact gene with small introns is exposed but the
programs are much worse at connecting across long introns and in finding exons with only 5’UTR
or 3’ UTR). Alignments among multiple species and combining several programs helps, but does
not solve the problem. Thus, determining mRNA structure by experiment (mainly cDNA
sequences) is the essential gold standard. “Predictions” will often confirm those demonstrated
RNA structures and sometimes, as here, will point out reasonable additional possibilities to be
investigated.
A small point:- several answers claimed that BLASt shows the last exon beginning at 2340 instead
of 2341 (including the G of the intron’s AG). That appears to be the case only because the last
nucleotide of the preceding exon is a G. In the Blast program that G is used twice (in comparison to
the genomic sequence around the splice donor and around the splice acceptor). if you were to write
out this region of genomic vs cDNA comparison you would assign the G to the penultimate exon
and not the terminal exon so that the Gt and AG intron sequence consensus was preserved.
3.
(i) Simply look for the longest ORF and then for the most upstream Met codon in that ORF. It is
important to mention the longest ORF. The first Met is not always used. There is also a loose
consensus sequence for favored Met codons (known as the Kozak sequence. you then translate
each codon until you reach a stop codon.
(ii) Most long ORFs have a stop codon upstream of the first Met codon. If your long ORF is still
open at the upstream end of the cDNA there is a good chance you are missing some part of the
mRNA and the ORF. If your cDNA does not have any remnant of a polyA tail you are likely missing
some portion of the mRNA 3’ end being represented.
You can compare the size of the cDNA with the size of band(s) on Northern blots using the cDNA
as probe. A polyA tail may occupy about 200nt & will not be fully represented on your cDNA and
length estimation from a Northern is imprecise but if your cDNA is more than 300-400 bp shorter
than the RNA you are missing something.
If you isolated or examined several cDNAs you could compare them by overall size and restriction
digests to see if they are co-extensive at each end. You could analyze cloned 5’-RACE products or
bulk PCR-amplified RACE product in the same way. If your cDNA were no shorter than others and
if several had the same size you can be fairly confident that they are full-length.
From sequencing a cDNA you cannot tell anything about whether the template had a cap. The
mRNA cap is a 5’5’ linkage and so will not be copied into cDNA.
(iii) You can design compatible primers for exon 1 and 3 and use RT-PCR on the two types of RNA
sample. Products can be examined on a gel for size and, if necessary for confirmation, by
sequencing gel-purified fragments with one of the primers. You can make primers that span
specific splice junctions but that seems unnecessarily complicated and less clear to me. If the two
RNAs had sufficiently different sizes you might be able to see two species by Northern blot.
5
However, the experiment is more arduous, less sensitive and less precise- differences in overall
mRNA length could be due to other reasons.
As in some previous questions, when you know sequences of some (or all) exons and you are
concentrating on one gene you should use gene-specific primers for RT-PCR and not oligodT,
which will create cDNA copies of a huge number of mRNAs that are of no interest and capable only
of contributing background to subsequent PCR amplifications. In any reaction that requires
template & primers it is important to specify the template and primers (as precisely as your
information allows) in order to describe what you are doing.
(iv) The essential idea is to hybridize first strand cDNA copied from an RNA sample to an excess of
the same RNA sample. Hybrids are removed and single-stranded cDNAs converted to ds-cDNA
and cloned. During the hybridization, the proportion of abundant cDNA species will be reduced
relative to rare cDNA species because there is a greater concentration of abundant RNA species in
the hybridization.
While all aspects of a technique like this mist be optimized for it to work, the use of excess mRNA
as a driver for hybridization is more important than details of duplex removal by magnetic beads to
the general principle of normalization.
(v) Most differentially spliced RNAs share most of their sequences. Hence, a cDNA copy of one
form will be depleted by hybridization to RNA of a differently spliced form. So, isoform diversity will
be reduced as the overall representation of a certain gene is reduced. in other words,
normalization does not equalize the prevalence of different splice variants.
Many answers did not consider that alternatively spliced RNAs will likely overlap extensively and
hence permit extensive hybrization between the cDNA of one form and mRNA of another form.
6
Download