Biotechnology Homework 4 Answers Fall 2009 1. (i) Starting at the first nucleotide I looked for triplets that were stop codons and marked them with an “X” below. Then I shifted by one nucleotide for reading –frame 2, repeated the search, and then shifted to reading frame 3. I then pick the most likely ORF simply as being the longest (a program could also look for codon usage). For the first segment a stretch of reading frame 2, starting TCC and ending TAA is the longest, spanning almost the entire segment. For the second segment readingframe 3 has the longest ORF from CGA to (beyond) the end of the sequence shown. ……….GTAATCCATCCCACCTCAGGCGTACCACATGTCCTACATGGCTAAAACGCTGAA ---------------------------------------------------------------------------------------X ------------------X --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------X -GGAGCAGAGTAATCCCGTGCAGTTAGTGCGACAGTACTTTAAGAAACTCGAGAGGGTA --------------------X --------------------------------------------------------X ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------X ---------------------------------------------------------------X AGTCCAAAACCAAAATTATAAATAATA….…AAGCAAACCCAAAATACTAACGATAT -------------------------------------------------------------------------------------------------------------------------------------------------X ----------------------------------------------------------------------------------------------------------X ------------------------------------X ---------TGTTATTCATATCTTTGGCAGGAACACGTCACCCTGGATGCCAGGAGACAGGTAATCAG -----------------------------------------------------------------------------------------------------------X -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------CCCCGATGACTGCGAAGATCTTAGCCCGATGACACAAACAGCTGCTCCACCGGTTCGTA ------------------------------------------------------------X ------------------------------------------------------------------X --------------------------X --------------------------------------------------------------------X -----------------------------------------------------------------------------------------------------------------------GGCTAGGCGGAATC………… ------X ------------------------------------------------------------------------For (i) and (ii) there was no specific request to “Explain” but it is always a good idea to explain your logic. That allows mis-understandings to be exposed and for you to gain partial credit. It also, of course, shows that you derived the answer. (ii) In searching the first segment for a splice donor I look first for the invariant GT consensus & then examine each of these examples for a match to the more extensive sequence. Only one position has a good match- AGGGTAAGT. If that site is used the GT is the start of the intron (which I have made bold above). If the first segment is translated in reading frame 2 the reading frame 1 immediately before the splice donor will be CTC GAG GAG (gt…), so this will continue after a splice acceptor with the first nucleotide of the next exon being the first nucleotide of a codon. When looking for splice acceptors (AG, then match to CAGG with preceding pyrimidines) there are two strong candidates in segment two (underlined above; a third sequence is not quite as good). These two splice acceptors appear (to us) to be roughly equally good. However, their use will connect the previous exon into different reading frames. The first splice will connect to reading frame 3, which is the long ORF but the second splice connects to reading frame 2, which ends very shortly thereafter. Thus, only the first splice acceptor joins the two long ORFs (and intron preceding the second exon is in bold above). Some answers looked for splicing events within each segment. While such events may take place, such splices do not help in connecting ORFs that are separated by stop codons. Also, introns are always larger than about 30nt, eliminating some of the possibilities. Many answers did not see or acknowledge that the second segment had two perfectly good splice acceptors and that the choice depends on maintaining the ORF. (iii) The initiator should be the first ATG in the correct reading frame of the most upstream long ORF (reading frame 2 of the first segment). That ATG is underlined. The Kozak sequence can influence ATG choice but being the first ATG is usually the more significant criterion. (iv) If the initiator codon is not in the sequences shown it is likely to be further upstream. In this particular example there is a stop codon further upstream in reading frame 2 so there must be at least one additional exon that splices to the region shown. The general answer is that we need to know the complete mRNA sequence, especially the most 5’ regions. That is most readily accomplished by isolating cloned cDNAs and sequencing. If none exist a mini-cDNA library specific to this gene can be made using PCR primers for the sequences shown in a 5’ RACE procedure, followed by sequencing cloned cDNA ends. The cDNA will clearly reveal which ATG is first in a long ORF of a fully spliced mRNA. In this particular example you can see a good splice acceptor sequence early in segment 1, so it is quite likely that there are upstream sequences that splice to the sequences shown. Many answers simply wanted to examine genomic sequence, especially limited to the 2kb fragment. The best this can dop is to put you in a position to predict upstream exons with coding potential and guess whether they are spliced to the rest of the mRNA. This is usually the weakest aspect of prediction programs because the first coding exon may be small & distant. Performing primer extension (or S1 nuclease or RNase protection) is good because it examines RNA but it can only give you lengthy information. Primer extension will not tell you if an upstream splice was made or where the first exon is in genomic DNA (or, of course, its sequence). S1 nuclease protection tells you the genomic DNA from the probe that is present in mRNA but cannot tell you if there is another exon upstream. These methods are also less sensitive than an RT-PCR method (like 5’-RACE), so they are rarely used for this type of purpose. (v) In practice there might already be a cloned cDNA of known sequence that confirms the predicted splice but the purpose of the question is to do an experiment in the presumed absence of such knowledge. You might suggest making a cDNA library, finding the correct cDNA and sequencing. However, that includes a tremendous amount of unnecessary work. It is much better to use PCR to amplify the nucleic acid of interest and leave all the other mRNAs in the cell undisturbed. You can make an anti-sense (compared to mRNA) primer to a stretch of, say, 25nt near the end of the second 2 sequence segment and use this primer to hybridize to the RNA sample, followed by addition of reverse transcriptase, suitable buffer and nucleotides. You can then convert the single-stranded DNA into amplified double-stranded DNA by adding a sense direction primer based on predicted transcribed sequences from the first segment and performing PCR (both primers present). This should produce one or more bands (separate on a gel if there is more than one). remove primers (or purify single bands if necessary) and use one of the primers to sequence. You should recover cDNA sequence that spans the splice junction. In practice you would probably want to make a slightly longer PCR product (for easy visualization); here we are constrained by the small amount of sequence written down. The advantage of using primers spaced further apart is that you also have more chance of capturing more than one pattern of splicing in the RT-PCR products. Some answers suggest first making single or double-stranded cDNA generically (oligodT primer & then self-priming) before PCR amplification of required cDNA. It is simpler and better to use genespecific primers only for RT and PCR steps. 2. (i) The longest “ORFs” found by this program (quotation marks because I firmly believe that requiring a Met to initiate an ORF is not helpful and ought not to be part of the definition of an ORF) are 87-1643 in one reading frame, 2018-2320 in another, and 2392-2703 in a third reading frame. (here the stop codon is also included in the numbering; an odd choice) The ORFs in the other orientation would normallyt be worth examining, but here in the question I direct your attention only to the sense strand being written left to right. (ii) Using this program, which allows a more useful ORF definition of not necessarily beginning with Met, the largest ORFs are 2299-2703 (frame 1) 401-832 (2), 1679-2320 (2), 2477-2815 (2) 3-1643 (3) Several answers omitted some of these ORFs, probably by inappropriate setting of length cut-off, which made it harder to produce a full answer for (iii). Obviously, it is important to conduct these searches accurately to be able to use the products. (iii) The second program finds additional long ORFs because some of these lack a Met codon in frame entirely or near the beginning of the ORF. The ORFs in the second program also start earlier because they start immediately after the stop codon, not at the first ATG. The end-points are the same. The first long ORF ends by 1643, so we would expect that to connect to downstream ORFs by splicing from a position prior to 1643 (so look for a splice donor upstream of 1643- we cannot predict the distance but it is likely to be quite short). In looking for a connecting downstream exon we should be searching anywhere downstream of about 1500 for an ORF of a reasonable size (with no requirement for a Met codon in that ORF). The second program indicates that we should be looking for a splice acceptor (probably early) in the segment from 1679-2320 (program 1 would lead us only to look at 2018-2320 if we did not realize its great shortcoming in defining an ORF). If an exon within 1679-2320 is used we can look for the ORF to be continued by a second splice. The splice donor will likely be a little upstream of 2320. Program 2 tells us to look for an acceptor in the 2299-2703 interval or perhaps the 2477-2815 interval. If the former region has an exon then we might look for yet another splice connecting to the 2477-2815 region, although that may not have the potential to extend an ORF very much. We don’t know a priori that the longest possible 3 ORF will be made. We just know that very long ORFs are not created by random sequences so whenever we see a long ORF it is usually because there were selective pressures to maintain that ORF because the ORF is actually used. That argument obviously applies to an ORF of 400 codons but confident predictions are much more difficult for ORFs of 50 codons (which are common). The most common problems with answers were (a) not appreciating that exons downstream of the first coding exon do not need an ATG to start their ORFs. Translation initiation will have taken place in an upstream exon. (b) Splicing can change the reading-frame. In other words, splicing can connect consecutive ORFs even if they are in different reading frames (relative to an arbitrary, fixed starting position in genomic DNA). (c) not responding to the request to estimate where splices would be made & not appreciating that the order of ORFs is crucial to those estimates. (iv) This program indicates an intron between 1639 and 1699, and between 2273 and 2342. This conforms to the possibilities or expectations discussed in (iii), and indeed the splice junctions conform reasonably well to consensus sequences (though some would be hard to pick out by eye). The initiator codon is predicted to be at position 87. The program must assume you are giving it sufficient sequence and hence that there would be no upstream sequences. However, in practice there might easily be an upstream exon. Several answers made some incorrect extrapolations. Just because there is a Met codon at 87 does not mean that the exon starts there. Hence, it is incorrect to draw the exon starting there or call that a splice junction (same goes for position 1). The program happens to predict a polyA addition signal at 2709-2714. that does not mean there is a splice at 2709 (must read the program output carefully). The deuction of an intron between 2704-2708 should in any case strike you as odd (no introns are that small- no space for the chemistry of splicing). (v) Now we find that the first exon is split into two predicted exons (1 (with Met codon at 87)-197 and 270-1639). If you look at sequences around 197 and around 270 you will see excellent splice consensus matches. The C to T change was not in these regions. So, from the cell’s point of view the C to T change likely made no difference at all to the way the RNA was recognized by splicing factors. In other words the splicing pattern in the cell would be the same for (iv) and (v). We cannot be sure what it is, however, just by prediction because not every really good-looking potential splice site is used in vivo (my personal guess in this case would be that the 197-270 splice does take place). Why does Genscan make different predictions? Cleartly, Genscan is weighing up different factors. On the one hand, if there is a continuous long ORF there is no reason to predict interrupting it to make a splice site. On the other hand there is an excellent pair of consensus splice sites that restore the long ORF. Once the disputed intronic/exonic sequence (197-270) has a stop codon to interrupt the long ORF, however, the choice becomes easy; only making the splice will preserve the long ORF, and creating a long ORF is a heavily-weighted parameter in the prediction program. Most answers did not discuss explicitly why the prediction program was so heavily influenced by insertion of a stop codon. (vi) This comparison clearly shows introns from 197-270, 1639-1699 and 2273-2342. (vii) For the normal sequence (without the C to T change) Genscan appears to have made an incorrect prediction about the 197-270 splice, but was correct elsewhere. Genscan was incorrect in failing to point out the possibility of the 197-270 splice. However, it might be making a valuable 4 prediction. Perhaps two different splicing patterns are possible. One could try and find additional cDNAs derived perhaps from a variety of tissues to see if any matched the Genscan prediction. Alternatively, you could use RT-PCR primers from before & after the contested splice to amplify cDNAs for that region from various RNA samples. Do you see one band size or two? (Here you would have to be careful with controls to make sure you are not amplifying genomic DNA or incompletely processed RNAs (make one primer downstream of a second splice so that you only count molecules spliced at that second site). It is important to emphasize the use of a variety of RNA samples whenever searching for variety in mRNA forms. This question illustrates a couple of generalities. gene prediction programs are not all that good (here a problem within coding regions in a compact gene with small introns is exposed but the programs are much worse at connecting across long introns and in finding exons with only 5’UTR or 3’ UTR). Alignments among multiple species and combining several programs helps, but does not solve the problem. Thus, determining mRNA structure by experiment (mainly cDNA sequences) is the essential gold standard. “Predictions” will often confirm those demonstrated RNA structures and sometimes, as here, will point out reasonable additional possibilities to be investigated. A small point:- several answers claimed that BLASt shows the last exon beginning at 2340 instead of 2341 (including the G of the intron’s AG). That appears to be the case only because the last nucleotide of the preceding exon is a G. In the Blast program that G is used twice (in comparison to the genomic sequence around the splice donor and around the splice acceptor). if you were to write out this region of genomic vs cDNA comparison you would assign the G to the penultimate exon and not the terminal exon so that the Gt and AG intron sequence consensus was preserved. 3. (i) Simply look for the longest ORF and then for the most upstream Met codon in that ORF. It is important to mention the longest ORF. The first Met is not always used. There is also a loose consensus sequence for favored Met codons (known as the Kozak sequence. you then translate each codon until you reach a stop codon. (ii) Most long ORFs have a stop codon upstream of the first Met codon. If your long ORF is still open at the upstream end of the cDNA there is a good chance you are missing some part of the mRNA and the ORF. If your cDNA does not have any remnant of a polyA tail you are likely missing some portion of the mRNA 3’ end being represented. You can compare the size of the cDNA with the size of band(s) on Northern blots using the cDNA as probe. A polyA tail may occupy about 200nt & will not be fully represented on your cDNA and length estimation from a Northern is imprecise but if your cDNA is more than 300-400 bp shorter than the RNA you are missing something. If you isolated or examined several cDNAs you could compare them by overall size and restriction digests to see if they are co-extensive at each end. You could analyze cloned 5’-RACE products or bulk PCR-amplified RACE product in the same way. If your cDNA were no shorter than others and if several had the same size you can be fairly confident that they are full-length. From sequencing a cDNA you cannot tell anything about whether the template had a cap. The mRNA cap is a 5’5’ linkage and so will not be copied into cDNA. (iii) You can design compatible primers for exon 1 and 3 and use RT-PCR on the two types of RNA sample. Products can be examined on a gel for size and, if necessary for confirmation, by sequencing gel-purified fragments with one of the primers. You can make primers that span specific splice junctions but that seems unnecessarily complicated and less clear to me. If the two RNAs had sufficiently different sizes you might be able to see two species by Northern blot. 5 However, the experiment is more arduous, less sensitive and less precise- differences in overall mRNA length could be due to other reasons. As in some previous questions, when you know sequences of some (or all) exons and you are concentrating on one gene you should use gene-specific primers for RT-PCR and not oligodT, which will create cDNA copies of a huge number of mRNAs that are of no interest and capable only of contributing background to subsequent PCR amplifications. In any reaction that requires template & primers it is important to specify the template and primers (as precisely as your information allows) in order to describe what you are doing. (iv) The essential idea is to hybridize first strand cDNA copied from an RNA sample to an excess of the same RNA sample. Hybrids are removed and single-stranded cDNAs converted to ds-cDNA and cloned. During the hybridization, the proportion of abundant cDNA species will be reduced relative to rare cDNA species because there is a greater concentration of abundant RNA species in the hybridization. While all aspects of a technique like this mist be optimized for it to work, the use of excess mRNA as a driver for hybridization is more important than details of duplex removal by magnetic beads to the general principle of normalization. (v) Most differentially spliced RNAs share most of their sequences. Hence, a cDNA copy of one form will be depleted by hybridization to RNA of a differently spliced form. So, isoform diversity will be reduced as the overall representation of a certain gene is reduced. in other words, normalization does not equalize the prevalence of different splice variants. Many answers did not consider that alternatively spliced RNAs will likely overlap extensively and hence permit extensive hybrization between the cDNA of one form and mRNA of another form. 6