BME 1450/Winter 2005/994957169 1 Statistical analysis of errors created during amplification of transcript tags for serial analysis of gene expression Karyn S. Ho Abstract—Serial analysis of gene expression (SAGE) is a quantitative technique for profiling gene transcripts through the rapid generation of unique sequence tags. The resulting data describe the size and distribution of the sample transcriptome and allow easy comparison of gene expression under different disease or stress conditions. However, the amount of starting material required to sequence the transcripts often exceeds the starting sample size. Because polymerase fidelity is not perfect, the required polymerase chain reaction (PCR) amplification steps introduce false tags while reducing the relative abundances of true tags. To simulate the creation and propagation of false tags, tag replication has been modeled as a Galton-Watson branching process. The proportion of authentic tags remaining after amplification is a function of polymerase fidelity, PCR efficiency, SAGE tag length, and the number of PCR cycles run. a SAGE tag. To further reduce the possibility of shared tags, LongSAGE is a modified protocol that gives 17 bp tags [4]. The SAGE process was originally proposed by Velculescu et al. [2]. The process is summarized as a flow diagram up to the PCR amplification step in Fig. 1. Isolate mRNA 3’ - AAAAA 3’ - AAAAA 3’ - AAAAA RT-PCR with biotinylated poly(dT) primer 3’ - AAAAA * - 5’ - TTTTT Cleave with restriction enzyme NlaIII (recognizes 5’ – CATG – 3’ and leaves 5’ overhang ) Index Terms—Branching Process, PCR Amplification, Sequence Errors, Serial Analysis of Gene Expression. CATG Divide beads in half (pool A and pool B) and ligate to linker sequences containing Type IIS restriction sites and separate PCR primers W Manuscript received November 7, 2005. Submitted in partial fulfillment of requirements for BME 1450. Karyn S. Ho is an M.A.Sc. candidate with the Institute of Biomaterials and Biomedical Engineering at the University of Toronto (email: karyn.ho@utoronto.ca). Isolate fragments closest to original 3’-end by conjugation of biotin (*) onto streptavidin beads 3’ - AAAAA * - 5’ - TTTTT I. INTRODUCTION HEN subjected to different disease or stress conditions, cells modify their gene expression patterns in order to survive and thrive. Consequently, profiling the transcriptome (mRNA) indicates how cells react to better their own chances at survival, or to protect surrounding tissues. Although several methods exist to measure gene expression, serial analysis of gene expression (SAGE) is unique in its ability to quantify transcript abundance for subsequent comparison [1]. Other methods lack sensitivity or can only evaluate a limited number of genes at once [2]. SAGE can also be used in gene discovery because it does not require prior knowledge of the genome in order to proceed [1, 3]. SAGE is based on the premise that a 10 base pair (bp) long nucleotide tag contains enough information to relate it back to its original complete mRNA strand, provided it can be taken from a defined position [2]. Given a 10 bp tag length, there are over 1 million possible nucleotide sequences (410), and only around 80 000 transcripts in the entire human genome [2]. Although possible, it is unlikely that multiple genes will share GTAC CATG GTAC CATG 3’ - AAAAA * - 5’ - TTTTT Digest with Type IIS to release blunt ended tag with defined length from beads Tag A Primer A Type IIS Type IIS Primer A or B Ligate pool A and pool B to form ditags and amplify by PCR Tag B GTAC CATG CATG GTAC Type IIS Primer B Ditag Fig. 1. Process flow diagram of protocol for conversion of raw mRNA sample to SAGE ditags. A few points are necessary to note in order to understand the flow diagram: unmodified mRNA has a 3’-poly(A) tail, which has high affinity for its complement, poly(T); the mRNA strand can be converted to double-stranded cDNA through reverse transcriptase-polymerase chain reaction (RTPCR); to generate SAGE tags, a series of enzymatic cleavage and ligation steps are used; restriction enzymes can recognize specific nucleotide sequences, then cleave leaving overhangs (eg. NlaIII) or blunt ends, sometimes a prescribed distance from the recognition site (eg. Type IIS); biotin and streptavidin BME 1450/Winter 2005/994957169 have exceptionally high affinity for one another; prior to amplification, tags ligate in pairs called ditags, which are flanked on both ends by linkers containing primer sequences, and when used as PCR templates will be copied from both ends, giving rise to two copies in each cycle. Following amplification, the ditags are digested again to release the linker sequences and concatenated into long strands. The 5’-CATG-3’ sequences are retained and punctuate the boundaries between ditags, which are then cloned and sequenced. The tag counts are subsequently related back to their original genes, and gene expression patterns can then be compared between samples. When comparing gene expression libraries, it is common to have only a single-digit number of samples measured because the SAGE protocol is labour intensive and cost prohibitive [5]. Any representation biases can later be misinterpreted as differential expression. II. SOURCES OF ERROR Due to the large number of manipulations necessary to arrive at the final tag sequencing stage, there are many ways in which sequence errors can be introduced. Some material can be lost during the initial mRNA purification from the cell sample, due to lack of coupling of the poly(A) tail to poly(dT)biotin, or lack of attachment of poly(dT)-biotin to the streptavidin beads. As long as there are excess reagents available and enough time for binding to occur, these losses will be minor because of the strong affinities between poly(A):poly(T) sequences and biotin:streptavidin. The remaining losses are assumed to be proportional. Digestion with the anchoring enzyme is nearly complete given a long incubation time. This is important because any strand that is not cut at the 3’-most anchoring enzyme recognition site will lose its defined tag position. However, strands that completely lack the recognition sequence will remain uncut and will not be included in the analysis because of their inability to ligate to a linker sequence. For this reason, restriction enzymes having 4 bp recognition sites are commonly used because they cleave every 256 bp (44) on average. Most transcripts are much longer, ensuring that almost every transcript will be included [2]. The remaining digestion and ligation steps are considered nearly complete. Also, because ditags are punctuated by the 4 bp recognition site after concatenation, any frameshift errors occurring in a single ditag will not affect surrounding ditags. The most significant sources of representation bias, then, are attributed to SAGE tag amplification [4, 6]. These errors are inevitable because the fidelity of DNA polymerases is not perfect [3, 4, 6-8]. As a result, tag mutations can be observed as base substitutions or as insertions or deletions, which are also known as frame-shift mutations. Polymerase inefficiency can also result in certain sequences being skipped or copied only partially during a PCR cycle [9]. Some errors may occur during RT-PCR conversion of the original mRNA to cDNA, but because only one cycle is necessary and no duplication occurs, these errors are dwarfed 2 by the propensity of PCR amplification to create and propagate sequence errors. The result can be particularly drastic for transcripts having low copy numbers [4, 6]. Sequence errors not only create false tags, but they remove true tags from the sequenced pool, thereby reducing the relative abundances of true tags [3, 4, 6]. III. STATISTICAL MODEL OF SEQUENCE ERRORS Sequence errors occur during PCR amplification because DNA polymerases do not have perfect fidelity. The error rate per base duplication,depends on the polymerase and temperatures used. Because the error rate is on a per base duplication basis, the fraction of mutant tags is then dependent on the number of bases within each ditag, N. Errors that are created in early PCR cycles are also propagated in subsequent cycles, so replication of existing false ditags and introduction of new false ditags depend on the cycle number, n. The PCR reaction can also be characterized in terms of its efficiency, f, which is a measure of how many sequences are successfully duplicated during a given cycle. The fraction of false generated tags, then, is a function of polymerase fidelity, PCR efficiency, ditag length, and cycle number. The behaviour of any given cycle n+1 depends only on the outcome of cycle n, so no information is required about previous cycles in order to calculate the next probabilities. Also, the complete replication of any given ditag is independent of the replication of all other ditags. These two characteristics of PCR amplification allow the process to be modeled as a Galton-Watson branching process [4, 9]. In addition, it can be assumed that the number of mutations is described by a Poisson process, which has been shown to give more accurate error predictions than a Gaussian distribution [9, 10]. It has also been assumed that mutations will only occur once at a given position such that a second mutation cannot recover the original sequence. The equations that follow are based on these assumptions. The expected number of correct ditag sequences is denoted EZn(1), and the number of incorrect ditag sequences EZn(2). At the beginning of the amplification, EZ0(1) = 1 and EZ0(2) = 0. Equation (1) describes the expected values for the expected number of correctly replicated tags at cycle n [9]. EZ n(1) [1 f exp( N )] n (1) The total number of ditags at cycle n expands exponentially and is a function of the PCR efficiency [4]. EZ n(1) EZ n( 2) [1 f ]n (2) To better understand (2), it is possible to imagine that at perfect PCR efficiency, the amplified pool doubles in size in every cycle. At null efficiency, the pool never increases in size. The limits of (2) are therefore intuitive. lim EZ n(1) EZ n( 2) 2 n (3) lim EZ n(1) EZ n( 2) 1 (4) f 1 f 0 It is informative to estimate the fraction of tags having at BME 1450/Winter 2005/994957169 3 RPCR EZ n( 2) (1 f ) n 1 EZ n(1) (5) (1 f ) n It can be noted from (5) that any starting value for EZn(1) can be assumed, because it would cancel. Substituting (1) into (5) gives an estimate of the fraction of mutant tags at any cycle n. RPCR 1 [1 f exp( N )] n (6) (1 f ) n IV. DISCUSSION OF STATISTICAL MODEL PREDICTIONS A. Effects of PCR Cycle Number and SAGE ditag length Using (6) the fraction of ditags expected to have at least one mutation was calculated, fixing f = 0.88 and = 2.0×10-4 mutations/base duplification, both of which were found experimentally for Taq polymerase at 70°C [7]. Fig. 2 shows the dependence of RPCR on the number of cycles run for SAGE ditags (N = 20) and LongSAGE ditags (N = 34). The relationship is nearly linear at this low range of cycle number, and 25<n<30 is a typical range for SAGE experiments [4]. The model reflects the tendency for errors to accumulate from one cycle to the next. It also demonstrates the greater likelihood for point mutations to appear on a greater fraction of sequences as ditag length increases. Expected Fraction of Ditags with Mutations 0.1 = 2.0 × 10-4 mutations/base duplication 0.07 N = 20 bp ditags 0.06 n = 30 0.05 n = 25 n = 20 0.04 n = 15 0.03 n = 10 0.02 n =5 0.01 0.8 f = 0.88 0.08 0.08 0 = 2.0 × 10-4 mutations/base duplication 0.09 not be replicated, but not that mutation frequency will increase. However, reduced PCR efficiency introduces a different representation bias. Those transcripts that are not replicated become proportionally under-represented in the amplified ditag pool. This can be particularly problematic for genes with low expression levels in the transcriptome; they can become diluted out quickly if skipped, leading to an even greater probability of being skipped in subsequent cycles. Even for transcripts with higher initial representation, the effects can be significant when comparing libraries because it can be misinterpreted as differential expression. It is uncommon to have enough replicates to be able to resolve these errors through averaging because the SAGE protocol is cost prohibitive. Expected Fraction of Ditags with Mutations least one mutation after amplification is complete. An amplified ditag sample will have a proportion of mutant tags, RPCR, which can be obtained by rearranging (2). 0.85 0.9 0.95 1 PCR Efficiency, f Fig. 3. The PCR efficiency does not have a strong effect on the expected fraction of mutant SAGE ditags. However, there still exists an impact on overall representation bias by dilution of tags missed in each PCR cycle. 0.07 0.06 0.05 0.04 0.03 0.02 SAGE (20 bp ditags) 0.01 LongSAGE (34 bp ditags) 0 0 5 10 15 20 25 30 Cycle Number, n Fig. 2. The expected fraction of ditags in the amplified pool containing an error increases as more PCR cycles are run; errors accumulate and are replicated along with the remaining true ditags, in addition to the introduction of new false ditags. The expected mutant fraction is also higher for longer ditag lengths because more bases are duplicated in their synthesis. B. Effect of PCR efficiency The effects of varying f can also be seen through (6). As shown in Fig. 3, RPCR is not a strong function of f. This is the expected result, as inefficiency implies that certain ditags will C. Effect of Polymerase Fidelity Polymerase fidelity has a strong influence on the fraction of mutant ditags. When the error rate increases, it follows that the mutation frequency increases, and therefore the proportion of error-containing ditags. The effect was calculated using (6) and is shown in Fig. 4. In order to minimize the fraction of mutant ditags it is important to minimize the polymerase error rate. The error rates of most DNA polymerases fall on the order of 10-6<<10-4 mutations/base duplication [4]. Each polymerase has an optimal temperature to ensure high fidelity, and the PCR reaction should be carried out at this temperature with sufficient amounts of each nucleotide base for accurate synthesis. BME 1450/Winter 2005/994957169 4 It was found that SAGE analysis on the linearly amplified and unamplified fractions were comparable. However, linear amplification requires many more cycles and is therefore more time consuming. Also, the strategy of keeping tags bound to the original streptavidin beads implies a much longer sequence length and requires many more starting free nucleotide bases. Expected Fraction of Ditags with Mutations 0.18 f = 0.88 0.16 N = 20 bp ditags n = 30 0.14 n = 25 0.12 Taq polymerase at 70°C 0.1 n = 20 VI. CONCLUSION n = 15 0.08 0.06 n = 10 0.04 n =5 0.02 0 0 0.0002 0.0004 0.0006 0.0008 Polymerase Error Rate, (mutations/base duplication) Fig. 4. The expected fraction of mutant SAGE ditags increases as the error rate of the DNA polymerase used increases. The effect is more pronounced as cycle number increases. Most DNA polymerases have error rates on the order of 10-6<<10-4 mutations/base duplication. The error rate of Taq polymerase at 70°C is shown; this is the optimal temperature for Taq, which is the most commonly used polymerase in the PCR amplification of SAGE ditags. V. RECOMMENDATIONS FOR ERROR REDUCTION Several algorithms exist to correct tags with mutation errors by looking at single tag occurrences and postulating which tags are the likely “parents” of the copies. If successful, these tags can be added back into the counts for the original tags [4, 6]. However, these algorithms are not able to process frame-shift mutations and become less and less accurate as ditags with multiple mutations arise. It follows that the best methods of error reduction involve direct reduction of sequence errors, allowing easier subsequent application of correction algorithms. The simplest modifications include using the shortest possible ditag length without risking tag overlap between different transcripts. Another modification would be to use as few PCR cycles as possible while retaining a sufficient number of tags for cloning and sequencing. However, this implies increasing the amount of starting material, which is not practical for all applications. Linear tag replication has the potential to partially resolve this issue. The current protocol amplifies tags exponentially, thereby propagating any errors that occur. If the original tags can be isolated from the newly synthesized tags, then they will always serve as templates in the next PCR cycle. For example, this can be done by keeping the tags attached to the streptavidin beads [3]. This is an important technique when dealing with very small starting sample sizes. When the original sample size is very limited, such as in the case of micro-dissected tissues, the number of required PCR cycles prior to sequencing precludes conventional PCR amplification. To test the validity of linear amplification, Vilain et al. split a tissue sample into two fractions, one the size of a microdissected sample and the other requiring no amplification [3]. SAGE is a powerful method to survey thousands of gene transcripts in parallel and generate global profiles of the transcriptome. PCR amplification of SAGE ditags is often necessary in order to carry out the protocol, but polymerase fidelity is not perfect and false tags are introduced. The rate at which these errors occur has been modeled using a GaltonWatson branching process. This model assumes that conventional PCR amplification is used and that ditags are expanded exponentially. Although replication errors will always occur during synthesis, the propagation of these errors through exponential expansion can be eliminated using linear amplification methods. This becomes important when the amount of starting material is limiting, but is not practical when sufficient starting material is available. The reduction of errors introduced during amplification makes it simpler to apply valid tag artifact correction algorithms and obtain transcript profiles that accurately reflect raw mRNA samples. REFERENCES J. D. Pollock, “Gene expression profiling: methodological challenges, results, and prospects for addiction research,” Chemistry and Physics of Lipids, vol. 121(1-2), Dec. 2002, pp. 241-256. [2] V. E. Velculescu, L. Zhang, B. Volgelstein, and K. Kinzler, “Serial Analysis of Gene Expression,” Science, vol. 270(5235), Oct. 1995, pp. 484-487. [3] C. Vilain, F. Libert, D. Venet, S. Costagliola, and G. Vassart, “Small amplified RNA-SAGE: an alternative approach to study transcriptome from limiting amount of mRNA,” Nucleic Acids Research, vol. 31(6) , Mar. 2003, pp. e24. [4] V. R. Akmaev and C. J. Wang, “Correction of sequence-based artifacts in serial analysis of gene expression,” Bioinformatics, vol. 20(8), May 2004, pp. 1254-1263. [5] K. A. Baggerly, L. Deng, J. S. Morris, and C. M. Aldaz, “Overdispersed logistic regression for SAGE: Modelling multiple groups and covariates,” BMC Bioinformatics, vol. 5(144), Oct. 2004, pp. 144. [6] T. Beibarth, L. Hyde, G. K. Smyth, C. Job, W.-M. Boon, S.-S. Tan, J. S. Scott, and T. P. Speed, “Statistical modeling of sequencing errors in SAGE libraries,” Bioinformatics, vol. 20 suppl. 1, Aug. 2004, pp. i31i39. [7] P. Keohavong and W. G. Thilly, “Fidelity of DNA polymerases in DNA amplification,” Proc. Natl. Acad. Sci. USA, vol. 86(23), Dec. 1989, pp. 9253-9257. [8] K. R. Tindall and T. A. Kunkel, “Fidelity of DNA Synthesis by the Thermus Aquaticus DNA polymerase,” Biochemistry, vol. 27(16), Aug. 1988, pp. 6008-6013. [9] F. Sun, “The polymerase chain reaction and branching processes,” Journal of Computational Biology, vol. 2(1), spring 1995, pp. 63-86. [10] L. Cai, H. Huang., S. Blackshaw, J. S. Liu, C. Cepko, and W. H . Wong, “ Clustering analysis of SAGE data using a Poisson approach,” Genome Biology, vol. 5(7), June 2004, pp. R51. [1]