Statistical analysis of errors created during amplification of transcript

advertisement
BME 1450/Winter 2005/994957169
1
Statistical analysis of errors created
during amplification of transcript tags
for serial analysis of gene expression
Karyn S. Ho

Abstract—Serial analysis of gene expression (SAGE) is a
quantitative technique for profiling gene transcripts through the
rapid generation of unique sequence tags. The resulting data
describe the size and distribution of the sample transcriptome and
allow easy comparison of gene expression under different disease
or stress conditions. However, the amount of starting material
required to sequence the transcripts often exceeds the starting
sample size. Because polymerase fidelity is not perfect, the
required polymerase chain reaction (PCR) amplification steps
introduce false tags while reducing the relative abundances of
true tags. To simulate the creation and propagation of false tags,
tag replication has been modeled as a Galton-Watson branching
process. The proportion of authentic tags remaining after
amplification is a function of polymerase fidelity, PCR efficiency,
SAGE tag length, and the number of PCR cycles run.
a SAGE tag. To further reduce the possibility of shared tags,
LongSAGE is a modified protocol that gives 17 bp tags [4].
The SAGE process was originally proposed by Velculescu
et al. [2]. The process is summarized as a flow diagram up to
the PCR amplification step in Fig. 1.
Isolate mRNA
3’ - AAAAA
3’ - AAAAA
3’ - AAAAA
RT-PCR with biotinylated
poly(dT) primer
3’ - AAAAA
* - 5’ - TTTTT
Cleave with restriction enzyme
NlaIII (recognizes 5’ – CATG –
3’ and leaves 5’ overhang )
Index Terms—Branching Process, PCR Amplification,
Sequence Errors, Serial Analysis of Gene Expression.
CATG
Divide beads in half (pool A and pool B) and
ligate to linker sequences containing Type IIS
restriction sites and separate PCR primers
W
Manuscript received November 7, 2005. Submitted in partial fulfillment
of requirements for BME 1450.
Karyn S. Ho is an M.A.Sc. candidate with the Institute of Biomaterials and
Biomedical Engineering at the University of Toronto (email:
karyn.ho@utoronto.ca).
Isolate fragments closest
to original 3’-end by
conjugation of biotin (*)
onto streptavidin beads
3’ - AAAAA
* - 5’ - TTTTT
I. INTRODUCTION
HEN subjected to different disease or stress conditions,
cells modify their gene expression patterns in order to
survive and thrive. Consequently, profiling the transcriptome
(mRNA) indicates how cells react to better their own chances
at survival, or to protect surrounding tissues. Although several
methods exist to measure gene expression, serial analysis of
gene expression (SAGE) is unique in its ability to quantify
transcript abundance for subsequent comparison [1]. Other
methods lack sensitivity or can only evaluate a limited number
of genes at once [2]. SAGE can also be used in gene
discovery because it does not require prior knowledge of the
genome in order to proceed [1, 3].
SAGE is based on the premise that a 10 base pair (bp) long
nucleotide tag contains enough information to relate it back to
its original complete mRNA strand, provided it can be taken
from a defined position [2]. Given a 10 bp tag length, there
are over 1 million possible nucleotide sequences (410), and
only around 80 000 transcripts in the entire human genome [2].
Although possible, it is unlikely that multiple genes will share
GTAC
CATG
GTAC
CATG
3’ - AAAAA
* - 5’ - TTTTT
Digest with Type IIS to release
blunt ended tag with defined
length from beads
Tag A
Primer A
Type IIS
Type IIS
Primer A or B
Ligate pool A and pool B to
form ditags and amplify by
PCR
Tag B
GTAC
CATG
CATG
GTAC
Type IIS
Primer B
Ditag
Fig. 1. Process flow diagram of protocol for conversion of raw mRNA
sample to SAGE ditags.
A few points are necessary to note in order to understand
the flow diagram: unmodified mRNA has a 3’-poly(A) tail,
which has high affinity for its complement, poly(T); the
mRNA strand can be converted to double-stranded cDNA
through reverse transcriptase-polymerase chain reaction (RTPCR); to generate SAGE tags, a series of enzymatic cleavage
and ligation steps are used; restriction enzymes can recognize
specific nucleotide sequences, then cleave leaving overhangs
(eg. NlaIII) or blunt ends, sometimes a prescribed distance
from the recognition site (eg. Type IIS); biotin and streptavidin
BME 1450/Winter 2005/994957169
have exceptionally high affinity for one another; prior to
amplification, tags ligate in pairs called ditags, which are
flanked on both ends by linkers containing primer sequences,
and when used as PCR templates will be copied from both
ends, giving rise to two copies in each cycle.
Following amplification, the ditags are digested again to
release the linker sequences and concatenated into long
strands.
The 5’-CATG-3’ sequences are retained and
punctuate the boundaries between ditags, which are then
cloned and sequenced. The tag counts are subsequently
related back to their original genes, and gene expression
patterns can then be compared between samples. When
comparing gene expression libraries, it is common to have
only a single-digit number of samples measured because the
SAGE protocol is labour intensive and cost prohibitive [5].
Any representation biases can later be misinterpreted as
differential expression.
II. SOURCES OF ERROR
Due to the large number of manipulations necessary to
arrive at the final tag sequencing stage, there are many ways in
which sequence errors can be introduced. Some material can
be lost during the initial mRNA purification from the cell
sample, due to lack of coupling of the poly(A) tail to poly(dT)biotin, or lack of attachment of poly(dT)-biotin to the
streptavidin beads. As long as there are excess reagents
available and enough time for binding to occur, these losses
will be minor because of the strong affinities between
poly(A):poly(T) sequences and biotin:streptavidin.
The
remaining losses are assumed to be proportional.
Digestion with the anchoring enzyme is nearly complete
given a long incubation time. This is important because any
strand that is not cut at the 3’-most anchoring enzyme
recognition site will lose its defined tag position. However,
strands that completely lack the recognition sequence will
remain uncut and will not be included in the analysis because
of their inability to ligate to a linker sequence. For this reason,
restriction enzymes having 4 bp recognition sites are
commonly used because they cleave every 256 bp (44) on
average. Most transcripts are much longer, ensuring that
almost every transcript will be included [2]. The remaining
digestion and ligation steps are considered nearly complete.
Also, because ditags are punctuated by the 4 bp recognition
site after concatenation, any frameshift errors occurring in a
single ditag will not affect surrounding ditags.
The most significant sources of representation bias, then, are
attributed to SAGE tag amplification [4, 6]. These errors are
inevitable because the fidelity of DNA polymerases is not
perfect [3, 4, 6-8]. As a result, tag mutations can be observed
as base substitutions or as insertions or deletions, which are
also known as frame-shift mutations. Polymerase inefficiency
can also result in certain sequences being skipped or copied
only partially during a PCR cycle [9].
Some errors may occur during RT-PCR conversion of the
original mRNA to cDNA, but because only one cycle is
necessary and no duplication occurs, these errors are dwarfed
2
by the propensity of PCR amplification to create and
propagate sequence errors. The result can be particularly
drastic for transcripts having low copy numbers [4, 6].
Sequence errors not only create false tags, but they remove
true tags from the sequenced pool, thereby reducing the
relative abundances of true tags [3, 4, 6].
III. STATISTICAL MODEL OF SEQUENCE ERRORS
Sequence errors occur during PCR amplification because
DNA polymerases do not have perfect fidelity. The error rate
per base duplication,depends on the polymerase and
temperatures used. Because the error rate is on a per base
duplication basis, the fraction of mutant tags is then dependent
on the number of bases within each ditag, N. Errors that are
created in early PCR cycles are also propagated in subsequent
cycles, so replication of existing false ditags and introduction
of new false ditags depend on the cycle number, n. The PCR
reaction can also be characterized in terms of its efficiency, f,
which is a measure of how many sequences are successfully
duplicated during a given cycle. The fraction of false
generated tags, then, is a function of polymerase fidelity, PCR
efficiency, ditag length, and cycle number.
The behaviour of any given cycle n+1 depends only on the
outcome of cycle n, so no information is required about
previous cycles in order to calculate the next probabilities.
Also, the complete replication of any given ditag is
independent of the replication of all other ditags. These two
characteristics of PCR amplification allow the process to be
modeled as a Galton-Watson branching process [4, 9]. In
addition, it can be assumed that the number of mutations is
described by a Poisson process, which has been shown to give
more accurate error predictions than a Gaussian distribution
[9, 10]. It has also been assumed that mutations will only
occur once at a given position such that a second mutation
cannot recover the original sequence. The equations that
follow are based on these assumptions.
The expected number of correct ditag sequences is denoted
EZn(1), and the number of incorrect ditag sequences EZn(2). At
the beginning of the amplification, EZ0(1) = 1 and EZ0(2) = 0.
Equation (1) describes the expected values for the expected
number of correctly replicated tags at cycle n [9].
EZ n(1)  [1  f exp(  N )] n
(1)
The total number of ditags at cycle n expands exponentially
and is a function of the PCR efficiency [4].
EZ n(1)  EZ n( 2)  [1  f ]n
(2)
To better understand (2), it is possible to imagine that at
perfect PCR efficiency, the amplified pool doubles in size in
every cycle. At null efficiency, the pool never increases in
size. The limits of (2) are therefore intuitive.
lim EZ n(1)  EZ n( 2)  2 n
(3)
lim EZ n(1)  EZ n( 2)  1
(4)
f 1
f 0
It is informative to estimate the fraction of tags having at
BME 1450/Winter 2005/994957169
3
RPCR 
EZ n( 2)
(1  f ) n
 1
EZ n(1)
(5)
(1  f ) n
It can be noted from (5) that any starting value for EZn(1) can
be assumed, because it would cancel. Substituting (1) into (5)
gives an estimate of the fraction of mutant tags at any cycle n.
RPCR  1 
[1  f exp(  N )] n
(6)
(1  f ) n
IV. DISCUSSION OF STATISTICAL MODEL PREDICTIONS
A. Effects of PCR Cycle Number and SAGE ditag length
Using (6) the fraction of ditags expected to have at least one
mutation was calculated, fixing f = 0.88 and  = 2.0×10-4
mutations/base duplification, both of which were found
experimentally for Taq polymerase at 70°C [7]. Fig. 2 shows
the dependence of RPCR on the number of cycles run for SAGE
ditags (N = 20) and LongSAGE ditags (N = 34).
The relationship is nearly linear at this low range of cycle
number, and 25<n<30 is a typical range for SAGE
experiments [4]. The model reflects the tendency for errors to
accumulate from one cycle to the next. It also demonstrates
the greater likelihood for point mutations to appear on a
greater fraction of sequences as ditag length increases.
Expected Fraction of Ditags with Mutations
0.1
 = 2.0 × 10-4 mutations/base duplication
0.07
N = 20 bp ditags
0.06
n = 30
0.05
n = 25
n = 20
0.04
n = 15
0.03
n = 10
0.02
n =5
0.01
0.8
f = 0.88
0.08
0.08
0
 = 2.0 × 10-4 mutations/base duplication
0.09
not be replicated, but not that mutation frequency will
increase.
However, reduced PCR efficiency introduces a different
representation bias. Those transcripts that are not replicated
become proportionally under-represented in the amplified
ditag pool. This can be particularly problematic for genes with
low expression levels in the transcriptome; they can become
diluted out quickly if skipped, leading to an even greater
probability of being skipped in subsequent cycles. Even for
transcripts with higher initial representation, the effects can be
significant when comparing libraries because it can be
misinterpreted as differential expression. It is uncommon to
have enough replicates to be able to resolve these errors
through averaging because the SAGE protocol is cost
prohibitive.
Expected Fraction of Ditags with Mutations
least one mutation after amplification is complete. An
amplified ditag sample will have a proportion of mutant tags,
RPCR, which can be obtained by rearranging (2).
0.85
0.9
0.95
1
PCR Efficiency, f
Fig. 3. The PCR efficiency does not have a strong effect on the expected
fraction of mutant SAGE ditags. However, there still exists an impact on
overall representation bias by dilution of tags missed in each PCR cycle.
0.07
0.06
0.05
0.04
0.03
0.02
SAGE (20 bp ditags)
0.01
LongSAGE (34 bp ditags)
0
0
5
10
15
20
25
30
Cycle Number, n
Fig. 2. The expected fraction of ditags in the amplified pool containing an
error increases as more PCR cycles are run; errors accumulate and are
replicated along with the remaining true ditags, in addition to the introduction
of new false ditags. The expected mutant fraction is also higher for longer
ditag lengths because more bases are duplicated in their synthesis.
B. Effect of PCR efficiency
The effects of varying f can also be seen through (6). As
shown in Fig. 3, RPCR is not a strong function of f. This is the
expected result, as inefficiency implies that certain ditags will
C. Effect of Polymerase Fidelity
Polymerase fidelity has a strong influence on the fraction of
mutant ditags. When the error rate increases, it follows that
the mutation frequency increases, and therefore the proportion
of error-containing ditags. The effect was calculated using (6)
and is shown in Fig. 4.
In order to minimize the fraction of mutant ditags it is
important to minimize the polymerase error rate. The error
rates of most DNA polymerases fall on the order of
10-6<<10-4 mutations/base duplication [4]. Each polymerase
has an optimal temperature to ensure high fidelity, and the
PCR reaction should be carried out at this temperature with
sufficient amounts of each nucleotide base for accurate
synthesis.
BME 1450/Winter 2005/994957169
4
It was found that SAGE analysis on the linearly amplified and
unamplified fractions were comparable. However, linear
amplification requires many more cycles and is therefore more
time consuming. Also, the strategy of keeping tags bound to
the original streptavidin beads implies a much longer sequence
length and requires many more starting free nucleotide bases.
Expected Fraction of Ditags with Mutations
0.18
f = 0.88
0.16
N = 20 bp ditags
n = 30
0.14
n = 25
0.12
Taq polymerase
at 70°C
0.1
n = 20
VI. CONCLUSION
n = 15
0.08
0.06
n = 10
0.04
n =5
0.02
0
0
0.0002
0.0004
0.0006
0.0008
Polymerase Error Rate, (mutations/base duplication)
Fig. 4. The expected fraction of mutant SAGE ditags increases as the error
rate of the DNA polymerase used increases. The effect is more pronounced as
cycle number increases. Most DNA polymerases have error rates on the order
of 10-6<<10-4 mutations/base duplication. The error rate of Taq polymerase
at 70°C is shown; this is the optimal temperature for Taq, which is the most
commonly used polymerase in the PCR amplification of SAGE ditags.
V. RECOMMENDATIONS FOR ERROR REDUCTION
Several algorithms exist to correct tags with mutation errors
by looking at single tag occurrences and postulating which tags
are the likely “parents” of the copies. If successful, these tags
can be added back into the counts for the original tags [4, 6].
However, these algorithms are not able to process frame-shift
mutations and become less and less accurate as ditags with
multiple mutations arise. It follows that the best methods of
error reduction involve direct reduction of sequence errors,
allowing easier subsequent application of correction
algorithms.
The simplest modifications include using the shortest
possible ditag length without risking tag overlap between
different transcripts. Another modification would be to use as
few PCR cycles as possible while retaining a sufficient number
of tags for cloning and sequencing. However, this implies
increasing the amount of starting material, which is not
practical for all applications.
Linear tag replication has the potential to partially resolve
this issue. The current protocol amplifies tags exponentially,
thereby propagating any errors that occur. If the original tags
can be isolated from the newly synthesized tags, then they will
always serve as templates in the next PCR cycle. For example,
this can be done by keeping the tags attached to the
streptavidin beads [3]. This is an important technique when
dealing with very small starting sample sizes. When the
original sample size is very limited, such as in the case of
micro-dissected tissues, the number of required PCR cycles
prior to sequencing precludes conventional PCR amplification.
To test the validity of linear amplification, Vilain et al. split a
tissue sample into two fractions, one the size of a microdissected sample and the other requiring no amplification [3].
SAGE is a powerful method to survey thousands of gene
transcripts in parallel and generate global profiles of the
transcriptome. PCR amplification of SAGE ditags is often
necessary in order to carry out the protocol, but polymerase
fidelity is not perfect and false tags are introduced. The rate at
which these errors occur has been modeled using a GaltonWatson branching process.
This model assumes that
conventional PCR amplification is used and that ditags are
expanded exponentially. Although replication errors will
always occur during synthesis, the propagation of these errors
through exponential expansion can be eliminated using linear
amplification methods. This becomes important when the
amount of starting material is limiting, but is not practical
when sufficient starting material is available. The reduction of
errors introduced during amplification makes it simpler to
apply valid tag artifact correction algorithms and obtain
transcript profiles that accurately reflect raw mRNA samples.
REFERENCES
J. D. Pollock, “Gene expression profiling: methodological challenges,
results, and prospects for addiction research,” Chemistry and Physics of
Lipids, vol. 121(1-2), Dec. 2002, pp. 241-256.
[2] V. E. Velculescu, L. Zhang, B. Volgelstein, and K. Kinzler, “Serial
Analysis of Gene Expression,” Science, vol. 270(5235), Oct. 1995, pp.
484-487.
[3] C. Vilain, F. Libert, D. Venet, S. Costagliola, and G. Vassart, “Small
amplified RNA-SAGE: an alternative approach to study transcriptome
from limiting amount of mRNA,” Nucleic Acids Research, vol. 31(6) ,
Mar. 2003, pp. e24.
[4] V. R. Akmaev and C. J. Wang, “Correction of sequence-based artifacts
in serial analysis of gene expression,” Bioinformatics, vol. 20(8), May
2004, pp. 1254-1263.
[5] K. A. Baggerly, L. Deng, J. S. Morris, and C. M. Aldaz, “Overdispersed
logistic regression for SAGE:
Modelling multiple groups and
covariates,” BMC Bioinformatics, vol. 5(144), Oct. 2004, pp. 144.
[6] T. Beibarth, L. Hyde, G. K. Smyth, C. Job, W.-M. Boon, S.-S. Tan, J.
S. Scott, and T. P. Speed, “Statistical modeling of sequencing errors in
SAGE libraries,” Bioinformatics, vol. 20 suppl. 1, Aug. 2004, pp. i31i39.
[7] P. Keohavong and W. G. Thilly, “Fidelity of DNA polymerases in DNA
amplification,” Proc. Natl. Acad. Sci. USA, vol. 86(23), Dec. 1989, pp.
9253-9257.
[8] K. R. Tindall and T. A. Kunkel, “Fidelity of DNA Synthesis by the
Thermus Aquaticus DNA polymerase,” Biochemistry, vol. 27(16), Aug.
1988, pp. 6008-6013.
[9] F. Sun, “The polymerase chain reaction and branching processes,”
Journal of Computational Biology, vol. 2(1), spring 1995, pp. 63-86.
[10] L. Cai, H. Huang., S. Blackshaw, J. S. Liu, C. Cepko, and W. H . Wong,
“ Clustering analysis of SAGE data using a Poisson approach,” Genome
Biology, vol. 5(7), June 2004, pp. R51.
[1]
Download