Designing Methods for Error Correction in Gene Fabrication

advertisement
Designing Methods for Error Correction in Gene Fabrication
by
Jason Park
Submitted to the Department of Mechanical Engineering
in Partial Fulfillment of the Requirements for the Degree of
Bachelor of Science
"MASSACHUSi S INSTT
OF TECHNOLOGY
at the
Massachusetts Institute of Technology
Hi
April 2005
2cot 3
2-
JUN 0 8 2005
LIBRARIES
© 2005 Massachusetts Institute of Technology
All rights reserved.
Signature of Author:
Department of Mechanical Engineering
May 6, 2005
Certified by:
N\~~~
Y
~~~Joseph
Jacobson
Associate Professor of Media Arts and Sciences and Mechanical Engineering
Thesis Supervisor
Accepted by:
I
Ernest G. Cravalho
Professor of Mechanical Engineering
Chairman, Undergraduate Thesis Committee
$.CHIVES
1
E
Designing Methods for Error Correction in Gene Fabrication
by
Jason Park
Submitted to the Department of Mechanical Engineering
on May 6, 2005 in Partial Fulfillment of the
Requirements for the Degree of Bachelor of Science in
Mechanical Engineering: Biotrack
ABSTRACT
Gene Fabrication technology involves the development and optimization of
methods relevant to the in vitro synthesis of any given target gene sequence(s) in the
absence of template. The driving purpose of this field of research is to bring about the
capability for on-demand fabrication of a DNA construct of arbitrary length and sequence
quickly, efficiently, and cost-effectively. One of the main challenges in gene fabrication
is to not only synthesize a given DNA target, but to do so without making any errors. At
high error rates, fabrication of long gene targets is expensive and impractical - in some
cases, it is impossible.
Improvements in error rates are essential for continued progress in the
development of gene fabrication technology. Error reduction technologies can be broadly
split into three categories at present: error filtration, error correction, and error
prevention. This thesis presents the past, present, and future design of a number of quick,
easy, robust, economical, and effective error reduction methods in gene fabrication.
Thesis Supervisor: Joseph Jacobson
Title: Associate Professor of Media Arts and Sciences and Mechanical Engineering
2
1. Introduction
Overview of Gene Fabrication
Gene fabrication involves the development and optimization of methods relevant
to the in vitro synthesis of any given target gene sequence(s) in the absence of template.
The driving purpose of this field of research is to bring about the capability for ondemand fabrication of a DNA construct of arbitrary length and sequence quickly,
efficiently, and cost-effectively. Current gene synthesis technology is effective for
making single genes, but can be slow and costly, especially for long sequences of DNA
or when multiple genes are desired. Through developments in gene fabrication
technology, we hope to remove these boundaries and provide an invaluable tool in
biological and biomedical research and engineering.
There are many uses for synthetic DNA, especially with the continued growth of
DNA sequence and protein structure databases available online. For example, the ability
to directly synthesize a gene and several variants can often be useful in the study of a
gene in biological research. Also, synthesizing a gene de novo eliminates the need to
obtain an organism from its natural habitat in order to study one of its genes. Gene
fabrication technology also allows for the synthesis of genes with novel functionalities
that do not even exist in nature. The ability to synthesize multiple genes at once also
gives researchers the ability to engineer and analyze complete biochemical pathways,
genetic networks, and entire genomes.
Gene ifabrication can be an essential and enabling technology for biological
engineers in various currently developing fields of research. For example, some
researchers have demonstrated the use of DNA as a structural biomaterial and scaffold
apart from its natural function as the genetic material (Chen and Seeman 1991). Others
are developing genetic circuit logic analogous to that of electronic circuits (Elowitz and
Leibler, 2001). Other exciting new fields of work involve research on minimal life" the minimum essential number of genes for an organism to be "'living" - and modifying
the genetic code to remap codons to novel amino acids (Goho, 2003, Wang et al., 2001).
Gene Fabrication Protocols
There exist a number of protocols for the in vitro synthesis of DNA in gene
fabrication. The most prominent among these - and the protocol that we use - involves
the use of the PIolymerase Chain Reaction (PCR). The components of the reaction
include: a pool of oligonucleotides that together make up the entire sequence of the target
DNA, dNTPs, thermostable DNA polymerase, PCR buffer (various components
including salts), and sometimes, oligonucleotide primers that define the ends of the target
DNA. The oligonucleotides in the starting pool are built up into longer constructs
through successive rounds of PCR. The full-length constructs are amplified
exponentially as in traditional PCR by the end oligonucleotide primers.
3
The precise method by which PCR is used for gene fabrication varies. For
example, the gene can be synthesized in either a one-step process or a two-step process.
In a two-step process, the first PCR assembles the oligonucleotide in the oligonucleotide
pool together into the full-length product at some low frequency. The second PCR
includes some of the first PCR product and adds primers that correspond to the ends of
the full-length product sequence. This serves to exponentially amplify the full-length
product sequence like a traditional PCR.
In a one-step process, both the oligonucleotide pool and a high concentration of
the end primers are included in the PCR. Thus, there is at once the linear build-up of the
full-length product from the oligonucleotide pool and the end primers as well as
exponential amplification of the full-length product once the first copy is synthesized.
Though in general the one-step process requires much more fine-tuning of parameters
than the two-step process and is difficult to do for longer gene products, it has the
advantage of being quicker and reducing the amount of necessary sample handling. For
one thing, reducing the number of sample handling steps will be increasingly important
as error correction protocols are ported from the lab bench to automatable microfluidic
devices.
A target DNA sequence for gene fabrication is parsed into a set of overlapping
oligonucleotides (40-70 bp in length) in computer software. In the simplest parsing
scheme, the target sequence can be divided up into equal chunks of n base pairs. In more
complicated algorithms, the sequence can be parsed in ways so as to optimize for certain
parameters including consistent melting temperature, no hairpin formation, no selfannealing, no primer-dimerization, codon frequency in host organisms, and more. An
example of this type of software, hosted at the NIH, is DNAWorks
(http://molbio.info.nih.gov/dnaworks/). The choice of parsing algorithm depends on the
particular needs of a project.
The parsed oligonucleotides are purchased from a commercial vendor. Choice of
vendor is important, as different vendors provide DNA with very different error rates.
For example, we have determined that Integrated DNA Technologies provides
oligonucleotides several times better in error rate than a competitor, MWG Biotech. This
is particularly important considering that we have determined in our work that the errors
in a starting oligonucleotide pool make up a high proportion of the errors present in its
final synthesized gene product (unpublished data).
Through a series of PCR cycles in which the parsed oligonucleotides extend by
essentially priming off of its complements, the final full-length DNA construct is
eventually made. Then, end primers as found in traditional PCR anneal to the ends of the
full-length sequence and exponential amplification of the DNA construct occurs.
Optimizing Gene Fabrication
4
In order to fulfill its potential as an enabling technology, gene fabrication
technology must be done cheaper, faster, better, and more robustly. Optimization of the
process through a number of vital parameters must be considered. These include
sequence design and software engineering of the parsing software, temperature profile of
the PCR cycles, automation of the process, high throughput processes, enzyme choice,
reaction conditions, and more. However, development of better techniques in error
correction is one of the most crucial elements to the progress of gene fabrication
technology.
Nature routinely performs DNA replication in the cell with high fidelity, with
error rates ranging from error per 08 base pairs synthesized to I error in 101° base pairs
synthesized. It does so by the use of various error detection and correction mechanisms.
In contrast, widespread gene fabrication technology methods found in the literature
currently demonstrates error rates on the order of in 600 base pairs. Roughly speaking,
a given error rate allows for facile synthesis and selection of a perfect copy of a gene that
is of a length that is the reciprocal of the error rate (e.g., 600 bp in length for 1 in 600).
The number of perfect DNA constructs produced in gene fabrication becomes
exponentially less with increased length of the target. At high error rates, fabrication of
long gene targets is expensive and impractical - in some cases, it is impossible.
Improvements in error rates via error correction protocols are essential for continued
progress in the development of gene fabrication technology.
Error Correction
The theory behind error correction in gene fabrication lies at the intersection of
computer science and biology: error-correcting codes. The central idea behind errorcorrecting codes is that multiple noisy or unreliable inputs can be combined to give high-
fidelity output. In "consensus voting," copies of a signal are compared against one
another and the consensus is chosen as the output. In this way, each input signal "votes"
on the output signal.
Von Neumann formulated the requirements to simulate a given number of perfect
inputs (Winograd and Cohen, 1967): "A circuit containing N (error-free) gates can be
simulated with probability of error at most Fusing N log (N/s) faulty gates, which fail
with probability p, so long as p < Pth." The key feature here is the logarithmic
relationship. In terms of gene fabrication, in order to improve the reliability of the
system by a factor of x, the number of DNA molecules required for error correction
scales by only log x.
5
Figure 1 Example of a
circuit employing error
correcting codes.
Multiple noisy signals
xi are routed to
components fi for
"consensus voting." The
result proceeds to a
second round of
'consensus voting" by
similar components di.
(Winograd and Cohen,
1967).
Y1 Y2
In either an electronic or a biological setting, one needs a means of
detecting errors in order to apply the principles of error-correcting codes. In
nature, biological systems maintain low error rates in DNA replication by
comparing newly synthesized DNA to its original template. For example,
methylation of a strand of DNA at a recognition site marks it as an original - and
presumably error-free - template. In the case of an error, a number of repair
mechanisms are set in motion. The mismatch repair system of many organisms
detects an error where a misincorporated base fails to form a Watson-Crick base
pair with the base on its complementary strand (Modrich, 1991).
The concept of an original, presumably error-free copy of DNA acting as a
template for detection of errors is key component of error correction in vivo. However,
gene fabrication, which is de novo DNA synthesis, is quite different. There is no
distinguishable original, perfect copy of DNA. All DNA molecules produced have a
discrete probability of containing fairly randomly distributed errors in their sequence.
The standard of quality also differs between gene fabrication and natural processes.
Biological systems, because of silent mutations, mutations with insignificant effect, or
compensatory mutations, are more fault-tolerant than gene fabrication processes. In
contrast, a perfect copy of DNA must be made from scratch in gene fabrication. Error
filtration, correction, and prevention methods are therefore of critical importance.
The error rate of a gene fabrication process directly affects its usability to
synthesize DNA for its various applications. What error rate is required to economically
and realistically make DNA of a given sequence length? In terms of molecular biology,
this translates to: "How many clones must be sequenced to have adequate confidence of
obtaining at least one perfect clone?" The calculation goes as follows:
For a per-base error rate of P and a DNA target of length N, the probability that a
given clone contains no errors at any position is (I -P)N . For X clones sequenced, the
6
confidence (C) of obtaining at least one perfect clone is C = -(1 -( I -p)N)x . Conversely,
for a given confidence value, the number of clones one expects to sequence is X = In( S) / ln(1-(1-P)"'). Figure 1 below indicates how much more difficult it is to synthesize, for
example, a 750(0-mer than a 750-mer. At a standard error rate of I error in 750 bp
(Stemmer et al, 1995, Withers-Martinez, et al., 1999, Hoover and Lubkowski, 2002, as
well as our own observed rates) a 750-mer can be produced with high confidence and
only a few clones sequenced. On the other hand, without a better error rate, one can see
that synthesizing a 7500-mer is unrealistic and nearly impossible. One way to get around
this problem would be two synthesize ten 750-mer pieces and clone and sequence them,
and then assemble these together into a 7500mer and clone and sequence them again. On
the other hand, an improvement in error rate to I in 7500 would allow one to generate a
DNA target about 7500 bp in length with little trouble.
-
,
100%
90%
80%
70%
o 1
60%
o
c
o
X
50%
Z
*.
X'
e
10000
-1
1 in 750
in750
1 in 1400
1 in 1400
1000 -
/
1 n 7500-f
1 in 7500
o
100
40°
a .2
30%
;
20%
n0
/
10t/i
1 r__________c
E
.
10%
0%
0
0
o
0:
lng
0
0
o
"I
0
0
o
Dl
0
0
o
0
0
0
0
o
o
targe
0
0
o
0
0
o0
a
oo
oq
0
0o
o
o
length of DNAtarget (bp)
.
o0
o
o0
o0 - o
o
o:
o:
o
o
o2
length of target DNA(bp)
length of target DNA(bp)
I
Figure 2 Gene tfabrication of long targets requires improved error rates (From Carr et al., 2004)
Error correction is also a vital aspect of gene fabrication because it cuts
down on sequencing and cloning. Cloning, preparing samples, and sequencing
take up almost two-thirds of the total process time in gene fabrication. Figure 2
shows the process of gene fabrication a time pie chart, illustrating this point. By
more accurately synthesizing gene targets, one can cut down on the number of
clones necessary to get one perfect copy, as well as avoid the need for any time-
consuming additional steps such as site-directed mutagenesis.
7
A
_
I~~~I
-
i rd
V
- ............
....
_ "_
,m
__ _ __________________
211
v
un
IFcticn
.z
Mett
Ie-a
-
1
:1
I
%i,.
. ,
~.I-'
,,
.,
: h .L,
_IS
iI
v
/--J-UL.- -L--7 k... t-clovex
Vu:6
-
- - _
-c
:Lo~
l ret.n
A:
4
I
. '
i
Figure 3 (A) Main steps performed in construction of genes in gene fabrication, using a MutS pull-down
procedure for error-filtration. The pie chart indicates the approximate amount of time consumed by each
step (in hours), with a red arrow indicating the order of operations. The most time-consuming steps in this
process are oftenoligonucleotide synthesis and DNA sequencing (including plasmid production). The 24+
and 48+ hours indicated for each of these represent lower bounds on theseprocesses, possible if performed
with immediate access to the appropriate equipment. If these steps are performed by outside providers, 3-5
days are typical of each step. (From Carr et al., 2004)
In order to employ the concept of error-correcting codes in gene
fabrication, one makes the assumption that errors are far outnumbered by correct
DNA. Therefore, with many copies of DNA, a consensus vote between the DNA
strands allows for determination of the correct sequence and allows for errors to
be found and fixed. To accomplish this, an error correction system must be able
to accomplish error detection, error removal, and error repair. The overarching
functional requirements in the development of error correction protocols are: a)
utilization of a molecule with affinity for errors, b) a method for removal of the
errors which are bound by aforementioned molecules, and c) a process that can be
cycled for improvement of error rate by multiple rounds of consensus voting. The
ideal result is a cheap, robust, and hopefully machine-automatable procedure with
short cycle time that results in large improvements of error rate.
11. Current and Future Work
Error detection using MutS
The error-binding molecule of choice in much of our current and future work in the
development of error correction protocols employs the protein MutS. As part of a
common natural mismatch repair mechanism involving MutL, MutH, and MutS, MutS is
8
a protein with affinity for binding to DNA duplexes in places where they deviate from
Waston-Crick base pairings. MutS, which has homologs in many organisms, exhibits
different sensitivity to different types of mismatches (Brown et al. 2001). The sensitivity
and specificity of the mismatch binding also varies across species.
__A_
Affinity is highest for single base deletions, which we
have determined to be the most prominent error in gene
synthesis. The different types of mismatches are: AA, CC, GG,
TT, AC, AG, TC, TG, and the bulges caused by short insertions
and deletions. E. coli MutS is known to have poor affinity for
C03
CC mismatches (Brown et al., 2001). In eukaryotes, mismatch
recognition is carried about by multiple proteins in MutS
homlogs (Eisen, 1998). MutS proteins from thermophilic
bacteria, seem to have affinity for all of the types of
mismatches (Biswas and Hsieh, 1995, Whitehouse et al. 1997).
~~Figure
4 details the mechanism of action of MutS,
l~
MutL, and MutH. As mentioned earlier, methylation of one of
Ia3
. om"*-
· * -----CH
DNAym
the strands of DNA identifies it as the original copy. In the
absence of strand methylation, both strands are cut by MutH
(Smith and Modrich, 1997). This feature has been utilized in
' the literature for the removal of errors from synthesized DNA.
#eme, The cleaved, error-containing products are separated from non-
__________error-containing
products by size separation in gel
Figure 4 MutS, MutH,
and MutL mismatch
correction mechanism in
electrophoresis.
In ce novo gene fabrication, we do not have the luxury
of knowing which DNA strands are original" and correct."
viv;o
_
Further, because
of the nature of PCR-based gene synthesis,
errors in DNA are copied into the comlementary strands of DNA as well. Thus there is
the necessity fr dissociating and reassociating DNA duplexes in order to re-assort DNA
strands and create error heteroduplexes in which errors are matched with their
corresponding correct bases. This mismatch can then be recognized by a protein such as
MutS. For lengths of DNA <10-20 kbp, this re-assortment can be accomplished by
thermal enaturation and re-annealing. For larger targets, one might utilize proteins such
as RecA to effect strand transfer between duplexes.
One of the main issues with MutS for use in error correction in gene fabrication is
nonspecific binding. Especially with longer DNA constructs, nonspecific binding may be
a problem. Depending on the application, our data shows that MutS can be used
effectively to remove errors from DNA of length about 1000 bp or less. In the synthesis
of various gene products including GFP (1000 bp) and Thermlts aquaticus MutS (2500
bp), we circumvented this issue parsed the sequences into chunks of -300 bp, errorcorrecting these pieces, and then assembling them into the full gene targets through a
PCR-based method. However, a method of gene fabrication and error correction that
does not require this additional step would be preferable.
There exist a number of other molecules used for binding to DNA mismatches,
including T7 endonuclease I (Babon et al., 2000), T4 endonuclease VII (Youil, 1996),
and Cell endonuclease (Yang et al., 2000). These may serve as ready alternatives to the
use of MutS and its homologs.
9
Assays for determination of error rate
In assessing our progress on protocols in error correction, it is beneficial to have a
cheap, quick, and easy assay that gives a crude estimate of error rate. Though there is no
ultimate replacement for the information provided by DNA sequencing, it is too costly in
terms of time and money to always sequence DNA when we want a rough estimate of
error rate for a given error correction or gene synthesis protocol. We have found two
gene targets particularly useful in our work in that they provide an easily measurable
read-out when cloned and transformed into cells: LacZ-alpha and GFP.
By cloning and transforming LacZ-alpha constructs into E. coli cells and plating
them on media with IPTG inducer, we have a blue/white screen that allows us to crudely
back out an error rate from the resulting percentage of blue colonies. Similarly for GFP,
counting fluorescent colonies of E. coli on a plate allows one to back out an error rate.
In order to back-calculate out an error rate from a blue/white LacZ-alpha or
green/white GFP screen:
(average % silent mutations)-' * (% blue/green) = (I-error rate per bp)DNAlength
1 - ((average % silent mutations)- * (% blue/green)) t /(DNAlength)= error rate per bp
Fluorescence activated cell sorting (FACS) analysis is another useful tool for
determining the number of fluorescent colonies in a population. By analyzing tens of
thousands of cells individually and assessing each for presence and intensity of
fluorescence, one can quickly get a sense of the error rate for a given GFP construct
transformed into cells.
Figure 5 Gene fabrication of GFP and
transformation of E. coli. Green colonies
contain working copies of GFP; white
colonies contain lethal error-containing GFP
genes (or no GFP)
Performing a MutS "pull-down" (as described in the next session) is another way
to quickly assess error rates. One can roughly compare error rates between various DNA
samples by binding MutS to each sample and running them through polyacrylamide gel
10
electrophoresis (PAGE). The more DNA is shifted (decreased mobility through the gel),
the higher the error rate.
Sequencing is the ultimate method for determining the error rate in a given DNA
sample. However, it is costly both in terms of time as well as money. Only when an
error correction protocol has been thoroughly optimized through the use of the assays
listed above do we use DNA sequencing to determine obtain more accurate error rate data
as well as data regarding the types of errors that survive each error correction process.
Methods in Error Correction
In seeking an ideal error correction protocol for use in gene fabrication
technology, we have explored a number of different methods to date. In designing
protocols, we keep in mind the principle design considerations described earlier in this
document. These include best improvement of error rate, robustness, ease of use, speed,
ability to automrnate,
and ability to be cycled. Additionally, "error correction" protocols
will be divided into three general categories: error filtration, error correction, and error
prevention.
Below, each protocols examined thus far in the laboratory will be described.
MttS pl/-cdotn
-
error filtration
Newly synthesized DNA product (or fragments thereof, in the case of gene targets too
large for this procedure) are melted and re-annealed to re-assort errors and create
heterodimers of errors as described earlier. The DNA is then exposed to MutS from T.
taquiticus in an appropriate stoichiometric ratio and under a certain set of reaction
conditions (temperature, time, magnesium ions, etc). DNA with errors is selectively
bound by the MVutS.The whole pool is run through a TBE non-denaturing
polyacrylamide gel. DNA is visualized by staining with SYBR Gold. A shifted band of
D)NA Dounc to MvutScan De
seen, due to the reduced
mobility of the complex
relative to unbound DNA.
Figure 6 MutS pull-down filter. Lane 1: kb ladder. Lanes 2,3,4,5:
The unbound DNA is cut out
and recovered from the gel by
the crush and soak method.
PCR is used to amplify the
DNA recovered from the gel,
which i nreFsnt in miniitf
-
-..
-_
The entire filtration
-300mer pieces of GFP (993bp), treated with MutS. Lanes 6,7,8,9:
amounts.
Same as lanes 2,3.45, except without MutS treatment. (From Carr et
al., 2004)
procedure is either repeated
transformed into E. coli, and expressed.
11
or the amplified DNA is
cloned into a vector,
Figure 5 shows the TBE non-denaturing polyacrylamide gel in the MutS pull-down
procedure. The lanes treated with MutS show how MutS decreases the mobility of a
fraction of the DNA - presumably the error-containing material.
Green Fluorescent Protein (GFP) was synthesized using the MutS pull-down error
filter. We performed extensive analysis on the synthesized DNA, using FACS analysis,
green/white colony count, and sequencing in order to quantitate errors and assess the
error rate of the MutS pull-down error filter protocol.
FACS analysis (see Figure 7) showed substantial improvement in fluorescence in
those samples that were error-filtered, both in terms of percent green cells as well as in
mean intensity of green fluorescence.
A
3A
Ijj-tc'~il
1'.
im
B
a
¢ *L
..
...............
.'J
h. .
[t
4;1
'J I
gated
F Wre
3. ~lJ }ll eci .~ .m r rein v i : r {] .~ 'c r~ee
-. 0I': .is. Flos .z~'orra:try ne l'mll: m ts.) .>
dl e xp r ~ql ne, £] P I n*m~yt,,e u ~r's. E[sIsf rnzo
ls&qi: wn In
,ercuzcrs.
Hofiizonitl axes irticat¢ 11utrcS-r*:
111te.ri- .peelti,: t) L, gers-. whle *¢mcal txes
i~.ure ' ,,. ."*u& zLse' :o lmrx:,~e :re _L.Jlit~ .* :ce. -yt.ctsa
rn I¢',rel -t'
Ir~cnc:,
'Dxlm. celv wt'Jc.comnti.5 mceestl'l,!
,^T~tcad
GI-'P n.~ car¢ e xpa:ted to cisplJ' I rrinin
ir .:i a.- rA.n .por i 'i~ 11Lo*.^r,_
cn~
st. itl;2rcn
,. EitrerncI.dha
liu~replotolr,
t .nc.a.edrcr;on
in ~e Iowrrit[e.h
I¢ ltL[.c.r:J
c,;r.m
ILzrc ucxt
~~)
'1rlv In ,trilly
IA. GCre.,er
ue2r.P
¥1r1p r
\ell
N
M1g:-i..
. Ir.
-.:
er
T-.r. I Lt Tel
r
rrlC
iFP gere
p~ Lunx
c In
,l S.>.1
r DNA I nrrSr..:
lu.r ;
CFtJ res prQ, ccd 1. c'ttT:i
en¢ .Vrtl..
s'l Itt.
ll p Ce.,lr. tor,:rn
g
ree
entr3: Em,lTr
dt icer:
..
¢',
t.rr
r-,:L
Pit
' -It'i.e ¢ .:
;:oy
:pl.'c-' G VPte'o., wJt.a .
ne
;
.4
_rrrrn.o3:
ls
I : I r'rivr
he pr.peIr
r
tlt
f¢~lz3;,:,r
popc.dan,:-erl
.lt
,ft
.'ll.t) per ,fe.tl'rln'.~t
I
'iFP,er f . IC r.e -.The vCar. B I it, tlL.rTyq:¢re
~t'.t*
~xres. ' hl.
vicld
*~
.~I ~rio'cer.
in tl~ , 'ul^3ilt'.) :" .ntliCt
,.J'. ,r. F
e "t ele-:
~
C_ iritc lto,= Ic per. l 1~L. [:.C~ irJl, 11[ n .-f'*i elTOr-1ren!ql 1[ ['~"qs
1: .rr:,r-em-l. 'ce: 'iaak
--=t& 1:.a~re
L,.:rr
ec cla r..csl:' .ntr,:ltec'
DNA ~.1Dle..2.¢uto '~ .:lr- n'pl lpuators -l'aw.,'
lrcle~l: regltl.e- .,~r.rII.
*;:1'tl,',trl !lclren
;FPDNA ~-mph/Wem
t~i :.c qppliao
r. -1'.ILTS pt t)/eir: hl-t Nnik
itlie 1:DNA err. -. ~ piee..mre
L,in r lhkSp~:~tCir: hlactCI riatrle. cr, <.*rrx'
It, Fir ~.re =. :',= tT In~:~anu~nnrnni:I~nulndn3rrnri
I-nl-1
,l
It~
hin
--.: I 1,,
Jp,:o Lrn
r.
f.r tc p.>iunn'.c A itginZ
ir. -.- .trnplifted
ty d~I~ptr,
nL.
.,re-n'R
,.rpl¢ . tr i: e 1¢.;.-A iCe': lack i'rrc.
,= VAA~pnnr
-1: pinpi'eNakdlnnnIPh,
riu. al-i
V
'acx hrn rr.alie
-~
.
t1.
,.A -,eLetXIC Icr.7 Cr.%~T,~'J~pr..:L. teymrTxll..-h)wr
in.
rreanI iir.[ F~-i['.
-l t p :
.'.Tlll
.,r.,I
-et ,: I 1..r]. . Srl.,.hr
ir l.~le -et.~:'At1~ ,JA-[' _Lle,_lec
FP-,1:
Drn.precr
h
~
Figure 7 FACS data for GFP synthesized with MutS pull-down error filter
(from Carr et al. 2004)
12
Sequencing data was used to quantitate errors and characterize the types and
locations of errors surviving the error-filtration procedure. (See Table 1 and Figure 7).
Error rate was decreased from about I in 600 to
U
UI
in 10000.
an
!i
.01
I
4
,,, 4,i
:
I
414 t *4
,
*.4
*
t
*
* 4
0
t *+
-
-
l
0.00010
c
0.0200026
0.0018
I
. :,
, "
:
7,
'.7
,.
-,
.
,
I
--:,-
-
EE
:.
0.0030
I
4 -F]-n . I I,
-
,--
4,
-
-
,4
,
-
4
-
4
-
4,
-
4,
-
4
,
:plcE:
h.
:irl:_..r.c
m'
:'/,ice
pcc
'FC.
<.r.a
r. :rr ,r',r1-:
f±.
-,siicat.
i
,-
4 P .:il.r
iturs:
'-:.rltird
pr.:
r. t-
.de
e'tl
rl
'. L~ir
L
FP [Nf
-. A
h.
*++*
.:-.F'i.
r m:r
-[n
p.
l.l :
.:
p L'
'.rrcr: :x~bcil ±e (iFP ereranrcarkir -. lcr. a
r..ca
f.,r:. tir. rLrt . , rr.r pltir.: poitiar_, t rrr pr:.ertir. Trc rrer.ri
.p
Irp:
:
r:tcI;
pr.cr,:..
,' rhipplir.J
I
,h : , ~..-rlcjuicc<
lkir.
j
,-.ix :,rrcr-
Jr.: :,ao
c'ir..
- *.± Lt>i'
ojc,: . f i
ri1 .--r.-:2.ri
l.
.aEI .el.
:r-itrc't¢.:
I N.r:cr:iJnrlr,. ,error-
Pr-htl:
..r:-
I-ct.
Figure 8 Locations of errors within the GFP DNA synthesis product - MuLtSpull-down error filter (from
CaTrret al. 2004).
.r ir GFP * r-- T.Th}.C-
irv
.,
1 ahl¢ 1.. ,rrrr'.
:
Er1'.,r
rorriotLe:
i r.r'i.
-
U r2 r- .2 -.
.
:c
rr. rQp.'2pl2,
I.-f
.
kr.
ltc
twice
.i'I,-,.l .£1i
; - '.:
Izi
,(
-l ir.
' ,Xir.-
-I? :
I.'(
I
, i
:]1
,
Ir.-e r.i.,r
Sin. ,t ir- r:i.:,r.
-( ;.:'{'
-
I
i./
,
:
-A: F*
Ml-a~pls'Ir,c ijor
5.:n- li .',r.
:i:L.
:
,!
;)
9l
I
,
)
)
!
Tr .r.i :i.r.
,l:.' ;.-rA,.
'-I
:a
I:
;{
I
I
,)
TrJr..: ri. r.
.
:
:)
7~~~~~~~~~~~~~~~~~~~~~~
[): r
:A:.I.'B
: rr. r< ce~
i:..
ir < ,et...
!.'a r'~
I
II';'
*~KIA
.14(
l
,
1
.?7
1
. 0 ,:,)!OI)
.)'.i i!X-?
-:rr.,rrt:. P.r'l:X):.:,
r .t!,.).):.l..
e*'
i:.lO:)1
..
Table 1 Summary of errors in GFP gene synthesis - MutS
pull-down error filter (from Carr et al. 2004).
13
I
MutS-His6 with resin or magnetic beads - error filtration
Similar in concept to the previous protocol, MutS is bound to error-containing
copies of newly synthesized and re-assorted DNA. However, in this protocol, a
polyhistidine tag is added to the MutS. Polyhistidine tags have affinity for certain metal
ions like Ni2+ . The tag can be used to separate the MutS-bound error-containing DNA
from unbound DNA. The main advantages of this method over the previous
polyacrylamide gel-based method are vastly shortened cycle time, ease of use, the ability
to automate, and the expectation of more precise separation between bound and unbound
DNA.
On the other hand, some of the primary challenges in developing this protocol
include determining a workable separation method based on the polyhistidine tag and
finding an optimized set of variables for the process. Some of the parameters that must
be decided include length and composition of the amino acid linker between the MutS
and the polyhistidine tag, the type of tag (if the polyhistidine tag turns out to be
inadequate), reaction buffer conditions, reaction time, and temperature.
One of the possible separation methods involves the use of a Ni-NTA-conjugated
resin and a size-based separation spin filter. The MutS-DNA complexes bind to the resin,
and then the entire mixture is passed through the filter. Free DNA, enriched in perfect
DNA, flows through the filter and is amplified via PCR. In initial tests, we performed the
separation using sets of 70-mer oligonucleotides. One set was all perfect DNA, one set
was all error-containing DNA, and another set was half of each. Unfortunately, the resinbased separation method was abandoned after initial attempts because of basic problems
getting the MutS-DNA complexes to reliably bind to the resin. Possible reasons include
steric issues with the positioning of the polyhistidine tag on the MutS and the length of its
linker, and problems with the binding capacity and effectiveness of the Ni-NTA resin.
Another separation method that was explored used magnetic agarose beads
conjugated with Ni-NTA. In this method, a magnet is used to separate the bead-MutSDNA complexes from the free DNA. This technique was also abandoned after initial
tests for the similar reasons given above for the resin filter.
MS-Fokfision
- error correction.
MzitS-Fok fitsion - error correction
Figure 9 MutS (Sixma, 2001)
14
Figure 10 FokI nuclease (Wah et al., 1997)
Another protocol involves the creation of a fusion protein with the mismatch
binding capabilities of MutS and DNA cleaving activity of the FokI nuclease. In
principle, this fusion protein would bind to errors in the DNA and cleave around them.
The cut-out error-containing DNA fragments would remain bound to the protein and be
removed from the mixture using one of several possible approaches, most of which are
covered in the two previous protocols. This is the first example of a true error correction
protocol in this document (See Figures 8 and 9). Fold is a member of the type IIS
outside cutter" class of endonucleases, which bind to a specific DNA sequence but
cleave at a remote site (Modrich, 1991). FokI is an ideal nuclease for the creation of this
fusion protein because the nuclease domain of Fold has already been used for several
fusions that incorporate a DNA-binding domain from another protein. These fusions
have the ability to cleave adjacent to whatever site its DNA-binding domain has affinity
for (Kim and Chandrasegaran, 1994, Kim et al. 1997, Kim et al. 1998, Kim et al. 1999,
Porteus and Baltimore, 2003).
Much of the challenge of this protocol is in the engineering and expression of the
fusion protein itself: The crystal structures of both Fold (Wah et al., 1997) and E. coli
MutS (Sixma, 2001) are known, and were used to design the fusion protein. It is
expected that the fusion protein would dimerize, since MutS forms a dimer around the
DNA in solution. The footprint of the fusion protein dimer would be about 40 bp,
meaning that for each error found, 40 bp of material would have to be excised and
sacrificed in order to replace the incorrect DNA with correct DNA.
An amino acid linker would connect the MutS and the nuclease domain of the
FokI protein. 'The composition and length of this linker are also important variables to
optimize. Proper kinematic constraint is necessary for the two domains of the fusion
protein to have their intended activity.
Initial attempts at expressing the designed MutS-FokI fusion protein have
displayed problems with proper folding and solubility. The protein must be obtained in
soluble form for it to be useful in error correction. There are a number of strategies for
attempting to obtain the protein in a soluble and active form. Aggregated inclusion
bodies can be denatured chemically with guanidinium chloride and/or urea and exposed
to a variety of re-folding protocols (Sambrook and Russell, 2001). The protein can be
15
fused to more soluble protein domains like thioredoxin or NusA (Harrison, 2000).
Cysteine residues present in the protein, which have no significant structural role nor
participate in disulfide bonds, may be removed by mutagenesis, as they may interfere
with proper folding.
MutS-His6 plus nucleease- error correction
In another approach, the mismatch-binding ability of the MutS and the DNA-
cleaving ability of the FokI protein above are decoupled. MutS is bound to the errors on
the DNA, and a non-specific nuclease that makes double-stranded cuts (like DNase I in
the presence of Mn2 ions) is used to cleave the DNA into fragments of 100-300 bp in
length. Those regions bound by MutS would tend not to be cleaved. The errorcontaining DNA would be separated from the "good" DNA using an affinity columnbased method or gel electrophoresis, and the good DNA would be amplified using PCR.
A good deal of effort was invested in working out the appropriate reaction
conditions for this procedure. First, working buffer conditions were determined for the
reaction. Though both MutS and DNase I are known to have activity in the presence of
Mg2+ DNase I only had the desired double-stranded cutting activity in the presence of
Mn- + . Tests were performed to ensure that MutS had sufficient activity in a buffer of
Mn> (gels not shown).
Once the buffer conditions were decided, we performed a time-course assay on
the activity of DNase I. Since DNase I is a non-specific nuclease, we wanted to
determine an appropriate concentration and reaction time for digestion with the nuclease
in order to obtain DNA fragments of optimal size for binding and removal by MutS
(-100-300 b). See figure IL.
The next step, in which we applied both
MutS and DNase I to the DNA, turned out to be
much more difficult to troubleshoot. MutS was
applied to the DNA samples, and then DNase I
was added at various different concentrations.
The resulting solution was visualized on a nondenaturing TBE polyacrylamide gel. See Figure
11.
effect on synthesized DNA. From left to
Next, the DNA was extracted from the
gel by the crush-and-soak method from those
right: kb ladder; DNA + MutS + 0, 0.002,
by the crush-and-soak method from those
0007
044 00.010,
10 0.015
015U
0 000.0067,
0.003, 0.0044,
U :gel
DNase ; DNA
+
O,
0.002,
0.003,
0.0044,
D~ae
+ I , DN
0002 0.03,0.044,regions
that we believed corresponded to "good"
DNA that had not been bound my MutS. In
0.0067, 0.010, 0.015 U DNase I.
principle, any error-free DNA that had been
bound by MutS would be found shifted up in the gel, presumably above the unbound
1000 bp DNA material. Thus, DNA was extracted from the region of about 100-300bp,
and re-amplified in a PCR.
Though this step was repeated multiple times, taking DNA from slightly different
regions of the gel and using different concentrations of DNase I, we were unable to
achieve rebuilding of the DNA fragments into the full length GFP construct. There are
many possible reasons for this. First of all, there is the question of DNase I activity of
16
double-strand vs. single-strand cutting. Though at certain concentrations of Mn + , DNase
I makes double-stranded cuts, even then it cuts single-stranded with at most a relative
frequency of about 30% (Campbell, 1980). The single-strand cuts in the DNA would not
result in fragments of the desired length that could be screened for errors by MutS.
Also, we are affected by the inability of the gel to give us specific information
about whether visualized DNA migrated a given distance because it is a shorter piece of
DNA bound by MutS or a longer piece of DNA that was not bound by MutS. Because of
this, when we extract DNA from the gel via crush-and-soak, we may not be extracting
material that is of the appropriate length to rebuild into the full length construct.
Recently, a group from the University of Wisconsin-Madison has demonstrated
the ability to improve the error rate of DNA encoding a synthetic green fluorescent
protein gene by 3.5-4.3 fold to a final error rate of about in 3500 bp (Brock et al.,
2005). Their error correction protocol, termed consensus shuffling, is very similar to the
MutS and nuclease procedure discussed in this section. Instead of a gel-, resin-, or
magnetic bead-based separation though, they utilized a MutS fusion to maltose-binding
protein to precipitate the MutS. Also, instead of a generic non-specific nuclease like
DNase , they used a number of restriction endonucleases and combined pools of DNA
each treated with a different endonuclease.
Because the consensus shuffling paper is so similar in strategy to the MutS plus
nuclease strategy of error correction described here, we have dropped further
development of this protocol.
MSPCR -- error prevention
MutS-mn-PCR (MSPCR) is a novel idea for error prevention that involves the use
of the MutS protein in the gene fabrication PCR reaction. It is unique from the
previously discussed ideas in that it attempts to address errors before they are integrated
into synthesized DNA. The PCR cycle is designed to allow MutS to bind to mismatches
and to misprining events in the PCR, both of which lead to erroneous PCR product. It
has been noted in previous work that the errors originating in the oligonucleotide pool
make up a substantial portion of the errors found in the fully synthesized gene product.
Before PCR cycling begins, MutS is incubated with the oligonucleotide pool and
all PCR reagents besides polymerase. After a 60°C, l0 minutes incubation, polymerase is
added and a modified MSPCR program begins. For the first several cycles of the PCR, a
10 minute step at 40°C precedes the regular 72°C extension step. At this temperature, the
polymerase is relatively inactive but the thermostable T. aquaticus MutS is quite active.
MutS binds mismatches and mispriming events as shown in Figures 12 and 13, giving
other "good" DNA a competitive advantage over the error-containing DNA. It has also
been noted that MSPCR appears to give cleaner" bands on the gel electrophoresis of
synthesized gene products, presumably because of decreased mispriming events (gel not
shown).
17
4k
_
_
_
_
_
_ %1k
1%
N I
r, ounler ,
I
I
I
Figure 12 One proposed mechanism of action of MutS in MSPCR - Steric blocking
of polymerase for short pieces of DNA and fbr mismatches near 3' ends (* denotes
error)
"
N
_11%,
MW
7-. , =
r 'n 4
12L_
%,L
1%
Figure 13 Another proposed mechanism of
action of MutS in MSPCR - MutS binds error;
polymerase copies everywhere but mismatch
(falls off once it reaches MutS)
Depending on whether MSPCR is used in the first step of a two-step protocol for
gene fabrication or in one-step protocol, the error prevention protocol seems to give some
small but real 1.5- to 2-fold improvement in error rate of a -400 bp target (E. coli LacZalpha). Though this is so far only a modest improvement in error rate, it comes at little
added cost in terms of time, handling, or reagents. In addition, MSPCR is an example of
error prevention as opposed to error correction, so there are no wasted reagents and no
need for an additional step of error correction.
MSPCR is more effective in the one-step gene synthesis protocol than in the two-
step process, presumably because the competitive advantage given to "good" DNA has a
greater effect when the entire gene synthesis process is carried out at once. By
optimizing certain parameters in MSPCR, we hope to improve the capability of MSPCR
to decrease the error rate of gene fabrication. As noted earlier in this document, even
small improvements in error rate can make a big difference in improving the usability and
applicability of gene fabrication technology. Some relatively simple variables to be
18
adjusted include the temperature profile of thermal cycling
in MSPCR, the stoichiometry of MutS and DNA in the
reaction, and species of MutS.
Studying the resistance of T aquaticus MutS to
thermal degradation has recently led to an insight into
improving MSPCR. As shown in Figure 14, repeated
temperature cycling from 94°C to 55°C of MutS has a
drastic effect on the ability of MutS to bind DNA. Further
studies will be undertaken to determine just. Though we
Figure 14 Effectof
expected T. quacticusto have high resistance to thermal
temperature cycling on MutS
activity. From left to right: kb
ladder, perfect DNA + MutS,
mismatch DNA+ MutS after
0,5, 10, 15, 20, 25, and 30
temperature cycles (30 sec @
94oC, 30 sec @ 55o C)
denaturation, this is apparently not the case. Given this
data, it is encouraging that MSPCR currently gives an
improvement in error rate, even with the thermal
denaturation of MutS within a few temperature cycles.
One option is to add MutS to the MSPCR as it is
running after every couple cycles - this is reminiscent of
PCR back before the use of thermophilic DNA
polymerases. Another option is to use MutS from a more hyperthermophilic organism
such as T thermophilts or P. firiosuts. We will try both of these options in an attempt to
get better performance out of MSPCR. If by optimizing parameters in MSPCR, we can
improve the error rate improvement of the process (for example, to five-fold as opposed
to two-fold), N4SPCR will become a potent and very useful low-cost method for error
prevention at the gene synthesis level.
However, in order to harness the real power of the concept behind MSPCR, it
may be necessary to not only rely on the non-covalent interaction of MutS with mismatch
DNA, but to initiate a covalent protein-DNA or protein-protein cross-linking event when
MutS binds to a target. This would provide a means of effectively irreversibly removing
error-containing DNA from the gene synthesis reaction. We are currently considering
possibilities for the appropriate chemistry to facilitate such a cross-linking event. One
possibility is the site-directed mutagenesis of the MutS protein with a cysteine residue,
whose thiol group could be used to attach a photo-activated cross-linker or some other
chemical group that could bind to DNA.
Thermophilic MutS-FokI fusion protein
The concept for the design of a thermophilic MutS-Fokl fusion protein is itself a
fusion of two ideas explained earlier in this document: the MutS-FokI error-correcting
fusion protein and MSPCR. A thermophilic MutS-FokI fusion protein, whose respective
domains could be taken from thermophilic organisms such as T. thermophilits, could
safely be thermocycled in a gene synthesis PCR. Its function would be error prevention,
like MSPCR, but the fusion protein would not only bind error-containing DNA but cleave
the good DNA around it. The bad DNA could then selectively be removed from the PCR
without wasting good material or continuing the extension at the ends of an errorcontaining piece of DNA that is bound by MutS in the middle.
19
III. Closing Statement
Many examples of past, present, and future error filtration, correction, and
prevention protocols have been presented in this document. The design of all of these
"error correction" protocols keep in mind a number of central goals: a quick, easy, robust,
economical, and effective method to reduce errors in a gene product synthesized in gene
fabrication. A number of error filtration and some early error correction methods have
been published in recent years. Though all of them have weaknesses, they are significant
early steps toward effective error reduction, a necessary part of gene fabrication if it is to
reach its potential as a revolutionary enabling technology in biological engineering and
biotechnology. Possibly one of our greatest hopes for error reduction in gene fabrication
is the creation of a so-called "golden bullet" - a reagent that could be added to a gene
synthesis PCR and enable the synthesis of genes with very low error rates at essentially
no added cost. We have hope for ideas such as the thermophilic MutS-FokI fusion
protein to be such a golden bullet for gene fabrication.
IV. References
Babon JJ, McKenzie M, Cotton RG. (2000) The use of resolvases T4
endonuclease VII and T7 endonuclease I in mutation detection. Methods
Mol Biol. 152:187-99.
Biswas, I. and Hsieh, P. (1996) Identification and Characterization of a
Thermostable MutS Homolog from Thermus aquaticus. J. Biol. Chem
271:5040-5048.
Brock F. Binkowski, Kathryn E. Richmond, James Kaysen, Michael R. Sussman,
and Peter J. Belshaw. (2005) Correcting errors in synthetic DNA through
consensus shuffling. Nucleic Acids Research, Vol. 33, No. 6, e55.
Brown, J., Brown, T, and Fox, K.R. (2001) Affinity of mismatch-binding protein
MutS for heteroduplexes containing different mismatches. Biochem. J.
354:627-633
Campbell, Virginia W. and David A. Jackson. (1980) The Effect of Divalent Cations on
the Mode of Action of DNase I. The Journal of Biological Chemistry. Vol. 255,
No. 8, April 25, pp. 3726-3735.
Carr, P.A., Park J.S., Lee Y.J., Yu T., Zhang, S., Jacobson J.M. (2004) Protein-mediated
error correction for de novo DNA synthesis. Nucleic Acids Research. Vol. 32 No.
20 e162
Chen, Junghuei and Seeman, Nadrian C. (1991) Synthesis from DNA of a molecule with
the connectivity of a cube. Nature Apr 18;350(6319):631-3.
20
Eisen, J.A. (1998) A phylogenomic study of the MutS family of proteins. Nucleic
Acids Research 26:4291-4300
Elowitz MB, Leibler S. A synthetic oscillatory network of transcriptional regulators.
(2000) Nature 403:335-8.
Goho, A.M. (003)
Life Made to Order. Technology Review (April)
Harrison, R.G. (2000) Expression of soluble heterologous proteins via fusion with
NusA. inNovations 11:4-7.
Hoover D.M., and Lubkowski J. (2002) DNAWorks: an automated method for
designing oligonucleotides for PCR-based gene synthesis. Nucleic Acids
Res. 30:43-
Kim Y.CG.,Chandrasegaran S. Chimeric restriction endonuclease. Proc Natl Acad
Sci U S A. 1994 Feb 1;91(3):883-7.
Modrich, P. Mechanisms and biological effects of mismatch repair. (1991) Annilt.
Rev. Genet. 25:229-53.
Porteus M.H. and Baltimore, D. (2003) Chimeric nucleases stimulate gene
targeting in human cells. Science 300:763.
Sambrook, and Russell, (2001) Molecular Cloning: A Laboratory Manual. Cold
Spring Harbor Laboratory Press, Cold Spring Harbor, New York.
Sixma, T.K (2001) DNA mismatch repair: MutS structures bound to mismatches.
Curr. (O)p.Struct. Biol. 11:47-52
Smith J, Modrich P. (1997) Removal of polymerase-produced
mutant sequences
from PCR products. PNAS 94:6847-50.
Stemmer WP, Crameri A, Ha KD, Brennan TM, Heyneker HL. (1995) Single-step
assembly of a gene and entire plasmid from large numbers of
oligodeoxyribonucleotides.
Gene. 164:49-53.
Wah, D.A. et al. (1997) Structure of themultimodular endonuclease Fokli bound to
DNA Nature 388:97.
Wang, L., Brock, A., Herberich, B., and Schultz, P.G. (2001) Expanding the Genetic
Code of Escherichia coli. Science 292:498-500.
Whitehouse, A. et al. (1997) Analysis of the Mismatch and Insertion/Deletion
Binding Properties of Thermus thermophilus, HB8, MutS. Biochem Biophs Res.
Commun. 233:834-837
21
Winograd, S., and Cowan, J.D. (1967) Reliable computation in the presence of
noise. MIT Press, Cambridge, MA.
Withers-Martinez, C. et al. (1999) PCR-based gene synthesis as an efficient
approach for expression of the A+T-rich malaria genome. Protein Engineering
12:1113-1120.
Yang, B., et al. (2000) Purification, Cloning, and Characterization of the CEL I
Nuclease. Biochemistry 39:3533-3541
Youil, R., Kemper,R. and Cotton, R.H.G (1996) Detection of 81 of 81 Known
Mouse b-Globin Promoter Mutations with T4 Endonuclease VII-The
EMC Method. Genomics 32:431-435. Genomics 32:431-435.
22
Download