Designing Methods for Error Correction in Gene Fabrication by Jason Park Submitted to the Department of Mechanical Engineering in Partial Fulfillment of the Requirements for the Degree of Bachelor of Science "MASSACHUSi S INSTT OF TECHNOLOGY at the Massachusetts Institute of Technology Hi April 2005 2cot 3 2- JUN 0 8 2005 LIBRARIES © 2005 Massachusetts Institute of Technology All rights reserved. Signature of Author: Department of Mechanical Engineering May 6, 2005 Certified by: N\~~~ Y ~~~Joseph Jacobson Associate Professor of Media Arts and Sciences and Mechanical Engineering Thesis Supervisor Accepted by: I Ernest G. Cravalho Professor of Mechanical Engineering Chairman, Undergraduate Thesis Committee $.CHIVES 1 E Designing Methods for Error Correction in Gene Fabrication by Jason Park Submitted to the Department of Mechanical Engineering on May 6, 2005 in Partial Fulfillment of the Requirements for the Degree of Bachelor of Science in Mechanical Engineering: Biotrack ABSTRACT Gene Fabrication technology involves the development and optimization of methods relevant to the in vitro synthesis of any given target gene sequence(s) in the absence of template. The driving purpose of this field of research is to bring about the capability for on-demand fabrication of a DNA construct of arbitrary length and sequence quickly, efficiently, and cost-effectively. One of the main challenges in gene fabrication is to not only synthesize a given DNA target, but to do so without making any errors. At high error rates, fabrication of long gene targets is expensive and impractical - in some cases, it is impossible. Improvements in error rates are essential for continued progress in the development of gene fabrication technology. Error reduction technologies can be broadly split into three categories at present: error filtration, error correction, and error prevention. This thesis presents the past, present, and future design of a number of quick, easy, robust, economical, and effective error reduction methods in gene fabrication. Thesis Supervisor: Joseph Jacobson Title: Associate Professor of Media Arts and Sciences and Mechanical Engineering 2 1. Introduction Overview of Gene Fabrication Gene fabrication involves the development and optimization of methods relevant to the in vitro synthesis of any given target gene sequence(s) in the absence of template. The driving purpose of this field of research is to bring about the capability for ondemand fabrication of a DNA construct of arbitrary length and sequence quickly, efficiently, and cost-effectively. Current gene synthesis technology is effective for making single genes, but can be slow and costly, especially for long sequences of DNA or when multiple genes are desired. Through developments in gene fabrication technology, we hope to remove these boundaries and provide an invaluable tool in biological and biomedical research and engineering. There are many uses for synthetic DNA, especially with the continued growth of DNA sequence and protein structure databases available online. For example, the ability to directly synthesize a gene and several variants can often be useful in the study of a gene in biological research. Also, synthesizing a gene de novo eliminates the need to obtain an organism from its natural habitat in order to study one of its genes. Gene fabrication technology also allows for the synthesis of genes with novel functionalities that do not even exist in nature. The ability to synthesize multiple genes at once also gives researchers the ability to engineer and analyze complete biochemical pathways, genetic networks, and entire genomes. Gene ifabrication can be an essential and enabling technology for biological engineers in various currently developing fields of research. For example, some researchers have demonstrated the use of DNA as a structural biomaterial and scaffold apart from its natural function as the genetic material (Chen and Seeman 1991). Others are developing genetic circuit logic analogous to that of electronic circuits (Elowitz and Leibler, 2001). Other exciting new fields of work involve research on minimal life" the minimum essential number of genes for an organism to be "'living" - and modifying the genetic code to remap codons to novel amino acids (Goho, 2003, Wang et al., 2001). Gene Fabrication Protocols There exist a number of protocols for the in vitro synthesis of DNA in gene fabrication. The most prominent among these - and the protocol that we use - involves the use of the PIolymerase Chain Reaction (PCR). The components of the reaction include: a pool of oligonucleotides that together make up the entire sequence of the target DNA, dNTPs, thermostable DNA polymerase, PCR buffer (various components including salts), and sometimes, oligonucleotide primers that define the ends of the target DNA. The oligonucleotides in the starting pool are built up into longer constructs through successive rounds of PCR. The full-length constructs are amplified exponentially as in traditional PCR by the end oligonucleotide primers. 3 The precise method by which PCR is used for gene fabrication varies. For example, the gene can be synthesized in either a one-step process or a two-step process. In a two-step process, the first PCR assembles the oligonucleotide in the oligonucleotide pool together into the full-length product at some low frequency. The second PCR includes some of the first PCR product and adds primers that correspond to the ends of the full-length product sequence. This serves to exponentially amplify the full-length product sequence like a traditional PCR. In a one-step process, both the oligonucleotide pool and a high concentration of the end primers are included in the PCR. Thus, there is at once the linear build-up of the full-length product from the oligonucleotide pool and the end primers as well as exponential amplification of the full-length product once the first copy is synthesized. Though in general the one-step process requires much more fine-tuning of parameters than the two-step process and is difficult to do for longer gene products, it has the advantage of being quicker and reducing the amount of necessary sample handling. For one thing, reducing the number of sample handling steps will be increasingly important as error correction protocols are ported from the lab bench to automatable microfluidic devices. A target DNA sequence for gene fabrication is parsed into a set of overlapping oligonucleotides (40-70 bp in length) in computer software. In the simplest parsing scheme, the target sequence can be divided up into equal chunks of n base pairs. In more complicated algorithms, the sequence can be parsed in ways so as to optimize for certain parameters including consistent melting temperature, no hairpin formation, no selfannealing, no primer-dimerization, codon frequency in host organisms, and more. An example of this type of software, hosted at the NIH, is DNAWorks (http://molbio.info.nih.gov/dnaworks/). The choice of parsing algorithm depends on the particular needs of a project. The parsed oligonucleotides are purchased from a commercial vendor. Choice of vendor is important, as different vendors provide DNA with very different error rates. For example, we have determined that Integrated DNA Technologies provides oligonucleotides several times better in error rate than a competitor, MWG Biotech. This is particularly important considering that we have determined in our work that the errors in a starting oligonucleotide pool make up a high proportion of the errors present in its final synthesized gene product (unpublished data). Through a series of PCR cycles in which the parsed oligonucleotides extend by essentially priming off of its complements, the final full-length DNA construct is eventually made. Then, end primers as found in traditional PCR anneal to the ends of the full-length sequence and exponential amplification of the DNA construct occurs. Optimizing Gene Fabrication 4 In order to fulfill its potential as an enabling technology, gene fabrication technology must be done cheaper, faster, better, and more robustly. Optimization of the process through a number of vital parameters must be considered. These include sequence design and software engineering of the parsing software, temperature profile of the PCR cycles, automation of the process, high throughput processes, enzyme choice, reaction conditions, and more. However, development of better techniques in error correction is one of the most crucial elements to the progress of gene fabrication technology. Nature routinely performs DNA replication in the cell with high fidelity, with error rates ranging from error per 08 base pairs synthesized to I error in 101° base pairs synthesized. It does so by the use of various error detection and correction mechanisms. In contrast, widespread gene fabrication technology methods found in the literature currently demonstrates error rates on the order of in 600 base pairs. Roughly speaking, a given error rate allows for facile synthesis and selection of a perfect copy of a gene that is of a length that is the reciprocal of the error rate (e.g., 600 bp in length for 1 in 600). The number of perfect DNA constructs produced in gene fabrication becomes exponentially less with increased length of the target. At high error rates, fabrication of long gene targets is expensive and impractical - in some cases, it is impossible. Improvements in error rates via error correction protocols are essential for continued progress in the development of gene fabrication technology. Error Correction The theory behind error correction in gene fabrication lies at the intersection of computer science and biology: error-correcting codes. The central idea behind errorcorrecting codes is that multiple noisy or unreliable inputs can be combined to give high- fidelity output. In "consensus voting," copies of a signal are compared against one another and the consensus is chosen as the output. In this way, each input signal "votes" on the output signal. Von Neumann formulated the requirements to simulate a given number of perfect inputs (Winograd and Cohen, 1967): "A circuit containing N (error-free) gates can be simulated with probability of error at most Fusing N log (N/s) faulty gates, which fail with probability p, so long as p < Pth." The key feature here is the logarithmic relationship. In terms of gene fabrication, in order to improve the reliability of the system by a factor of x, the number of DNA molecules required for error correction scales by only log x. 5 Figure 1 Example of a circuit employing error correcting codes. Multiple noisy signals xi are routed to components fi for "consensus voting." The result proceeds to a second round of 'consensus voting" by similar components di. (Winograd and Cohen, 1967). Y1 Y2 In either an electronic or a biological setting, one needs a means of detecting errors in order to apply the principles of error-correcting codes. In nature, biological systems maintain low error rates in DNA replication by comparing newly synthesized DNA to its original template. For example, methylation of a strand of DNA at a recognition site marks it as an original - and presumably error-free - template. In the case of an error, a number of repair mechanisms are set in motion. The mismatch repair system of many organisms detects an error where a misincorporated base fails to form a Watson-Crick base pair with the base on its complementary strand (Modrich, 1991). The concept of an original, presumably error-free copy of DNA acting as a template for detection of errors is key component of error correction in vivo. However, gene fabrication, which is de novo DNA synthesis, is quite different. There is no distinguishable original, perfect copy of DNA. All DNA molecules produced have a discrete probability of containing fairly randomly distributed errors in their sequence. The standard of quality also differs between gene fabrication and natural processes. Biological systems, because of silent mutations, mutations with insignificant effect, or compensatory mutations, are more fault-tolerant than gene fabrication processes. In contrast, a perfect copy of DNA must be made from scratch in gene fabrication. Error filtration, correction, and prevention methods are therefore of critical importance. The error rate of a gene fabrication process directly affects its usability to synthesize DNA for its various applications. What error rate is required to economically and realistically make DNA of a given sequence length? In terms of molecular biology, this translates to: "How many clones must be sequenced to have adequate confidence of obtaining at least one perfect clone?" The calculation goes as follows: For a per-base error rate of P and a DNA target of length N, the probability that a given clone contains no errors at any position is (I -P)N . For X clones sequenced, the 6 confidence (C) of obtaining at least one perfect clone is C = -(1 -( I -p)N)x . Conversely, for a given confidence value, the number of clones one expects to sequence is X = In( S) / ln(1-(1-P)"'). Figure 1 below indicates how much more difficult it is to synthesize, for example, a 750(0-mer than a 750-mer. At a standard error rate of I error in 750 bp (Stemmer et al, 1995, Withers-Martinez, et al., 1999, Hoover and Lubkowski, 2002, as well as our own observed rates) a 750-mer can be produced with high confidence and only a few clones sequenced. On the other hand, without a better error rate, one can see that synthesizing a 7500-mer is unrealistic and nearly impossible. One way to get around this problem would be two synthesize ten 750-mer pieces and clone and sequence them, and then assemble these together into a 7500mer and clone and sequence them again. On the other hand, an improvement in error rate to I in 7500 would allow one to generate a DNA target about 7500 bp in length with little trouble. - , 100% 90% 80% 70% o 1 60% o c o X 50% Z *. X' e 10000 -1 1 in 750 in750 1 in 1400 1 in 1400 1000 - / 1 n 7500-f 1 in 7500 o 100 40° a .2 30% ; 20% n0 / 10t/i 1 r__________c E . 10% 0% 0 0 o 0: lng 0 0 o "I 0 0 o Dl 0 0 o 0 0 0 0 o o targe 0 0 o 0 0 o0 a oo oq 0 0o o o length of DNAtarget (bp) . o0 o o0 o0 - o o o: o: o o o2 length of target DNA(bp) length of target DNA(bp) I Figure 2 Gene tfabrication of long targets requires improved error rates (From Carr et al., 2004) Error correction is also a vital aspect of gene fabrication because it cuts down on sequencing and cloning. Cloning, preparing samples, and sequencing take up almost two-thirds of the total process time in gene fabrication. Figure 2 shows the process of gene fabrication a time pie chart, illustrating this point. By more accurately synthesizing gene targets, one can cut down on the number of clones necessary to get one perfect copy, as well as avoid the need for any time- consuming additional steps such as site-directed mutagenesis. 7 A _ I~~~I - i rd V - ............ .... _ "_ ,m __ _ __________________ 211 v un IFcticn .z Mett Ie-a - 1 :1 I %i,. . , ~.I-' ,, ., : h .L, _IS iI v /--J-UL.- -L--7 k... t-clovex Vu:6 - - - _ -c :Lo~ l ret.n A: 4 I . ' i Figure 3 (A) Main steps performed in construction of genes in gene fabrication, using a MutS pull-down procedure for error-filtration. The pie chart indicates the approximate amount of time consumed by each step (in hours), with a red arrow indicating the order of operations. The most time-consuming steps in this process are oftenoligonucleotide synthesis and DNA sequencing (including plasmid production). The 24+ and 48+ hours indicated for each of these represent lower bounds on theseprocesses, possible if performed with immediate access to the appropriate equipment. If these steps are performed by outside providers, 3-5 days are typical of each step. (From Carr et al., 2004) In order to employ the concept of error-correcting codes in gene fabrication, one makes the assumption that errors are far outnumbered by correct DNA. Therefore, with many copies of DNA, a consensus vote between the DNA strands allows for determination of the correct sequence and allows for errors to be found and fixed. To accomplish this, an error correction system must be able to accomplish error detection, error removal, and error repair. The overarching functional requirements in the development of error correction protocols are: a) utilization of a molecule with affinity for errors, b) a method for removal of the errors which are bound by aforementioned molecules, and c) a process that can be cycled for improvement of error rate by multiple rounds of consensus voting. The ideal result is a cheap, robust, and hopefully machine-automatable procedure with short cycle time that results in large improvements of error rate. 11. Current and Future Work Error detection using MutS The error-binding molecule of choice in much of our current and future work in the development of error correction protocols employs the protein MutS. As part of a common natural mismatch repair mechanism involving MutL, MutH, and MutS, MutS is 8 a protein with affinity for binding to DNA duplexes in places where they deviate from Waston-Crick base pairings. MutS, which has homologs in many organisms, exhibits different sensitivity to different types of mismatches (Brown et al. 2001). The sensitivity and specificity of the mismatch binding also varies across species. __A_ Affinity is highest for single base deletions, which we have determined to be the most prominent error in gene synthesis. The different types of mismatches are: AA, CC, GG, TT, AC, AG, TC, TG, and the bulges caused by short insertions and deletions. E. coli MutS is known to have poor affinity for C03 CC mismatches (Brown et al., 2001). In eukaryotes, mismatch recognition is carried about by multiple proteins in MutS homlogs (Eisen, 1998). MutS proteins from thermophilic bacteria, seem to have affinity for all of the types of mismatches (Biswas and Hsieh, 1995, Whitehouse et al. 1997). ~~Figure 4 details the mechanism of action of MutS, l~ MutL, and MutH. As mentioned earlier, methylation of one of Ia3 . om"*- · * -----CH DNAym the strands of DNA identifies it as the original copy. In the absence of strand methylation, both strands are cut by MutH (Smith and Modrich, 1997). This feature has been utilized in ' the literature for the removal of errors from synthesized DNA. #eme, The cleaved, error-containing products are separated from non- __________error-containing products by size separation in gel Figure 4 MutS, MutH, and MutL mismatch correction mechanism in electrophoresis. In ce novo gene fabrication, we do not have the luxury of knowing which DNA strands are original" and correct." viv;o _ Further, because of the nature of PCR-based gene synthesis, errors in DNA are copied into the comlementary strands of DNA as well. Thus there is the necessity fr dissociating and reassociating DNA duplexes in order to re-assort DNA strands and create error heteroduplexes in which errors are matched with their corresponding correct bases. This mismatch can then be recognized by a protein such as MutS. For lengths of DNA <10-20 kbp, this re-assortment can be accomplished by thermal enaturation and re-annealing. For larger targets, one might utilize proteins such as RecA to effect strand transfer between duplexes. One of the main issues with MutS for use in error correction in gene fabrication is nonspecific binding. Especially with longer DNA constructs, nonspecific binding may be a problem. Depending on the application, our data shows that MutS can be used effectively to remove errors from DNA of length about 1000 bp or less. In the synthesis of various gene products including GFP (1000 bp) and Thermlts aquaticus MutS (2500 bp), we circumvented this issue parsed the sequences into chunks of -300 bp, errorcorrecting these pieces, and then assembling them into the full gene targets through a PCR-based method. However, a method of gene fabrication and error correction that does not require this additional step would be preferable. There exist a number of other molecules used for binding to DNA mismatches, including T7 endonuclease I (Babon et al., 2000), T4 endonuclease VII (Youil, 1996), and Cell endonuclease (Yang et al., 2000). These may serve as ready alternatives to the use of MutS and its homologs. 9 Assays for determination of error rate In assessing our progress on protocols in error correction, it is beneficial to have a cheap, quick, and easy assay that gives a crude estimate of error rate. Though there is no ultimate replacement for the information provided by DNA sequencing, it is too costly in terms of time and money to always sequence DNA when we want a rough estimate of error rate for a given error correction or gene synthesis protocol. We have found two gene targets particularly useful in our work in that they provide an easily measurable read-out when cloned and transformed into cells: LacZ-alpha and GFP. By cloning and transforming LacZ-alpha constructs into E. coli cells and plating them on media with IPTG inducer, we have a blue/white screen that allows us to crudely back out an error rate from the resulting percentage of blue colonies. Similarly for GFP, counting fluorescent colonies of E. coli on a plate allows one to back out an error rate. In order to back-calculate out an error rate from a blue/white LacZ-alpha or green/white GFP screen: (average % silent mutations)-' * (% blue/green) = (I-error rate per bp)DNAlength 1 - ((average % silent mutations)- * (% blue/green)) t /(DNAlength)= error rate per bp Fluorescence activated cell sorting (FACS) analysis is another useful tool for determining the number of fluorescent colonies in a population. By analyzing tens of thousands of cells individually and assessing each for presence and intensity of fluorescence, one can quickly get a sense of the error rate for a given GFP construct transformed into cells. Figure 5 Gene fabrication of GFP and transformation of E. coli. Green colonies contain working copies of GFP; white colonies contain lethal error-containing GFP genes (or no GFP) Performing a MutS "pull-down" (as described in the next session) is another way to quickly assess error rates. One can roughly compare error rates between various DNA samples by binding MutS to each sample and running them through polyacrylamide gel 10 electrophoresis (PAGE). The more DNA is shifted (decreased mobility through the gel), the higher the error rate. Sequencing is the ultimate method for determining the error rate in a given DNA sample. However, it is costly both in terms of time as well as money. Only when an error correction protocol has been thoroughly optimized through the use of the assays listed above do we use DNA sequencing to determine obtain more accurate error rate data as well as data regarding the types of errors that survive each error correction process. Methods in Error Correction In seeking an ideal error correction protocol for use in gene fabrication technology, we have explored a number of different methods to date. In designing protocols, we keep in mind the principle design considerations described earlier in this document. These include best improvement of error rate, robustness, ease of use, speed, ability to automrnate, and ability to be cycled. Additionally, "error correction" protocols will be divided into three general categories: error filtration, error correction, and error prevention. Below, each protocols examined thus far in the laboratory will be described. MttS pl/-cdotn - error filtration Newly synthesized DNA product (or fragments thereof, in the case of gene targets too large for this procedure) are melted and re-annealed to re-assort errors and create heterodimers of errors as described earlier. The DNA is then exposed to MutS from T. taquiticus in an appropriate stoichiometric ratio and under a certain set of reaction conditions (temperature, time, magnesium ions, etc). DNA with errors is selectively bound by the MVutS.The whole pool is run through a TBE non-denaturing polyacrylamide gel. DNA is visualized by staining with SYBR Gold. A shifted band of D)NA Dounc to MvutScan De seen, due to the reduced mobility of the complex relative to unbound DNA. Figure 6 MutS pull-down filter. Lane 1: kb ladder. Lanes 2,3,4,5: The unbound DNA is cut out and recovered from the gel by the crush and soak method. PCR is used to amplify the DNA recovered from the gel, which i nreFsnt in miniitf - -.. -_ The entire filtration -300mer pieces of GFP (993bp), treated with MutS. Lanes 6,7,8,9: amounts. Same as lanes 2,3.45, except without MutS treatment. (From Carr et al., 2004) procedure is either repeated transformed into E. coli, and expressed. 11 or the amplified DNA is cloned into a vector, Figure 5 shows the TBE non-denaturing polyacrylamide gel in the MutS pull-down procedure. The lanes treated with MutS show how MutS decreases the mobility of a fraction of the DNA - presumably the error-containing material. Green Fluorescent Protein (GFP) was synthesized using the MutS pull-down error filter. We performed extensive analysis on the synthesized DNA, using FACS analysis, green/white colony count, and sequencing in order to quantitate errors and assess the error rate of the MutS pull-down error filter protocol. FACS analysis (see Figure 7) showed substantial improvement in fluorescence in those samples that were error-filtered, both in terms of percent green cells as well as in mean intensity of green fluorescence. A 3A Ijj-tc'~il 1'. im B a ¢ *L .. ............... .'J h. . [t 4;1 'J I gated F Wre 3. ~lJ }ll eci .~ .m r rein v i : r {] .~ 'c r~ee -. 0I': .is. Flos .z~'orra:try ne l'mll: m ts.) .> dl e xp r ~ql ne, £] P I n*m~yt,,e u ~r's. E[sIsf rnzo ls&qi: wn In ,ercuzcrs. Hofiizonitl axes irticat¢ 11utrcS-r*: 111te.ri- .peelti,: t) L, gers-. whle *¢mcal txes i~.ure ' ,,. ."*u& zLse' :o lmrx:,~e :re _L.Jlit~ .* :ce. -yt.ctsa rn I¢',rel -t' Ir~cnc:, 'Dxlm. celv wt'Jc.comnti.5 mceestl'l,! ,^T~tcad GI-'P n.~ car¢ e xpa:ted to cisplJ' I rrinin ir .:i a.- rA.n .por i 'i~ 11Lo*.^r,_ cn~ st. itl;2rcn ,. EitrerncI.dha liu~replotolr, t .nc.a.edrcr;on in ~e Iowrrit[e.h I¢ ltL[.c.r:J c,;r.m ILzrc ucxt ~~) '1rlv In ,trilly IA. GCre.,er ue2r.P ¥1r1p r \ell N M1g:-i.. . Ir. -.: er T-.r. I Lt Tel r rrlC iFP gere p~ Lunx c In ,l S.>.1 r DNA I nrrSr..: lu.r ; CFtJ res prQ, ccd 1. c'ttT:i en¢ .Vrtl.. s'l Itt. ll p Ce.,lr. tor,:rn g ree entr3: Em,lTr dt icer: .. ¢', t.rr r-,:L Pit ' -It'i.e ¢ .: ;:oy :pl.'c-' G VPte'o., wJt.a . ne ; .4 _rrrrn.o3: ls I : I r'rivr he pr.peIr r tlt f¢~lz3;,:,r popc.dan,:-erl .lt ,ft .'ll.t) per ,fe.tl'rln'.~t I 'iFP,er f . IC r.e -.The vCar. B I it, tlL.rTyq:¢re ~t'.t* ~xres. ' hl. vicld *~ .~I ~rio'cer. in tl~ , 'ul^3ilt'.) :" .ntliCt ,.J'. ,r. F e "t ele-: ~ C_ iritc lto,= Ic per. l 1~L. [:.C~ irJl, 11[ n .-f'*i elTOr-1ren!ql 1[ ['~"qs 1: .rr:,r-em-l. 'ce: 'iaak --=t& 1:.a~re L,.:rr ec cla r..csl:' .ntr,:ltec' DNA ~.1Dle..2.¢uto '~ .:lr- n'pl lpuators -l'aw.,' lrcle~l: regltl.e- .,~r.rII. *;:1'tl,',trl !lclren ;FPDNA ~-mph/Wem t~i :.c qppliao r. -1'.ILTS pt t)/eir: hl-t Nnik itlie 1:DNA err. -. ~ piee..mre L,in r lhkSp~:~tCir: hlactCI riatrle. cr, <.*rrx' It, Fir ~.re =. :',= tT In~:~anu~nnrnni:I~nulndn3rrnri I-nl-1 ,l It~ hin --.: I 1,, Jp,:o Lrn r. f.r tc p.>iunn'.c A itginZ ir. -.- .trnplifted ty d~I~ptr, nL. .,re-n'R ,.rpl¢ . tr i: e 1¢.;.-A iCe': lack i'rrc. ,= VAA~pnnr -1: pinpi'eNakdlnnnIPh, riu. al-i V 'acx hrn rr.alie -~ . t1. ,.A -,eLetXIC Icr.7 Cr.%~T,~'J~pr..:L. teymrTxll..-h)wr in. rreanI iir.[ F~-i['. -l t p : .'.Tlll .,r.,I -et ,: I 1..r]. . Srl.,.hr ir l.~le -et.~:'At1~ ,JA-[' _Lle,_lec FP-,1: Drn.precr h ~ Figure 7 FACS data for GFP synthesized with MutS pull-down error filter (from Carr et al. 2004) 12 Sequencing data was used to quantitate errors and characterize the types and locations of errors surviving the error-filtration procedure. (See Table 1 and Figure 7). Error rate was decreased from about I in 600 to U UI in 10000. an !i .01 I 4 ,,, 4,i : I 414 t *4 , *.4 * t * * 4 0 t *+ - - l 0.00010 c 0.0200026 0.0018 I . :, , " : 7, '.7 ,. -, . , I --:,- - EE :. 0.0030 I 4 -F]-n . I I, - ,-- 4, - - ,4 , - 4 - 4 - 4, - 4, - 4 , :plcE: h. :irl:_..r.c m' :'/,ice pcc 'FC. <.r.a r. :rr ,r',r1-: f±. -,siicat. i ,- 4 P .:il.r iturs: '-:.rltird pr.: r. t- .de e'tl rl '. L~ir L FP [Nf -. A h. *++* .:-.F'i. r m:r -[n p. l.l : .: p L' '.rrcr: :x~bcil ±e (iFP ereranrcarkir -. lcr. a r..ca f.,r:. tir. rLrt . , rr.r pltir.: poitiar_, t rrr pr:.ertir. Trc rrer.ri .p Irp: : r:tcI; pr.cr,:.. ,' rhipplir.J I ,h : , ~..-rlcjuicc< lkir. j ,-.ix :,rrcr- Jr.: :,ao c'ir.. - *.± Lt>i' ojc,: . f i ri1 .--r.-:2.ri l. .aEI .el. :r-itrc't¢.: I N.r:cr:iJnrlr,. ,error- Pr-htl: ..r:- I-ct. Figure 8 Locations of errors within the GFP DNA synthesis product - MuLtSpull-down error filter (from CaTrret al. 2004). .r ir GFP * r-- T.Th}.C- irv ., 1 ahl¢ 1.. ,rrrr'. : Er1'.,r rorriotLe: i r.r'i. - U r2 r- .2 -. . :c rr. rQp.'2pl2, I.-f . kr. ltc twice .i'I,-,.l .£1i ; - '.: Izi ,( -l ir. ' ,Xir.- -I? : I.'( I , i :]1 , Ir.-e r.i.,r Sin. ,t ir- r:i.:,r. -( ;.:'{' - I i./ , : -A: F* Ml-a~pls'Ir,c ijor 5.:n- li .',r. :i:L. : ,! ;) 9l I , ) ) ! Tr .r.i :i.r. ,l:.' ;.-rA,. '-I :a I: ;{ I I ,) TrJr..: ri. r. . : :) 7~~~~~~~~~~~~~~~~~~~~~~ [): r :A:.I.'B : rr. r< ce~ i:.. ir < ,et... !.'a r'~ I II';' *~KIA .14( l , 1 .?7 1 . 0 ,:,)!OI) .)'.i i!X-? -:rr.,rrt:. P.r'l:X):.:, r .t!,.).):.l.. e*' i:.lO:)1 .. Table 1 Summary of errors in GFP gene synthesis - MutS pull-down error filter (from Carr et al. 2004). 13 I MutS-His6 with resin or magnetic beads - error filtration Similar in concept to the previous protocol, MutS is bound to error-containing copies of newly synthesized and re-assorted DNA. However, in this protocol, a polyhistidine tag is added to the MutS. Polyhistidine tags have affinity for certain metal ions like Ni2+ . The tag can be used to separate the MutS-bound error-containing DNA from unbound DNA. The main advantages of this method over the previous polyacrylamide gel-based method are vastly shortened cycle time, ease of use, the ability to automate, and the expectation of more precise separation between bound and unbound DNA. On the other hand, some of the primary challenges in developing this protocol include determining a workable separation method based on the polyhistidine tag and finding an optimized set of variables for the process. Some of the parameters that must be decided include length and composition of the amino acid linker between the MutS and the polyhistidine tag, the type of tag (if the polyhistidine tag turns out to be inadequate), reaction buffer conditions, reaction time, and temperature. One of the possible separation methods involves the use of a Ni-NTA-conjugated resin and a size-based separation spin filter. The MutS-DNA complexes bind to the resin, and then the entire mixture is passed through the filter. Free DNA, enriched in perfect DNA, flows through the filter and is amplified via PCR. In initial tests, we performed the separation using sets of 70-mer oligonucleotides. One set was all perfect DNA, one set was all error-containing DNA, and another set was half of each. Unfortunately, the resinbased separation method was abandoned after initial attempts because of basic problems getting the MutS-DNA complexes to reliably bind to the resin. Possible reasons include steric issues with the positioning of the polyhistidine tag on the MutS and the length of its linker, and problems with the binding capacity and effectiveness of the Ni-NTA resin. Another separation method that was explored used magnetic agarose beads conjugated with Ni-NTA. In this method, a magnet is used to separate the bead-MutSDNA complexes from the free DNA. This technique was also abandoned after initial tests for the similar reasons given above for the resin filter. MS-Fokfision - error correction. MzitS-Fok fitsion - error correction Figure 9 MutS (Sixma, 2001) 14 Figure 10 FokI nuclease (Wah et al., 1997) Another protocol involves the creation of a fusion protein with the mismatch binding capabilities of MutS and DNA cleaving activity of the FokI nuclease. In principle, this fusion protein would bind to errors in the DNA and cleave around them. The cut-out error-containing DNA fragments would remain bound to the protein and be removed from the mixture using one of several possible approaches, most of which are covered in the two previous protocols. This is the first example of a true error correction protocol in this document (See Figures 8 and 9). Fold is a member of the type IIS outside cutter" class of endonucleases, which bind to a specific DNA sequence but cleave at a remote site (Modrich, 1991). FokI is an ideal nuclease for the creation of this fusion protein because the nuclease domain of Fold has already been used for several fusions that incorporate a DNA-binding domain from another protein. These fusions have the ability to cleave adjacent to whatever site its DNA-binding domain has affinity for (Kim and Chandrasegaran, 1994, Kim et al. 1997, Kim et al. 1998, Kim et al. 1999, Porteus and Baltimore, 2003). Much of the challenge of this protocol is in the engineering and expression of the fusion protein itself: The crystal structures of both Fold (Wah et al., 1997) and E. coli MutS (Sixma, 2001) are known, and were used to design the fusion protein. It is expected that the fusion protein would dimerize, since MutS forms a dimer around the DNA in solution. The footprint of the fusion protein dimer would be about 40 bp, meaning that for each error found, 40 bp of material would have to be excised and sacrificed in order to replace the incorrect DNA with correct DNA. An amino acid linker would connect the MutS and the nuclease domain of the FokI protein. 'The composition and length of this linker are also important variables to optimize. Proper kinematic constraint is necessary for the two domains of the fusion protein to have their intended activity. Initial attempts at expressing the designed MutS-FokI fusion protein have displayed problems with proper folding and solubility. The protein must be obtained in soluble form for it to be useful in error correction. There are a number of strategies for attempting to obtain the protein in a soluble and active form. Aggregated inclusion bodies can be denatured chemically with guanidinium chloride and/or urea and exposed to a variety of re-folding protocols (Sambrook and Russell, 2001). The protein can be 15 fused to more soluble protein domains like thioredoxin or NusA (Harrison, 2000). Cysteine residues present in the protein, which have no significant structural role nor participate in disulfide bonds, may be removed by mutagenesis, as they may interfere with proper folding. MutS-His6 plus nucleease- error correction In another approach, the mismatch-binding ability of the MutS and the DNA- cleaving ability of the FokI protein above are decoupled. MutS is bound to the errors on the DNA, and a non-specific nuclease that makes double-stranded cuts (like DNase I in the presence of Mn2 ions) is used to cleave the DNA into fragments of 100-300 bp in length. Those regions bound by MutS would tend not to be cleaved. The errorcontaining DNA would be separated from the "good" DNA using an affinity columnbased method or gel electrophoresis, and the good DNA would be amplified using PCR. A good deal of effort was invested in working out the appropriate reaction conditions for this procedure. First, working buffer conditions were determined for the reaction. Though both MutS and DNase I are known to have activity in the presence of Mg2+ DNase I only had the desired double-stranded cutting activity in the presence of Mn- + . Tests were performed to ensure that MutS had sufficient activity in a buffer of Mn> (gels not shown). Once the buffer conditions were decided, we performed a time-course assay on the activity of DNase I. Since DNase I is a non-specific nuclease, we wanted to determine an appropriate concentration and reaction time for digestion with the nuclease in order to obtain DNA fragments of optimal size for binding and removal by MutS (-100-300 b). See figure IL. The next step, in which we applied both MutS and DNase I to the DNA, turned out to be much more difficult to troubleshoot. MutS was applied to the DNA samples, and then DNase I was added at various different concentrations. The resulting solution was visualized on a nondenaturing TBE polyacrylamide gel. See Figure 11. effect on synthesized DNA. From left to Next, the DNA was extracted from the gel by the crush-and-soak method from those right: kb ladder; DNA + MutS + 0, 0.002, by the crush-and-soak method from those 0007 044 00.010, 10 0.015 015U 0 000.0067, 0.003, 0.0044, U :gel DNase ; DNA + O, 0.002, 0.003, 0.0044, D~ae + I , DN 0002 0.03,0.044,regions that we believed corresponded to "good" DNA that had not been bound my MutS. In 0.0067, 0.010, 0.015 U DNase I. principle, any error-free DNA that had been bound by MutS would be found shifted up in the gel, presumably above the unbound 1000 bp DNA material. Thus, DNA was extracted from the region of about 100-300bp, and re-amplified in a PCR. Though this step was repeated multiple times, taking DNA from slightly different regions of the gel and using different concentrations of DNase I, we were unable to achieve rebuilding of the DNA fragments into the full length GFP construct. There are many possible reasons for this. First of all, there is the question of DNase I activity of 16 double-strand vs. single-strand cutting. Though at certain concentrations of Mn + , DNase I makes double-stranded cuts, even then it cuts single-stranded with at most a relative frequency of about 30% (Campbell, 1980). The single-strand cuts in the DNA would not result in fragments of the desired length that could be screened for errors by MutS. Also, we are affected by the inability of the gel to give us specific information about whether visualized DNA migrated a given distance because it is a shorter piece of DNA bound by MutS or a longer piece of DNA that was not bound by MutS. Because of this, when we extract DNA from the gel via crush-and-soak, we may not be extracting material that is of the appropriate length to rebuild into the full length construct. Recently, a group from the University of Wisconsin-Madison has demonstrated the ability to improve the error rate of DNA encoding a synthetic green fluorescent protein gene by 3.5-4.3 fold to a final error rate of about in 3500 bp (Brock et al., 2005). Their error correction protocol, termed consensus shuffling, is very similar to the MutS and nuclease procedure discussed in this section. Instead of a gel-, resin-, or magnetic bead-based separation though, they utilized a MutS fusion to maltose-binding protein to precipitate the MutS. Also, instead of a generic non-specific nuclease like DNase , they used a number of restriction endonucleases and combined pools of DNA each treated with a different endonuclease. Because the consensus shuffling paper is so similar in strategy to the MutS plus nuclease strategy of error correction described here, we have dropped further development of this protocol. MSPCR -- error prevention MutS-mn-PCR (MSPCR) is a novel idea for error prevention that involves the use of the MutS protein in the gene fabrication PCR reaction. It is unique from the previously discussed ideas in that it attempts to address errors before they are integrated into synthesized DNA. The PCR cycle is designed to allow MutS to bind to mismatches and to misprining events in the PCR, both of which lead to erroneous PCR product. It has been noted in previous work that the errors originating in the oligonucleotide pool make up a substantial portion of the errors found in the fully synthesized gene product. Before PCR cycling begins, MutS is incubated with the oligonucleotide pool and all PCR reagents besides polymerase. After a 60°C, l0 minutes incubation, polymerase is added and a modified MSPCR program begins. For the first several cycles of the PCR, a 10 minute step at 40°C precedes the regular 72°C extension step. At this temperature, the polymerase is relatively inactive but the thermostable T. aquaticus MutS is quite active. MutS binds mismatches and mispriming events as shown in Figures 12 and 13, giving other "good" DNA a competitive advantage over the error-containing DNA. It has also been noted that MSPCR appears to give cleaner" bands on the gel electrophoresis of synthesized gene products, presumably because of decreased mispriming events (gel not shown). 17 4k _ _ _ _ _ _ %1k 1% N I r, ounler , I I I Figure 12 One proposed mechanism of action of MutS in MSPCR - Steric blocking of polymerase for short pieces of DNA and fbr mismatches near 3' ends (* denotes error) " N _11%, MW 7-. , = r 'n 4 12L_ %,L 1% Figure 13 Another proposed mechanism of action of MutS in MSPCR - MutS binds error; polymerase copies everywhere but mismatch (falls off once it reaches MutS) Depending on whether MSPCR is used in the first step of a two-step protocol for gene fabrication or in one-step protocol, the error prevention protocol seems to give some small but real 1.5- to 2-fold improvement in error rate of a -400 bp target (E. coli LacZalpha). Though this is so far only a modest improvement in error rate, it comes at little added cost in terms of time, handling, or reagents. In addition, MSPCR is an example of error prevention as opposed to error correction, so there are no wasted reagents and no need for an additional step of error correction. MSPCR is more effective in the one-step gene synthesis protocol than in the two- step process, presumably because the competitive advantage given to "good" DNA has a greater effect when the entire gene synthesis process is carried out at once. By optimizing certain parameters in MSPCR, we hope to improve the capability of MSPCR to decrease the error rate of gene fabrication. As noted earlier in this document, even small improvements in error rate can make a big difference in improving the usability and applicability of gene fabrication technology. Some relatively simple variables to be 18 adjusted include the temperature profile of thermal cycling in MSPCR, the stoichiometry of MutS and DNA in the reaction, and species of MutS. Studying the resistance of T aquaticus MutS to thermal degradation has recently led to an insight into improving MSPCR. As shown in Figure 14, repeated temperature cycling from 94°C to 55°C of MutS has a drastic effect on the ability of MutS to bind DNA. Further studies will be undertaken to determine just. Though we Figure 14 Effectof expected T. quacticusto have high resistance to thermal temperature cycling on MutS activity. From left to right: kb ladder, perfect DNA + MutS, mismatch DNA+ MutS after 0,5, 10, 15, 20, 25, and 30 temperature cycles (30 sec @ 94oC, 30 sec @ 55o C) denaturation, this is apparently not the case. Given this data, it is encouraging that MSPCR currently gives an improvement in error rate, even with the thermal denaturation of MutS within a few temperature cycles. One option is to add MutS to the MSPCR as it is running after every couple cycles - this is reminiscent of PCR back before the use of thermophilic DNA polymerases. Another option is to use MutS from a more hyperthermophilic organism such as T thermophilts or P. firiosuts. We will try both of these options in an attempt to get better performance out of MSPCR. If by optimizing parameters in MSPCR, we can improve the error rate improvement of the process (for example, to five-fold as opposed to two-fold), N4SPCR will become a potent and very useful low-cost method for error prevention at the gene synthesis level. However, in order to harness the real power of the concept behind MSPCR, it may be necessary to not only rely on the non-covalent interaction of MutS with mismatch DNA, but to initiate a covalent protein-DNA or protein-protein cross-linking event when MutS binds to a target. This would provide a means of effectively irreversibly removing error-containing DNA from the gene synthesis reaction. We are currently considering possibilities for the appropriate chemistry to facilitate such a cross-linking event. One possibility is the site-directed mutagenesis of the MutS protein with a cysteine residue, whose thiol group could be used to attach a photo-activated cross-linker or some other chemical group that could bind to DNA. Thermophilic MutS-FokI fusion protein The concept for the design of a thermophilic MutS-Fokl fusion protein is itself a fusion of two ideas explained earlier in this document: the MutS-FokI error-correcting fusion protein and MSPCR. A thermophilic MutS-FokI fusion protein, whose respective domains could be taken from thermophilic organisms such as T. thermophilits, could safely be thermocycled in a gene synthesis PCR. Its function would be error prevention, like MSPCR, but the fusion protein would not only bind error-containing DNA but cleave the good DNA around it. The bad DNA could then selectively be removed from the PCR without wasting good material or continuing the extension at the ends of an errorcontaining piece of DNA that is bound by MutS in the middle. 19 III. Closing Statement Many examples of past, present, and future error filtration, correction, and prevention protocols have been presented in this document. The design of all of these "error correction" protocols keep in mind a number of central goals: a quick, easy, robust, economical, and effective method to reduce errors in a gene product synthesized in gene fabrication. A number of error filtration and some early error correction methods have been published in recent years. Though all of them have weaknesses, they are significant early steps toward effective error reduction, a necessary part of gene fabrication if it is to reach its potential as a revolutionary enabling technology in biological engineering and biotechnology. Possibly one of our greatest hopes for error reduction in gene fabrication is the creation of a so-called "golden bullet" - a reagent that could be added to a gene synthesis PCR and enable the synthesis of genes with very low error rates at essentially no added cost. We have hope for ideas such as the thermophilic MutS-FokI fusion protein to be such a golden bullet for gene fabrication. IV. References Babon JJ, McKenzie M, Cotton RG. (2000) The use of resolvases T4 endonuclease VII and T7 endonuclease I in mutation detection. Methods Mol Biol. 152:187-99. Biswas, I. and Hsieh, P. (1996) Identification and Characterization of a Thermostable MutS Homolog from Thermus aquaticus. J. Biol. Chem 271:5040-5048. Brock F. Binkowski, Kathryn E. Richmond, James Kaysen, Michael R. Sussman, and Peter J. Belshaw. (2005) Correcting errors in synthetic DNA through consensus shuffling. Nucleic Acids Research, Vol. 33, No. 6, e55. Brown, J., Brown, T, and Fox, K.R. (2001) Affinity of mismatch-binding protein MutS for heteroduplexes containing different mismatches. Biochem. J. 354:627-633 Campbell, Virginia W. and David A. Jackson. (1980) The Effect of Divalent Cations on the Mode of Action of DNase I. The Journal of Biological Chemistry. Vol. 255, No. 8, April 25, pp. 3726-3735. Carr, P.A., Park J.S., Lee Y.J., Yu T., Zhang, S., Jacobson J.M. (2004) Protein-mediated error correction for de novo DNA synthesis. Nucleic Acids Research. Vol. 32 No. 20 e162 Chen, Junghuei and Seeman, Nadrian C. (1991) Synthesis from DNA of a molecule with the connectivity of a cube. Nature Apr 18;350(6319):631-3. 20 Eisen, J.A. (1998) A phylogenomic study of the MutS family of proteins. Nucleic Acids Research 26:4291-4300 Elowitz MB, Leibler S. A synthetic oscillatory network of transcriptional regulators. (2000) Nature 403:335-8. Goho, A.M. (003) Life Made to Order. Technology Review (April) Harrison, R.G. (2000) Expression of soluble heterologous proteins via fusion with NusA. inNovations 11:4-7. Hoover D.M., and Lubkowski J. (2002) DNAWorks: an automated method for designing oligonucleotides for PCR-based gene synthesis. Nucleic Acids Res. 30:43- Kim Y.CG.,Chandrasegaran S. Chimeric restriction endonuclease. Proc Natl Acad Sci U S A. 1994 Feb 1;91(3):883-7. Modrich, P. Mechanisms and biological effects of mismatch repair. (1991) Annilt. Rev. Genet. 25:229-53. Porteus M.H. and Baltimore, D. (2003) Chimeric nucleases stimulate gene targeting in human cells. Science 300:763. Sambrook, and Russell, (2001) Molecular Cloning: A Laboratory Manual. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York. Sixma, T.K (2001) DNA mismatch repair: MutS structures bound to mismatches. Curr. (O)p.Struct. Biol. 11:47-52 Smith J, Modrich P. (1997) Removal of polymerase-produced mutant sequences from PCR products. PNAS 94:6847-50. Stemmer WP, Crameri A, Ha KD, Brennan TM, Heyneker HL. (1995) Single-step assembly of a gene and entire plasmid from large numbers of oligodeoxyribonucleotides. Gene. 164:49-53. Wah, D.A. et al. (1997) Structure of themultimodular endonuclease Fokli bound to DNA Nature 388:97. Wang, L., Brock, A., Herberich, B., and Schultz, P.G. (2001) Expanding the Genetic Code of Escherichia coli. Science 292:498-500. Whitehouse, A. et al. (1997) Analysis of the Mismatch and Insertion/Deletion Binding Properties of Thermus thermophilus, HB8, MutS. Biochem Biophs Res. Commun. 233:834-837 21 Winograd, S., and Cowan, J.D. (1967) Reliable computation in the presence of noise. MIT Press, Cambridge, MA. Withers-Martinez, C. et al. (1999) PCR-based gene synthesis as an efficient approach for expression of the A+T-rich malaria genome. Protein Engineering 12:1113-1120. Yang, B., et al. (2000) Purification, Cloning, and Characterization of the CEL I Nuclease. Biochemistry 39:3533-3541 Youil, R., Kemper,R. and Cotton, R.H.G (1996) Detection of 81 of 81 Known Mouse b-Globin Promoter Mutations with T4 Endonuclease VII-The EMC Method. Genomics 32:431-435. Genomics 32:431-435. 22