Detection of Possible Restriction Sites for Type II Restriction Enzymes in DNA Sequences P. GAGNIUC1, D. CIMPONERIU1, C. IONESCU-TÎRGOVIŞTE2, ANDRADA MIHAI2, MONICA STAVARACHI1, T. MIHAI1, L. GAVRILĂ1 1 Human Genetics and Molecular Diagnosis Laboratory, Department of Genetics, University of Bucharest, Romania 2 “N.C. Paulescu” National Institute of Diabetes, Nutrition and Metabolic Diseases In order to make a step forward in the knowledge of the mechanism operating in complex polygenic disorders such as diabetes and obesity, this paper proposes a new algorithm (PRSD – possible restriction site detection) and its implementation in Applied Genetics software. This software can be used for in silico detection of potential (hidden) recognition sites for endonucleases and for nucleotide repeats identification. The recognition sites for endonucleases may result from hidden sequences through deletion or insertion of a specific number of nucleotides. Tests were conducted on DNA sequences downloaded from NCBI servers using specific recognition sites for common type II restriction enzymes introduced in the software database (n = 126). Each possible recognition sites indicated by the PRSD algorithm implemented in Applied Genetics was checked and confirmed by NEBcutter V2.0 and Webcutter 2.0 software. In the sequence NG_008724.1 (which includes 63632 nucleotides) we found a high number of potential restriction sites for ECO R1 that may be produced by deletion (n = 43 sites) or insertion (n = 591 sites) of one nucleotide. The second module of Applied Genetics has been designed to find simple repeats sizes with a real future in the understanding the role of SNPs (Single Nucleotide Polymorphisms) in the pathogenesis of the complex metabolic disorders. We have been tested the presence of simple repetitive sequences in five DNA sequence. The software indicated exact position of each repeats detected in tested sequences. Future development of Applied Genetics can provide an alternative for powerful tools used to search for restriction sites or repetitive sequences or to improve genotyping methods. Key words: PRSD algorithm, brute force, DNA sequences, restriction endonucleases, recognition sites, tandem repeats. The genetic bases of several metabolic syndromes with an obvious heritability [1] as diabetes and obesity, remains a territory of waves of optimism and then by disillusions. In a less than one decade, the number of genes associated with diabetes and obesity raised to a dozen of tens [2–6]. For a practitioner who looked at a map with a location of various loci associated with diabetes (located more often in introns of genes as SNPs, then in exons which encode known or unknown molecules) the confusion is almost total. Any coherence could be suggested for such view. However, the order of many trillions of nucleotides in the human genome has such an arrangement that can ensure the survival and continuity of Homo sapiens sapiens. Inside it must be hidden a code which results from a sum mum of many sub-codes, some of them known, but most of them unknown. To break such secret code wisdom, time and tools are needed. One of such tool is presented in this paper, enlarging the previous automatic knowledge extraction methods on sequences of biological molecules [7]. ROM. J. INTERN. MED., 2011, 49, 2, 121–128 Restriction enzymes [8] can cut doublestranded DNA molecules that contain a particular recognition sequence. Different software (e.g. NEBcutter V2.0 [9], Webcutter 2.0 [10]) can detect the sites where one or several restriction enzymes can cut the sequences of interest. We have been focused on the analysis of possible restriction sites with a new algorithm called Possible Restriction Sites Detector (PRSD) implemented in Applied Genetics. The aim of this software is to help researchers to find or to design new recognition sites for endonucleases through the insertion or deletion of a number of nucleotides within the initial sequence. Analysis of repetitive sequences has multiple potential applications. For example, identification of homopolymer tracts located upstream of the promoter elements can be useful for detecting protein binding signals [1]. Nucleotide repeats located in exons may determine frame-shift errors or protein toxicity which lead to diseases [e.g. dinucleotides associated with cancers, (CAG)n/ 122 P. Gagniuc et al. (CTG)n tracts which can lead to polyglutamine tracts are associated with Huntington’s disease, Machado–Joseph disease or Spinocerebellar ataxia [12]. Nucleotide repeats located in noncoding regions have been associated in a more criptic model with some human diseases (e.g. Norrie’s disease [13], neurodegenerative disease, chromosomal fragility [14]). The second aim of Applied Genetics is to find repeats with different unit sizes in a DNA sequence. MATERIALS AND METHODS PRSD algorithm detects possible recognition sites of use frequently in molecular biology protocols in the input DNA sequence. These restriction enzymes (n = 126) have no ambiguous nucleotides in their recognition sites [15][16]. The method of detection is based on detecting a hidden sequences generated from specific recognition site of restriction enzyme in the input DNA sequence. First step of this process consists in dividing the specific recognition sequence in all possible combinations of two pieces (the strings characters are noted as X and Y). Then algorithm compared all generated X and Y variants with the input sequence (denoted by S). Finally, the algorithm search if these combinations are located at the distance d (d represents the number of characters) selected by the user (Fig. 2). For example, if d = 0, no nucleotide is present between the X and Y. Instead, if d = 3, the X and Y are found at three nucleotides away from each other whereas if d = 1 to 3, the algorithm will indicate all X and Y complementary sequences separated by 1, 2 or 3 nucleotides. The X and Y complementary parts and the d characters between them represent a hidden sequence. The pseudocodes for the PRSD algorithms based on hidden sequence are shown in Annex 1 and 2. Briefly, in these pseudocodes A, B and G are boolean variables. Variable A can be TRUE if B variable becomes FALSE, however, B variable is FALSE when G variable becomes TRUE. Variable B can become FALSE when G variable becomes FALSE, but when G variable becomes TRUE, B variable becomes FALSE. P(l) will be incremented for every hidden sequence found in S within a distance of d = k up to q. A series of tests were conducted on different biological DNA sequences downloaded from NCBI servers. Two researchers check manually each possible restriction site indicated by the algorithm. The changes in the input sequence indicated by 2 PRSD algorithm have been tested using NEBcutter V2.0, Webcutter 2.0. Brute force algorithms were used to search for short unknown tandem repeats, unknown repetitive sequences, direct and inverted repeats with dynamic unit sizes. Briefly, the Brute Force engine generates all possible strings depending on four parameters (Pa, Pb, Pc, Pd). These strings are then compared with the analyzed DNA sequence. Pa parameter represents the minimum number of letters from which these strings are generated and Pb parameter is the maximum length of a generated string. Parameters Pc and Pd are the minimum and maximum number of repetitions of the generated string. AppliedGenetics use brute force methods for searching for simple tandem repeats in sequences in five sequences: NG_008724.1 (NAIP), AB017610 (TT virus genotype 1a DNA), AB025946.2 (TTV SANBAN) AF122913.1 (TT virus isolate GH1) and AF247138.1 (TT virus isolate T3PB). Applied Genetics works with GenBank [17] format or plain text sequences. The software does not have theoretical restrictive length barriers for data input. However, we tested sequences of maximum 500 kb. This size was considered sufficient for an analysis especially if we take into account estimation that the average gene size in vertebrates is about 30 kb and that some duplicated regions in the human genome can reach this size (e.g. 5q duplicated region in human has 500 kb). Applied Genetics program is compatible with windows 2000/2003/XP operating systems. After installation, the extensions “pro” and “agx” are associated with the Applied Genetics program. All Applied Genetics projects are saved under these file extensions the software does not have theoretical restrictive length barriers for data input. RESULTS We have created a software platform named Applied Genetics which incorporates several modules used to search for existent or hidden (possible) recognition sites for endonucleases and to test the presence of simple repetitive sequences (e.g. which examines short tandem repeats). 1. AppliedGenetics detects possible restriction sites for 126 restriction enzymes that are used frequently in recombinant DNA techniques, RFLP analysis and genomic mapping. The tests were performed on different sequences downloaded from the NCBI servers. 3 Type II restriction enzymes in DNA sequences 123 (n = 43 sites) or insertion (n = 591 sites) of one nucleotide. The results provided by the Applied Genetics software have been checked and confirmed by NEBcutter V2.0 and Webcutter 2.0 software. The number of sites detected increases very fast if the value of distance d increases (e.g. d > 3) (Table I). For examples we tested the sequence Homo sapiens NLR family, Neuronal Apoptosis Inhibitory Protein (NAIP), (NG_008724.1, which includes 63632 nucleotides). Applied Genetics detected in this sequence a high number of potential restriction sites for EcoR1 that may be produced by deletion Table I Applied Genetics detected 10351 potential sites for EcoR 1 in NG_008724.1 if this sequence is analyzed in both directions Sequence Polarity 5’ > 3’ 3’ > 5’ Total sites Number of deleted nucleotide(s) Number of inserted nucleotide(s) 1 2 3 1 2 3 21 9 16 330 951 3031 22 19 29 261 1074 4588 43 28 45 591 2025 7619 Hidden sequences identified by Applied Genetics in the input sequence represent parts of the potential recognition site that may be created or modified by PCR site-directed mutagenesis. 2. Although the first brute force methods have been used in cryptography and computer security. AppliedGenetics use brute force methods for searching for simple repetitive sequences in five DNA sequences (Table II). The program output for each set of restriction enzymes presents the following data: the restriction enzyme; the type of the method (e.g. insertion or deletion) used to construct a virtual recognition site; the number of nucleotides that are inserted or deleted from the original sequence; the number of possible restriction sites found for the distance d selected by the user; the list of restriction sites found within the analyzed sequence (Fig. 1). Table II The number of simple repeats detected by Applied Genetics in five sequences Sequence Size NG_880724.1 63632 AB017610.1 3853 AB025946.2 3808 AF122913.1 3852 AF247138.1 3838 Repeats (C)n (G)n (A)n (T)n (C)n (G)n (A)n (T)n (C)n (G)n (A)n (T)n (C)n (G)n (A)n (T)n (C)n (G)n (A)n (T)n DISCUSSION The identification of type 2 restriction endonucleases sites is a preliminary step for the indepth study of genomic DNA arrangements which might explain finally the polygenic interactions in ensuring n>8 1 0 34 50 1 2 1 0 4 2 0 0 3 2 1 0 2 1 0 0 n>7 2 0 52 69 1 3 3 0 4 2 0 0 3 3 3 0 5 4 2 0 n>6 8 2 90 124 7 5 8 2 9 3 2 1 6 4 10 2 7 5 6 1 n>5 31 20 200 198 13 7 14 3 9 6 12 3 15 8 13 3 12 8 17 2 the energy homeostasis of the human body and then the understanding the polygenic derangements of the normal genetic architecture which led to the dysregulation of a huge system including many cells/tissues/organs, as is the case of obesity, diabetes and their related chronic vascular complications [18]. P. Gagniuc et al. Fig. 1.Applied Genetics software. On the left side of the figure there are the buttons which give access to Applied Genetics modules. In the middle there is the graphic representation of the DNA sequence as text and results window. In the right side there are the window where the parameters of the modules can be changed, the start button, the color window and the statisticd window. 124 4 5 Type II restriction enzymes in DNA sequences 125 Fig. 2. Number of variants for delition and insertion methods. The figure shows the significance of PRSD generated variants for deletion case (a) and insertion case (b) in which the imposed limit for distance is d = k to q (k =1, q = 3). Both DNA chains and directions (5’–3’, 3’–5’) are taken into account for EcoR1 case. 126 P. Gagniuc et al. Because the “brain” of metabolic regulation is represented by the pancreatic β cells is our intention to use the new informatics tool in analyzing the secretory components of these cells: pre-proinsulin and pre-proamylin and to try to understand why, in some circumstances, proinsulin remains in great part unsplitted and amylin (the second hormone secreted by this cell) suffers a conformational change leading to their self association with the formation of large amylin deposits, favoring their progressive decrease in the β cell mass [19]. The most important challenge in Bioinformatics is the integration of available visualization and analysis tools. Visualization can play an important role in exploratory data analysis, where graphical representations build up an understanding of the DNA sequence content. Programming objects of the graphic representation are interconnected, this 6 relationship between objects gives the user an unquestionable and clear vision of the analyzed sequence. Navigating through the sequence using the mouse is one of the key features of Applied Genetics. The results provided by Applied Genetics are presented in HTML format for saving or publishing on different Internet servers. The generated HTML file is equipped with a JavaScript search engine used to navigate the users through Applied Genetics results. Future development of Applied Genetics can provide an alternative for powerful tools used to view sequences (e.g. DNAVis [20], Phylo-VISTA [21]), to search for restriction sites or polymorphisms (e.g. TandemSWAN [22], Phobos [23], Poly [24], MIcroSAtellite identification tool [25], Microsatellite repeats finder [26] or Tandem Repeats Finder [27]) or to improve genotyping methods. Acknowledgements: This work is supported by the Ministry of Education, Research and Innovation, CNCSIS, IDEI contract number: 2150/2008. Pentru a face un pas înainte în cunoaşterea mecanismului care operează în afecţiuni poligenice complexe, cum ar fi diabetul zaharat şi obezitatea, această lucrare propune un nou algoritm (PRSD – detectarea de situsuri de restricţie posibile) implementat în aplicaţia software Applied Genetics. Acest program software poate fi folosit pentru detectarea in silico a situsurilor de restricţie posibile (ascunse), a situsurilor de recunoaştere pentru endonucleaze şi a repetiţiilor nucleotidice. Situsurile de recunoaştere pentru endonucleaze ar putea rezulta din secvenţe ascunse prin ştergerea sau introducerea unui anumit număr de nucleotide în secvenţa ADN analizată iniţial. Testele au fost efectuate pe secvenţe descărcate de pe serverele NCBI, folosind situsuri de recunoaştere specifice pentru enzimele de restricţie comune de tip II – enzime introduse şi în baza de date a programului Applied Genetics (n = 126). Fiecare situs posibil detectat de către algoritmul PRSD implementat în Applied Genetics a fost verificat şi confirmat de programele software NEBcutter V2.0 şi Webcutter 2.0. În secvenţa NG_008724.1 (care include 63632 nucleotide) a fost detectat un număr mare de situsuri de restricţie posibile pentru ECO R1, care pot fi produse prin deleţia (n = 43 situsuri) sau inserţia (n = 591 situsuri) de nucleotide. Al doilea modul al aplicaţiei software Applied Genetics, cu un viitor real în înţelegerea rolului SNP-urilor (Single Nucleotide Polymorphisms) în patogeneza tulburărilor metabolice complexe, a fost proiectat pentru a găsi repetiţii de diferite dimensiuni. Au fost efectuate teste pentru detectarea de secvenţe repetitive simple în cinci secvenţe ADN. Aplicaţia a indicat poziţia exactă a fiecărei repetiţii detectate în secvenţele testate. Dezvoltarea pe viitor a aplicaţiei Applied Genetics poate oferi o alternativă pentru instrumente puternice folosite pentru a căuta situsurile de restricţie, secvenţe repetitive sau pentru a îmbunătăţi metodele de genotipare. ___________________________________________________________________ Corresponding author: P. Gagniuc, MD University of Bucharest, Human Genetics and Molecular Diagnosis Laboratory, Department of Genetics, E-mail: paulgagniuc@yahoo 7 Type II restriction enzymes in DNA sequences 127 REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. PERMUT M.A., WASSON J., COX N., Genetic epidemiology of diabetes. J.Clin.Invest., 2005, 115, 1431–1439. LI S., ZHAO J.H., LUAN J., LANGENBERG C., LUBEN R.N., KHAW K.T., WAREHAM N.J., LOOS R.J.F., Genetic predisposition to obesity leads to increased risk of type 2 diabetes. Diabetologia, 2011, 54:776–782. RAMOS E., CHEN G., SHRINER D., DOUMATEY A., GERRY N.P., HERBERT A., HUANG H., ZHOU J., CHRISTMAN M.F., ADEYEMO A., ROTIMI C., Replication of genome-wide association studies (GWAS) loci for fasting plasma glucose in African-Americans. Diabetologia, 2011, 54:783–788. TIAN C., FANG S., DU X., JIA C., Association of the C47T polymorphism in SOD2 with diabetes mellitus and diabetic microvascular complications: a meta-analysis. Diabetologia 2011, 54: 803–811. STANČÁKOVÁ A., PAANANEN J., SOININEN P., KANGAS A.J., BONNYCASTLE L.L., MORKEN M.A, COLLINS F.S., JACKSON A.U., BOEHNKE M.L., KUUSISTO J., ALA-KORPELA M., LAAKSO M., Effects of 34 risk loci for type 2 diabetes or hyperglycemia on lipoprotein subclasses and their composition in 6, 580 nondiabetic Finnish men, Diabetes 2011, 60:1608–1616. JENSEN A.C., BARKER A., KUMARI M., BRUNNER E.J., KIVIMÄKI M., HINGORANI A.D., WAREHAM N.J., TABÁK A.G., WITTE D.R., LANGENBERG C., Associations of common genetic variants with age-related changes in fasting and postload glucose. Evidence from 18 years of follow-up of the Whitehall II Cohort. Diabetes 2011, 60:1617–1623. PAUL GAGNIUC, DĂNUŢ CIMPONERIU, NICOLAE MIRCEA PANDURU, MONICA STAVARACHI, MIHAI TOMA, CONSTANTIN IONESCU-TÎRGOVIŞTE, LUCIAN GAVRILĂ, A sensitive method for detecting dinucleotide islands and clusters through depth analysis – abstract, RJDNMD 2011, 18, 2: 165–170. ROBERTS R.J., How restriction enzymes became the workhorses of molecular biology, Proc Natl Acad Sci USA. 2005 Apr 26; 102(17):5905–8. http://tools.neb.com/NEBcutter2/index.php. http://rna.lundberg.gu.se/cutter2/. STRUHL K., Naturally occurring poly(dA-dT) sequences are upstream promoter elements for constitutive transcription in yeast. Proc Natl Acad Sci U S A. 1985 Dec; 82(24):8419–23. MALLIK M., LAKHOTIA S.C., Modifiers and mechanisms of multi-system polyglutamine neurodegenerative disorders: lessons from fly models. J Genet. 2010 Dec; 89(4):497–526. KENYON J.R., CRAIG I.W., Analysis of the 5’ regulatory region of the human Norrie’s disease gene: evidence that a nontranslated CT dinucleotide repeat in exon one has a role in controlling expression. Gene. 1999 Feb 18; 227(2):181–8. ASHLEY C.T. Jr., WARREN S.T., Trinucleotide repeat expansion and human disease. Annu Rev Genet 1995, 29:703–28. PINGOUD A., FUXREITER M., PINGOUD V., WENDE W., Type II restriction endonucleases: structure and mechanism. Cell Mol Life Sci. 2005 Mar; 62(6):685–707. BILCOCK D.T., DANIELS L.E., BATH A.J., HALFORD S.E., Reactions of type II restriction endonucleases with 8-base pair recognition sites. J Biol Chem. 1999 Dec 17; 274(51):36379–86. BENSON D.A., KARSCH-MIZRACHI I., LIPMAN D.J., OSTELL J., SAYERS E.W., Gen Bank. Nucleic Acids Res. 2010 Jan; 38(Database issue): D46–51. IONESCU-TÎRGOVIŞTE C., Insulin resistance – what is myth and what is reality? Acta Endocrinologica 2011; VII, 1:123–146. IONESCU-TÎRGOVIŞTE C., Proinsulin As The Possible Key In The Pathogenesis Of Type 1 Diabetes. Acta Endocrinologica 2009; 5(2):233–249. FIERS M.W., VAN de WETERING H., PEETERS T.H., VAN WIJK J.J., NAP J.P., DNAVis: interactive visualization of comparative genome annotations. Bioinformatics. 2006 Feb 1; 22(3):354–5. SHAH N., COURONNE O., PENNACCHIO L.A., BRUDNO M., BATZOGLOU S., BETHEL E.W., RUBIN E.M., HAMANN B., DUBCHAK I., Phylo-VISTA: interactive visualization of multiple DNA sequence alignments. Bioinformatics. 2004 Mar 22; 20(5):636–43. http://favorov.imb.ac.ru/swan/. http://www.ruhr-uni-bochum.de/spezzoo/cm/cm_phobos.htm. http://www.bioinformatics.org/poly/wiki/. http://pgrc.ipk-gatersleben.de/misa/misa.html. http://www.biophp.org/minitools/microsatellite_repeats_finder/demo.php. http://tandem.bu.edu/trf/trf.html. Received June 3, 2011 ANNEX 1 The pseudocode for the PRSD algorithm based on deletion of nucleotides 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Let k =1 Let d = k For l = 1 to mu Let P(l) = 0 For alpha = 1 to beta For e = 1 to n Let A = True Let B = True Let G = False For i = 1 to e if E(l,i) <> N(alpha+i-1) then A = False Next i For j = 1 to n-e 128 P. Gagniuc et al. 18 if E(l, j+e+d) <> N(alpha+e+d+j-1) then B = False 19 Next j 20 21 if (beta-alpha-1 < n) then 22 Let G = True 23 Let alpha = 1 24 Let l = l + 1 25 end if 26 27 if A = True & B = True & G = False then 28 29 For t = 1 to alpha+i-1 30 xi(d, P(l)) = xi(d, P(l)) and N(t) 31 Next t 32 33 For e = alpha+i-1+d to beta 34 xi(d, P(l)) = xi(d, P(l)) and N(e) 35 Next e 36 37 P(l) = P(l)+1 38 alpha = alpha+e+d+e-n 39 40 end if 41 42 Next e 43 Next alpha 44 Next l ANNEX 2 The pseudocode for the PRSD algorithm based on insertion of nucleotides 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Let k = 1 Let d = k For l = 1 to mu Let P(l) = 0 For alpha = 1 to beta For e = 1 to n-d Let A = True Let B = True Let G = False For i = 1 to e if E(l,i) <> N(alpha+i-1) then A = False Next i For j = 1 to e-i if E(l,j+e+d) <> N(alpha+e+d+j-1) then B = False Next j if (beta-alpha-1 < n) then Let G = True Let alpha=1 Let l=l+1 end if if A = True & B = True & G = False then For t = 1 to alpha+i-1 xi(d, P(l)) = xi(d, P(l)) and N(t) Next t For f = k to q xi(d, P(l)) = xi(d, P(l)) and E(l,f) Next e For e = alpha+i-1+q to beta xi(d, P(l)) = xi(d, P(l)) and N(e) Next e P(l)=P(l)+1 alpha=alpha+e+d+e-i end if Next e Next alpha Next l 8