Detection of Possible Restriction Sites for Type II

advertisement
Detection of Possible Restriction Sites for Type II Restriction Enzymes
in DNA Sequences
P. GAGNIUC1, D. CIMPONERIU1, C. IONESCU-TÎRGOVIŞTE2, ANDRADA MIHAI2,
MONICA STAVARACHI1, T. MIHAI1, L. GAVRILĂ1
1
Human Genetics and Molecular Diagnosis Laboratory, Department of Genetics, University of Bucharest, Romania
2
“N.C. Paulescu” National Institute of Diabetes, Nutrition and Metabolic Diseases
In order to make a step forward in the knowledge of the mechanism operating in complex
polygenic disorders such as diabetes and obesity, this paper proposes a new algorithm (PRSD –
possible restriction site detection) and its implementation in Applied Genetics software. This software
can be used for in silico detection of potential (hidden) recognition sites for endonucleases and for
nucleotide repeats identification. The recognition sites for endonucleases may result from hidden
sequences through deletion or insertion of a specific number of nucleotides. Tests were conducted on
DNA sequences downloaded from NCBI servers using specific recognition sites for common type II
restriction enzymes introduced in the software database (n = 126). Each possible recognition sites
indicated by the PRSD algorithm implemented in Applied Genetics was checked and confirmed by
NEBcutter V2.0 and Webcutter 2.0 software. In the sequence NG_008724.1 (which includes 63632
nucleotides) we found a high number of potential restriction sites for ECO R1 that may be produced
by deletion (n = 43 sites) or insertion (n = 591 sites) of one nucleotide.
The second module of Applied Genetics has been designed to find simple repeats sizes with a
real future in the understanding the role of SNPs (Single Nucleotide Polymorphisms) in the
pathogenesis of the complex metabolic disorders. We have been tested the presence of simple
repetitive sequences in five DNA sequence. The software indicated exact position of each repeats
detected in tested sequences.
Future development of Applied Genetics can provide an alternative for powerful tools used to
search for restriction sites or repetitive sequences or to improve genotyping methods.
Key words: PRSD algorithm, brute force, DNA sequences, restriction endonucleases, recognition
sites, tandem repeats.
The genetic bases of several metabolic syndromes with an obvious heritability [1] as diabetes
and obesity, remains a territory of waves of optimism
and then by disillusions. In a less than one decade,
the number of genes associated with diabetes and
obesity raised to a dozen of tens [2–6]. For a
practitioner who looked at a map with a location of
various loci associated with diabetes (located more
often in introns of genes as SNPs, then in exons
which encode known or unknown molecules) the
confusion is almost total. Any coherence could be
suggested for such view. However, the order of
many trillions of nucleotides in the human genome
has such an arrangement that can ensure the survival
and continuity of Homo sapiens sapiens. Inside it
must be hidden a code which results from a sum
mum of many sub-codes, some of them known, but
most of them unknown. To break such secret code
wisdom, time and tools are needed. One of such
tool is presented in this paper, enlarging the previous
automatic knowledge extraction methods on
sequences of biological molecules [7].
ROM. J. INTERN. MED., 2011, 49, 2, 121–128
Restriction enzymes [8] can cut doublestranded DNA molecules that contain a particular
recognition sequence. Different software (e.g. NEBcutter V2.0 [9], Webcutter 2.0 [10]) can detect the
sites where one or several restriction enzymes can
cut the sequences of interest. We have been
focused on the analysis of possible restriction sites
with a new algorithm called Possible Restriction
Sites Detector (PRSD) implemented in Applied
Genetics. The aim of this software is to help
researchers to find or to design new recognition
sites for endonucleases through the insertion or
deletion of a number of nucleotides within the
initial sequence.
Analysis of repetitive sequences has multiple
potential applications. For example, identification
of homopolymer tracts located upstream of the
promoter elements can be useful for detecting
protein binding signals [1]. Nucleotide repeats
located in exons may determine frame-shift errors
or protein toxicity which lead to diseases [e.g.
dinucleotides associated with cancers, (CAG)n/
122
P. Gagniuc et al.
(CTG)n tracts which can lead to polyglutamine
tracts are associated with Huntington’s disease,
Machado–Joseph disease or Spinocerebellar ataxia
[12]. Nucleotide repeats located in noncoding regions
have been associated in a more criptic model with
some human diseases (e.g. Norrie’s disease [13],
neurodegenerative disease, chromosomal fragility
[14]). The second aim of Applied Genetics is to find
repeats with different unit sizes in a DNA sequence.
MATERIALS AND METHODS
PRSD algorithm detects possible recognition
sites of use frequently in molecular biology protocols
in the input DNA sequence. These restriction
enzymes (n = 126) have no ambiguous nucleotides
in their recognition sites [15][16]. The method of
detection is based on detecting a hidden sequences
generated from specific recognition site of restriction
enzyme in the input DNA sequence. First step of
this process consists in dividing the specific recognition sequence in all possible combinations of two
pieces (the strings characters are noted as X and Y).
Then algorithm compared all generated X and Y
variants with the input sequence (denoted by S).
Finally, the algorithm search if these combinations
are located at the distance d (d represents the
number of characters) selected by the user (Fig. 2).
For example, if d = 0, no nucleotide is present
between the X and Y. Instead, if d = 3, the X and Y
are found at three nucleotides away from each
other whereas if d = 1 to 3, the algorithm will
indicate all X and Y complementary sequences
separated by 1, 2 or 3 nucleotides. The X and Y
complementary parts and the d characters between
them represent a hidden sequence.
The pseudocodes for the PRSD algorithms
based on hidden sequence are shown in Annex 1
and 2. Briefly, in these pseudocodes A, B and G are
boolean variables. Variable A can be TRUE if B
variable becomes FALSE, however, B variable is
FALSE when G variable becomes TRUE. Variable
B can become FALSE when G variable becomes
FALSE, but when G variable becomes TRUE, B
variable becomes FALSE. P(l) will be incremented
for every hidden sequence found in S within a
distance of d = k up to q.
A series of tests were conducted on different
biological DNA sequences downloaded from NCBI
servers. Two researchers check manually each
possible restriction site indicated by the algorithm.
The changes in the input sequence indicated by
2
PRSD algorithm have been tested using NEBcutter
V2.0, Webcutter 2.0.
Brute force algorithms were used to search
for short unknown tandem repeats, unknown repetitive
sequences, direct and inverted repeats with dynamic
unit sizes. Briefly, the Brute Force engine generates
all possible strings depending on four parameters
(Pa, Pb, Pc, Pd). These strings are then compared
with the analyzed DNA sequence. Pa parameter
represents the minimum number of letters from
which these strings are generated and Pb parameter
is the maximum length of a generated string.
Parameters Pc and Pd are the minimum and maximum
number of repetitions of the generated string.
AppliedGenetics use brute force methods for
searching for simple tandem repeats in sequences in
five sequences: NG_008724.1 (NAIP), AB017610
(TT virus genotype 1a DNA), AB025946.2 (TTV
SANBAN) AF122913.1 (TT virus isolate GH1) and
AF247138.1 (TT virus isolate T3PB).
Applied Genetics works with GenBank [17]
format or plain text sequences. The software does
not have theoretical restrictive length barriers for
data input. However, we tested sequences of maximum
500 kb. This size was considered sufficient for an
analysis especially if we take into account estimation
that the average gene size in vertebrates is about
30 kb and that some duplicated regions in the human
genome can reach this size (e.g. 5q duplicated
region in human has 500 kb).
Applied Genetics program is compatible with
windows 2000/2003/XP operating systems. After
installation, the extensions “pro” and “agx” are
associated with the Applied Genetics program. All
Applied Genetics projects are saved under these file
extensions the software does not have theoretical
restrictive length barriers for data input.
RESULTS
We have created a software platform named
Applied Genetics which incorporates several modules
used to search for existent or hidden (possible)
recognition sites for endonucleases and to test the
presence of simple repetitive sequences (e.g. which
examines short tandem repeats).
1. AppliedGenetics detects possible restriction
sites for 126 restriction enzymes that are used
frequently in recombinant DNA techniques, RFLP
analysis and genomic mapping. The tests were
performed on different sequences downloaded from
the NCBI servers.
3
Type II restriction enzymes in DNA sequences
123
(n = 43 sites) or insertion (n = 591 sites) of one
nucleotide. The results provided by the Applied
Genetics software have been checked and confirmed
by NEBcutter V2.0 and Webcutter 2.0 software.
The number of sites detected increases very fast if
the value of distance d increases (e.g. d > 3) (Table I).
For examples we tested the sequence Homo
sapiens NLR family, Neuronal Apoptosis Inhibitory
Protein (NAIP), (NG_008724.1, which includes
63632 nucleotides). Applied Genetics detected in
this sequence a high number of potential restriction
sites for EcoR1 that may be produced by deletion
Table I
Applied Genetics detected 10351 potential sites for EcoR 1 in NG_008724.1
if this sequence is analyzed in both directions
Sequence
Polarity
5’ > 3’
3’ > 5’
Total sites
Number of deleted nucleotide(s) Number of inserted nucleotide(s)
1
2
3
1
2
3
21
9
16
330
951
3031
22
19
29
261
1074
4588
43
28
45
591
2025
7619
Hidden sequences identified by Applied
Genetics in the input sequence represent parts of
the potential recognition site that may be created or
modified by PCR site-directed mutagenesis.
2. Although the first brute force methods
have been used in cryptography and computer
security. AppliedGenetics use brute force methods
for searching for simple repetitive sequences in five
DNA sequences (Table II).
The program output for each set of restriction
enzymes presents the following data: the restriction
enzyme; the type of the method (e.g. insertion or
deletion) used to construct a virtual recognition
site; the number of nucleotides that are inserted or
deleted from the original sequence; the number of
possible restriction sites found for the distance d
selected by the user; the list of restriction sites
found within the analyzed sequence (Fig. 1).
Table II
The number of simple repeats detected by Applied Genetics in five sequences
Sequence
Size
NG_880724.1
63632
AB017610.1
3853
AB025946.2
3808
AF122913.1
3852
AF247138.1
3838
Repeats
(C)n
(G)n
(A)n
(T)n
(C)n
(G)n
(A)n
(T)n
(C)n
(G)n
(A)n
(T)n
(C)n
(G)n
(A)n
(T)n
(C)n
(G)n
(A)n
(T)n
DISCUSSION
The identification of type 2 restriction endonucleases sites is a preliminary step for the indepth
study of genomic DNA arrangements which might
explain finally the polygenic interactions in ensuring
n>8
1
0
34
50
1
2
1
0
4
2
0
0
3
2
1
0
2
1
0
0
n>7
2
0
52
69
1
3
3
0
4
2
0
0
3
3
3
0
5
4
2
0
n>6
8
2
90
124
7
5
8
2
9
3
2
1
6
4
10
2
7
5
6
1
n>5
31
20
200
198
13
7
14
3
9
6
12
3
15
8
13
3
12
8
17
2
the energy homeostasis of the human body and
then the understanding the polygenic derangements
of the normal genetic architecture which led to the
dysregulation of a huge system including many
cells/tissues/organs, as is the case of obesity, diabetes
and their related chronic vascular complications [18].
P. Gagniuc et al.
Fig. 1.Applied Genetics software. On the left side of the figure there are the buttons which give access to Applied Genetics modules. In the middle there
is the graphic representation of the DNA sequence as text and results window. In the right side there are the window where the parameters of the
modules can be changed, the start button, the color window and the statisticd window.
124
4
5
Type II restriction enzymes in DNA sequences
125
Fig. 2. Number of variants for delition and insertion methods. The figure shows the significance of PRSD generated variants for
deletion case (a) and insertion case (b) in which the imposed limit for distance is d = k to q (k =1, q = 3).
Both DNA chains and directions (5’–3’, 3’–5’) are taken into account for EcoR1 case.
126
P. Gagniuc et al.
Because the “brain” of metabolic regulation
is represented by the pancreatic β cells is our
intention to use the new informatics tool in analyzing
the secretory components of these cells: pre-proinsulin and pre-proamylin and to try to understand
why, in some circumstances, proinsulin remains in
great part unsplitted and amylin (the second hormone
secreted by this cell) suffers a conformational
change leading to their self association with the
formation of large amylin deposits, favoring their
progressive decrease in the β cell mass [19].
The most important challenge in Bioinformatics is the integration of available visualization
and analysis tools. Visualization can play an
important role in exploratory data analysis, where
graphical representations build up an understanding
of the DNA sequence content. Programming objects
of the graphic representation are interconnected, this
6
relationship between objects gives the user an
unquestionable and clear vision of the analyzed
sequence. Navigating through the sequence using
the mouse is one of the key features of Applied
Genetics.
The results provided by Applied Genetics are
presented in HTML format for saving or publishing
on different Internet servers. The generated HTML
file is equipped with a JavaScript search engine used to
navigate the users through Applied Genetics results.
Future development of Applied Genetics can
provide an alternative for powerful tools used to
view sequences (e.g. DNAVis [20], Phylo-VISTA
[21]), to search for restriction sites or polymorphisms (e.g. TandemSWAN [22], Phobos [23],
Poly [24], MIcroSAtellite identification tool [25],
Microsatellite repeats finder [26] or Tandem Repeats
Finder [27]) or to improve genotyping methods.
Acknowledgements: This work is supported by the Ministry of Education, Research and Innovation, CNCSIS, IDEI contract
number: 2150/2008.
Pentru a face un pas înainte în cunoaşterea mecanismului care operează în
afecţiuni poligenice complexe, cum ar fi diabetul zaharat şi obezitatea, această
lucrare propune un nou algoritm (PRSD – detectarea de situsuri de restricţie
posibile) implementat în aplicaţia software Applied Genetics. Acest program
software poate fi folosit pentru detectarea in silico a situsurilor de restricţie
posibile (ascunse), a situsurilor de recunoaştere pentru endonucleaze şi a repetiţiilor
nucleotidice.
Situsurile de recunoaştere pentru endonucleaze ar putea rezulta din secvenţe
ascunse prin ştergerea sau introducerea unui anumit număr de nucleotide în
secvenţa ADN analizată iniţial.
Testele au fost efectuate pe secvenţe descărcate de pe serverele NCBI,
folosind situsuri de recunoaştere specifice pentru enzimele de restricţie comune de
tip II – enzime introduse şi în baza de date a programului Applied Genetics (n = 126).
Fiecare situs posibil detectat de către algoritmul PRSD implementat în
Applied Genetics a fost verificat şi confirmat de programele software NEBcutter
V2.0 şi Webcutter 2.0. În secvenţa NG_008724.1 (care include 63632 nucleotide) a
fost detectat un număr mare de situsuri de restricţie posibile pentru ECO R1, care
pot fi produse prin deleţia (n = 43 situsuri) sau inserţia (n = 591 situsuri) de
nucleotide.
Al doilea modul al aplicaţiei software Applied Genetics, cu un viitor real în
înţelegerea rolului SNP-urilor (Single Nucleotide Polymorphisms) în patogeneza
tulburărilor metabolice complexe, a fost proiectat pentru a găsi repetiţii de diferite
dimensiuni. Au fost efectuate teste pentru detectarea de secvenţe repetitive simple
în cinci secvenţe ADN. Aplicaţia a indicat poziţia exactă a fiecărei repetiţii
detectate în secvenţele testate.
Dezvoltarea pe viitor a aplicaţiei Applied Genetics poate oferi o alternativă
pentru instrumente puternice folosite pentru a căuta situsurile de restricţie,
secvenţe repetitive sau pentru a îmbunătăţi metodele de genotipare.
___________________________________________________________________
Corresponding author: P. Gagniuc, MD
University of Bucharest,
Human Genetics and Molecular Diagnosis Laboratory,
Department of Genetics,
E-mail: paulgagniuc@yahoo
7
Type II restriction enzymes in DNA sequences
127
REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
PERMUT M.A., WASSON J., COX N., Genetic epidemiology of diabetes. J.Clin.Invest., 2005, 115, 1431–1439.
LI S., ZHAO J.H., LUAN J., LANGENBERG C., LUBEN R.N., KHAW K.T., WAREHAM N.J., LOOS R.J.F., Genetic
predisposition to obesity leads to increased risk of type 2 diabetes. Diabetologia, 2011, 54:776–782.
RAMOS E., CHEN G., SHRINER D., DOUMATEY A., GERRY N.P., HERBERT A., HUANG H., ZHOU J., CHRISTMAN
M.F., ADEYEMO A., ROTIMI C., Replication of genome-wide association studies (GWAS) loci for fasting plasma glucose in
African-Americans. Diabetologia, 2011, 54:783–788.
TIAN C., FANG S., DU X., JIA C., Association of the C47T polymorphism in SOD2 with diabetes mellitus and diabetic
microvascular complications: a meta-analysis. Diabetologia 2011, 54: 803–811.
STANČÁKOVÁ A., PAANANEN J., SOININEN P., KANGAS A.J., BONNYCASTLE L.L., MORKEN M.A, COLLINS F.S.,
JACKSON A.U., BOEHNKE M.L., KUUSISTO J., ALA-KORPELA M., LAAKSO M., Effects of 34 risk loci for type 2 diabetes or
hyperglycemia on lipoprotein subclasses and their composition in 6, 580 nondiabetic Finnish men, Diabetes 2011, 60:1608–1616.
JENSEN A.C., BARKER A., KUMARI M., BRUNNER E.J., KIVIMÄKI M., HINGORANI A.D., WAREHAM N.J., TABÁK
A.G., WITTE D.R., LANGENBERG C., Associations of common genetic variants with age-related changes in fasting and
postload glucose. Evidence from 18 years of follow-up of the Whitehall II Cohort. Diabetes 2011, 60:1617–1623.
PAUL GAGNIUC, DĂNUŢ CIMPONERIU, NICOLAE MIRCEA PANDURU, MONICA STAVARACHI, MIHAI TOMA,
CONSTANTIN IONESCU-TÎRGOVIŞTE, LUCIAN GAVRILĂ, A sensitive method for detecting dinucleotide islands and
clusters through depth analysis – abstract, RJDNMD 2011, 18, 2: 165–170.
ROBERTS R.J., How restriction enzymes became the workhorses of molecular biology, Proc Natl Acad Sci USA. 2005 Apr 26;
102(17):5905–8.
http://tools.neb.com/NEBcutter2/index.php.
http://rna.lundberg.gu.se/cutter2/.
STRUHL K., Naturally occurring poly(dA-dT) sequences are upstream promoter elements for constitutive transcription in
yeast. Proc Natl Acad Sci U S A. 1985 Dec; 82(24):8419–23.
MALLIK M., LAKHOTIA S.C., Modifiers and mechanisms of multi-system polyglutamine neurodegenerative disorders:
lessons from fly models. J Genet. 2010 Dec; 89(4):497–526.
KENYON J.R., CRAIG I.W., Analysis of the 5’ regulatory region of the human Norrie’s disease gene: evidence that a nontranslated CT dinucleotide repeat in exon one has a role in controlling expression. Gene. 1999 Feb 18; 227(2):181–8.
ASHLEY C.T. Jr., WARREN S.T., Trinucleotide repeat expansion and human disease. Annu Rev Genet 1995, 29:703–28.
PINGOUD A., FUXREITER M., PINGOUD V., WENDE W., Type II restriction endonucleases: structure and mechanism. Cell
Mol Life Sci. 2005 Mar; 62(6):685–707.
BILCOCK D.T., DANIELS L.E., BATH A.J., HALFORD S.E., Reactions of type II restriction endonucleases with 8-base pair
recognition sites. J Biol Chem. 1999 Dec 17; 274(51):36379–86.
BENSON D.A., KARSCH-MIZRACHI I., LIPMAN D.J., OSTELL J., SAYERS E.W., Gen Bank. Nucleic Acids Res. 2010 Jan;
38(Database issue): D46–51.
IONESCU-TÎRGOVIŞTE C., Insulin resistance – what is myth and what is reality? Acta Endocrinologica 2011; VII, 1:123–146.
IONESCU-TÎRGOVIŞTE C., Proinsulin As The Possible Key In The Pathogenesis Of Type 1 Diabetes. Acta Endocrinologica
2009; 5(2):233–249.
FIERS M.W., VAN de WETERING H., PEETERS T.H., VAN WIJK J.J., NAP J.P., DNAVis: interactive visualization of
comparative genome annotations. Bioinformatics. 2006 Feb 1; 22(3):354–5.
SHAH N., COURONNE O., PENNACCHIO L.A., BRUDNO M., BATZOGLOU S., BETHEL E.W., RUBIN E.M., HAMANN B.,
DUBCHAK I., Phylo-VISTA: interactive visualization of multiple DNA sequence alignments. Bioinformatics. 2004 Mar 22; 20(5):636–43.
http://favorov.imb.ac.ru/swan/.
http://www.ruhr-uni-bochum.de/spezzoo/cm/cm_phobos.htm.
http://www.bioinformatics.org/poly/wiki/.
http://pgrc.ipk-gatersleben.de/misa/misa.html.
http://www.biophp.org/minitools/microsatellite_repeats_finder/demo.php.
http://tandem.bu.edu/trf/trf.html.
Received June 3, 2011
ANNEX 1
The pseudocode for the PRSD algorithm based on deletion of nucleotides
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Let k =1
Let d = k
For l = 1 to mu
Let P(l) = 0
For alpha = 1 to beta
For e = 1 to n
Let A = True
Let B = True
Let G = False
For i = 1 to e
if E(l,i) <> N(alpha+i-1) then A = False
Next i
For j = 1 to n-e
128
P. Gagniuc et al.
18
if E(l, j+e+d) <> N(alpha+e+d+j-1) then B = False
19
Next j
20
21
if (beta-alpha-1 < n) then
22
Let G = True
23
Let alpha = 1
24
Let l = l + 1
25
end if
26
27
if A = True & B = True & G = False then
28
29
For t = 1 to alpha+i-1
30
xi(d, P(l)) = xi(d, P(l)) and N(t)
31
Next t
32
33
For e = alpha+i-1+d to beta
34
xi(d, P(l)) = xi(d, P(l)) and N(e)
35
Next e
36
37
P(l) = P(l)+1
38
alpha = alpha+e+d+e-n
39
40
end if
41
42
Next e
43 Next alpha
44 Next l
ANNEX 2
The pseudocode for the PRSD algorithm based on insertion of nucleotides
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
Let k = 1
Let d = k
For l = 1 to mu
Let P(l) = 0
For alpha = 1 to beta
For e = 1 to n-d
Let A = True
Let B = True
Let G = False
For i = 1 to e
if E(l,i) <> N(alpha+i-1) then A = False
Next i
For j = 1 to e-i
if E(l,j+e+d) <> N(alpha+e+d+j-1) then B = False
Next j
if (beta-alpha-1 < n) then
Let G = True
Let alpha=1
Let l=l+1
end if
if A = True & B = True & G = False then
For t = 1 to alpha+i-1
xi(d, P(l)) = xi(d, P(l)) and N(t)
Next t
For f = k to q
xi(d, P(l)) = xi(d, P(l)) and E(l,f)
Next e
For e = alpha+i-1+q to beta
xi(d, P(l)) = xi(d, P(l)) and N(e)
Next e
P(l)=P(l)+1
alpha=alpha+e+d+e-i
end if
Next e
Next alpha
Next l
8
Download