DNA Sequence Analysis Using Boolean Algebra

advertisement
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN
BIOMEDICAL ENGINEERING
IMPROVED ALGORITHM FOR GLOBAL
ALIGNMENT IN DNA SEQUENCING
1 V.
ANITHA , 2 B.POORNA
1 Sr.Lecturer,
2 Corresponding
Panimalar Engineering College,Chennai,India
author: Professor, Easwari Engineering College, Chennai,India
anithamoses_06@hotmail.com , poornasundar@yahoo.com
ABSTRACT : Alignment is the most basic component of biological sequence manipulation and has diverse
applications in sequence assembly, sequence annotation, structural and functional predictions for genes and
proteins, phylogeny and evolutionary analysis. Classical methods like Needleman Wunsch ( for global
alignment) in 1970 and Smith-Waterman (for local alignment) in 1981 suffer from the drawback that it involves
a large number of computational steps and has a statically allocate a large section of memory for computer
implementation. This paper suggests an algorithm for global alignment between two DNA sequences and
compares the performance of the algorithm with Needleman Wunsch. Complexity calculation shows that the
proposed algorithm has a much less time complexity and requires very much less amount of memory storage
than Needleman Wunsch algorithm
1. INTRODUCTION
DNA carries the genetic information for life as we
know it. Before its identification by Watson and
Crick in 1953, the quantum physicist Schrödinger
had already accurately predicted the carrier of genetic
information to be an “a periodic crystal”: a structured
medium (crystal) capable of storing information
because of variation allowed within the structure (a
periodicity)[5]. With more and more complete
genomes of prokaryotes and eukaryotes becoming
available and the completion of human genome
project in the horizon, fundamental questions
regarding the characteristics of these sequences arise.
Life represents order. It is not chaotic or random [7].
Thus, we expect the DNA sequences that encode life
to be non random. In other words, they should be
very compressible. There are also strong biological
evidences that support this claim: it is well-known
that DNA sequences, especially in higher eukaryotes,
contain many (approximate) tandem repeats; it is
believed that there are only about a thousand basic
protein folding patterns; it also has been conjectured
that genes duplicate themselves sometimes for
evolutionary or simply for “selfish: purposes. All
these give more concrete support that the DNA
sequences should be reasonably compressible.
However, such regularities are often blurred by
random mutation, translocation, cross-over and
reversal events, as well as sequencing errors. It is
well recognized that the compression of DNA
sequences is a very difficult task [10]. The DNA
sequences only consists of 4 nucleotide bases { A, C,
G, T}, 2 bits are enough to store each base. So a
mapping between every pair of letters that are
derived from the same ancestral letter through the
replication of cells called evolutionary relationships
is done on some biological sequences. This has the
disadvantage that it is very restrictive in not allowing
point mutation or convergent evolution and too
permissive by not seeking to find the truly
orthologous segments between the sequences of two
species. This finds all sequence similarities that are
more significant than a threshold above random
similarity, implying some common function[6].
The goals of the genomics community are to
sequence and align a large number of species in order
to study their biology and evolution through
comparative sequence analysis. The unsolved
problems in alignment are improved pair wise
alignment with a statistical basis improved alignment
based gene prediction, effective multiple genomic
sequence alignment, better alignment browsers and
rigorous methods for evaluating the accuracy of
alignment. A simple analysis of the algorithm reveals
that it has a significantly high time complexity lying
between nxm and nxm2 and space complexity of the
order of nxm, where n and m denote the lengths of
the shorter and longer sequences respectively.
Sequence evolution is complicated and difficult to
model probabilistically and therefore rigorous yet
practical statistical methods for alignment and
objective criteria for evaluation of alignments are
really hard to obtain.
2. DNA SEQUENCE
DNA molecules are composed of single or double
DNA fragments or often called oligonucleotides
(oligos for short) or strands. Nucleotides form the
basis of DNA. A single stranded fragment has a
phosphor-sugar backbone and four kinds of bases
denoted by the symbols A, T, G and C for the bases
adenine, thymine, guanine and cytosine respectively.
ISSN: 0975 – 6752 | NOV 10 TO OCT 11 | Volume 1, Issue 2
Page 15
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN
BIOMEDICAL ENGINEERING
These four nucleic acids, which can occur in any
order combined in Watson-Crick complementary
pairs to form a double strand helix of DNA. Due to
the hybridization reaction, A is complementary with
T and C is complementary with G. Base pairs are the
most common unit for measuring the length of a
DNA. A DNA can be specified uniquely by listing its
sequence of nucleotides on base pairs.
As an
example, any sequence oligonucleotides, such as 5’ –
ACCTG – 3’ has a complementary sequence, 3’ TGGAC – 5’. Digits 5’ and 3’ denote orientation of
DNA oligonucleotides.
Fig. I Double helix
Needleman Wunsch proposed an algorithm based on
dynamic programming to solve the alignment
problem. The main disadvantage of this method is
that the scoring matrix construction and trace back
causes a significant degradation in the runtime of the
above algorithm. The Altschul’s method to search
for similarities between a query and all the sequences
in a databases matches by first looking for small
segments of the input sequence to match with other
sequences[13]. It then builds from those matched
regions to the largest ungapped region it can find.
The method proposed by Pearson and Lipman
searches similarities between one sequence and any
group of sequences of the same types [2].
2.1 Existing Method
The genome is the complete set of DNA molecules
inside any cell of a living organism that is passed
from one generation to its offspring. The DNA is
considered the blue print of life because it encodes
the information necessary to produce the proteins
required for all cellular processes. This makes living
beings biologically similar or distinct. Early sequence
alignment programs used the unitary scoring matrix.
A unitary matrix scores all matches the same and
penalizes all mismatches the same. Although this
scoring is sometimes appropriate for DNA
comparisons, using a unitary matrix amounts to
proclaiming ignorance about DNA evolution and
structure. Thirty years of research in aligning protein
sequences have shown that different matches and
mismatches among the pairs that are found in
alignments require different scores. Many
alternatives to the unitary scoring matrix have been
suggested [12]. Swagatam Das and Debangshu Dey
[15] is designed a new algorithm for local alignment
in DNA and Subhra Sundar Bandyopadhyay et al. [
14] have designed a algorithm for direct comparison
and compare the performance of our algorithm with
that of the classical method and also propose an
alternate scoring scheme based on fuzzy concept ,
discuss its advantage over conventional PAM matrix.
Guralnik et al. [3] have designed a K-means based
algorithm for clustering protein sequences. Eva
Bolten et al. [16] have used transitive homology, a
graph theoretic based approach for clustering. Protein
sequences for structure prediction. Single linkage
clustering method is computationally very expensive
for large set of protein sequences and it also suffers
from chaining effect, Even in graph based approaches
the distance matrix values are to be calculated and is
also expensive for large data sets. In both the cases,
distance matrix may not be accommodated in main
memory and it increases the disk I/O operations.
Here, we propose an algorithm as simple and the
results are discussed in the following sections.
2.1.1 Needleman-Wunsch Algorithm
Needleman-Wunsch uses dynamic programming in
order to obtain global alignment between two
sequences. Global alignment, as the name suggests
takes into account all the elements of the two
sequences while aligning the two sequences. We can
also call it as an “end to end” alignment.
In Needleman-Wunsch algorithm, a scoring matrix of
size m*n (m being the length of longer sequence and
n being that of the shorter sequence) is first formed.
The optimal score at each matrix position is
calculated by adding the current match score to
previously scored positions and subtracting gap
penalties. Each matrix position may have a positive,
negative or 0 value.
For two sequences
S1= a1a2..................am
S2= b1b2..................bn
where Tij=T(a1a2..................am, b1b2..................bn)
then the element at the i,jth position of the matrix T ij is
given by
Tij =Max{ Ti-1,j-1+s,
Max(Ti-x,j – px), x >=1
Max(Ti,j-y – py),y>=1 }
(1)
Where Tij is the score at position i in the sequence S1
and j in the sequence S2, T(ai bj) is the score for
aligning the characters at positions i and j, p x is the
penalty for a gap of length x in the sequence S1 and
py is the penalty for a gap of length y in the sequence
S2. After the T matrix is filled up, to determine an
optimal alignment of the sequences from scoring
matrix, a method called trace back is used. The trace
back keeps track of the position in the scoring matrix
that contributed to the highest overall score found.
The positions may align or may be next to a gap,
depending on the information in the trace back
matrix. There may exist multiple maximal
alignments.
2.2 Proposed Method
The main kinds of information stored in biological
databases are DNA and protein sequences. The
International
Nucleotide
Sequence
Database
Collaboration, which is comprised of the DNA
ISSN: 0975 – 6752 | NOV 10 TO OCT 11 | Volume 1, Issue 2
Page 16
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN
BIOMEDICAL ENGINEERING
DataBank of Japan(DDBJ), the European Molecular
Biology Laboratory(EMBL), and the GenBank at the
National Center for Biotechnology Information, hosts
more than 25 million sequence records comprising
more than 32 billion nucleotides. The dynamic
programming algorithm for alignment between two
DNA
sequences proposed by Smith and Waterman,
Needleman-Wunsch is a very well known and
versatile algorithm, and has been widely been
referred in the domain of Bioinformatics. The scoring
matrix construction and trace-back causes a
significant degradation in the runtime of the above
algorithm. Sequence similarity is actually a wellknown problem in computer science. For the
computer scientist, biomolecular sequences are just
another source of data. As biological databases grow
in size,
faster algorithms and tools are needed. The
information is saved in binary strings that are made
up of 0 and 1 integers at computers, similarly it is
saved in DNA strings that are build of A, T, C and G
molecules in living individuals. In the other words,
the DNA is capable to perform computing and
processing tasks and in the method of saving the
information on genes. Clustering is an active topic in
pattern recognition, data mining, statistics and
machine learning with diverse emphasis[1]. The
earlier approaches do not adequately consider the fact
that the data set can be too large and may not fit in
the main memory of some computers. In
bioinformatics, the number of DNA sequences is now
more than half a million. It is necessary to devise
efficient algorithms to minimize the disk I/O
operations. The problem we have considered here is
to reduce the time and space complexities.
2.3 Encoding Of Nitrogen Base
Let the four kinds of nitrogen bases Adenosine,
Cytosine, Guanine and Thymine be represented by a
pair of binary numbers 00, 01, 10 and 11
respectively. Thus the given sequence AGCTAT will
be represented by the binary string 00 10 01 11 00
11.
Thus any given DNA sequence may be
represented by a binary string.
3. SIMILARITY BETWEEN SEQUENCES
Let S1 and S2 be two DNA sequences the similarity
between which needs to be determined. For example
S1= AGTCATGGCCAA and its binary equivalent is
001011010011101001010000.
Also
S2=AGTCCTGCCCAC and its binary equivalent is
001011010111100101010001.
A NOR function
operation between two binary variables in Boolean
algebra satisfies the condition that if both the binary
variables are the same the output is 1, else it is equal
to zero [8]. Let X and Y are the two binary variables.
Applying NOR function between the two sequences
gives
X
0
0
1
1
Y
0
1
0
1
X NOR Y
1
0
0
1
table 1 nor gate
Now when this NOR function is applied between the
binary strings corresponding to the DNA sequences
results in another binary string, with a 1 value when
the two corresponding bits are one.
Let R be the resultant string. The number of 1’s in R
gives a measure of the similarity between the two
sequences. Replace two successive one’s by one if
they are same and replace two successive bits by 0 if
they are different. Thus the resultant binary string
will have a 1 when the corresponding characters
match else it is a zero.
3.1 Proposed Algorithm
Step 0: Start
Step 1:
Input two binary strings B1, B2
corresponding to the two DNA sequences S1 and S2
Step 2: Form the binary string of length ‘n’ for S1
Step 3: Form the binary string of length ‘m’ for S2
Step 4: Perform exclusive NOR operation between
corresponding bits of S1 and S2 to form string R of
length p, where p= min (n, m)
j=0
Step 5: For i = 1 to p step 2
Begin
If R (i) = R(i+1) then Oj = 1
else Oj = 0
j = j+1
End
cnt = 0
half = p/2
Step 6: For j = 1 to half
If Oj = 0 then cnt = cnt+1
Step 7: mtch = ((half – cnt) / half ) x 100
Step 7: The two strings match in mtch %
Step 8: End
Thus for the example given above
S1= A G T C A T G G C C A A
B1=00 10 11 0100 11 10 10 01 01 00 00
S2= A G T C C T G C C C A C
B2= 00 10 11 01 01 11 10 01 01 01 00 01
R(B1NORB2)=11 11 11 11 10 11 11 00 11 11 11 10
O=
1 1 1 1 0 1 1 0 1 1 1 0
Match = 75%
A simple subtraction of the corresponding bits will
not result in a binary string, hence the logical
function NOR is chosen.
4.
COMPLEXITY
OF
PROPOSED
ALGORITHM
Let the lengths of the two sequences be n and m,
where sequence1 have length n and sequence2 have
length m. We have to match all the elements of the
sequence of length n and m.
ISSN: 0975 – 6752 | NOV 10 TO OCT 11 | Volume 1, Issue 2
Page 17
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN
BIOMEDICAL ENGINEERING
After matching we have to check remaining (m-i)
elements of sequence of length m. In worst case, for
(i+1)th element the number of match=1 and for other
remaining elements the number of mismatch =0.
Therefore the number of comparisons =n+0+1x(m-i1)
(2)
Again, ith element of sequence2 may have one
matching, let it be denoted by ki. Therefore the total
number of comparisons for ith element in worst case.
=n+ki x[ 0+1x(m-i-1)]
(3)
Therefore the total number of comparisons
p

=
[n+ki x{0+1x(m-i-1)}]
(4)
 
i=1 j=1
During trace-back step, let the optimum global
alignment consists of x number of elements and the
kth element has row number nk, column number mk.
Consider there be more than one number of optimum
global alignments, let it be denoted by p and lth
alignment has a length xl. Therefore the number of
comparisons for the trace back step
p
xl nk mk
   
=px
p

[ki x{0+1x(m-i-1)}]
(5)
i=1
After finding all the alignments, we have to find the
optimum alignment.
p
Now, the number of alignments=

ki
(6)
i=1
therefore the total number of comparisons=
p
=(

ki )-1
(7)
i=1
Since for finding the optimum alignment, we select
the score of the first alignment and compare with
alignments.
Therefore the total number of comparisons of the
proposed algorithm
p
=nxm

[ki x{0+1x(m-i-1)}]+
i=1
p
(

ki ) -1
i=1
=nxm-0+1x
(8)
p

[ki x(m-i)]
(9)
i=1
5. COMPUTATIONAL COMPLEXITY OF
NEEDLEMAN- WUNSCH ALGORITHM
During matrix fill up steps, the score at the position
(i,j) of the matrix is of the form by referring “Eq. (1)”
Tij =Max{ Ti-1,j-1+s,
Max(Ti-x,j – px), x >=1
Max(Ti,j-y – py),y>=1 }
where px is the penalty for a gap of length x and py is
the penalty for a gap of length y and T is the score for
aligning the characters at the position (i,j). The
number of comparisons needed for calculating the
scoring matrix.
N
M
(10)
(i+j)
(11)
l=1 k=1 i=1 j=1
xl
i=1
=nxm
(i+j) =[MxNx(M+N+2)]/2

[{mkxnkx(mk+nk+2)}/2]
(12)
k=1
Therefore, the total number of comparisons
xl
=[MxNx(M+N+2)]/2+px

[{mkxnkx(mk+nk+2)}/
2]
k=1
(13)
in the worst case, the length of local alignment is
equal to the length of global alignment ie x =M
therefore, the total number of comparisons
M
=[MxNx(M+N+2)]/2+px
2]

[{mkxnkx(mk+nk+2)}/
k=1
(14)
6. EXPERIMENTAL RESULTS
To evaluate the performance, a DNA sequence is
considered
6.1 DNA Sequence Data Set
DNA sequences of human mitochondrial genome
have been collected from Genbank. It contains 16571
sequences. From randomly 1919 sequences were
selected for training and 100 for testing. The
experiments were done on Intel Pentium-IV
processor based machine having a clock frequency of
1700 Mhz and 512 MB RAM. The sample result for
the proposed algorithm is shown in the table below.
Sequence1
Sequence2
GTACCTTGGA
GGATCGA
A
GGATCGA
GAATTCAGTT
A
GTACCTGAA
GAACTGA
GTCACGTA
AGTCAGCAT
CTAGCTG
A
ACTGCA
GTACTGATC
TCGATAG
ISSN: 0975 – 6752 | NOV 10 TO OCT 11 | Volume 1, Issue 2
Global
alignment
GTACCTTGGA
GGATC GAA
GAATTCAGTT
A
GGAT - C -G
A
GTACCTGAA
GAAC- TGAGTCACGTA
CTAGCTGA
AGTCAGCAT
ACT
GCAGTACTGATC - T- C –GATAG
Page 18
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN
BIOMEDICAL ENGINEERING
GTCGATCGTA
G
CGTACGT
A
GTACGTACGT
A
CGTAGCT
CTAGTCATC
GTACCAT
GACTGATC
CGTAC
GTCGATCGTA
G
CGTACGTAGTACGTACGT
A
CGTAGCTCTAGTCATC
GTA- CCATGACTGAT- C
C- G - TAC
Table 2 sample result for proposed algorithm
7. CONCLUSION
Experimental results on a DNA sequence data set
show that it performs well by comparing at low
computation cost and this may be used to find the
superfamily, family and subfamily relationships in
DNA sequences. A binary representation for DNA
sequence is given. The comparison of sequence using
logic gates is proposed. This method gives the
accuracy with which the two sequences match. It
also provides information about which portion of the
sequence do not match. From the complexity
calculation, it is seen that the proposed algorithm has
a much less time complexity and requires very much
less amount of memory storage with respect to
Needleman-Wunsch algorithm. This is a simple
method of comparing two DNA sequences. This
method not only determines the percentage of
matching of two sequences, but also identified the
portion of the sequence that does not match.
Therefore, it is useful in most of the cases where less
computational burden on the processor and less
system memory usage is encouraged. We further aim
at clustering a large set of DNA sequences consisting
of sequences from various families and evaluate the
performance of the proposed algorithm with
necessary modifications. The proposed algorithm can
also be used on text and web document collection.
[6] K. Lanctot, M. Li, E. H. Yang. Estimating DNA
sequence entropy, to appear in SODA ‘2000.
[7]
M. Li, P. Vitanyi, An introduction to
kolmogorov complexity and its applications,
Springer, 2nd ed., 1997.
[8] M. Morris Mano, Computer System Architecture
[9] Needleman Wunsch Algorithm –Global
Alignment Algorithm using Dynamic Programming.
[10] R.Curnow and T. Kirkwood, Statistical analysis
of deoxyribonucleic acid sequence data-a review.
J. Royal Statistical Society, 152(1989), 199-200.
[11]
S.C. Rastogi, Namita Mendiratta, Parag
Rastogi, Bioinformatics concepts, skills and
applications.
[12] Sequence alignment, Journal of computational
biology, 147, pp.195-197, 1970.
[13] S. Grumbach and F. Tahi, “A new challenge for
compression algorithms: genetic sequences”,Journal
of Information Processing and Management,
30:6(1994), 875-866.
[14] Subhra Sundar Bandyopadhyay, Somnath Paul
and Amit Konar, Improved algorithm for DNA
sequence alignment and revision of Scoring matrix,
Proc. of ICISIP 2005.
[15] Swagatam Das & Debangshu Dey, A new
algorithm for local alignment in DNA sequencing,
Proc. of IEEE conference, INDICON 2004.
[16] V. Guralnik and G.karypis, A scalable algorithm
for clustering sequential data, Proc. of I st IEEE
conference on Data Mining, 2001.
8. REFERENCES
[1] A. K. Jain, M.N. Murty, and P.J. Flynn, Data
clustering: A Review, ACM Computing Surveys,
Vol.
31, 3, pp.264-323, 1999.
[2] E.Rivals, J-P. Delahaye, M. Dauchet and
O.Delgrange. A Guaranteed Compression Scheme for
Repetitive DNA Sequences. LIFL Lille I University,
technical report IT-285, 1995.
[3] Eva Bolten, Alexander Schliep and Sebastian
Schneckener, Clustering Protein Sequences-Structure
prediction by transitive homology, GCB,1999.
[4] Harshawardhan P.Bal, Bioinformatics principles
and applications.
[5] J.H.M. Dassen, DNA Computing promises,
problems
and
perspective,
report
(http://citeseer.nj.nec.com/jhm.html)
ISSN: 0975 – 6752 | NOV 10 TO OCT 11 | Volume 1, Issue 2
Page 19
Download