JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN BIOMEDICAL ENGINEERING IMPROVED ALGORITHM FOR GLOBAL ALIGNMENT IN DNA SEQUENCING 1 V. ANITHA , 2 B.POORNA 1 Sr.Lecturer, 2 Corresponding Panimalar Engineering College,Chennai,India author: Professor, Easwari Engineering College, Chennai,India anithamoses_06@hotmail.com , poornasundar@yahoo.com ABSTRACT : Alignment is the most basic component of biological sequence manipulation and has diverse applications in sequence assembly, sequence annotation, structural and functional predictions for genes and proteins, phylogeny and evolutionary analysis. Classical methods like Needleman Wunsch ( for global alignment) in 1970 and Smith-Waterman (for local alignment) in 1981 suffer from the drawback that it involves a large number of computational steps and has a statically allocate a large section of memory for computer implementation. This paper suggests an algorithm for global alignment between two DNA sequences and compares the performance of the algorithm with Needleman Wunsch. Complexity calculation shows that the proposed algorithm has a much less time complexity and requires very much less amount of memory storage than Needleman Wunsch algorithm 1. INTRODUCTION DNA carries the genetic information for life as we know it. Before its identification by Watson and Crick in 1953, the quantum physicist Schrödinger had already accurately predicted the carrier of genetic information to be an “a periodic crystal”: a structured medium (crystal) capable of storing information because of variation allowed within the structure (a periodicity)[5]. With more and more complete genomes of prokaryotes and eukaryotes becoming available and the completion of human genome project in the horizon, fundamental questions regarding the characteristics of these sequences arise. Life represents order. It is not chaotic or random [7]. Thus, we expect the DNA sequences that encode life to be non random. In other words, they should be very compressible. There are also strong biological evidences that support this claim: it is well-known that DNA sequences, especially in higher eukaryotes, contain many (approximate) tandem repeats; it is believed that there are only about a thousand basic protein folding patterns; it also has been conjectured that genes duplicate themselves sometimes for evolutionary or simply for “selfish: purposes. All these give more concrete support that the DNA sequences should be reasonably compressible. However, such regularities are often blurred by random mutation, translocation, cross-over and reversal events, as well as sequencing errors. It is well recognized that the compression of DNA sequences is a very difficult task [10]. The DNA sequences only consists of 4 nucleotide bases { A, C, G, T}, 2 bits are enough to store each base. So a mapping between every pair of letters that are derived from the same ancestral letter through the replication of cells called evolutionary relationships is done on some biological sequences. This has the disadvantage that it is very restrictive in not allowing point mutation or convergent evolution and too permissive by not seeking to find the truly orthologous segments between the sequences of two species. This finds all sequence similarities that are more significant than a threshold above random similarity, implying some common function[6]. The goals of the genomics community are to sequence and align a large number of species in order to study their biology and evolution through comparative sequence analysis. The unsolved problems in alignment are improved pair wise alignment with a statistical basis improved alignment based gene prediction, effective multiple genomic sequence alignment, better alignment browsers and rigorous methods for evaluating the accuracy of alignment. A simple analysis of the algorithm reveals that it has a significantly high time complexity lying between nxm and nxm2 and space complexity of the order of nxm, where n and m denote the lengths of the shorter and longer sequences respectively. Sequence evolution is complicated and difficult to model probabilistically and therefore rigorous yet practical statistical methods for alignment and objective criteria for evaluation of alignments are really hard to obtain. 2. DNA SEQUENCE DNA molecules are composed of single or double DNA fragments or often called oligonucleotides (oligos for short) or strands. Nucleotides form the basis of DNA. A single stranded fragment has a phosphor-sugar backbone and four kinds of bases denoted by the symbols A, T, G and C for the bases adenine, thymine, guanine and cytosine respectively. ISSN: 0975 – 6752 | NOV 10 TO OCT 11 | Volume 1, Issue 2 Page 15 JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN BIOMEDICAL ENGINEERING These four nucleic acids, which can occur in any order combined in Watson-Crick complementary pairs to form a double strand helix of DNA. Due to the hybridization reaction, A is complementary with T and C is complementary with G. Base pairs are the most common unit for measuring the length of a DNA. A DNA can be specified uniquely by listing its sequence of nucleotides on base pairs. As an example, any sequence oligonucleotides, such as 5’ – ACCTG – 3’ has a complementary sequence, 3’ TGGAC – 5’. Digits 5’ and 3’ denote orientation of DNA oligonucleotides. Fig. I Double helix Needleman Wunsch proposed an algorithm based on dynamic programming to solve the alignment problem. The main disadvantage of this method is that the scoring matrix construction and trace back causes a significant degradation in the runtime of the above algorithm. The Altschul’s method to search for similarities between a query and all the sequences in a databases matches by first looking for small segments of the input sequence to match with other sequences[13]. It then builds from those matched regions to the largest ungapped region it can find. The method proposed by Pearson and Lipman searches similarities between one sequence and any group of sequences of the same types [2]. 2.1 Existing Method The genome is the complete set of DNA molecules inside any cell of a living organism that is passed from one generation to its offspring. The DNA is considered the blue print of life because it encodes the information necessary to produce the proteins required for all cellular processes. This makes living beings biologically similar or distinct. Early sequence alignment programs used the unitary scoring matrix. A unitary matrix scores all matches the same and penalizes all mismatches the same. Although this scoring is sometimes appropriate for DNA comparisons, using a unitary matrix amounts to proclaiming ignorance about DNA evolution and structure. Thirty years of research in aligning protein sequences have shown that different matches and mismatches among the pairs that are found in alignments require different scores. Many alternatives to the unitary scoring matrix have been suggested [12]. Swagatam Das and Debangshu Dey [15] is designed a new algorithm for local alignment in DNA and Subhra Sundar Bandyopadhyay et al. [ 14] have designed a algorithm for direct comparison and compare the performance of our algorithm with that of the classical method and also propose an alternate scoring scheme based on fuzzy concept , discuss its advantage over conventional PAM matrix. Guralnik et al. [3] have designed a K-means based algorithm for clustering protein sequences. Eva Bolten et al. [16] have used transitive homology, a graph theoretic based approach for clustering. Protein sequences for structure prediction. Single linkage clustering method is computationally very expensive for large set of protein sequences and it also suffers from chaining effect, Even in graph based approaches the distance matrix values are to be calculated and is also expensive for large data sets. In both the cases, distance matrix may not be accommodated in main memory and it increases the disk I/O operations. Here, we propose an algorithm as simple and the results are discussed in the following sections. 2.1.1 Needleman-Wunsch Algorithm Needleman-Wunsch uses dynamic programming in order to obtain global alignment between two sequences. Global alignment, as the name suggests takes into account all the elements of the two sequences while aligning the two sequences. We can also call it as an “end to end” alignment. In Needleman-Wunsch algorithm, a scoring matrix of size m*n (m being the length of longer sequence and n being that of the shorter sequence) is first formed. The optimal score at each matrix position is calculated by adding the current match score to previously scored positions and subtracting gap penalties. Each matrix position may have a positive, negative or 0 value. For two sequences S1= a1a2..................am S2= b1b2..................bn where Tij=T(a1a2..................am, b1b2..................bn) then the element at the i,jth position of the matrix T ij is given by Tij =Max{ Ti-1,j-1+s, Max(Ti-x,j – px), x >=1 Max(Ti,j-y – py),y>=1 } (1) Where Tij is the score at position i in the sequence S1 and j in the sequence S2, T(ai bj) is the score for aligning the characters at positions i and j, p x is the penalty for a gap of length x in the sequence S1 and py is the penalty for a gap of length y in the sequence S2. After the T matrix is filled up, to determine an optimal alignment of the sequences from scoring matrix, a method called trace back is used. The trace back keeps track of the position in the scoring matrix that contributed to the highest overall score found. The positions may align or may be next to a gap, depending on the information in the trace back matrix. There may exist multiple maximal alignments. 2.2 Proposed Method The main kinds of information stored in biological databases are DNA and protein sequences. The International Nucleotide Sequence Database Collaboration, which is comprised of the DNA ISSN: 0975 – 6752 | NOV 10 TO OCT 11 | Volume 1, Issue 2 Page 16 JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN BIOMEDICAL ENGINEERING DataBank of Japan(DDBJ), the European Molecular Biology Laboratory(EMBL), and the GenBank at the National Center for Biotechnology Information, hosts more than 25 million sequence records comprising more than 32 billion nucleotides. The dynamic programming algorithm for alignment between two DNA sequences proposed by Smith and Waterman, Needleman-Wunsch is a very well known and versatile algorithm, and has been widely been referred in the domain of Bioinformatics. The scoring matrix construction and trace-back causes a significant degradation in the runtime of the above algorithm. Sequence similarity is actually a wellknown problem in computer science. For the computer scientist, biomolecular sequences are just another source of data. As biological databases grow in size, faster algorithms and tools are needed. The information is saved in binary strings that are made up of 0 and 1 integers at computers, similarly it is saved in DNA strings that are build of A, T, C and G molecules in living individuals. In the other words, the DNA is capable to perform computing and processing tasks and in the method of saving the information on genes. Clustering is an active topic in pattern recognition, data mining, statistics and machine learning with diverse emphasis[1]. The earlier approaches do not adequately consider the fact that the data set can be too large and may not fit in the main memory of some computers. In bioinformatics, the number of DNA sequences is now more than half a million. It is necessary to devise efficient algorithms to minimize the disk I/O operations. The problem we have considered here is to reduce the time and space complexities. 2.3 Encoding Of Nitrogen Base Let the four kinds of nitrogen bases Adenosine, Cytosine, Guanine and Thymine be represented by a pair of binary numbers 00, 01, 10 and 11 respectively. Thus the given sequence AGCTAT will be represented by the binary string 00 10 01 11 00 11. Thus any given DNA sequence may be represented by a binary string. 3. SIMILARITY BETWEEN SEQUENCES Let S1 and S2 be two DNA sequences the similarity between which needs to be determined. For example S1= AGTCATGGCCAA and its binary equivalent is 001011010011101001010000. Also S2=AGTCCTGCCCAC and its binary equivalent is 001011010111100101010001. A NOR function operation between two binary variables in Boolean algebra satisfies the condition that if both the binary variables are the same the output is 1, else it is equal to zero [8]. Let X and Y are the two binary variables. Applying NOR function between the two sequences gives X 0 0 1 1 Y 0 1 0 1 X NOR Y 1 0 0 1 table 1 nor gate Now when this NOR function is applied between the binary strings corresponding to the DNA sequences results in another binary string, with a 1 value when the two corresponding bits are one. Let R be the resultant string. The number of 1’s in R gives a measure of the similarity between the two sequences. Replace two successive one’s by one if they are same and replace two successive bits by 0 if they are different. Thus the resultant binary string will have a 1 when the corresponding characters match else it is a zero. 3.1 Proposed Algorithm Step 0: Start Step 1: Input two binary strings B1, B2 corresponding to the two DNA sequences S1 and S2 Step 2: Form the binary string of length ‘n’ for S1 Step 3: Form the binary string of length ‘m’ for S2 Step 4: Perform exclusive NOR operation between corresponding bits of S1 and S2 to form string R of length p, where p= min (n, m) j=0 Step 5: For i = 1 to p step 2 Begin If R (i) = R(i+1) then Oj = 1 else Oj = 0 j = j+1 End cnt = 0 half = p/2 Step 6: For j = 1 to half If Oj = 0 then cnt = cnt+1 Step 7: mtch = ((half – cnt) / half ) x 100 Step 7: The two strings match in mtch % Step 8: End Thus for the example given above S1= A G T C A T G G C C A A B1=00 10 11 0100 11 10 10 01 01 00 00 S2= A G T C C T G C C C A C B2= 00 10 11 01 01 11 10 01 01 01 00 01 R(B1NORB2)=11 11 11 11 10 11 11 00 11 11 11 10 O= 1 1 1 1 0 1 1 0 1 1 1 0 Match = 75% A simple subtraction of the corresponding bits will not result in a binary string, hence the logical function NOR is chosen. 4. COMPLEXITY OF PROPOSED ALGORITHM Let the lengths of the two sequences be n and m, where sequence1 have length n and sequence2 have length m. We have to match all the elements of the sequence of length n and m. ISSN: 0975 – 6752 | NOV 10 TO OCT 11 | Volume 1, Issue 2 Page 17 JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN BIOMEDICAL ENGINEERING After matching we have to check remaining (m-i) elements of sequence of length m. In worst case, for (i+1)th element the number of match=1 and for other remaining elements the number of mismatch =0. Therefore the number of comparisons =n+0+1x(m-i1) (2) Again, ith element of sequence2 may have one matching, let it be denoted by ki. Therefore the total number of comparisons for ith element in worst case. =n+ki x[ 0+1x(m-i-1)] (3) Therefore the total number of comparisons p = [n+ki x{0+1x(m-i-1)}] (4) i=1 j=1 During trace-back step, let the optimum global alignment consists of x number of elements and the kth element has row number nk, column number mk. Consider there be more than one number of optimum global alignments, let it be denoted by p and lth alignment has a length xl. Therefore the number of comparisons for the trace back step p xl nk mk =px p [ki x{0+1x(m-i-1)}] (5) i=1 After finding all the alignments, we have to find the optimum alignment. p Now, the number of alignments= ki (6) i=1 therefore the total number of comparisons= p =( ki )-1 (7) i=1 Since for finding the optimum alignment, we select the score of the first alignment and compare with alignments. Therefore the total number of comparisons of the proposed algorithm p =nxm [ki x{0+1x(m-i-1)}]+ i=1 p ( ki ) -1 i=1 =nxm-0+1x (8) p [ki x(m-i)] (9) i=1 5. COMPUTATIONAL COMPLEXITY OF NEEDLEMAN- WUNSCH ALGORITHM During matrix fill up steps, the score at the position (i,j) of the matrix is of the form by referring “Eq. (1)” Tij =Max{ Ti-1,j-1+s, Max(Ti-x,j – px), x >=1 Max(Ti,j-y – py),y>=1 } where px is the penalty for a gap of length x and py is the penalty for a gap of length y and T is the score for aligning the characters at the position (i,j). The number of comparisons needed for calculating the scoring matrix. N M (10) (i+j) (11) l=1 k=1 i=1 j=1 xl i=1 =nxm (i+j) =[MxNx(M+N+2)]/2 [{mkxnkx(mk+nk+2)}/2] (12) k=1 Therefore, the total number of comparisons xl =[MxNx(M+N+2)]/2+px [{mkxnkx(mk+nk+2)}/ 2] k=1 (13) in the worst case, the length of local alignment is equal to the length of global alignment ie x =M therefore, the total number of comparisons M =[MxNx(M+N+2)]/2+px 2] [{mkxnkx(mk+nk+2)}/ k=1 (14) 6. EXPERIMENTAL RESULTS To evaluate the performance, a DNA sequence is considered 6.1 DNA Sequence Data Set DNA sequences of human mitochondrial genome have been collected from Genbank. It contains 16571 sequences. From randomly 1919 sequences were selected for training and 100 for testing. The experiments were done on Intel Pentium-IV processor based machine having a clock frequency of 1700 Mhz and 512 MB RAM. The sample result for the proposed algorithm is shown in the table below. Sequence1 Sequence2 GTACCTTGGA GGATCGA A GGATCGA GAATTCAGTT A GTACCTGAA GAACTGA GTCACGTA AGTCAGCAT CTAGCTG A ACTGCA GTACTGATC TCGATAG ISSN: 0975 – 6752 | NOV 10 TO OCT 11 | Volume 1, Issue 2 Global alignment GTACCTTGGA GGATC GAA GAATTCAGTT A GGAT - C -G A GTACCTGAA GAAC- TGAGTCACGTA CTAGCTGA AGTCAGCAT ACT GCAGTACTGATC - T- C –GATAG Page 18 JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN BIOMEDICAL ENGINEERING GTCGATCGTA G CGTACGT A GTACGTACGT A CGTAGCT CTAGTCATC GTACCAT GACTGATC CGTAC GTCGATCGTA G CGTACGTAGTACGTACGT A CGTAGCTCTAGTCATC GTA- CCATGACTGAT- C C- G - TAC Table 2 sample result for proposed algorithm 7. CONCLUSION Experimental results on a DNA sequence data set show that it performs well by comparing at low computation cost and this may be used to find the superfamily, family and subfamily relationships in DNA sequences. A binary representation for DNA sequence is given. The comparison of sequence using logic gates is proposed. This method gives the accuracy with which the two sequences match. It also provides information about which portion of the sequence do not match. From the complexity calculation, it is seen that the proposed algorithm has a much less time complexity and requires very much less amount of memory storage with respect to Needleman-Wunsch algorithm. This is a simple method of comparing two DNA sequences. This method not only determines the percentage of matching of two sequences, but also identified the portion of the sequence that does not match. Therefore, it is useful in most of the cases where less computational burden on the processor and less system memory usage is encouraged. We further aim at clustering a large set of DNA sequences consisting of sequences from various families and evaluate the performance of the proposed algorithm with necessary modifications. The proposed algorithm can also be used on text and web document collection. [6] K. Lanctot, M. Li, E. H. Yang. Estimating DNA sequence entropy, to appear in SODA ‘2000. [7] M. Li, P. Vitanyi, An introduction to kolmogorov complexity and its applications, Springer, 2nd ed., 1997. [8] M. Morris Mano, Computer System Architecture [9] Needleman Wunsch Algorithm –Global Alignment Algorithm using Dynamic Programming. [10] R.Curnow and T. Kirkwood, Statistical analysis of deoxyribonucleic acid sequence data-a review. J. Royal Statistical Society, 152(1989), 199-200. [11] S.C. Rastogi, Namita Mendiratta, Parag Rastogi, Bioinformatics concepts, skills and applications. [12] Sequence alignment, Journal of computational biology, 147, pp.195-197, 1970. [13] S. Grumbach and F. Tahi, “A new challenge for compression algorithms: genetic sequences”,Journal of Information Processing and Management, 30:6(1994), 875-866. [14] Subhra Sundar Bandyopadhyay, Somnath Paul and Amit Konar, Improved algorithm for DNA sequence alignment and revision of Scoring matrix, Proc. of ICISIP 2005. [15] Swagatam Das & Debangshu Dey, A new algorithm for local alignment in DNA sequencing, Proc. of IEEE conference, INDICON 2004. [16] V. Guralnik and G.karypis, A scalable algorithm for clustering sequential data, Proc. of I st IEEE conference on Data Mining, 2001. 8. REFERENCES [1] A. K. Jain, M.N. Murty, and P.J. Flynn, Data clustering: A Review, ACM Computing Surveys, Vol. 31, 3, pp.264-323, 1999. [2] E.Rivals, J-P. Delahaye, M. Dauchet and O.Delgrange. A Guaranteed Compression Scheme for Repetitive DNA Sequences. LIFL Lille I University, technical report IT-285, 1995. [3] Eva Bolten, Alexander Schliep and Sebastian Schneckener, Clustering Protein Sequences-Structure prediction by transitive homology, GCB,1999. [4] Harshawardhan P.Bal, Bioinformatics principles and applications. [5] J.H.M. Dassen, DNA Computing promises, problems and perspective, report (http://citeseer.nj.nec.com/jhm.html) ISSN: 0975 – 6752 | NOV 10 TO OCT 11 | Volume 1, Issue 2 Page 19