“A DNA Toolkit for Bioinformatics” Sophia Banton CSC 7351 Summer 2007 ABSTRACT DNA sequence analysis is important to bioinformatics. Processing a raw DNA sequence is usually the first step in carrying out projects in bioinformatics. A total of four programs were created: PairwiseAlignment.cpp, Trans.cpp, PairwiseProteinAlign.cpp, and MSA.cpp. The PairwiseAlignment program generates an alignment of two DNA sequences. The PairwiseProteinAlign program does the same thing, with the exception that the input is a pair of protein sequences. The MSA.cpp completes an alignment of three protein sequences. The Trans.cpp translates a DNA sequence into a primary amino acid sequence. The NeedlemanWunsch and Smith-Waterman algorithms are efficient in finding homologous gene sequences and proteins. Progressive alignments of multiple sequences can be carried out using the Feng Doolittle algorithm. The limitation with the algorithm is that early mistakes cannot be corrected. The genetic code and the proteinic codes continue to be both simple in structure and elusive in nature. INTRODUCTION DNA sequence analysis is important to bioinformatics. Processing a raw DNA sequence is usually the first step in carrying out projects in bioinformatics. The goal was to write a package of small programs that manipulate raw DNA sequences. A total of four programs were created: PairwiseAlignment.cpp, Trans.cpp, PairwiseProteinAlign.cpp, and MSA.cpp. The PairwiseAlignment program reads in a pair of DNA sequences in standard FASTA format and generates an aligned sequence. The PairwiseProteinAlign program does the same thing, with the exception that the input is a pair of protein sequences. The MSA.cpp program extends on these two programs and completes an alignment of three or more protein sequences. The Trans.cpp program is the most unique among the group and this program translates a DNA sequence into a primary amino acid sequence. In vivo DNA is translated from its primary nucleotide sequence to a primary amino acid sequence. Within the nucleus of a cell, the DNA is fist replicated and then exported to the cytoplasm where transcription occurs. Transcription involves the formation of messenger RNA (mRNA) from a DNA template. The mRNA is then brought to a ribosome where translation occurs. In translation the mRNA in converted to amino acids sequences. Each amino acid is derived from three DNA nucleotides. For computational simplicity the transcription step can be ignored because RNA only differs from DNA by a single nucleotide. Ignoring this fact, the DNA can be directly translated to amino acid sequences. Due to the nature of DNA, each sequence has a total of six frames from which it can be read. The code is read in triplets so the DNA can be read for the 1st, 2nd, or 3rd codon. Similarly the DNA can be read for the opposite or 3’-5’ end. At this end the sequence can be read from the nth, nth -1, or nth – 2 position. The figure below illustrates this concept for the 5’ to 3’ order. 5' 3' atgcccaagctgaatagcgtagaggggttttcatcatttgaggacgatgtataa 1 atg ccc aag ctg aat agc gta gag ggg ttt tca tca ttt gag gac gat gta taa M P K L N S V E G F S S F E D D V * 2 tgc cca agc tga ata gcg tag agg ggt ttt cat cat ttg agg acg atg tat C P S * I A * R G F H H L R T M Y 3 gcc caa gct gaa tag cgt aga ggg gtt ttc atc att tga gga cga tgt ata A Q A E * R R G V F I I * G R C I Sequence alignment is a useful step in trying to determine the function of an unknown sequence. A global alignment is most useful when applied to sequences that are similar in both similar in their sequences and size. To align two sequences a score must be found. DNA sequences are scored using a match-mismatch table of constants. Matches are assigned positive scores while mismatches are assigned negative scores. The protein sequences are aligned using the Block Substitution Matrix 62, which is set of values that are used specifically for protein sequence alignment. To make an alignment such as: MRNDPCQ M -NEPCEach column of the alignment is treated independently. Then we find the score of the total alignment by summing all the columns. This process is called dynamic programming. To find the optimal alignment, sub-alignments are found instead of finding all the possible alignments. Aligned sequences contain amino acid residues or nucleotide bases with gaps. Gaps represent evolutionary based insertions and deletions. The idea is that mutations between DNA nucleotides or between amino acids with similar properties will be more tolerated than mutations between pairs that are different. Thus a gap penalty can be viewed as a level of tolerance for evolution of a residue in a molecular sequence. The alignment programs align the input sequences, both DNA and protein using a gap penalty of -3. Gap penalties contribute to the final score of the alignments. The size of the gap penalty relative to the entries in the scoring table or matrix affects the alignment that is finally selected. A higher gap penalty will cause less favorable characters to be aligned, in an attempt not to create too many gaps. A high number of gaps signal a poor alignment. A gap penalty of -3 is fairly reasonable for using the BLOSUM62 matrix. Multiple sequence alignments are more powerful than pair-wise alignments when trying to find the evolutionary history of a protein. Such alignments allow the grouping of proteins and genes into families. Gene or protein families share similar structure, function, and properties. AlGORITHMS Section I : DNA Translation (Trans.cpp This program converts a raw DNA sequence into a protein primary structure. DNA sequence is read into an array of characters. Then the DNA is read from three frames in the forward direction. Primary amino acid sequences are generated for each frame. The protein generated with the longest length is chosen as best protein. The protein is converted to complementary strand, reversed and the steps are repeated. Class Declaration: class trans { private: DNA[] trans_DNA[] public: trans( char (&input)[]){} void Translate(){} int get_prot_Length(){} void get_Protein(){} void print_best_naive_Prot(){} void execute(){} // }; Read in text file Read sequence into an array of char DNA[i] Frame 1 starts at first location in (DNA) Frame 2 at nucleotide 2 (DNA + 1) Frame 3 at nucleotide 3 (DNA + 2) Read triplets and convert to amino acid Store sequence into array [tempProtein] Find the best start and stop nucleotide for the tempProtein Generate a processed protein sequence using best stop Find best stop by reading the DNA from the end (3’) The longest of the generated sequences is the best Transpose(DNA[]) trans_DNA[] REPEAT PROCESS Section II : Sequence Alignments I. Setting up the Scoring Matrix For the DNA sequence alignment program the following scoring table was used. Each match was rewarded with a positive score while, each A G C T A 10 -1 -3 -4 G -1 7 -5 -3 C -3 -5 9 0 T -4 -3 0 8 mismatch earned a negative mark. Also a mismatch between pyrimidines (C-T) or purines (AG) were assigned a score of zero. This is because from an evolutionary perspective changes between those bases would be more favorable and more conserved. The Scoring matrix used for the protein pairwise alignment and multiple sequence alignment programs is the BLOSUM62 matrix shown below. C C 9 S -1 T -1 P -3 A 0 G -3 N -3 D -3 E -4 Q -3 H -3 R -3 K -3 M -1 I -1 L -1 V -1 F -2 Y -2 W -2 S -1 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3 T -1 1 4 1 -1 1 0 1 0 0 0 -1 0 -1 -2 -2 -2 -2 -2 -3 P -3 -1 1 7 -1 -2 -2 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4 A 0 1 -1 -1 4 0 -2 -2 -1 -1 -2 -1 -1 -1 -1 -1 0 -2 -2 -3 G -3 0 1 -2 0 6 0 -1 -2 -2 -2 -2 -2 -3 -4 -4 -3 -3 -3 -2 N -3 1 0 -1 -1 -2 6 1 0 0 1 0 0 -2 -3 -3 -3 -3 -2 -4 D -3 0 1 -1 -2 -1 1 6 2 0 1 -2 -1 -3 -3 -4 -3 -3 -3 -4 E -4 0 0 -1 -1 -2 0 2 5 2 0 0 1 -2 -3 -3 -2 -3 -2 -3 Q -3 0 0 -1 -1 -2 0 0 2 5 0 1 1 0 -3 -2 -2 -3 -1 -2 H -3 -1 0 -2 -2 -2 -1 -1 0 0 8 0 -1 -2 -3 -3 -3 -1 2 -2 R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 2 -1 -3 -2 -3 -3 -2 -3 K -3 0 0 -1 -1 -2 0 -1 1 1 -1 2 5 -1 -3 -2 -2 -3 -2 -3 M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 1 2 1 0 -1 -1 I -1 -2 -2 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 2 3 0 -1 -3 L -1 -2 -2 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 1 0 -1 -2 V -1 -2 -2 -2 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 4 -1 -1 -3 F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 3 1 Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 2 W -2 -3 -3 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 1 1 BLOSUM is an abbreviation for “Blocks of Amino Acid Substitution Matrix”. It is a substitution matrix used for sequence alignment of proteins. It is used frequently in bioinformatics to score alignments between related and non-related proteins. The matrix was created by Henikoff and Henikoff (1992; PNAS 89:10915-10919). They created the matrix by scanning the BLOCKS database for much conserved regions of protein families. The regions lacked gaps and so were truly “conserved”. To generate the matrix the scientists counted the relative frequencies of amino acids and their substitution probabilities. A log-odds score for each of the 210 possible substitutions of the 20 standard amino acids was derived and that is the basis of the BLOSUM 62. So the matrix is not computer generated, but rather based on observed alignments. Each score within a BLOSUM matrix is a measure of the likelihood of two amino acids appearing/evolving by chance. The matrices are based on the minimum percentage identity of the aligned protein sequence used in calculating them. Each possible identity or substitution is assigned a score based on its observed frequencies in the alignment of related proteins. The more likely substitutions are assigned positive scores, and those less likely to occur are give negative scores. BLOSUM62 is the matrix calculated by using the observed substitutions between proteins which have 62% or more sequence identity. It is important to note that the likelihood of substitution is based on the biological properties of the amino acid. Substitutions among the members of each of the following groups of amino acids are more likely: aliphatic, polar, acidic, basic, and aromatic. The aliphatic residues are G, A, V, L, I, and M. The polar residues are S, T, and C. The Aromatic residues are F, W, and Y. The Basic residues are H, K, and R. Lastly; the acidic residues are D and E. II. Aligning the Sequences Global Alignment This project implements the Needleman-Wunsch algorithm for finding a global alignment in both the DNA sequence alignment (PairwiseAlignment.cpp) and the protein sequence alignment (ProteinPairwiseAlign.cpp). The process begins by focusing on the last column of each alignment. These are the only three possibilities for the last column of the alignment: alignments in which the last character of A is paired with the last character in B, alignments in which the last character in A is paired with a gap, and alignments in which the last character in B is paired with a gap. Each of these alignments has effectively two parts, a prefix, and the last column. The prefix of the alignment contains every column but the last. MRNDPCQ M –NEPCThe score for each alignment is calculated by scoring the prefix and adding the score for the last column. In each group all of the alignments in this group have the same last column, preceded by prefixes that are different. With the last column held constant, the alignment with the highest score is the alignment with the highest-scoring prefix. Once the highest scoring alignments for the prefixes of the sequences are found, the highest scoring alignment over the entire lengths of the sequences can be found. The two sequences can be called S1 and S2. For any number i, we'll refer to the first i characters of S1 as S1[1...i]. Analogously, for any number j, we'll refer to the first j characters of S2 as S2[1...j]. For any value of i and j, we can calculate the optimal alignment between S1[1...i] and S2[1...j] by finding the highest score among the three possible groups. The BLOSUM62 matrix is used to determine which of these three options is best, and that yields the optimal alignment between S1[1...i] and S2[1...j]. Let i and j range from 0 to the lengths of the sequences (S1 and S2). The value of the optimal alignment for any particular i and j is then used to find the next larger optimal alignment and so on. The Needleman-Wunsch algorithm is shown below: A two-dimensional array (or matrix) is allocated. This matrix is the F matrix, and its (i,j)th entry is often denoted Fij. There is one column for each character in S1, and one row for each character in S2. for i=0 to length(A)-1 F(i,0) <- d*i for j=0 to length(B)-1 F(0,j) <- d*j for i=1 to length(A) for j = 1 to length(B) { Choice1 <- F(i-1,j-1) + S(A(i-1), B(j-1)) Choice2 <- F(i-1, j) + d Choice3 <- F(i, j-1) + d F(i,j) <- max(Choice1, Choice2, Choice3) } The bottom right hand corner of the matrix is the maximum score for any alignments. To find the alignment which generates this score, start from the bottom right cell, and compare the value with the three possible sources (Choice1, Choice2, and Choice3 above) to see which it came from. If Choice1, then S1 (i) and S2 (i) aligned, if Choice2, then S1 (i) is aligned with a gap, and if Choice3, then S2(i) is aligned with a gap. Embedded in the algorithm is a weight Matrix which gives the alignment between the sequences. Local Alignment The local alignment is very similar to the global alignment. Local AlignmentA <- "" AlignmentB <- "" i <- length(A) j <- length(B) while (i > 0 AND j > 0) { Score <- F(i,j) ScoreDiag <- F(i - 1, j - 1) ScoreUp <- F(i, j - 1) ScoreLeft <- F(i - 1, j) if (Score == ScoreDiag + S(A(i-1), B(j-1))) { AlignmentA <- A(i-1) + AlignmentA AlignmentB <- B(j-1) + AlignmentB i <- i - 1 j <- j - 1 } else if (Score == ScoreLeft + d) { AlignmentA <- A(i-1) + AlignmentA AlignmentB <- "-" + AlignmentB i <- i - 1 } otherwise (Score == ScoreUp + d) { AlignmentA <- "-" + AlignmentA AlignmentB <- B(j-1) + AlignmentB j <- j - 1 } } while (i > 0) { AlignmentA <- A(i-1) + AlignmentA AlignmentB <- "-" + AlignmentB i <- i - 1 } while (j > 0) { AlignmentA <- "-" + AlignmentA AlignmentB <- B(j-1) + AlignmentB j <- j - 1 } are alignments are best for non-similar sequences that contain regions of homology. This program implements the Smith-Waterman algorithm for finding local alignments. The Smith-Waterman method highlights only those regions of the alignment which have positive scores. For each cell of the matrix, the algorithm considers each path that leads to it. The paths can be of any length and can contain gaps (insertions and deletions). Instead of looking at an entire sequence at once, the S-W algorithm compares multi-lengthed segments, looking for whichever segment maximizes the scoring measure. The goal here is to align the sequences while ignoring the badly aligned areas of the entire alignment. The only change in the algorithm as compared to the global is during the initialization and iteration step. Here the matrix is set to zero at all cells. Initialization: F(0, j) = F(i, 0) = 0 Iteration: 0 F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + s(xi, yj) When finding the maximum score for a matrix entry, if the maximum score is negative, we instead make it 0. This is done because we are searching for the optimal substring; if a part of the alignment is negative it can be ignored. III. Multiple sequence alignment (MSA) The multiple sequence alignment method used was progressive alignment. The algorithm used was Feng-Doolittle algorithm. First global alignments are generated for all the input sequence, then a distance tree is created to keep track of the alignments, and then all sequences are aligned to the best pair. Due to shortage of time, only the first step of the algorithm was implemented. Global alignments were derived for all the sequences using a for-loop that aligned each pair of sequences by generating instances of the global alignment class. The global alignment class is the same as the one used for the pairwise sequence alignments. RESULTS This section will display the results generated by each program. DNA sequence alignment: DNA Translation (Trans.cpp): Protein sequence Alignement (ProteinPairWiseAlign.cpp): Multiple Protein Sequence Alignment FUTURE WORKS The next logical step is to improve the multiple sequence alignment program. Currently the program does not align the sequences as a group. From the output above it can be inferred which sets of sequences are most similar, but similarities between all the proteins cannot be seen.. With the trans.cpp (translation program), more efficient program would do a better job of trimming the excess amino acids from the “best” proteins. CONCLUSION It is necessary to process raw DNA sequences in order for biologists to obtain meaningful data from the various genomes that have been sequenced. Predicting the primary sequence can assist in understanding the function of gene. Sequence alignment is a useful tool is finding gene or protein analogs. Hypothetical genes and protein don’t always amount to real genes and proteins. The Needleman-Wunsch and Smith-Waterman algorithms are efficient in finding homologous gene sequences and proteins. The genetic code and the proteinic codes continue to be both simple in structure but elusive in nature. REFERENCES Agarwal, Pankaj K., Orlando, David (2003). Lecture 15: Multiple Sequence Alignment. CPS260/BGT204.1 Algorithms in Computational Biology. http://www.cs.duke.edu/courses/cps260/fall03/notes/lecture15.pdf Bhattacharya, D., Haque,R. and Singzh,U. (2005). Coding and Noncoding Genomic Regions of Entamoeba histolytica Have Significantly Different Rates of Sequence Polymorphisms: Implications for Epidemiological Studies. J. Clin. Microbiol. 43 (9), 4815-4819 Champe C., Pamela, Harvey, Richard A. and Ferrier, Denise R. (2005). Lippincott's Illustrated Reviews: Biochemistry (3rd ed.). Lippincott Williams & Wilkins Needleman, S.B., Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequences of two proteins. 1970 Journal of Molecular Biology 48:443-453. Smith TF, Waterman MS (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195-197. Waterman, M.S., Eggert, M. A new algorithm for best subsequence alignments with applications to tRNA-rRNA comparisons. 1987 Journal of Molecular Biology 197:723-728.