A DNA Toolkit for Bioinformatics

advertisement
“A DNA Toolkit for Bioinformatics”
Sophia Banton
CSC 7351
Summer 2007
ABSTRACT
DNA sequence analysis is important to bioinformatics. Processing a raw DNA sequence
is usually the first step in carrying out projects in bioinformatics. A total of four programs were
created: PairwiseAlignment.cpp, Trans.cpp, PairwiseProteinAlign.cpp, and MSA.cpp. The
PairwiseAlignment program generates an alignment of two DNA sequences. The
PairwiseProteinAlign program does the same thing, with the exception that the input is a pair of
protein sequences. The MSA.cpp completes an alignment of three protein sequences. The
Trans.cpp translates a DNA sequence into a primary amino acid sequence. The NeedlemanWunsch and Smith-Waterman algorithms are efficient in finding homologous gene sequences
and proteins. Progressive alignments of multiple sequences can be carried out using the Feng
Doolittle algorithm. The limitation with the algorithm is that early mistakes cannot be corrected.
The genetic code and the proteinic codes continue to be both simple in structure and elusive in
nature.
INTRODUCTION
DNA sequence analysis is important to bioinformatics. Processing a raw DNA sequence
is usually the first step in carrying out projects in bioinformatics. The goal was to write a
package of small programs that manipulate raw DNA sequences. A total of four programs were
created: PairwiseAlignment.cpp, Trans.cpp, PairwiseProteinAlign.cpp, and MSA.cpp. The
PairwiseAlignment program reads in a pair of DNA sequences in standard FASTA format and
generates an aligned sequence. The PairwiseProteinAlign program does the same thing, with the
exception that the input is a pair of protein sequences. The MSA.cpp program extends on these
two programs and completes an alignment of three or more protein sequences. The Trans.cpp
program is the most unique among the group and this program translates a DNA sequence into a
primary amino acid sequence.
In vivo DNA is translated from its primary nucleotide sequence to a primary amino acid
sequence. Within the nucleus of a cell, the DNA is fist replicated and then exported to the
cytoplasm where transcription occurs. Transcription involves the formation of messenger RNA
(mRNA) from a DNA template. The mRNA is then brought to a ribosome where translation
occurs. In translation the mRNA in converted to amino acids sequences. Each amino acid is
derived from three DNA nucleotides. For computational simplicity the transcription step can be
ignored because RNA only differs from DNA by a single nucleotide. Ignoring this fact, the
DNA can be directly translated to amino acid sequences.
Due to the nature of DNA, each sequence has a total of six frames from which it
can be read. The code is read in triplets so the DNA can be read for the 1st, 2nd, or 3rd codon.
Similarly the DNA can be read for the opposite or 3’-5’ end. At this end the sequence can be
read from the nth, nth -1, or nth – 2 position. The figure below illustrates this concept for the 5’
to 3’ order.
5'
3'
atgcccaagctgaatagcgtagaggggttttcatcatttgaggacgatgtataa
1 atg ccc aag ctg aat agc gta gag ggg ttt tca tca ttt gag gac gat gta taa
M
P
K
L
N
S
V
E
G
F
S
S
F
E
D
D
V
*
2 tgc cca agc tga ata gcg tag agg ggt ttt cat cat ttg agg acg atg tat
C
P
S
*
I
A
*
R
G
F
H
H
L
R
T
M
Y
3
gcc caa gct gaa tag cgt aga ggg gtt ttc atc att tga gga cga tgt ata
A
Q
A
E
*
R
R
G
V
F
I
I
*
G
R
C
I
Sequence alignment is a useful step in trying to determine the function of an unknown
sequence. A global alignment is most useful when applied to sequences that are similar in both
similar in their sequences and size. To align two sequences a score must be found. DNA
sequences are scored using a match-mismatch table of constants. Matches are assigned positive
scores while mismatches are assigned negative scores. The protein sequences are aligned using
the Block Substitution Matrix 62, which is set of values that are used specifically for protein
sequence alignment.
To make an alignment such as:
MRNDPCQ
M -NEPCEach column of the alignment is treated independently. Then we find the score of the total
alignment by summing all the columns. This process is called dynamic programming. To find the
optimal alignment, sub-alignments are found instead of finding all the possible alignments.
Aligned sequences contain amino acid residues or nucleotide bases with gaps. Gaps
represent evolutionary based insertions and deletions. The idea is that mutations between DNA
nucleotides or between amino acids with similar properties will be more tolerated than mutations
between pairs that are different. Thus a gap penalty can be viewed as a level of tolerance for
evolution of a residue in a molecular sequence. The alignment programs align the input
sequences, both DNA and protein using a gap penalty of -3. Gap penalties contribute to the final
score of the alignments. The size of the gap penalty relative to the entries in the scoring table or
matrix affects the alignment that is finally selected. A higher gap penalty will cause less
favorable characters to be aligned, in an attempt not to create too many gaps. A high number of
gaps signal a poor alignment. A gap penalty of -3 is fairly reasonable for using the BLOSUM62
matrix.
Multiple sequence alignments are more powerful than pair-wise alignments when trying to
find the evolutionary history of a protein. Such alignments allow the grouping of proteins and
genes into families. Gene or protein families share similar structure, function, and properties.
AlGORITHMS
Section I : DNA Translation (Trans.cpp
This program converts a raw DNA sequence into a protein primary structure. DNA
sequence is read into an array of characters. Then the DNA is read from three frames in the
forward direction. Primary amino acid sequences are generated for each frame. The protein
generated with the longest length is chosen as best protein. The protein is converted to
complementary strand, reversed and the steps are repeated.
Class Declaration:
class trans {
private:
DNA[]
trans_DNA[]
public:
trans( char (&input)[]){}
void Translate(){}
int get_prot_Length(){}
void get_Protein(){}
void print_best_naive_Prot(){}
void execute(){} //
};

Read in text file

Read sequence into an array of char DNA[i]

Frame 1 starts at first location in (DNA)

Frame 2 at nucleotide 2 (DNA + 1)

Frame 3 at nucleotide 3 (DNA + 2)

Read triplets and convert to amino acid

Store sequence into array [tempProtein]

Find the best start and stop nucleotide for the tempProtein

Generate a processed protein sequence using best stop

Find best stop by reading the DNA from the end
(3’)

The longest of the generated sequences is the best

Transpose(DNA[])  trans_DNA[]

REPEAT PROCESS
Section II : Sequence Alignments
I. Setting up the Scoring Matrix
For the DNA sequence alignment program
the following scoring table was used. Each match
was rewarded with a positive score while, each
A
G
C
T
A
10
-1
-3
-4
G
-1
7
-5
-3
C
-3
-5
9
0
T
-4
-3
0
8
mismatch earned a negative mark. Also a mismatch between pyrimidines (C-T) or purines (AG) were assigned a score of zero. This is because from an evolutionary perspective changes
between those bases would be more favorable and more conserved.
The Scoring matrix used for the protein pairwise alignment and multiple sequence
alignment programs is the BLOSUM62 matrix shown below.
C
C 9
S -1
T -1
P -3
A 0
G -3
N -3
D -3
E -4
Q -3
H -3
R -3
K -3
M -1
I -1
L -1
V -1
F -2
Y -2
W -2
S
-1
4
1
-1
1
0
1
0
0
0
-1
-1
0
-1
-2
-2
-2
-2
-2
-3
T
-1
1
4
1
-1
1
0
1
0
0
0
-1
0
-1
-2
-2
-2
-2
-2
-3
P
-3
-1
1
7
-1
-2
-2
-1
-1
-1
-2
-2
-1
-2
-3
-3
-2
-4
-3
-4
A
0
1
-1
-1
4
0
-2
-2
-1
-1
-2
-1
-1
-1
-1
-1
0
-2
-2
-3
G
-3
0
1
-2
0
6
0
-1
-2
-2
-2
-2
-2
-3
-4
-4
-3
-3
-3
-2
N
-3
1
0
-1
-1
-2
6
1
0
0
1
0
0
-2
-3
-3
-3
-3
-2
-4
D
-3
0
1
-1
-2
-1
1
6
2
0
1
-2
-1
-3
-3
-4
-3
-3
-3
-4
E
-4
0
0
-1
-1
-2
0
2
5
2
0
0
1
-2
-3
-3
-2
-3
-2
-3
Q
-3
0
0
-1
-1
-2
0
0
2
5
0
1
1
0
-3
-2
-2
-3
-1
-2
H
-3
-1
0
-2
-2
-2
-1
-1
0
0
8
0
-1
-2
-3
-3
-3
-1
2
-2
R
-3
-1
-1
-2
-1
-2
0
-2
0
1
0
5
2
-1
-3
-2
-3
-3
-2
-3
K
-3
0
0
-1
-1
-2
0
-1
1
1
-1
2
5
-1
-3
-2
-2
-3
-2
-3
M
-1
-1
-1
-2
-1
-3
-2
-3
-2
0
-2
-1
-1
5
1
2
1
0
-1
-1
I
-1
-2
-2
-3
-1
-4
-3
-3
-3
-3
-3
-3
-3
1
4
2
3
0
-1
-3
L
-1
-2
-2
-3
-1
-4
-3
-4
-3
-2
-3
-2
-2
2
2
4
1
0
-1
-2
V
-1
-2
-2
-2
-2
0
-3
-3
-3
-2
-2
-3
-3
-2
1
3
4
-1
-1
-3
F
-2
-2
-2
-4
-2
-3
-3
-3
-3
-3
-1
-3
-3
0
0
0
-1
6
3
1
Y
-2
-2
-2
-3
-2
-3
-2
-3
-2
-1
2
-2
-2
-1
-1
-1
-1
3
7
2
W
-2
-3
-3
-4
-3
-2
-4
-4
-3
-2
-2
-3
-3
-1
-3
-2
-3
1
2
1
1
BLOSUM is an abbreviation for “Blocks of Amino Acid Substitution Matrix”. It is a
substitution matrix used for sequence alignment of proteins. It is used frequently in
bioinformatics to score alignments between related and non-related proteins. The matrix was
created by Henikoff and Henikoff (1992; PNAS 89:10915-10919). They created the matrix by
scanning the BLOCKS database for much conserved regions of protein families. The regions
lacked gaps and so were truly “conserved”. To generate the matrix the scientists counted the
relative frequencies of amino acids and their substitution probabilities. A log-odds score for each
of the 210 possible substitutions of the 20 standard amino acids was derived and that is the basis
of the BLOSUM 62. So the matrix is not computer generated, but rather based on observed
alignments.
Each score within a BLOSUM matrix is a measure of the likelihood of two amino acids
appearing/evolving by chance. The matrices are based on the minimum percentage identity of the
aligned protein sequence used in calculating them. Each possible identity or substitution is
assigned a score based on its observed frequencies in the alignment of related proteins. The more
likely substitutions are assigned positive scores, and those less likely to occur are give negative
scores. BLOSUM62 is the matrix calculated by using the observed substitutions between
proteins which have 62% or more sequence identity. It is important to note that the likelihood of
substitution is based on the biological properties of the amino acid. Substitutions among the
members of each of the following groups of amino acids are more likely: aliphatic, polar, acidic,
basic, and aromatic. The aliphatic residues are G, A, V, L, I, and M. The polar residues are S, T,
and C. The Aromatic residues are F, W, and Y. The Basic residues are H, K, and R. Lastly; the
acidic residues are D and E.
II. Aligning the Sequences
Global Alignment
This project implements the Needleman-Wunsch algorithm for finding a global alignment
in both the DNA sequence alignment (PairwiseAlignment.cpp) and the protein sequence
alignment (ProteinPairwiseAlign.cpp).
The process begins by focusing on the last column of each alignment. These are the only
three possibilities for the last column of the alignment: alignments in which the last character of
A is paired with the last character in B, alignments in which the last character in A is paired with
a gap, and alignments in which the last character in B is paired with a gap.
Each of these alignments has effectively two parts, a prefix, and the last column. The
prefix of the alignment contains every column but the last.
MRNDPCQ
M –NEPCThe score for each alignment is calculated by scoring the prefix and adding the score for the last
column. In each group all of the alignments in this group have the same last column, preceded
by prefixes that are different. With the last column held constant, the alignment with the highest
score is the alignment with the highest-scoring prefix. Once the highest scoring alignments for
the prefixes of the sequences are found, the highest scoring alignment over the entire lengths of
the sequences can be found.
The two sequences can be called S1 and S2. For any number i, we'll refer to the first i
characters of S1 as S1[1...i]. Analogously, for any number j, we'll refer to the first j characters of
S2 as S2[1...j]. For any value of i and j, we can calculate the optimal alignment between S1[1...i]
and S2[1...j] by finding the highest score among the three possible groups. The BLOSUM62
matrix is used to determine which of these three options is best, and that yields the optimal
alignment between S1[1...i] and S2[1...j]. Let i and j range from 0 to the lengths of the
sequences (S1 and S2). The value of the optimal alignment for any particular i and j is then used
to find the next larger optimal alignment and so on.
The Needleman-Wunsch algorithm is shown below:
A two-dimensional array (or matrix) is allocated. This matrix is the F matrix, and its
(i,j)th entry is often denoted Fij. There is one column for each character in S1, and one row for
each character in S2.
for i=0 to length(A)-1
F(i,0) <- d*i
for j=0 to length(B)-1
F(0,j) <- d*j
for i=1 to length(A)
for j = 1 to length(B)
{
Choice1 <- F(i-1,j-1) + S(A(i-1), B(j-1))
Choice2 <- F(i-1, j) + d
Choice3 <- F(i, j-1) + d
F(i,j) <- max(Choice1, Choice2, Choice3)
}
The bottom right hand corner of the matrix is the maximum score for any alignments. To find the
alignment which generates this score, start from the bottom right cell, and compare the value
with the three possible sources (Choice1, Choice2, and Choice3 above) to see which it came
from. If Choice1, then S1 (i) and S2 (i)
aligned, if Choice2, then S1 (i) is
aligned with a gap, and if Choice3,
then S2(i) is aligned with a gap.
Embedded in the algorithm is a weight
Matrix which gives the alignment
between the sequences.
Local Alignment
The local alignment is very
similar to the global alignment. Local
AlignmentA <- ""
AlignmentB <- ""
i <- length(A)
j <- length(B)
while (i > 0 AND j > 0)
{
Score <- F(i,j)
ScoreDiag <- F(i - 1, j - 1)
ScoreUp <- F(i, j - 1)
ScoreLeft <- F(i - 1, j)
if (Score == ScoreDiag + S(A(i-1), B(j-1)))
{
AlignmentA <- A(i-1) + AlignmentA
AlignmentB <- B(j-1) + AlignmentB
i <- i - 1
j <- j - 1
}
else if (Score == ScoreLeft + d)
{
AlignmentA <- A(i-1) + AlignmentA
AlignmentB <- "-" + AlignmentB
i <- i - 1
}
otherwise (Score == ScoreUp + d)
{
AlignmentA <- "-" + AlignmentA
AlignmentB <- B(j-1) + AlignmentB
j <- j - 1
} }
while (i > 0)
{
AlignmentA <- A(i-1) + AlignmentA
AlignmentB <- "-" + AlignmentB
i <- i - 1 }
while (j > 0)
{
AlignmentA <- "-" + AlignmentA
AlignmentB <- B(j-1) + AlignmentB
j <- j - 1 }
are
alignments are best for non-similar sequences that contain regions of homology. This program
implements the Smith-Waterman algorithm for finding local alignments. The Smith-Waterman
method highlights only those regions of the alignment which have positive scores. For each cell
of the matrix, the algorithm considers each path that leads to it. The paths can be of any length
and can contain gaps (insertions and deletions). Instead of looking at an entire sequence at once,
the S-W algorithm compares multi-lengthed segments, looking for whichever segment
maximizes the scoring measure. The goal here is to align the sequences while ignoring the badly
aligned areas of the entire alignment. The only change in the algorithm as compared to the
global is during the initialization and iteration step. Here the matrix is set to zero at all cells.
Initialization:
F(0, j) = F(i, 0) = 0
Iteration:
0
F(i, j) = max
F(i – 1, j) – d
F(i, j – 1) – d
F(i – 1, j – 1) + s(xi, yj)
When finding the maximum score for a matrix entry, if the maximum score is negative, we
instead make it 0. This is done because we are searching for the optimal substring; if a part of the
alignment is negative it can be ignored.
III. Multiple sequence alignment (MSA)
The multiple sequence alignment method used was progressive alignment. The algorithm
used was Feng-Doolittle algorithm. First global alignments are generated for all the input
sequence, then a distance tree is created to keep track of the alignments, and then all sequences
are aligned to the best pair. Due to shortage of time, only the first step of the algorithm was
implemented. Global alignments were derived for all the sequences using a for-loop that aligned
each pair of sequences by generating instances of the global alignment class. The global
alignment class is the same as the one used for the pairwise sequence alignments.
RESULTS
This section will display the results generated by each program.
DNA sequence alignment:
DNA Translation (Trans.cpp):
Protein sequence Alignement (ProteinPairWiseAlign.cpp):
Multiple Protein Sequence Alignment
FUTURE WORKS
The next logical step is to improve the multiple sequence alignment program. Currently
the program does not align the sequences as a group. From the output above it can be inferred
which sets of sequences are most similar, but similarities between all the proteins cannot be seen..
With the trans.cpp (translation program), more efficient program would do a better job of
trimming the excess amino acids from the “best” proteins.
CONCLUSION
It is necessary to process raw DNA sequences in order for biologists to obtain meaningful
data from the various genomes that have been sequenced. Predicting the primary sequence can
assist in understanding the function of gene. Sequence alignment is a useful tool is finding gene
or protein analogs. Hypothetical genes and protein don’t always amount to real genes and
proteins. The Needleman-Wunsch and Smith-Waterman algorithms are efficient in finding
homologous gene sequences and proteins. The genetic code and the proteinic codes continue to
be both simple in structure but elusive in nature.
REFERENCES
Agarwal, Pankaj K., Orlando, David (2003). Lecture 15: Multiple Sequence Alignment.
CPS260/BGT204.1 Algorithms in Computational Biology.
http://www.cs.duke.edu/courses/cps260/fall03/notes/lecture15.pdf
Bhattacharya, D., Haque,R. and Singzh,U. (2005). Coding and Noncoding Genomic Regions of
Entamoeba histolytica Have Significantly Different Rates of Sequence Polymorphisms:
Implications for Epidemiological Studies. J. Clin. Microbiol. 43 (9), 4815-4819
Champe C., Pamela, Harvey, Richard A. and Ferrier, Denise R. (2005). Lippincott's Illustrated
Reviews: Biochemistry (3rd ed.). Lippincott Williams & Wilkins
Needleman, S.B., Wunsch, C.D. A general method applicable to the search for similarities in the
amino acid sequences of two proteins. 1970 Journal of Molecular Biology 48:443-453.
Smith TF, Waterman MS (1981). "Identification of Common Molecular Subsequences". Journal
of Molecular Biology 147: 195-197.
Waterman, M.S., Eggert, M. A new algorithm for best subsequence alignments with applications
to tRNA-rRNA comparisons. 1987 Journal of Molecular Biology 197:723-728.
Download