Walking Tree Heuristics for Biological String Alignment, Gene Location, and Phylogenies

advertisement
Walking Tree Heuristics for Biological String
Alignment, Gene Location, and Phylogenies
P. Cull , J. L. Holloway
y
z
and J. D. Cavener
Computer
Science Dept., Oregon State University, Corvallis, OR 97331 USA, pc@cs.orst.edu
y Zymogenetics, 1201 Eastlake Av., Seattle, WA 98102 USA, holloway@zgi.com
z Computer Science Dept., Oregon State University, Corvallis, OR 97331 USA, cavenej@cs.orst.edu
Abstract. Basic biological information is stored in strings of nucleic acids (DNA, RNA) or amino acids
(proteins). Teasing out the meaning of these strings is a central problem of modern biology. Matching and
aligning strings brings out their shared characteristics. Although string matching is well-understood in the
edit-distance model, biological strings with transpositions and inversions violate this model's assumptions.
We propose a family of heuristics called walking trees to align biologically reasonable strings.
Both edit-distance and walking tree methods can locate specic genes within a large string when the
genes' sequences are given. When we attempt to match whole strings, the walking tree matches most
genes, while the edit-distance method fails. We also give examples in which the walking tree matches
substrings even if they have been moved or inverted. The edit-distance method was not designed to
handle these problems. We include an example in which the walking tree \discovered" a gene.
Calculating scores for whole genome matches gives a method for approximating evolutionary distance.
We show two evolutionary trees for the picornaviruses which were computed by the walking tree heuristic.
Both of these trees show great similarity to previously constructed trees. The point of this demonstration
is that WHOLE genomes can be matched and distances calculated. The rst tree was created on a Sequent
parallel computer and demonstrates that the walking tree heuristic can be eeciently parallelized. The
second tree was created using a network of work stations and demonstrates that there is suent parallelism
in the phylogenetic tree calculation that the sequential walking tree can be used eectively on a network.
Keywords: sequence alignment, genome alignment, heuristic, Picornavirus, inversions, swaps, dynamic programming, translocations, walking tree, phylogeny
INTRODUCTION
The organization of the eukaryote genome is the major scientic enigma of our era. In the last decade
there have been major advances in determining the sequential structure of DNA. Massive amounts of data
on a variety of organisms are daily pouring out of sequencing labs. Partial genetic maps are available for
many organisms. Full maps have recently been produced for several bacteria [1]. Detailed maps of the
human genome are in the works and should be available in the next decade. This avalanche of data will
require fast computer methods for analysis and data retrieval. Present methods for string matching are
based on the edit-distance model [12]. These methods assume that dierences between strings are due to
local changes, but there is evidence that larger scale changes can occur [4]. In particular, swaps, in which
large pieces of DNA are moved from one location to another, and inversion, in which a sequence is replaced
by its reversed complement, do not t neatly into the edit-distance model. Recently, there have been some
attempts to extend the edit-distance model to include inversions [11], but the extension will only handle
one level of inversion and cannot handle inversions within inversions. Further, it seems unlikely that any
simple model will be able to capture the minimum biologically correct distance between two strings. In all
likelihood nding the fewest operations that have to be applied to one string to obtain another string will
probably require trying all possible sequences of operations. Of course, trying all possible sequences will
ctatctactatgcggagcctagagtggcagtc
best score
best position
current position
left best score
left best pos
right best score
right best pos
Figure 1: Data structure used to align the pattern within the text. In this picture, each leaf node represents
8 characters of the pattern, each of the internal nodes represents 16 characters of the pattern, and the root
node represents the entire pattern. Each of the nodes contains the elds shown in the expanded node.
be computationally intractable, that is, any computer program based on this sort of exhaustive search will
require millenia to process strings of biologically reasonable size. This intuition of intractability has been
conrmed by a recent proof that determining the minimum number of ips needed to sort a sequence is an
NP-complete problem [3].
WALKING TREE HEURISTICS
Basic Heuristic
Our problem is to create a heuristic that nds an approximate biologically reasonable alignment between
two strings, one called pattern, and the other called text. Our metaphor is to consider the data structure
for the basic heuristic as a walking tree [9] (see Figure 1) with jP j leaves, one for each character in the
pattern. When the heuristic is considering position l + 1 of the text, the leaves of the tree are positioned
over the jP j contiguous characters of the text up to and including character l + 1. The leaves also remember
some of the information for the best alignment within the rst l characters of the text. On the basis of this
remembered information and the comparisons of the leaves with the text characters under them, the leaves
update their information and pass this information to their parents. The data will percolate up to the root
where a new best score is calculated. The tree can then walk to the next position by moving each of its
leaves one character to the right. The whole text has been processed when the leftmost leaf of the walking
tree has processed the rightmost character of the text.
To dene a scoring system that captures some biological intuitions, we currently use a function that gives
a positive contribution based on the similarity between aligned characters, and a negative contribution that
is related to the number and length of gaps, translocations, and inversions. A gap in an alignment occurs
when adjacent characters in the pattern are aligned with non{adjacent characters in the text. The length of
the gap is the number of characters between the non{adjacent characters in the text.
The computation at each leaf node makes use of two functions, MATCH and GAP. MATCH looks at the
current text character and compares it with the pattern character represented by the leaf. In the simplest
case one could use
Tj
MATCH (Pi ; Tj ) = c0 ifif PPi =
6= T
i
j
For many of our examples we use c = 3. If one were matching amino acid strings then MATCH could return
a value depending on how similar the two compared amino acids are. There are standard methods like PAM
matrices for scoring this similarity.
If we only used the MATCH function, the leaves would simply remember if they had ever seen a matching
character in the text. We use the GAP function to penalize the match for being far from the current position
of the tree. So the leaf needs to remember both if it found a match and the position of the walking tree
when it found a match. For example, a simple GAP function could be:
GAP (currentpos; pos) = jcurrentpos , posj:
Then the leaf could compute
SCORE := max(MATCH (Pi ; Tj ); SCORE , GAP (currentpos; pos))
and update pos to currentpos if MATCH is maximal. This means that a leaf will forget an actual match if
it occured far from the current position.
An internal node will only look at what it is remembering and at what its children have computed. Like
a leaf node, the internal node computes
SCORE , GAP (currentpos; pos)
which depends on what the node is remembering. From its two children, the node computes
SCORE:left + SCORE:right , GAP (pos:left; pos:right):
This will penalize the sum of the children's scores because the position for the two scores may be dierent.
But, the internal node also has to penalize this score because the left or right position may be far from the
current position. So it also subtracts
min(GAP (currentpos; pos:left); GAP (currentpos; pos:right)):
At this point we notice that there is an asymmetry that should be built into GAP, because having the left
position before the right position should be favored. Without going into details, we usually make GAP
asymmetric. The internal node will keep the better of the score from its remembered data and the score
computed from its children.
The walking tree will be computing a score for the current position of the tree. It is possible that it could
forget a better score that was far from the present position. To avoid this problem, we add a special root
node which simply keeps track of the best score seen so far.
The following diagrams (see Figure 2) give a simplied picture of the operation of the walking tree. In
these pictures, each node has
SCORE; BestSCORE; pos:
The rst diagram shows the tree at position 4. That is, the leaves are looking at the rst 4 characters in
the text. The pattern is ABCD, and each character of this pattern is represented by one leaf. The text is
ABXXCD, and so at position 4 the tree is looking at ABXX. For this example,
Tj
MATCH (Pi ; Tj ) = 30 ifif PPi =
6i = Tj
and
y
GAP (x; y) = 0jx , yj + 1 ifif xx 6=
=y :
The succeeding pictures show the walking tree in positions 5 and 6. Notice that in each position, the new
scores computed at the leaves percolate up through the tree and new scores are computed at each node as
necessary. For example, at position 6, the leaf node with A sees an X and computes 0 as its MATCH, and
since MATCH BestSCORE , GAP , this leaf changes pos to 6. On the other hand, the internal node
above this leaf does not change its pos because its remembered BestSCORE minus the GAP is larger than
the new value computed from the updated values in its children.
The walking tree nds an alignment f so that
f : [1; :::; jP j] ! [1; :::; jT j]:
6, 6, 4
6, 6, 4
0, 0, 4
3, 3, 4
3, 3, 4
0, 0, 4
0, 0, 4
A
B
C
D
A
B
X
X
C
D
4, 6, 4
4, 6, 4
A
0, 0, 5
1, 3, 4
1, 3, 4
0, 0, 5
0, 0, 5
A
B
C
D
B
X
X
C
D
9, 9, 6
3, 6, 4
A
B
6, 6, 6
0, 0, 6
0, 0, 6
3, 3, 6
3, 3, 6
A
B
C
D
X
X
C
D
Figure 2: Simplied walking tree for pattern ABCD at positions 4, 5, and 6 of the text ABXXCD.
The alignment found approximately maximizes
jP j
X
i=1
MATCH (Pi ; Tf (i)) ,
jPX
j,1
i=1
GAP (f (i); f (i + 1)):
The actual fuctional being maximized is best described by the heuristic itself and has no simple formula.
Among other reasons, the above formula is only an approximation because the heuristic tries to break strings
into substrings whose lengths are powers of 2 even when using other lengths would increase the value of the
formula.
Performance Characteristics
The basic heuristic computes the score of an alignment with no inversions. We modify this heuristic in
three ways: 1) to compute a better placement of gaps, 2) to construct the alignment, and 3) to use inversions
in the alignment. The extensions to the basic heuristic may be used individually or in combination.
We have shown the following resource usage results for the heuristic computing only the score of an
alignment with inversions in [6].
The heuristic will execute in time proportional to the product of the length of the text and the length
of the pattern.
The work space used by the heuristic is proportional to the length of the pattern. The work space used
is independent of the length of the text.
The heuristic underestimates the actual alignment score of a pattern at a given position in the text
by, at most, the sum of the gap penalties in the alignment at that position.
We have shown the following resource usage results for the heuristic for constructing an alignment with
inversions in [6].
In worst case, the heuristic to construct the alignment will run in O(jT jjP j log jP j) time given a text
string T and a pattern P . In practice (see Figure 3), alignment construction takes O(jT jjP j) time, as
the log jP j factor for constructing the alignment does not appear since a \better" alignment, requiring
that the best alignment be updated, is only rarely found as the walking tree moves along the text.
Work space of O(jP j log jP j) is required by the heuristic to construct an alignment given a text string
T and a pattern P .
The heuristic never needs to go backwards in the text.
Adjusting Gaps
The basic heuristic places gaps close to their proper positions. If we use the heuristic to align the string
\ABCDEF" in the string \ABCXXDEF" the gap may be placed between `B' and `C', rather than between
`C' and `D'. This is a result of the halving behavior of the basic heuristic. By searching in the vicinity of
the position that the basic heuristic places a gap we can nd any increase in score that can be obtained by
sliding the gap to the left or right. The cost of nding better placements of the gaps is a factor of log jP j
increase in running time since at each node we have to search a region of the text of length proportional to
the size of the substring represented by the node.
10000
Time in CPU seconds
1000
100
10
1
0.1
0.01
0.001
10
100
1000
Size of sequences aligned
10000
Figure 3: CPU time used by our heuristic to align pairs of equal length sequences on a Sun SPARC{10.
Including Inversions
An inversion occurs when a substring, P1 = pi pi+1 pj , is matched with text that has the form
pj pj,1 pi . We use pi to indicate the complement of pi . A translocation occurs when a substring P1
is matched with a text substring T1, and a substring P2 is matched with a text substring T2 , but P1 occurs
before P2 in the pattern string, while T1 occurs after T2 in the text string. The basic heuristic can be mod-
ied to nd alignments when substrings of the pattern need to be inverted to match substrings of the text.
The idea is to invert the pattern and move, along the text, a walking tree of the inverted pattern in parallel
with the walking tree of the original pattern. When the match score of a region of the inverted pattern is
suciently higher than the match score of the corresponding region of the pattern, the region of the inverted
pattern is used to compute the alignment score. The introduction of an inversion can be penalized using a
function similar to the gap penalty function.
Note that inversions are incorporated into both the walking tree and the inverted walking tree so that it
is possible to have inversions nested within a larger inverted substring.
Constructing the alignment
The basic heuristic will report the score and position of the alignment, but does not give enough information to construct the alignment.
Alignment construction introduces a factor of log jP j into theoretical time and space usage which does
not occur in practice. An additional eld to hold the best alignment of the pattern substring represented by
a node is added to each node of the walking tree.
Note that we must save the entire alignment and not just pointers to the alignments of a node's children
because the maximum scoring alignments of the children may change without the maximum scoring alignment
of the node changing. When the modied heuristic completes, the alignment that produced the best score
will be stored in the root node of the walking tree.
Parallelization
We have implemented and optimized our heuristic on a single node of a Meiko CS{2 computer. A node
consists of one Texas Instruments SuperSPARC scalar processor and two Fujitsu VP vector processors.
In practice, we let each leaf of the walking tree represent more than a single character, typically between
30 and 100 characters. This does not decrease the number of character comparisons that the heuristic
performs since the walking tree is still moving one character at a time across the text, but it does decrease
the size of the walking tree. The smaller walking tree decreases the time required to align two sequences.
Using the vector-optimized heuristic on one node of the Meiko CS{2, two 8192-base sequences can be
aligned in less than 1.5 CPU minutes, and aligning a pair of sequences of length 32768 requires less than 25
minutes of CPU time. This agrees with the predicted run time which says that increasing both the pattern
and the text by a factor of 4 should increase the run time by a factor of 16.
Extrapolating, an alignment of sequences one million bases long may take two weeks, but fortunately our
parallel implementations have shown near-linear speedup. The table below shows speedup results on a 28
processor shared memory Sequent Balance 21000.
Processors
Speedup
1
1.0
2
1.97
4
3.83
8
7.26
12
10.5
16
13.6
20
16.7
It may also be possible to construct a \smart" disk controller based on our walking tree heuristic. The
heuristic uses only a few simple operations and never needs to back up in the text. Each processor would
need a small, constant sized memory, and would need to communicate with at most four other processors.
Such a disk controller would allow database search heuristics to start with only the sequences that are similar
to the query sequence.
EXAMPLES
Phylogenetic Analysis
Picornaviridae is a family of single stranded RNA viruses that are 7.2 to 8.4 kb in length. It is composed
of the ve genera Aphthovirus, Cardiovirus, Enterovirus, Hepatovirus, and Rhinovirus. The RNA typically
codes for four major polypeptides and several proteases.
We selected the sequences from GenBank release 81.0 (February 1994) with the keywords \Picornaviridae", \complete", \genome", and \sequence". From these GenBank entries we selected the twenty entries
that contained a complete Picornavirus genome sequence. Using the heuristic, we computed an alignment
score between each pair of sequences. The alignment score, s, is converted to a normalized \distance", d,
using
s
d = 1 , max
s
where maxs is the maximum possible alignment score for the pattern sequence. We then used the Fitch{
Margoliash distance matrix method as implemented by Joe Felsenstein in the Phylip 3.53c package to construct the phylogenetic tree in Figure 4.
The phylogeny presented in Figure 4 is based on the complete viral genomes and is nearly identical to
the phylogeny presented by [13] based on the P1 (capsid{encoding) regions of each genome. The Hepatovirus genera (HPAACG, HPACG, HPA, HPAA) cluster tightly as expected since they are nearly identical
sequences. The Cardioviruses (EVCGAA, EMCDCG, EMCBCG, MNGPOLY, TMECG, TMEPP, TMEVCPLT, and TMEGDVCG) form three groups, the encephalomyocarditis viruses, a mengovirus, and the Theiler
murine encephalomyelitis viruses. The Enteroviruses (SVDG, CXB5CGA, CXB4S, CXA21CG, POLIOS1,
CXA24CG, and BEVVG527) cluster loosely. The Rhinoviruses (HRV) are not near any of the other Picornaviruses.
EVCGAA
EMCDCG
EMCBCG
TMECG
TMEPP
MNGPOLY
TMEVCPLT
TMEGDVCG
HPAACG
HPACG
HPAA
HRV
HPA
SVDG
CXB5CGA
BEVVG527
CXB4S
CXA24CG
CXA21CG
POLIOS1
.10
Figure 4: Phylogenetic tree of the Picornavirus constructed using distances between the complete genome
sequences as computed by our heuristic.
Figure 5: Phylogenetic tree of thirty-eight Picornaviruses.
Figure 5 shows the resulting tree of a phylogenetic study on thirty-eight picornavirus genomes. The
program pairwiseAlignments.pl running under PVM congured with fty workstations performed a total
of 1444 pairwise alignments on the genomes. The program align2distanceMatrix.pl constructed a distance
matrix based on the alignment les. Programs in the phylogenetic analysis package Phylip constructed
the tree from the distance matrix. Note that in this vertically oriented tree, it is the vertical component
of the branches which corresponds to the distance from putative ancestors at the branching points. This
tree contains the same clustering patterns as those of the tree of twenty picornavirus. It appears that the
method clusters the sequences well according to their relatedness, and the walking tree may prove useful for
automating other such comparisons given raw sequence data.
The importance of studying methods that are capable of aligning genetic sequences that include inversions
and translocations is evidenced by the frequent mention of the rearrangement of the order and orientation of
genes between organisms in the literature, for example [7] [4] [8]. Further the known order and orientation
of genes on 16 mitochondrial genomes has recently been used to construct a phylogenetic tree [10]. With
walking tree heuristics we can construct such phylogenetic trees easily from either the DNA sequences or
from the gene positions and orientations.
Alignment and Gene Finding
To demonstrate the utility of our heuristic, we use it to align two pairs of sequences. The result of the
heuristics discussed in this paper is an alignment for each character of the pattern into the text. We are
developing tools to lter out the uninteresting regions of the alignment and leave only the interesting regions.
Currently we use two simple lters. The rst lter selects regions that are aligned with no gaps. A minimum
length is specied and only regions of the alignment that are at least the minimum length with no gaps are
shown. The second lter selects only regions that have a percent identity greater than a specied minimum
percent identity. In Figures 6 and 7, the regions that pass through both lters are shown with red bars
connecting regions that align directly and blue bars connecting regions that are the inverse complement of
one another. The intensity of the color increases as the percent identity increases.
Histone gene cluster of X. laevis
We use two histone gene clusters from Xenopus laevis to demonstrate aligning sequences with inversions.
The histone gene cluster from X. laevis with GenBank accession number X03017 is 14942 base pairs in
length and the histone gene cluster from X. laevis with GenBank accession number X03018 is 8592 base
pairs long [7]. The orientation of exons H2A and H3 in X03017 is inverted to the orientation of exons H2A
and H3 in X03018. Our heuristic constructed the alignment shown in Figure 6. The position of the histone
genes H2A, H2B, H3, and H4 are marked on both sequences. Notice that the genes H3 and H4 appear twice
in the left sequence, X03017, in the gure and both copies are aligned with the single copy of the gene in the
right sequence, X03018. The alignment shows that the orientation of H2A and H3 regions are reversed in
the two sequences. The regions of the two sequences that show the highest similarity are the corresponding
genes. The walking tree heuristic properly aligns the histone gene clusters from X. laevis because it is capable
of inverting and transposing subsequences in the alignment that it creates.
Mitochondrial DNA genomes
Our heuristic was used to align the mitochondrial genomes of Anopheles quadrimaculatus, GenBank
locus MSQNCATR, composed of 15455 bases and Schizosaccharomyces pombe, GenBank locus MISPCG
composed of 19431 bases. The resulting alignment is given in Figure 7. The two genomes are labeled with
the CDS features as given in the GenBank entries. The two entries have the cytochrome c oxidase 1 (COX{
1), cytochrome b (CYT{B), cytochrome c oxidase 2 (COX{2), ATPase{6, and ATPase{8 CDS features in
common. There are two striking features of this alignment. The rst is that the introns that appear in the S.
H4
13750.0
H3
12500.0
H1B 11250.0
X. laevis gene cluster X03017
10000.0
8750.0
7500.0
6250.0
6250.0
5000.0
5000.0
3750.0
3750.0
2500.0
2500.0
1250.0
1250.0
H4
H3
H2B
H2A
H2B
H4
H3
H2A
0.0
H1A
X. laevis gene cluster X03018
7500.0
0.0
Figure 6: An alignment of two histone gene clusters from Xenopus laevis, GenBank loci XELHISH2 and
XELHISH3A.
18750.0
COX-2
ATPase-9
17500.0
ATPase-8
ATPase-6
15000.0
15000.0
13750.0
13750.0 CYT-B
12500.0
12500.0
CYT-B 11250.0
11250.0
NADH-1
NADH-6
CYT-B
10000.0
10000.0
8750.0
8750.0
7500.0
7500.0
NADH-5
NADH-3
6250.0
6250.0
COX-3
5000.0
5000.0
ATPase-6
ATPase-8
COX-2
3750.0
3750.0
2500.0
2500.0
1250.0
1250.0
NADH-4L
NADH-4
COX-1
COX-1
COX-1
COX-1
S. pombe mitochondrial genome MISPCG
A. quadrimaculatus mitochondrial genome MSQNCATR
16250.0 URF-A
NADH-2
0.0
0.0
Figure 7: An alignment of the mitochondrial genomes of Anopheles quadrimaculatus, GenBank locus MSQNCATR, and Schizosaccharomyces pombe, GenBank locus MISPCG. This alignment was calculated using
about ve minutes of computer time.
Figure 8: URL http://www.cs.orst.edu/~cavenej/chasmview.html provides on-line access to Walking Tree
software .
pombe sequences for cytochrome c oxidase 1 and cytochrome b, but do not appear in the A. quadrimaculatus,
are correctly represented in the alignment. The second striking feature of Figure 7 is the strong alignment
of the cytochrome c oxidase 3 (COX{3) region on the mitochondrial genome of A. quadrimaculatus with an
unlabeled region on the mitochondrial genome of S. pombe. We later found the cytochrome c oxidase subunit
3 from the mitochondria of S. pombe in GenBank with accession number X16868 [14] and it exactly matches
the region from bases 8959 to 9768 of the complete mitochondrial genome of S. pombe. This demonstrates
the capability of our heuristic to identify a previously unrecognized region of DNA by aligning similar regions
in other sequences. The heuristic correctly aligns each of these pairs of products with the exception of the
ATPase 8 product. ATPase 8 is a short region, 162 bases in the A. quadrimaculatus genome, and 147 bases
in the S. pombe genome. The simple lters that we currently use with the heutistic fail to identify this region
because it is short and not highly conserved relative to the rest of the genome. When relaxing the stringency
of the lters to the level that alignments appear for ATPase{8, the rest of the alignment becomes dicult
to see with simple lters due to the large number of short, less signicant regions that align.
World Wide Web access
We have developed the web page shown in Figure 8, to allow others convenient use of the alignment
visualization software. The web page processes submitted DNA data and provides the user with a visualization of the alignment in PostScript format, suitable for viewing and high quality printing. The following
are examples of what can be produced with the web site.
Cytochrome C
Figure 9 shows a matching of the DNA sequences for genes for the protein Cytochrome C in two species.
Such comparisons reveal genetic relationships. It should be noted that other comparisons are possible,
particularly those of amino acid sequence comparison, which provide more information on protein function.
1021
1021
970
970
919
919
868
868
817
817
766
766
715
715
664
664
613
613
562
562
511
511
460
460
409
409
358
358
307
307
256
256
205
205
154
154
103
103
52
52
1
1
D.melanogaster cytochrome C gene. M11381
cytochrome C
Yeast (S. pombe) cytochrome c gene and flanks. J01318
cytochrome c
Figure 9: Cytochrome C - matching of DNA
Amino Acid Sequences
Figure 10 is a comparison of the amino acid sequences for the protein Cytochrome C in the two species,
based on matching the amino acids. Notice that the amino acid sequences seem to be much more highly
conserved than the underlying DNA sequences.
Mutated sequences
Figures 11 through 13 illustrate alignment capabilities of the Walking Tree with respect to inversions and
transpositions. Figure 11 shows the exact relationship between an articial DNA sequence (on the left) and
the same sequence after repeated inversions and transpositions (right). Regions which are relatively inverted
are shown in red (bow-ties), and transposed regions are shown in green. Figure 12 shows a visualization
of raw results from a Walking Tree alignment for these two sequences. Inverse complementary matches
are shown in red, and direct matches are shown in green. Most of the features were detected. Figure 13
shows another visualization of the same Walking Tree alignment, with much of the noise ltered out by our
software.
CONCLUSION
Complete viral and organellar genomes have been sequenced and are in the biological sequence databases.
Complete sequences of bacterial and eukaryotic genomes are available. In the past, while aligning short
biological sequences, it was reasonable to model the evolution of biological sequences with the operations:
substitute one base for another, insert a base, and delete a base. Today, with the arrival of sequences of
complete genomes, the operations inversion and transposition need to be added in order to appropriately
model the evolution of these large biological sequences. In this paper we described a family of heuristics
127
121
121
115
115
109
109
103
103
97
97
91
91
85
85
79
79
73
73
67
67
61
61
55
55
49
49
43
43
37
37
31
31
25
25
19
19
13
13
7
7
1
1
CYTOCHROME C. 1169167
cytochrome C. 157161
???
127
Figure 10: Cytochrome C - matching of amino acids
1021
1021
970
970
919
919
868
868
817
817
766
766
715
715
664
664
613
613
562
562
511
511
460
460
409
409
358
358
307
307
256
256
205
205
154
154
103
103
52
52
1
1
Figure 11: Articial DNA sequence and mutated copy
1021
1021
970
970
919
919
868
868
817
817
766
766
715
715
664
664
613
613
562
562
511
511
460
460
409
409
358
358
307
307
256
256
205
205
154
154
103
103
52
52
1
1
Figure 12: A Walking Tree alignment (raw)
1021
1021
970
970
919
919
868
868
817
817
766
766
715
715
664
664
613
613
562
562
511
511
460
460
409
409
358
358
307
307
256
256
205
205
154
154
103
103
52
52
1
1
Figure 13: A Walking Tree alignment (ltered)
designed to align biological sequences that may include inversions and transpositions. Our heuristics use time
approximately proportional to the product of the lengths of the sequences being aligned and use work space
approximately proportional to the length of the shorter of the sequences. As an example of the utility of
our heuristic, it was used successfully to align the complete mitochondrial genomes of Schizosaccharomyces
pombe and Anopheles quadrimaculatus. This alignment shows the previously unreported location of the
cytochrome c oxidase subunit 3 on the mitochondrial genome of S. pombe and was completed using about
ve minutes of computer time.
REFERENCES
[1] Blattner, F. et al.1997. The complete genome sequence of Escherichia coli K{12. Science 277:1453{1474.
[2] Cavener, J. 1997. Visualization, Implementation, and Application of the Walking Tree Heuristics for Biological
String Matching. M.S. Thesis, Department of Computer Science, Oregon State University, Corvallis OR.
[3] Caprara, A. 1997. Sorting by reversals is dicult. RECOMB 97 75{83. ACM Press, New York.
[4] Devos, K. M., Atkinson, M. D., Chinoy, C. N., Francis, H. A., Harcourt, R. L., Koebner, R. M. D., Liu, C. J.,
Masojc, P., Xie, D. X., & Gale, M. D. 1993. Chromosomal rearrangements in the rye genome relative to that of
wheat. Theoretical and Applied Genetics 85:673{680.
[5] Holloway, J. L., and Cull, P. 1994. Aligning genomes with inversions and swaps. Second International Conference
on Intelligent Sytstems for Molecular Biology. Proceedings of ISMB'94 195{202. AAAI Press, Menlo Park CA
1994
[6] Holloway, J. L. 1992. String Matching Algorithms with Applications in Molecular Biology. Ph.D. Dissertation,
Oregon State Universtiy, Corvallis, OR.
[7] Perry, M., Thomsen, G. H. and Roeder, R. G. 1985. Genomic organization and nucleotide sequence of two distinct
histone gene clusters from xenopus laevis, Journal of Molecular Biology 185: 479{499.
[8] Prombona, A. and Subramanian, A. R. 1989. A new rearrangement of angiosperm chloroplast DNA in rye (secale
cereale) involving translocation and duplication of the ribosomal rpS15 gene, The Journal of Biological Chemistry
264(32): 19060{19065.
[9] Python, M. 1989. Just The Words. Methuen, London.
[10] Sanko, D., Leduc, G., Antoine, N., Paquin, B., Lang, B. F., & Cedergren, R. 1992. Gene order comparisons for
phylogenetic inference: Evolution of the mitochondrial genome. Proceedings of the National Academy of Science
89:6875{6579.
[11] Schoniger, M. and Waterman, M.S. 1992. A local algorithm for DNA sequence alignment with inversions. Bulletin
of Mathematical Biology 54:521{536.
[12] Setubal, J., Meidanis, J. 1996. Introduction to Computational Molecular Biology. PWS Publishing, Boston MA.
[13] Stanway, G. 1990. Structure, function, and evolution of picornaviruses. Journal of General Virology 87:2483{
2501.
[14] Trinkl, H. and Wolf, K. 1989. Nucleotide sequence of the gene encoding subunit 3 of cytochrome c oxidase (cox3)
in the mitochondrial genome of schizosaccharomyces pombe strain ef1, Nucleic Acids Research 17(23): 10104.
Download