Walking Tree Heuristics for Biological String Alignment, Gene Location, and Phylogenies P. Cull , J. L. Holloway y z and J. D. Cavener Computer Science Dept., Oregon State University, Corvallis, OR 97331 USA, pc@cs.orst.edu y Zymogenetics, 1201 Eastlake Av., Seattle, WA 98102 USA, holloway@zgi.com z Computer Science Dept., Oregon State University, Corvallis, OR 97331 USA, cavenej@cs.orst.edu Abstract. Basic biological information is stored in strings of nucleic acids (DNA, RNA) or amino acids (proteins). Teasing out the meaning of these strings is a central problem of modern biology. Matching and aligning strings brings out their shared characteristics. Although string matching is well-understood in the edit-distance model, biological strings with transpositions and inversions violate this model's assumptions. We propose a family of heuristics called walking trees to align biologically reasonable strings. Both edit-distance and walking tree methods can locate specic genes within a large string when the genes' sequences are given. When we attempt to match whole strings, the walking tree matches most genes, while the edit-distance method fails. We also give examples in which the walking tree matches substrings even if they have been moved or inverted. The edit-distance method was not designed to handle these problems. We include an example in which the walking tree \discovered" a gene. Calculating scores for whole genome matches gives a method for approximating evolutionary distance. We show two evolutionary trees for the picornaviruses which were computed by the walking tree heuristic. Both of these trees show great similarity to previously constructed trees. The point of this demonstration is that WHOLE genomes can be matched and distances calculated. The rst tree was created on a Sequent parallel computer and demonstrates that the walking tree heuristic can be eeciently parallelized. The second tree was created using a network of work stations and demonstrates that there is suent parallelism in the phylogenetic tree calculation that the sequential walking tree can be used eectively on a network. Keywords: sequence alignment, genome alignment, heuristic, Picornavirus, inversions, swaps, dynamic programming, translocations, walking tree, phylogeny INTRODUCTION The organization of the eukaryote genome is the major scientic enigma of our era. In the last decade there have been major advances in determining the sequential structure of DNA. Massive amounts of data on a variety of organisms are daily pouring out of sequencing labs. Partial genetic maps are available for many organisms. Full maps have recently been produced for several bacteria [1]. Detailed maps of the human genome are in the works and should be available in the next decade. This avalanche of data will require fast computer methods for analysis and data retrieval. Present methods for string matching are based on the edit-distance model [12]. These methods assume that dierences between strings are due to local changes, but there is evidence that larger scale changes can occur [4]. In particular, swaps, in which large pieces of DNA are moved from one location to another, and inversion, in which a sequence is replaced by its reversed complement, do not t neatly into the edit-distance model. Recently, there have been some attempts to extend the edit-distance model to include inversions [11], but the extension will only handle one level of inversion and cannot handle inversions within inversions. Further, it seems unlikely that any simple model will be able to capture the minimum biologically correct distance between two strings. In all likelihood nding the fewest operations that have to be applied to one string to obtain another string will probably require trying all possible sequences of operations. Of course, trying all possible sequences will ctatctactatgcggagcctagagtggcagtc best score best position current position left best score left best pos right best score right best pos Figure 1: Data structure used to align the pattern within the text. In this picture, each leaf node represents 8 characters of the pattern, each of the internal nodes represents 16 characters of the pattern, and the root node represents the entire pattern. Each of the nodes contains the elds shown in the expanded node. be computationally intractable, that is, any computer program based on this sort of exhaustive search will require millenia to process strings of biologically reasonable size. This intuition of intractability has been conrmed by a recent proof that determining the minimum number of ips needed to sort a sequence is an NP-complete problem [3]. WALKING TREE HEURISTICS Basic Heuristic Our problem is to create a heuristic that nds an approximate biologically reasonable alignment between two strings, one called pattern, and the other called text. Our metaphor is to consider the data structure for the basic heuristic as a walking tree [9] (see Figure 1) with jP j leaves, one for each character in the pattern. When the heuristic is considering position l + 1 of the text, the leaves of the tree are positioned over the jP j contiguous characters of the text up to and including character l + 1. The leaves also remember some of the information for the best alignment within the rst l characters of the text. On the basis of this remembered information and the comparisons of the leaves with the text characters under them, the leaves update their information and pass this information to their parents. The data will percolate up to the root where a new best score is calculated. The tree can then walk to the next position by moving each of its leaves one character to the right. The whole text has been processed when the leftmost leaf of the walking tree has processed the rightmost character of the text. To dene a scoring system that captures some biological intuitions, we currently use a function that gives a positive contribution based on the similarity between aligned characters, and a negative contribution that is related to the number and length of gaps, translocations, and inversions. A gap in an alignment occurs when adjacent characters in the pattern are aligned with non{adjacent characters in the text. The length of the gap is the number of characters between the non{adjacent characters in the text. The computation at each leaf node makes use of two functions, MATCH and GAP. MATCH looks at the current text character and compares it with the pattern character represented by the leaf. In the simplest case one could use Tj MATCH (Pi ; Tj ) = c0 ifif PPi = 6= T i j For many of our examples we use c = 3. If one were matching amino acid strings then MATCH could return a value depending on how similar the two compared amino acids are. There are standard methods like PAM matrices for scoring this similarity. If we only used the MATCH function, the leaves would simply remember if they had ever seen a matching character in the text. We use the GAP function to penalize the match for being far from the current position of the tree. So the leaf needs to remember both if it found a match and the position of the walking tree when it found a match. For example, a simple GAP function could be: GAP (currentpos; pos) = jcurrentpos , posj: Then the leaf could compute SCORE := max(MATCH (Pi ; Tj ); SCORE , GAP (currentpos; pos)) and update pos to currentpos if MATCH is maximal. This means that a leaf will forget an actual match if it occured far from the current position. An internal node will only look at what it is remembering and at what its children have computed. Like a leaf node, the internal node computes SCORE , GAP (currentpos; pos) which depends on what the node is remembering. From its two children, the node computes SCORE:left + SCORE:right , GAP (pos:left; pos:right): This will penalize the sum of the children's scores because the position for the two scores may be dierent. But, the internal node also has to penalize this score because the left or right position may be far from the current position. So it also subtracts min(GAP (currentpos; pos:left); GAP (currentpos; pos:right)): At this point we notice that there is an asymmetry that should be built into GAP, because having the left position before the right position should be favored. Without going into details, we usually make GAP asymmetric. The internal node will keep the better of the score from its remembered data and the score computed from its children. The walking tree will be computing a score for the current position of the tree. It is possible that it could forget a better score that was far from the present position. To avoid this problem, we add a special root node which simply keeps track of the best score seen so far. The following diagrams (see Figure 2) give a simplied picture of the operation of the walking tree. In these pictures, each node has SCORE; BestSCORE; pos: The rst diagram shows the tree at position 4. That is, the leaves are looking at the rst 4 characters in the text. The pattern is ABCD, and each character of this pattern is represented by one leaf. The text is ABXXCD, and so at position 4 the tree is looking at ABXX. For this example, Tj MATCH (Pi ; Tj ) = 30 ifif PPi = 6i = Tj and y GAP (x; y) = 0jx , yj + 1 ifif xx 6= =y : The succeeding pictures show the walking tree in positions 5 and 6. Notice that in each position, the new scores computed at the leaves percolate up through the tree and new scores are computed at each node as necessary. For example, at position 6, the leaf node with A sees an X and computes 0 as its MATCH, and since MATCH BestSCORE , GAP , this leaf changes pos to 6. On the other hand, the internal node above this leaf does not change its pos because its remembered BestSCORE minus the GAP is larger than the new value computed from the updated values in its children. The walking tree nds an alignment f so that f : [1; :::; jP j] ! [1; :::; jT j]: 6, 6, 4 6, 6, 4 0, 0, 4 3, 3, 4 3, 3, 4 0, 0, 4 0, 0, 4 A B C D A B X X C D 4, 6, 4 4, 6, 4 A 0, 0, 5 1, 3, 4 1, 3, 4 0, 0, 5 0, 0, 5 A B C D B X X C D 9, 9, 6 3, 6, 4 A B 6, 6, 6 0, 0, 6 0, 0, 6 3, 3, 6 3, 3, 6 A B C D X X C D Figure 2: Simplied walking tree for pattern ABCD at positions 4, 5, and 6 of the text ABXXCD. The alignment found approximately maximizes jP j X i=1 MATCH (Pi ; Tf (i)) , jPX j,1 i=1 GAP (f (i); f (i + 1)): The actual fuctional being maximized is best described by the heuristic itself and has no simple formula. Among other reasons, the above formula is only an approximation because the heuristic tries to break strings into substrings whose lengths are powers of 2 even when using other lengths would increase the value of the formula. Performance Characteristics The basic heuristic computes the score of an alignment with no inversions. We modify this heuristic in three ways: 1) to compute a better placement of gaps, 2) to construct the alignment, and 3) to use inversions in the alignment. The extensions to the basic heuristic may be used individually or in combination. We have shown the following resource usage results for the heuristic computing only the score of an alignment with inversions in [6]. The heuristic will execute in time proportional to the product of the length of the text and the length of the pattern. The work space used by the heuristic is proportional to the length of the pattern. The work space used is independent of the length of the text. The heuristic underestimates the actual alignment score of a pattern at a given position in the text by, at most, the sum of the gap penalties in the alignment at that position. We have shown the following resource usage results for the heuristic for constructing an alignment with inversions in [6]. In worst case, the heuristic to construct the alignment will run in O(jT jjP j log jP j) time given a text string T and a pattern P . In practice (see Figure 3), alignment construction takes O(jT jjP j) time, as the log jP j factor for constructing the alignment does not appear since a \better" alignment, requiring that the best alignment be updated, is only rarely found as the walking tree moves along the text. Work space of O(jP j log jP j) is required by the heuristic to construct an alignment given a text string T and a pattern P . The heuristic never needs to go backwards in the text. Adjusting Gaps The basic heuristic places gaps close to their proper positions. If we use the heuristic to align the string \ABCDEF" in the string \ABCXXDEF" the gap may be placed between `B' and `C', rather than between `C' and `D'. This is a result of the halving behavior of the basic heuristic. By searching in the vicinity of the position that the basic heuristic places a gap we can nd any increase in score that can be obtained by sliding the gap to the left or right. The cost of nding better placements of the gaps is a factor of log jP j increase in running time since at each node we have to search a region of the text of length proportional to the size of the substring represented by the node. 10000 Time in CPU seconds 1000 100 10 1 0.1 0.01 0.001 10 100 1000 Size of sequences aligned 10000 Figure 3: CPU time used by our heuristic to align pairs of equal length sequences on a Sun SPARC{10. Including Inversions An inversion occurs when a substring, P1 = pi pi+1 pj , is matched with text that has the form pj pj,1 pi . We use pi to indicate the complement of pi . A translocation occurs when a substring P1 is matched with a text substring T1, and a substring P2 is matched with a text substring T2 , but P1 occurs before P2 in the pattern string, while T1 occurs after T2 in the text string. The basic heuristic can be mod- ied to nd alignments when substrings of the pattern need to be inverted to match substrings of the text. The idea is to invert the pattern and move, along the text, a walking tree of the inverted pattern in parallel with the walking tree of the original pattern. When the match score of a region of the inverted pattern is suciently higher than the match score of the corresponding region of the pattern, the region of the inverted pattern is used to compute the alignment score. The introduction of an inversion can be penalized using a function similar to the gap penalty function. Note that inversions are incorporated into both the walking tree and the inverted walking tree so that it is possible to have inversions nested within a larger inverted substring. Constructing the alignment The basic heuristic will report the score and position of the alignment, but does not give enough information to construct the alignment. Alignment construction introduces a factor of log jP j into theoretical time and space usage which does not occur in practice. An additional eld to hold the best alignment of the pattern substring represented by a node is added to each node of the walking tree. Note that we must save the entire alignment and not just pointers to the alignments of a node's children because the maximum scoring alignments of the children may change without the maximum scoring alignment of the node changing. When the modied heuristic completes, the alignment that produced the best score will be stored in the root node of the walking tree. Parallelization We have implemented and optimized our heuristic on a single node of a Meiko CS{2 computer. A node consists of one Texas Instruments SuperSPARC scalar processor and two Fujitsu VP vector processors. In practice, we let each leaf of the walking tree represent more than a single character, typically between 30 and 100 characters. This does not decrease the number of character comparisons that the heuristic performs since the walking tree is still moving one character at a time across the text, but it does decrease the size of the walking tree. The smaller walking tree decreases the time required to align two sequences. Using the vector-optimized heuristic on one node of the Meiko CS{2, two 8192-base sequences can be aligned in less than 1.5 CPU minutes, and aligning a pair of sequences of length 32768 requires less than 25 minutes of CPU time. This agrees with the predicted run time which says that increasing both the pattern and the text by a factor of 4 should increase the run time by a factor of 16. Extrapolating, an alignment of sequences one million bases long may take two weeks, but fortunately our parallel implementations have shown near-linear speedup. The table below shows speedup results on a 28 processor shared memory Sequent Balance 21000. Processors Speedup 1 1.0 2 1.97 4 3.83 8 7.26 12 10.5 16 13.6 20 16.7 It may also be possible to construct a \smart" disk controller based on our walking tree heuristic. The heuristic uses only a few simple operations and never needs to back up in the text. Each processor would need a small, constant sized memory, and would need to communicate with at most four other processors. Such a disk controller would allow database search heuristics to start with only the sequences that are similar to the query sequence. EXAMPLES Phylogenetic Analysis Picornaviridae is a family of single stranded RNA viruses that are 7.2 to 8.4 kb in length. It is composed of the ve genera Aphthovirus, Cardiovirus, Enterovirus, Hepatovirus, and Rhinovirus. The RNA typically codes for four major polypeptides and several proteases. We selected the sequences from GenBank release 81.0 (February 1994) with the keywords \Picornaviridae", \complete", \genome", and \sequence". From these GenBank entries we selected the twenty entries that contained a complete Picornavirus genome sequence. Using the heuristic, we computed an alignment score between each pair of sequences. The alignment score, s, is converted to a normalized \distance", d, using s d = 1 , max s where maxs is the maximum possible alignment score for the pattern sequence. We then used the Fitch{ Margoliash distance matrix method as implemented by Joe Felsenstein in the Phylip 3.53c package to construct the phylogenetic tree in Figure 4. The phylogeny presented in Figure 4 is based on the complete viral genomes and is nearly identical to the phylogeny presented by [13] based on the P1 (capsid{encoding) regions of each genome. The Hepatovirus genera (HPAACG, HPACG, HPA, HPAA) cluster tightly as expected since they are nearly identical sequences. The Cardioviruses (EVCGAA, EMCDCG, EMCBCG, MNGPOLY, TMECG, TMEPP, TMEVCPLT, and TMEGDVCG) form three groups, the encephalomyocarditis viruses, a mengovirus, and the Theiler murine encephalomyelitis viruses. The Enteroviruses (SVDG, CXB5CGA, CXB4S, CXA21CG, POLIOS1, CXA24CG, and BEVVG527) cluster loosely. The Rhinoviruses (HRV) are not near any of the other Picornaviruses. EVCGAA EMCDCG EMCBCG TMECG TMEPP MNGPOLY TMEVCPLT TMEGDVCG HPAACG HPACG HPAA HRV HPA SVDG CXB5CGA BEVVG527 CXB4S CXA24CG CXA21CG POLIOS1 .10 Figure 4: Phylogenetic tree of the Picornavirus constructed using distances between the complete genome sequences as computed by our heuristic. Figure 5: Phylogenetic tree of thirty-eight Picornaviruses. Figure 5 shows the resulting tree of a phylogenetic study on thirty-eight picornavirus genomes. The program pairwiseAlignments.pl running under PVM congured with fty workstations performed a total of 1444 pairwise alignments on the genomes. The program align2distanceMatrix.pl constructed a distance matrix based on the alignment les. Programs in the phylogenetic analysis package Phylip constructed the tree from the distance matrix. Note that in this vertically oriented tree, it is the vertical component of the branches which corresponds to the distance from putative ancestors at the branching points. This tree contains the same clustering patterns as those of the tree of twenty picornavirus. It appears that the method clusters the sequences well according to their relatedness, and the walking tree may prove useful for automating other such comparisons given raw sequence data. The importance of studying methods that are capable of aligning genetic sequences that include inversions and translocations is evidenced by the frequent mention of the rearrangement of the order and orientation of genes between organisms in the literature, for example [7] [4] [8]. Further the known order and orientation of genes on 16 mitochondrial genomes has recently been used to construct a phylogenetic tree [10]. With walking tree heuristics we can construct such phylogenetic trees easily from either the DNA sequences or from the gene positions and orientations. Alignment and Gene Finding To demonstrate the utility of our heuristic, we use it to align two pairs of sequences. The result of the heuristics discussed in this paper is an alignment for each character of the pattern into the text. We are developing tools to lter out the uninteresting regions of the alignment and leave only the interesting regions. Currently we use two simple lters. The rst lter selects regions that are aligned with no gaps. A minimum length is specied and only regions of the alignment that are at least the minimum length with no gaps are shown. The second lter selects only regions that have a percent identity greater than a specied minimum percent identity. In Figures 6 and 7, the regions that pass through both lters are shown with red bars connecting regions that align directly and blue bars connecting regions that are the inverse complement of one another. The intensity of the color increases as the percent identity increases. Histone gene cluster of X. laevis We use two histone gene clusters from Xenopus laevis to demonstrate aligning sequences with inversions. The histone gene cluster from X. laevis with GenBank accession number X03017 is 14942 base pairs in length and the histone gene cluster from X. laevis with GenBank accession number X03018 is 8592 base pairs long [7]. The orientation of exons H2A and H3 in X03017 is inverted to the orientation of exons H2A and H3 in X03018. Our heuristic constructed the alignment shown in Figure 6. The position of the histone genes H2A, H2B, H3, and H4 are marked on both sequences. Notice that the genes H3 and H4 appear twice in the left sequence, X03017, in the gure and both copies are aligned with the single copy of the gene in the right sequence, X03018. The alignment shows that the orientation of H2A and H3 regions are reversed in the two sequences. The regions of the two sequences that show the highest similarity are the corresponding genes. The walking tree heuristic properly aligns the histone gene clusters from X. laevis because it is capable of inverting and transposing subsequences in the alignment that it creates. Mitochondrial DNA genomes Our heuristic was used to align the mitochondrial genomes of Anopheles quadrimaculatus, GenBank locus MSQNCATR, composed of 15455 bases and Schizosaccharomyces pombe, GenBank locus MISPCG composed of 19431 bases. The resulting alignment is given in Figure 7. The two genomes are labeled with the CDS features as given in the GenBank entries. The two entries have the cytochrome c oxidase 1 (COX{ 1), cytochrome b (CYT{B), cytochrome c oxidase 2 (COX{2), ATPase{6, and ATPase{8 CDS features in common. There are two striking features of this alignment. The rst is that the introns that appear in the S. H4 13750.0 H3 12500.0 H1B 11250.0 X. laevis gene cluster X03017 10000.0 8750.0 7500.0 6250.0 6250.0 5000.0 5000.0 3750.0 3750.0 2500.0 2500.0 1250.0 1250.0 H4 H3 H2B H2A H2B H4 H3 H2A 0.0 H1A X. laevis gene cluster X03018 7500.0 0.0 Figure 6: An alignment of two histone gene clusters from Xenopus laevis, GenBank loci XELHISH2 and XELHISH3A. 18750.0 COX-2 ATPase-9 17500.0 ATPase-8 ATPase-6 15000.0 15000.0 13750.0 13750.0 CYT-B 12500.0 12500.0 CYT-B 11250.0 11250.0 NADH-1 NADH-6 CYT-B 10000.0 10000.0 8750.0 8750.0 7500.0 7500.0 NADH-5 NADH-3 6250.0 6250.0 COX-3 5000.0 5000.0 ATPase-6 ATPase-8 COX-2 3750.0 3750.0 2500.0 2500.0 1250.0 1250.0 NADH-4L NADH-4 COX-1 COX-1 COX-1 COX-1 S. pombe mitochondrial genome MISPCG A. quadrimaculatus mitochondrial genome MSQNCATR 16250.0 URF-A NADH-2 0.0 0.0 Figure 7: An alignment of the mitochondrial genomes of Anopheles quadrimaculatus, GenBank locus MSQNCATR, and Schizosaccharomyces pombe, GenBank locus MISPCG. This alignment was calculated using about ve minutes of computer time. Figure 8: URL http://www.cs.orst.edu/~cavenej/chasmview.html provides on-line access to Walking Tree software . pombe sequences for cytochrome c oxidase 1 and cytochrome b, but do not appear in the A. quadrimaculatus, are correctly represented in the alignment. The second striking feature of Figure 7 is the strong alignment of the cytochrome c oxidase 3 (COX{3) region on the mitochondrial genome of A. quadrimaculatus with an unlabeled region on the mitochondrial genome of S. pombe. We later found the cytochrome c oxidase subunit 3 from the mitochondria of S. pombe in GenBank with accession number X16868 [14] and it exactly matches the region from bases 8959 to 9768 of the complete mitochondrial genome of S. pombe. This demonstrates the capability of our heuristic to identify a previously unrecognized region of DNA by aligning similar regions in other sequences. The heuristic correctly aligns each of these pairs of products with the exception of the ATPase 8 product. ATPase 8 is a short region, 162 bases in the A. quadrimaculatus genome, and 147 bases in the S. pombe genome. The simple lters that we currently use with the heutistic fail to identify this region because it is short and not highly conserved relative to the rest of the genome. When relaxing the stringency of the lters to the level that alignments appear for ATPase{8, the rest of the alignment becomes dicult to see with simple lters due to the large number of short, less signicant regions that align. World Wide Web access We have developed the web page shown in Figure 8, to allow others convenient use of the alignment visualization software. The web page processes submitted DNA data and provides the user with a visualization of the alignment in PostScript format, suitable for viewing and high quality printing. The following are examples of what can be produced with the web site. Cytochrome C Figure 9 shows a matching of the DNA sequences for genes for the protein Cytochrome C in two species. Such comparisons reveal genetic relationships. It should be noted that other comparisons are possible, particularly those of amino acid sequence comparison, which provide more information on protein function. 1021 1021 970 970 919 919 868 868 817 817 766 766 715 715 664 664 613 613 562 562 511 511 460 460 409 409 358 358 307 307 256 256 205 205 154 154 103 103 52 52 1 1 D.melanogaster cytochrome C gene. M11381 cytochrome C Yeast (S. pombe) cytochrome c gene and flanks. J01318 cytochrome c Figure 9: Cytochrome C - matching of DNA Amino Acid Sequences Figure 10 is a comparison of the amino acid sequences for the protein Cytochrome C in the two species, based on matching the amino acids. Notice that the amino acid sequences seem to be much more highly conserved than the underlying DNA sequences. Mutated sequences Figures 11 through 13 illustrate alignment capabilities of the Walking Tree with respect to inversions and transpositions. Figure 11 shows the exact relationship between an articial DNA sequence (on the left) and the same sequence after repeated inversions and transpositions (right). Regions which are relatively inverted are shown in red (bow-ties), and transposed regions are shown in green. Figure 12 shows a visualization of raw results from a Walking Tree alignment for these two sequences. Inverse complementary matches are shown in red, and direct matches are shown in green. Most of the features were detected. Figure 13 shows another visualization of the same Walking Tree alignment, with much of the noise ltered out by our software. CONCLUSION Complete viral and organellar genomes have been sequenced and are in the biological sequence databases. Complete sequences of bacterial and eukaryotic genomes are available. In the past, while aligning short biological sequences, it was reasonable to model the evolution of biological sequences with the operations: substitute one base for another, insert a base, and delete a base. Today, with the arrival of sequences of complete genomes, the operations inversion and transposition need to be added in order to appropriately model the evolution of these large biological sequences. In this paper we described a family of heuristics 127 121 121 115 115 109 109 103 103 97 97 91 91 85 85 79 79 73 73 67 67 61 61 55 55 49 49 43 43 37 37 31 31 25 25 19 19 13 13 7 7 1 1 CYTOCHROME C. 1169167 cytochrome C. 157161 ??? 127 Figure 10: Cytochrome C - matching of amino acids 1021 1021 970 970 919 919 868 868 817 817 766 766 715 715 664 664 613 613 562 562 511 511 460 460 409 409 358 358 307 307 256 256 205 205 154 154 103 103 52 52 1 1 Figure 11: Articial DNA sequence and mutated copy 1021 1021 970 970 919 919 868 868 817 817 766 766 715 715 664 664 613 613 562 562 511 511 460 460 409 409 358 358 307 307 256 256 205 205 154 154 103 103 52 52 1 1 Figure 12: A Walking Tree alignment (raw) 1021 1021 970 970 919 919 868 868 817 817 766 766 715 715 664 664 613 613 562 562 511 511 460 460 409 409 358 358 307 307 256 256 205 205 154 154 103 103 52 52 1 1 Figure 13: A Walking Tree alignment (ltered) designed to align biological sequences that may include inversions and transpositions. Our heuristics use time approximately proportional to the product of the lengths of the sequences being aligned and use work space approximately proportional to the length of the shorter of the sequences. As an example of the utility of our heuristic, it was used successfully to align the complete mitochondrial genomes of Schizosaccharomyces pombe and Anopheles quadrimaculatus. This alignment shows the previously unreported location of the cytochrome c oxidase subunit 3 on the mitochondrial genome of S. pombe and was completed using about ve minutes of computer time. REFERENCES [1] Blattner, F. et al.1997. The complete genome sequence of Escherichia coli K{12. Science 277:1453{1474. [2] Cavener, J. 1997. Visualization, Implementation, and Application of the Walking Tree Heuristics for Biological String Matching. M.S. Thesis, Department of Computer Science, Oregon State University, Corvallis OR. [3] Caprara, A. 1997. Sorting by reversals is dicult. RECOMB 97 75{83. ACM Press, New York. [4] Devos, K. M., Atkinson, M. D., Chinoy, C. N., Francis, H. A., Harcourt, R. L., Koebner, R. M. D., Liu, C. J., Masojc, P., Xie, D. X., & Gale, M. D. 1993. Chromosomal rearrangements in the rye genome relative to that of wheat. Theoretical and Applied Genetics 85:673{680. [5] Holloway, J. L., and Cull, P. 1994. Aligning genomes with inversions and swaps. Second International Conference on Intelligent Sytstems for Molecular Biology. Proceedings of ISMB'94 195{202. AAAI Press, Menlo Park CA 1994 [6] Holloway, J. L. 1992. String Matching Algorithms with Applications in Molecular Biology. Ph.D. Dissertation, Oregon State Universtiy, Corvallis, OR. [7] Perry, M., Thomsen, G. H. and Roeder, R. G. 1985. Genomic organization and nucleotide sequence of two distinct histone gene clusters from xenopus laevis, Journal of Molecular Biology 185: 479{499. [8] Prombona, A. and Subramanian, A. R. 1989. A new rearrangement of angiosperm chloroplast DNA in rye (secale cereale) involving translocation and duplication of the ribosomal rpS15 gene, The Journal of Biological Chemistry 264(32): 19060{19065. [9] Python, M. 1989. Just The Words. Methuen, London. [10] Sanko, D., Leduc, G., Antoine, N., Paquin, B., Lang, B. F., & Cedergren, R. 1992. Gene order comparisons for phylogenetic inference: Evolution of the mitochondrial genome. Proceedings of the National Academy of Science 89:6875{6579. [11] Schoniger, M. and Waterman, M.S. 1992. A local algorithm for DNA sequence alignment with inversions. Bulletin of Mathematical Biology 54:521{536. [12] Setubal, J., Meidanis, J. 1996. Introduction to Computational Molecular Biology. PWS Publishing, Boston MA. [13] Stanway, G. 1990. Structure, function, and evolution of picornaviruses. Journal of General Virology 87:2483{ 2501. [14] Trinkl, H. and Wolf, K. 1989. Nucleotide sequence of the gene encoding subunit 3 of cytochrome c oxidase (cox3) in the mitochondrial genome of schizosaccharomyces pombe strain ef1, Nucleic Acids Research 17(23): 10104.