Randomized approach to Distance Matrix calculation for Multiple Sequence Alignment Vishal Thapar (Vishal.Thapar@uconn.edu) BME 300 – Bioinformatics Instructor: Prof. Richard Simon (December 3rd, 2003) Abstract: Rigorous alignment of multiple sequences becomes impractical even with a modest number of sequences [1]. Solution to multiple sequence alignment problem is important for biological research purposes. Because of the high time complexity of traditional MSA algorithms, even today’s fast computers are not able to solve the problem for large number of sequences. Our approach in this paper is to evaluate the possibility of using randomized approach to calculate distance matrix for multiple sequence alignment algorithm. In order to reduce time complexity, we will evaluate a small randomly selected portion of a sequence and compare with similar portions collected randomly from all other sequences. The initial idea of randomization was taken from [2]. 1 Introduction Sequence alignment is one of the fundamental operations performed in computational biology research [3]. Often times, it is necessary to evaluate more than two sequences simultaneously in order to find out functions, structure and evolution of different organisms. Human genome project uses this technique to map and organize DNA and protein sequences into groups for later use. There has been significant research done in this area, because of the need for doing multiple sequence alignment for many sequences of varying length. Algorithms dealing with this problem span from simple comparison and dynamic programming procedures to complex ones that rely on underlying biological meaning of the sequences to align them more accurately. Since multiple sequence V. Thapar 1 BME 300 - Bioinformatics alignment is an NP-Hard problem, practical solutions rely on clever heuristics to do the job. There is a constant balancing of accuracy versus speed in these algorithms. Accurate algorithms need more processing time and are usually capable of comparing only a small number of sequences; where as fast and less accurate ones can analyze many sequences in reasonable amount of time. Dynamic programming algorithm first introduced by Needleman and Wunsch [4]. This algorithm is designed for pair-wise sequence alignment. Feng and Doolittle [5] developed an algorithm for multiple sequence alignment using modified version of [4]. There are more complicated algorithms such as CLUSTAL W [6], which relies on scoring system, and is adjusted based on local homology of the sequences. Progressive algorithms suffer from the lack of computational speed because of their iterative approach. Also, accuracy is compromised because greedy algorithm such as dynamic programming reaches a local minimum for distance matrix score and not global minimum. Algorithms that rely significantly on biological information may also be at a disadvantage in some domain. Often times, it is not necessary to find the most accurate alignment between sequences. In those cases, specialized algorithms such as CLUSTAL W might be over qualified. Also, these algorithms will require some human intervention while they are optimizing results. This intervention will have to be done by biologists who are very familiar with the data and thus there is limited user domain for such an algorithm. One of the more important usages of MSA is for Phylogenetic analyses [11]. Phylogenetic trees are at the base of understanding evolutionary relationships between various species. In order to build a Phylogenetic tree, orthologous sequences have to be entered into the database, sequences have to be aligned, pairwise Phylogenetic distance has to be calculated and a hierarchical tree is calculated using clustering algorithm as shown in [8]. V. Thapar 2 BME 300 - Bioinformatics There are many algorithms which maximize accuracy and do not concern themselves with speed. Few improvements have been made successfully to reduce the CPU time, since the proposal of the Feng and Doolittle [5] method [7]. Our approach deals with reducing CPU time by randomizing some part of multiple sequence alignment. Our approach calculates distance matrix for staralignment by randomly selecting small portions of sequences and aligning them. Since randomly selected portion of the sequence is significantly less than the actual sequence length, it will result in significant reduction of running time. 2 Survey of Literature In this section we will list relevant literature survey that was done for this paper. We will also list some competing algorithms and applications that are in use today. 2.1 CLUSTAL W CLUSTAL W approach is an improvement of progressive approach invented by Feng and Doolittle [5]. CLUSTAL W improves the sensitivity of multiple sequence alignment without sacrificing speed and efficiency [6]. The speed and efficiency in this context refer to that of Feng and Doolittle [5] style of progressive algorithm. It will be shown that our algorithm is actually faster in theoretical running time than CLUSTAL W. This algorithm differs from conventional algorithm in the sense that it allows genetic information to be included in distance matrix calculations. In other words, it will not limit the match/mismatch scores to constant but will allow them to change based on the number of criteria set by the user [6]. CLUSTAL W takes into account different types of weight matrices at each comparison step based on the homogeneity of sequences being compared and their evolutionary distances. It is divided into three stages. (1) In this stage, a fast V. Thapar 3 BME 300 - Bioinformatics approximation algorithm is used to evaluate alignment scores. Idea is that, errors made in alignment during this step will be corrected in later stages by more accurate weights. (2) Unrooted trees are calculated using Neighbor-joining method [6]. Each sequence is a branch in this tree. Each sequence gets a weight proportional to its distance from the root. Also, it gets a proportion of the weight from another sequence that it shares some similarities with. (3) This step is called progressive alignment. In this step, guide tree is used to combine sequences into larger and larger pairwise alignments. Sequences are selected from the tip of the tree to going towards the root. At each stage a full dynamic algorithm is used to calculate weight matrix and introduce gaps [6]. Giving proper weights is achieved by having one sequence with weight of 1.0 and the rest less than that. Groups of closely related sequences receive lower weights and thus do not “over-influence” the final alignment results inappropriately. Results of CLUSTAL W are staggeringly accurate. It gives near optimal results for a data set with more than 35% identical pairs. For sequences that are divergent, it is difficult to find proper weighing scheme and thus does not result in a good alignment. 2.2 MSA using Hierarchical Clustering Hierarchical clustering is a very interesting heuristic for MSA. It is rather old approach in the fast changing field bioinformatics. It uses an approach often used in bioinformatics, but mostly in the field of data-mining [9, 10]. This approach uses hierarchical clustering along with pairwise alignment to align similar sequences. Hierarchical clustering of the sequences is done using weight matrix. At each step, groups or clusters of sequences are aligned together in larger clusters until all of them are one group. Distance matrix calculation is the central theme in this approach. First distance matrix is calculated for each possible pairwise alignment of sequences. This V. Thapar 4 BME 300 - Bioinformatics process could be evaluated using a fast pairwise alignment algorithm such as [2]. Two sequences Si and Sj, which have lowest alignment score are chosen out of the matrix and are aligned with each other in one cluster. Now, a matrix of size nXn is replaced with (n-1)X(n-1) by deleting row j and column j from the resulting matrix. Also, row i is replaced with the average score of i and j [8]. This process continues until all sequences are aligned and they all form one cluster. This algorithm takes O (N(N-1)M2) time where N is the number of sequences and M is the length of sequences when aligned [8]. This solution is not nearly as fast as what we are trying to achieve. Since this algorithm also uses distance matrix calculation, using algorithm proposed here could reduce its running time further as well. 2.3 MAFFT: Fast Fourier Transform based approach Fast Fourier transform is used to determine homologous regions rapidly. FFT converts amino acid sequences into sequences composed of volume and polarity [7]. MAFFT implements two approaches of FFT, which are progressive method and iterative refinement method. In this method, correlation between two amino acid sequences is calculated using FFT formulas. High correlation value will indicate that sequences may have homologous regions [7]. This program also has sophisticated scoring system for similarity matrix and gap penalties. Just like CLUSTAL W, this approach also uses guiding trees and similarity matrices. By looking at results presented in [7], we can determine that FFT based algorithms are significantly better than CLUSTAL W and T-COFFEE algorithms. It is important to notice that all these algorithms are still polynomial time algorithms and thus have similar behavior on log scaled graph. The only difference in FFT is that it has a lower co-efficient. Thus, from complexity point of view, FFT is not significantly better than other approaches. V. Thapar 5 BME 300 - Bioinformatics 2.4 Other approaches to MSA There are many other innovative approaches for MSA. Stochastic processes are used to perform MSA. Simulated annealing and Genetic algorithms [11] are classic stochastic processes based MSA algorithms. In these algorithms, two sequences are randomly aligned and their score is compared with what was present earlier [11]. If the score is better than previous matrix, it is kept and if not then it is discarded. Non-stochastic iterative algorithms are simple in understanding. They rely on the logic that even a wrong alignment can be efficiently improved if it is realigned at a later stage. Berger and Munson’s algorithm [1] is one of such algorithm. This algorithm randomly aligns sequences at first. Then, it iteratively tries to find better results and updates sequences until no further improvements can be achieved. Gotoh has described such an algorithm in [12]. It is a double nested iterative strategy with randomization that optimizes the weighted sum-of-pairs with affine gap penalties [11]. There is also a relatively recent algorithm by Kececioglu, Lenhof, Mehlhorn, Mutzen, Reinert and Vingron [14], which studies alignment problem as an integer linear program. With polyhedral approach, variations of a basic problem can often be conveniently modeled through the addition of further constraints to the basic linear programming [14]. This algorithm solves MSA problem to optimality for non trivial algorithms of 18 sequences or more. 3 Randomized Algorithm The idea of randomized sampling for local alignment was proposed by Rajasekaran et. al [2]. Just like any other randomized algorithm, we are going to try to show that instead of evaluating entire sequences of length N, we can achieve same result by evaluating NЄ characters where 0 < Є < 1. This procedure V. Thapar 6 BME 300 - Bioinformatics has a potential of theoretically getting results which are significantly close order of magnitude reduction. Traditional algorithms take O (M2*N2) time to create a distance matrix where M is number of sequences and N is the length of aligned sequences. This could be supported by the fact that traditional Needleman-Wunsch [4] algorithm will require O(N2) time to find alignment score of any two sequences. There are M sequences so, all possible combination of pairwise sequence alignment will take M2 operations. Thus, total time taken by Needleman Wunsch type algorithm will be O (M2*N2). Our heuristic works to reduce time from pairwise-alignment and in effect reducing overall time of any algorithm that requires distance matrix calculations. It selects a subset of length NЄ from sequence S starting at randomly selected location between S1 to S (N- NЄ). Similarly same length subset starting at the same location is chosen from sequence T. These subsequences are aligned and score is recorded. Since the length of subsequences is NЄ, time complexity to find pairwise alignment is O(N2Є). This will result in an overall time of O(M 2*N2Є). This is a significant reduction if the resulting distance matrix can return a reliable and accurate score. Algorithm Input: A file containing DNA or Protein sequences separated by new line character, value of Є. Output: Distance matrix calculated for all of the sequences T1 to Tn and total sum of distances for each sequence. Algorithm: (1) (2) V. Thapar Read and store all sequences from the input file into an array. For all sequences T1 to Tn Do a. For all sequences P1 to Pn Do i. Select a Random number R that works as a starting point. ii. Select |Pj| Є characters from Pj starting at position PjR. 7 BME 300 - Bioinformatics (3) iii. Similarly select same number of characters from Ti starting at position TiR. Step ii and iii will result in two new sequences Pj’ and Ti’. iv. Use Needleman-Wunsch algorithm to evaluate pairwise alignment score of Pj’ and Ti’. b. Record score from step a-iv in Matrix M at M(Ti, Pj). c. Increment j by 1. At the end of step 2, we will have a complete matrix M with distance scores for each combination of sequences. Now sum alignment score in n row order where Sumi M (Ti, Pj) . j 1 (4) (5) Select the lowest score from Sumi and use it as center of star-alignment. Repeat the same process for different value of Є. Analysis This algorithm is closely related to Needleman-Wunsch algorithm for pairwise alignment. It requires a value of Є from the user along with input file containing sequences of same length. Step 1 reads in the input from input file. Step 2 loops around to exhaust all possible combination of sequences. This step is repeated once for each of the N sequences. Step 2a also iterates through each one of the N sequences. Thus, Step 2 takes O(N2) time. After selecting a random number as a starting position, we select a subsequence from both sequences and align them using Needleman-Wunsch or any other pairwise alignment algorithms. For our purpose, step 2iv will take O(|Pj|2Є) time. The score is recorded in the appropriate column of the distance matrix. Step 3 sums up all pairwise alignment scores for a given sequence. The sequence with lowest negative score or highest positive score gets selected. The running time of the algorithm is O(N2*|Pj|2Є). 4 Implementation In this section, we will explain the implementation detail of this algorithm on Java platform. The algorithm uses a design from Neobio [15]. The implementation of this algorithm was carried out in java. The logic for the algorithm is simple and has been designed with future additions in mind. As of now the algorithm uses a randomized form of Needleman Wuncsh algorithm for alignment, but in future it can be easily extended to use any algorithm that can globally align two sequences. V. Thapar 8 BME 300 - Bioinformatics The basic set of class framework has been referenced from the Neobio package [15]. The main classes in the algorithm are in the package TheMatrix. The classes are: 1. RandomMatrixCalculation.java : This class has the main method which take as input the file that contains all the sequences which are to be aligned. The file can be in FAST-A format or it can be just a sequence of characters. The scoring scheme can be specified in this class and the penalties for gap, match and mismatch can be set according to choice. We have used the standard convention of gap=-2, match=+1 and mismatch=-1 for our application. They can be changed easily. 2. BasicScoringScheme.java: This class extends the class ScoringScheme.java which is an abstract class. This can be used to set the scoring scheme and it can be also used to sensitize the scoring scheme by implementing the methods in the ScoringScheme class in anyway that is required by the user. The use of abstract classes gives us the freedom to dynamically modify the scoring schemes like the choice of the algorithm for the program dynamically based on the user preference. 3. PairwiseAlignmentAlgorithm.java: This is again the abstract class whose object “algorithm” is used through out in the program for all purposes and finally based on the users choice of algorithm, (in our case as of now its Needleman Wunsch but more can be added), at runtime the object is dynamically attached to this variable, “algorithm”. The methods that are implemented by any class that extends this class are loadAllsequenceFile() {This loads all the sequences from a file into the memory}, computePairwiseAlignmentAll(), {This method when implemented will contain the details of alignment of all sequences, they are aligned in pairs. Based on which algorithm class extends this class, the implementations will vary.} 4. CharFile.java: This file is used in the reading of the sequences from the disk to the memory and storing them in the desired format. In our case we have stored each sequence as a character array and the arrays are stored in vectors, (extendable arrays in java). V. Thapar 9 BME 300 - Bioinformatics 5. IncompatibleScoringSchemeException.java and InvalidScoringMatrixException.java : These have been taken from NeoBio package[15] and extend the Exception class of java and are used to display meaningful messages in case of errors. 6. NeedlemanWunsch.java: This is the major class that extends the class PairwiseAlignmentAlgorithm class and thus implements the methods described above in its way. So at run time the variable of the abstract class PairwiseAlignmentAlgorithm is assigned to the object of the NeedlemanWunsch class. Thus even though all throughout the program the methods are called for the PairwiseAlignment class, at run time the methods that are actually implemented will be those of this class and so later on when we need to add a new algorithm we can easily just create one class and then extend the PairwiseAlignmentAlgorithm class in that, implement the same methods in our own way and we would have to make no changes to the existing program. This is the basis for a flexible framework. The main methods implemented in this class are: a. ComputePairwiseAlignmentAll() b. ComputeScoreBetSeqIAndJ() The first method reads sequences one at a time, compares it to all the others by calling in a loop the method ComputeScoreBetSeqIAndJ() and recording the score for each comparison in the score matrix. Also the randomization step occurs in the second method ComputeScoreBetSeqIAndJ() where based on a fixed value of between 0.0 and 1.0 the lengths of the 2 sequences to be compared are reduced and then starting from a random point, “n*” lengths are taken from both sequences and compared using the standard Needleman Wuncsh algorithm. The output is then recorded in a file, “Output.txt” again along with the time elapsed for the computation of the matrix. 5 Results We are going to compare results from three different input files. Input files are given as appendices A, B and C. We are going to compare actual results for V. Thapar 10 BME 300 - Bioinformatics lowest distant score sum for each input file for various values of . We will also look at time it took to evaluate complete alignment (when = 1.0) as opposed to < 1.0. Table 1 shows sum of the values of distant scores for various . FIRST RUN Input in Appendix A N=9 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 S1 -738 -678 -635 -583 -486 -362 -304 -276 -230 S2 -656 -592 -553 -494 -424 -354 -287 -223 -206 S3 -980 -898 -796 -703 -627 -489 -387 -303 -219 S4 -914 -862 -740 -660 -576 -490 -432 -323 -225 S5 -1012 -913 -806 -721 -627 -532 -452 -302 -246 S6 -1194 -1080 -968 -894 -775 -608 -504 -382 -287 |Si| = 600 S7 -1076 -976 -840 -752 -676 -554 -433 -350 -304 S8 -1032 -951 -873 -797 -693 -578 -459 -367 -284 S9 -976 -860 -785 -730 -618 -525 -386 -286 -231 The highlighted part in table 1 shows that for different values of , lowest sum was consistently for sequence S2. Even going as low as = 0.2 gave accurate prediction of which sequence will have lowest sum. For = 0.2, run time was only 1/6th of what it was for = 1.0. This gives us a rough idea of the magnitude of time that could be saved with randomized approach. Table 2 shows sum of the values of distant scores for Input in Appendix B. Highlighted part in this section is in various columns. This shows the kind of inaccuracy that could arise with randomized approach. V. Thapar 11 But, majority of the BME 300 - Bioinformatics Run Time 3687ms 3203ms 2360ms 1953ms 1516ms 1281ms 985ms 1157ms 609ms values of have given the right values. It is not safe to take to be very low. For = 0.60, right sequence has been picked for lowest sum. Runtime reduction is a little more than ½ for this case. Table 3 shows distant matrix values for input in Appendix C. Highlighted part in this section is for S7 for all values of . This shows consistent results throughout different values of . For = 0.6, runtime reduction is more than ½. 6 Conclusion It can be concluded from the implementation of the algorithm presented in this paper that for a value of to be equal to 0.6 we are able to get a reduction in the time of the algorithm by more than 50% and the accuracy is also maintained. Also the implementation has supported our hypothesis about the improvement that can be brought about using the randomized approach for distance matrix calculation. As can be expected for very small values of , the results lose their accuracy and hence the choice for the proper value of would lead to a speedup while maintaining the accuracy of the algorithm 7 Discussion In this paper, we have discussed various methods of Multiple Sequence Alignment. We have also introduced a new approach that deals with randomly sampling sequences and aligning the samples to achieve the same result in terms of distance matrix calculation and achieve a significant runtime improvement. V. Thapar 12 BME 300 - Bioinformatics We have backed up our claim of speed up and accuracy by empirical data and examples. It can be noticed that since most algorithms that are currently being used for MSA are using the distance matrix calculation as an initial step, this time reduction could be of importance. 8 Future Work There has been no significant work done in the area of randomized algorithms for MSA. This leaves a lot of opportunities for us for future work. We plan to make certain very critical improvements to our algorithm. First of all, we would like to prove theoretical complexity of this algorithm and also show that it is in reality a faster algorithm. We would also like to show that randomization gives the same result with very high probability. At this time, we have assumed that all sequences are of same length. We would like to expand our work such that sequences of uneven lengths can also be aligned using random approach. There is a possibility of taking this work further and implementing randomized portions for CLUSTAL W, MAFFT and other popular MSA packages in order to increase their speed. In our opinion, further speedup can be achieved by randomizing not just pairwise alignment but also sequence selection, but this hypothesis still needs further work. References [1] [2] [3] [4] [5] [6] Berger M. P., P. J. Munson. A novel randomized iterative strategy for aligning multiple protein sequences. Computer Applications in Biosciences. Vol. 7, No. 4 1991. Pages 479-484. S. Rajasekaran, H. Nick, P.M. Pardalos, S. Sahni, G. Shaw, Efficient algorithms for local alignment search. Journal of Combinatorial Optimization. 5(1), 2001, pp. 117-124. K. Charter, J. Schaeffer, D. Szafron. Sequence Alignmetn using FastLSA. International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences. 2000. S. Needleman, C. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology. 48:443-453, 1970. D. Feng, R. Doolittle. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution. 25:351-360, 1987. J. Thompson, D Higgins, T. Gibson. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673-4680. V. Thapar 13 BME 300 - Bioinformatics [7] [8] [9] [10] [11] [12] [13] [14] [15] K. Katoh, K. Misawa, K Kuma, T. Miyata. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acid Res. 30(14), 30593066. F. Corpet. Multiple sequence alignment with hierarchical clustering. Nucleic Acid Res. Vol 16, 10881-10890. November 1998. G. Karypis, S. Han, V. Kumar. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. Technical report TR-99. University of Minnesota, Minneapolis, 1999. A. Szymkowiak, J. Larsen, L. Hansen. Hierarchical clustering for datamining. Fifth International Conference on Knowledge-Based Intelligent Information Engineering Systems & Allied Technologies. 2001. C. Notredame. Recent progress in multiple sequence alignment: a survey. Pharmacogenomics 3(1). 2002. O. Gotoh. Furhter improvement in methods of group-to-group sequence alignment with generalized profile operations. Computer Applications in Biosciences, 10 (4), 1994, pp. 379-387. O. Gotoh. Optimal alignment between groups of sequences and its application to multiple sequence alignment. Computer Applications in biosciences, 9(3), 1993, pp. 361-370. J. Kececioglu, H. Lenhof, K. Mehlhorn, P. Mutzel, K. Reinert, M. Vingron. A polyhedral approach to sequence alignment problems. Discrete applied mathematics 104 (2000), pp. 143-186. S. Anibal de Carvalho. http://neobio.sourceforge.net/. Department of Computer Science, King’s college, London, UK. Appendix A Input 1 S1:AGGCTATACTTAAGTGGTCGTTATGGCCGTACACCGACCAGCGAGGAACGCATAACAGCGACCTACAT AAGTTTGTGGTGCATCAAGCTACCGCTTTGCTGATGGCGGACGAAACGCAATTGTTAGAAAGGGGGCGGCA CAGTACCGAACACGCGTTTCCACGGTCATATTCAGAGGTGCTGTTTTTCTCGTGTAACGCGGCACCTTCCA TGTCGCCGTTAGTGCGATGAGACTCCAGACCGTGCCCACACTTTGCTCATCGCGCACCAAGAGGAGACCCC TGTTATCAGGCGTCGCAGTTCCTAGGGGCGCTATCCCACCGTCGCATAACGCCCGACCAAAGGACCACCAA TCGTTCCGGCGCTGATTTGTCTGGCTCGAGGCGAGTGTCTGATCTGCACTGAGTAGCGGTCCCACTTGGTG CGCTATTACGGGACGCATGAGCCCTGCGTTTTCTCTCTAATAGTTAGAGAGTATCCTTCTATGCGTCATGC GAGAGGTTTCGCCTTAGACTAGGTTTTCGAGCTGCCCAGGGTTCCAGTGTGCTTAAGCCGCCATTTATGGT TTACTCAAGGGTAAAGGTGATCCCATGATTTGATA S2:ACTCCCACACCACTACTACTAGCCGTTCTTTGCTGTAGAATTCGAAACACCTTTCAGACTGTACCCTG CCTGCAACTTATAGGGTGCTCATACCGACTCCTAGCCTGAGTCTGACTTGTCGGAAAAATACTGCGCTCGT ATGGAAAAGTACACCGAGATGCTGAGCCTGAGTTACAAATCAGGCAGTTTTTGGGTCTTATTACTAGGCCC ACGCTATCTTTGAACATATACTTCTCAGATAACGAGATTTATGTGCTAAGCGATACGTGGCTCAATCCCCG CTAGGATCTGCCACAACACCACGACTGTCACTCCTTATCAATGACACTCAGTTTTCCAAACGCGGCTGTAG GTGGTTATTGGTTACGAACGCGACGAACTTACTGTCTTACCTATTGTCAAAGGCCTATAATGCCACACTCT AAAGCGAGCGGACAACTACCGTTTAAAGCGAATAATGTACCGACCCAAAAAGAACATTTCCCGGTCCCGTC AGTAGAGCTGGTCAAGAAGGTAGTCTGAATAACTCACGGAGGTATCTTTAGGCTAGGAGCTGAACAAACTT CAGAAATATAACGCCCCGCCGCCTGCACATGCGCA S3:TGCTCTCAGTCTTTGTGTCGGCGTCTGAGTACCGTTGAGCGATCCGACAGTGGGGCCAGCCTGCGGAC CGTCACGAACGTCGTTACCTTGATGCGCATAGTTGCCGTTCTCGCCGAGGCTGGGTGTCCAAGGTGGTCTT V. Thapar 14 BME 300 - Bioinformatics TAGCGCCTGCTTTTCAAAGGTAGTAACCTGGTATAATCTGGGGCGATAGTGTCGCCAGTTCAAGGCGTTCA ACGAGTCGCGCACCTGCTATTACACTGGGAGTAACTATTCAATCAAGTATGAGGCTCAGAACCACAGGTAT TATTGATGATAAGCCAGACCTTCGAGGATCGTCTCTAGCACATGATCGTTTGATAGAAAGTGTGCAGCTGG TGAAGTTTTTAACATCCCGTGAGGACGTACACTGGCCTCTCTTGTGCCGGGCGTTAAACAATACCTTAAAG CATGCCACAATCGTACCGGGCATAGGATGCTGATTTATGCCTTCATAAAGGGACTCGGCCACGTTGTAAGG TGTGAATGCTAGATCTACCACGAAAGGGCCTGTTAGCACACATGCCGCCCTTGTCGCTAAAGGTTTTATAA TACGCGTACGCTCATGCCCCCGAAAGAAGACCATGAGTTGACATTCGCTCATAATACAGGTCAGGCATAGG TGGAGCTCGTGGATTTCTTATCGTTACAAACCATCGCAGAGCACCGTTCGATATACAATAGAGCTTCGGGC ACTACGCCTACGCGGGTGATTAGGAACCCGTTACAAGGCAAGGACTCAATGGTGTCCCGGAATTTACGCCA ACAACGGTTGTGAAGGGGATGCGGCGGACTATTGTTTAATGTGGTTGGATCCCACCGTGTGCAATCAGCCT AGGGGAAACGCAGGAGTCAGAGGCAGTTGGAGTCAGATTGTGCATTAATCAGTTCGTAAGCCTTCCACGGA GAGTAATCACAACGTCTCGGACAGAAGCTCCCTAGACGACTAGCTGAAAGTGCCCCCAAAGTGCTATGGCA TCAATCCCT S4:GCCTATTCGGATGTACTCTCTCCGCCCAGAAGTGAAGGAGTCAGATAGGTCCTTGCTATAACAGCCGC AACACTCATCGTGCCGGCAGCCTAGCAGTTACCTGGATCCCAGATCTACCTTACCATTTCAGGCTAAATTT AGGCTCGGGTACAAAAAACATCGCCGGGCTTCAACCTTGCCGCCCTTAACACACGGTGTGACTTTATACAG GGAGATGGAGCATGGGCTGGCCTAGTGGGGTGTGGCGCTAATTTCCTCGCTAATGCTATGCGGAGCCCTGA AAGCTGACTGGAGGAGGCCGAGCCGACAATGTCTCGTGAGTGGCATTGCGTTTAAGGAAGACTTTTGTCCG ATCTACACCTTCCTCGAGTCTCCGCAGGGTTGTGCATAGTGGCTGTAGACAGAATCCAGCTGACAGGTCTG CATTTAGAAATAGCTTAGCGTCCGCCGGACCACTGTCAACTTTACTGTGGCTCTCGTCTGCTGACTTTGAT TATCTGAATGTGAGTCTCAGTAACTGACCTGGGCGTCTTCGGCGAAGGATCAATGAACGAATCAAAGAGGT GAAGGGGCTTTCCTGCTAAGACCGTGCATCAGTACTAGCCGGTCGAGTCCTTTGCACGTCCGCCGCAGCCG TACAGTCGATTGATATAGTCTACCCTCGATCCTTTAGCAAGTGCATATGCAGCCGACCAACCTTGCGGCAT ACTCCAATCAACACTACCCAGATCCTAAGGTGACGGTTTCAGAGGATATACGAAGCGTATTGCACCGCGTA TGTATTTAAGAACGGTGGGTGTTATGTCAGACGCGTCCGGTTTTAACCCTTTATACAAATCGTCTCGACAC ACTACATCAATATATTACATGAAGGTGCATCACAGCCGGTCCACACCGGTT S5:TCGGCTGTATTGGCGACCCAGGCGTGGGCTTAATGAATCAGAGACTCTGCAGCCAGGGAGTATGTATA GCAGTTCTTTAAACGGTCTGCGACGAGGAAGGTTTCGAGTGTGCAACGTGAGGCTATCGTAAAAGTGTTTC AACAGATGGGGGGCTATGAGCCGCTCGAACGTTACACACTGCACGCGGGGTCGACTAATGGAAGCTAACCT AAGCTAATTGCCCTATTCGTGAAGAAACATCTAATTCCTTCCTTGTATGTGTTCTCCCTACAGCACATATC GACAATAGGTTTTAGTGCTTTACCACAAGTAGCAAGTACAACTTGAATTGGGTAAGACTTGCACTTCATGT ATTTGAAATCGCTATCCCACGACTTGGTGTCAACCCCCGGCTCTTTATCACCTTGCATACCCAGCGGCATC AAGTGACCGACATATGATCTGGTAGTAGTTCAACCCTGAAGACTATCTTTAGCTCAGCGCGTTAAGTCCTT ATACACTCTAGCGAGTGGGAAGGATGGATCGGCCGGACATCGTACGTAATTTAGAACCCAGTACCGAGACG CGTTCGACAGTCCTAAGGCTCCATCAGAGTAGCTTACTACGTCACGAGTCAGGTAAAGCCGAGAGCGTCCG ATCCATCCTTGGTGGATCAGCGTTCTCTGTTGTTGAACGCGAGGTAAACGTTGGTAACTTTTTCAACAGCA GTAGAGTAGCGTGTAGTTACTCGGAGATCGACGTAACTGCGCGCCCTGCAACACTAAGCGCTGCGCTGTCT GCTGCGCAGACTCTATGAGAGTCGCTCGTCTCCGTCTGCTTAGGGGGCGTTAGCACACTAATCACGGCTCA AATATGTTAAAGAAGGAGCCCCATTTCCGTGACGTCAGTACGAGCAATTTACGATGGCAAAGAGAGCAAGA CCTTCGCGCAGGGTACGGACCTGACAGCATGGGTTATCAAGGCCCTTTCCAGGTAATAAATTTCAGATTTA GTACTTATCATGTAGATAAGTTGGAAACCTTGA S6:GAAGACTCAGGGAGAGAAATTTTTCTTGATTCATTCTGCAGATTGGCTTACTACACATGCTCTTTTCC ATGAAGTTGCAAAATTGGATGTGGTGAAATTATTATACAATGAGCAGTTTGCTGTTCAAGGGTTGTTGAGA TACCATACATATGCAAGATTTGGCATTGAAATTCAAGTTCAGATAAACCCTACACCTTTCCAACAGGGGGG ATTGATCTGTGCTATGGTTCCTGGTGACCAGAGCTATGGTTCTATAGCATCATTGACTGTTTATCCTCATG GTTTGTTAAATTGCAATATTAACAATGTGGTTAGAATAAAGGTTCCATTTATTTACACAAGAGGTGCTTAC CACTTTAAAGATCCACAATACCCAGTTTGGGAATTGACAATTAGAGTTTGGTCAGAATTAAATATTGGGAC AGGAACTTCAGCTTATACTTCACTCAATGTTTTAGCTAGATTTACAGATTTGGAGTTGCATGGATTAACTC CTCTTTCTACACAAATGATGAGAAATGAATTTAGGGTCAGTACTACTGAGAATGTGGTGAATCTGTCAAAT TATGAAGATGCAAGAGCAAAGATGTCTTTTGCTTTGGATCAGGAAGATTGGAAATCTGATCCGTCCCAGGG TGGTGGGATCAAAATTACTCATTTTACTACTTGGACATCTATTCCAACTTTGGCTGCTCAGTTTCCATTTA ATGCTTCAGACTCAGTTGGTCAACAAATTAAAGTTATTCCAGTTGACCCATATTTTTTCCAAATGACAAAT ACGAATCCTGACCAAAAATGTATAACTGCTTTGGCTTCTATTTGTCAGATGTTTTGTTTTTGGAGAGGAGA TCTTGTCTTTGATTTTCAAGTTTTTCCCACCAAATATCATTCAGGTAGATTACTGTTTTGTTTTGTTCCTG GCAATGAGCTAATAGATGTTTCTGGAATCACATTAAAGCAAGCAACTACTGCTCCTTGTGCAGTAATGGAT ATTACAGGAGTGCAGTCAAC V. Thapar 15 BME 300 - Bioinformatics S7:CAGTGGCGATGACCCTGGAAAAGAATATGCCGATCGGTTCGGGCTTAGGCTCCAGTGCCTGTTCGGTG GTCGCGGCGCTGATGGCGATGAATGAACACTGCGGCAAGCCGCTTAATGACACTCGTTTGCTGGCTTTGAT GGGCGAGCTGGAAGGCCGTATCTCCGGCAGCATTCATTACGACAACGTGGCACCGTGTTTTCTCGGTGGTA TGCAGTTGATGATCGAAGAAAACGACATCATCAGCCAGCAAGTGCCAGGGTTTGATGAGTGGCTGTGGGTG CTGGCGTATCCGGGGATTAAAGTCTCGACGGCAGAAGCCAGGGCTATTTTACCGGCGCAGTATCGCCGCCA GGATTGCATTGCGCACGGGCGACATCTGGCAGGCTTCATTCACGCCTGCTATTCCCGTCAGCCTGAGCTTG CCGCGAAGCTGATGAAAGATGTTATCGCTGAACCCTACCGTGAACGGTTACTGCCAGGCTTCCGGCAGGCG CGGCAGGCGGTCGCGGAAATCGGCGCGGTAGCGAGCGGTATCTCCGGCTCCGGCCCGACCTTGTTCGCTCT GTGTGACAAGCCGGAAACCGCCCAGCGCGTTGCCGACTGGTTGGGTAAGAACTACCTGCAAAATCAGGAAG GTTTTGTTCATATTTGCCGGCTGGATACGGCGGGCGCACGAGTACTGGAAAACTAAATGAAACTCTACAAT CTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCCGTAACCCAGGGGTTGGGCAAAAATCAGGGGCT GTTTTTTCCGCACGACCTGCCGGAATTCAGCCTGACTGAAATTGATGAGATGCTGAAGCTGGATTTTGTCA CCCGCAGTGCGAAGATCCTCTCGGCGTTTATTGGTGATGAAATCCCACAGGAAATCCTGGAAGAGCGCGTG CGCGCGGCGTTTGCCTTCCCGGCTCCGGTCGCCAATGTTGAAAGCGATGTCGGTTGTCTGGAATTGTTCCA CGGGCCAACGCTGGCATTTAAAGATTTCGGCGG S8:AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCA GCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCA ATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGC CCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGT TCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGG CAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAA AACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGG GACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAA ATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTG CCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGTTACTGTTA TCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCTGAGTCCACC CGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGA AAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGCTGCCTGTTTAC GCGCCGATTGTTGCGAGATTTGGACGGACGTTG S9:ACCCATAACGGGCAATGATAAAAGGAGTAACCTGTGAAAAAGATGCAATCTATCGTACTCGCACTTTC CCTGGTTCTGGTCGCTCCCATGGCAGCAGAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGA TAGGCGATCGTGATAATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAA CATTATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCATAAGAAAGC TCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAAATGACAAATGCCGGGTAACAAT CCGGCATTCAGCGCCTGATGCGACGCTGGCGCGTCTTATCAGGCCTACGTTAATTCTGCAATATATTGAAT CTGCATGCTTTTGTAGGCAGGATAAGGCGTTCACGCCGCATCCGGCATTGACTGCAAACTTAACGCTGCTC GTAGCGTTTAAACACCAGTTCGCCATTGCTGGAGGAATCTTCATCAAAGAAGTAACCTTCGCTATTAAAAC CAGTCAGTTGCTCTGGTTTGGTCAGCCGATTTTCAATAATGAAACGACTCATCAGACCGCGTGCTTTCTTA GCGTAGAAGCTGATGATCTTAAATTTGCCGTTCTTCTCATCGAGGAACACCGGCTTGATAATCTCGGCATT CAATTTCTTCGGCTTCACCGATTTAAAATACTCATCTGACGCCAGATTAATCACCACATTATCGCCTTGTG CTGCGAGCGCCTCGTTCAGCTTGTTGGTGATGATATCTCCCCAGAATTGATACAGATCTTTCCCTCGGGCA TTCTCAAGACGGATCCCCATTTCCAGACGATAAGGCTGCATTAAATCGAGCGGGCGGAGTACGCCATACAA GCCGGAAAGCATTCGCAAATGCTGTTGGGCAAAATCGAAATCGTCTTCGCTGAAGGTTTCGGCCTGCAAGC CGGTGTAGACATCACCTTTAAACGCCAGAATCG Appendix B Input 2 V. Thapar 16 BME 300 - Bioinformatics Appendix C Input 3 V. Thapar 17 BME 300 - Bioinformatics