7 Discussion

advertisement
Randomized approach to Distance Matrix calculation for
Multiple Sequence Alignment
Vishal Thapar
(Vishal.Thapar@uconn.edu)
BME 300 – Bioinformatics
Instructor: Prof. Richard Simon
(December 3rd, 2003)
Abstract: Rigorous alignment of multiple sequences becomes impractical even with a
modest number of sequences [1]. Solution to multiple sequence alignment problem is
important for biological research purposes. Because of the high time complexity of
traditional MSA algorithms, even today’s fast computers are not able to solve the
problem for large number of sequences. Our approach in this paper is to evaluate the
possibility of using randomized approach to calculate distance matrix for multiple
sequence alignment algorithm. In order to reduce time complexity, we will evaluate a
small randomly selected portion of a sequence and compare with similar portions
collected randomly from all other sequences. The initial idea of randomization was taken
from [2].
1
Introduction
Sequence alignment is one of the fundamental operations performed in
computational biology research [3]. Often times, it is necessary to evaluate more
than two sequences simultaneously in order to find out functions, structure and
evolution of different organisms. Human genome project uses this technique to
map and organize DNA and protein sequences into groups for later use. There
has been significant research done in this area, because of the need for doing
multiple sequence alignment for many sequences of varying length. Algorithms
dealing with this problem span from simple comparison and dynamic
programming procedures to complex ones that rely on underlying biological
meaning of the sequences to align them more accurately. Since multiple sequence
V. Thapar
1
BME 300 - Bioinformatics
alignment is an NP-Hard problem, practical solutions rely on clever heuristics to
do the job. There is a constant balancing of accuracy versus speed in these
algorithms.
Accurate algorithms need more processing time and are usually
capable of comparing only a small number of sequences; where as fast and less
accurate ones can analyze many sequences in reasonable amount of time.
Dynamic programming algorithm first introduced by Needleman and Wunsch [4].
This algorithm is designed for pair-wise sequence alignment. Feng and Doolittle
[5] developed an algorithm for multiple sequence alignment using modified
version of [4]. There are more complicated algorithms such as CLUSTAL W [6],
which relies on scoring system, and is adjusted based on local homology of the
sequences.
Progressive algorithms suffer from the lack of computational speed because of
their iterative approach.
Also, accuracy is compromised because greedy
algorithm such as dynamic programming reaches a local minimum for distance
matrix score and not global minimum. Algorithms that rely significantly on
biological information may also be at a disadvantage in some domain. Often
times, it is not necessary to find the most accurate alignment between sequences.
In those cases, specialized algorithms such as CLUSTAL W might be over
qualified. Also, these algorithms will require some human intervention while
they are optimizing results. This intervention will have to be done by biologists
who are very familiar with the data and thus there is limited user domain for such
an algorithm.
One of the more important usages of MSA is for Phylogenetic analyses [11].
Phylogenetic trees are at the base of understanding evolutionary relationships
between various species.
In order to build a Phylogenetic tree, orthologous
sequences have to be entered into the database, sequences have to be aligned,
pairwise Phylogenetic distance has to be calculated and a hierarchical tree is
calculated using clustering algorithm as shown in [8].
V. Thapar
2
BME 300 - Bioinformatics
There are many algorithms which maximize accuracy and do not concern
themselves with speed.
Few improvements have been made successfully to
reduce the CPU time, since the proposal of the Feng and Doolittle [5] method [7].
Our approach deals with reducing CPU time by randomizing some part of
multiple sequence alignment. Our approach calculates distance matrix for staralignment by randomly selecting small portions of sequences and aligning them.
Since randomly selected portion of the sequence is significantly less than the
actual sequence length, it will result in significant reduction of running time.
2
Survey of Literature
In this section we will list relevant literature survey that was done for this paper.
We will also list some competing algorithms and applications that are in use
today.
2.1
CLUSTAL W
CLUSTAL W approach is an improvement of progressive approach invented by
Feng and Doolittle [5].
CLUSTAL W improves the sensitivity of multiple
sequence alignment without sacrificing speed and efficiency [6]. The speed and
efficiency in this context refer to that of Feng and Doolittle [5] style of
progressive algorithm. It will be shown that our algorithm is actually faster in
theoretical running time than CLUSTAL W.
This algorithm differs from
conventional algorithm in the sense that it allows genetic information to be
included in distance matrix calculations. In other words, it will not limit the
match/mismatch scores to constant but will allow them to change based on the
number of criteria set by the user [6].
CLUSTAL W takes into account different types of weight matrices at each
comparison step based on the homogeneity of sequences being compared and
their evolutionary distances. It is divided into three stages. (1) In this stage, a fast
V. Thapar
3
BME 300 - Bioinformatics
approximation algorithm is used to evaluate alignment scores. Idea is that, errors
made in alignment during this step will be corrected in later stages by more
accurate weights.
(2) Unrooted trees are calculated using Neighbor-joining
method [6]. Each sequence is a branch in this tree. Each sequence gets a weight
proportional to its distance from the root. Also, it gets a proportion of the weight
from another sequence that it shares some similarities with. (3) This step is called
progressive alignment. In this step, guide tree is used to combine sequences into
larger and larger pairwise alignments. Sequences are selected from the tip of the
tree to going towards the root. At each stage a full dynamic algorithm is used to
calculate weight matrix and introduce gaps [6].
Giving proper weights is achieved by having one sequence with weight of 1.0 and
the rest less than that. Groups of closely related sequences receive lower weights
and thus do not “over-influence” the final alignment results inappropriately.
Results of CLUSTAL W are staggeringly accurate. It gives near optimal results
for a data set with more than 35% identical pairs.
For sequences that are
divergent, it is difficult to find proper weighing scheme and thus does not result in
a good alignment.
2.2
MSA using Hierarchical Clustering
Hierarchical clustering is a very interesting heuristic for MSA. It is rather old
approach in the fast changing field bioinformatics. It uses an approach often used
in bioinformatics, but mostly in the field of data-mining [9, 10]. This approach
uses hierarchical clustering along with pairwise alignment to align similar
sequences. Hierarchical clustering of the sequences is done using weight matrix.
At each step, groups or clusters of sequences are aligned together in larger
clusters until all of them are one group.
Distance matrix calculation is the central theme in this approach. First distance
matrix is calculated for each possible pairwise alignment of sequences. This
V. Thapar
4
BME 300 - Bioinformatics
process could be evaluated using a fast pairwise alignment algorithm such as [2].
Two sequences Si and Sj, which have lowest alignment score are chosen out of
the matrix and are aligned with each other in one cluster. Now, a matrix of size
nXn is replaced with (n-1)X(n-1) by deleting row j and column j from the
resulting matrix. Also, row i is replaced with the average score of i and j [8].
This process continues until all sequences are aligned and they all form one
cluster.
This algorithm takes O (N(N-1)M2) time where N is the number of sequences and
M is the length of sequences when aligned [8]. This solution is not nearly as fast
as what we are trying to achieve. Since this algorithm also uses distance matrix
calculation, using algorithm proposed here could reduce its running time further
as well.
2.3
MAFFT: Fast Fourier Transform based approach
Fast Fourier transform is used to determine homologous regions rapidly. FFT
converts amino acid sequences into sequences composed of volume and polarity
[7]. MAFFT implements two approaches of FFT, which are progressive method
and iterative refinement method. In this method, correlation between two amino
acid sequences is calculated using FFT formulas. High correlation value will
indicate that sequences may have homologous regions [7]. This program also has
sophisticated scoring system for similarity matrix and gap penalties. Just like
CLUSTAL W, this approach also uses guiding trees and similarity matrices.
By looking at results presented in [7], we can determine that FFT based
algorithms are significantly better than CLUSTAL W and T-COFFEE algorithms.
It is important to notice that all these algorithms are still polynomial time
algorithms and thus have similar behavior on log scaled graph.
The only
difference in FFT is that it has a lower co-efficient. Thus, from complexity point
of view, FFT is not significantly better than other approaches.
V. Thapar
5
BME 300 - Bioinformatics
2.4
Other approaches to MSA
There are many other innovative approaches for MSA. Stochastic processes are
used to perform MSA. Simulated annealing and Genetic algorithms [11] are
classic stochastic processes based MSA algorithms. In these algorithms, two
sequences are randomly aligned and their score is compared with what was
present earlier [11]. If the score is better than previous matrix, it is kept and if not
then it is discarded.
Non-stochastic iterative algorithms are simple in understanding. They rely on the
logic that even a wrong alignment can be efficiently improved if it is realigned at
a later stage. Berger and Munson’s algorithm [1] is one of such algorithm. This
algorithm randomly aligns sequences at first. Then, it iteratively tries to find
better results and updates sequences until no further improvements can be
achieved. Gotoh has described such an algorithm in [12]. It is a double nested
iterative strategy with randomization that optimizes the weighted sum-of-pairs
with affine gap penalties [11].
There is also a relatively recent algorithm by Kececioglu, Lenhof, Mehlhorn,
Mutzen, Reinert and Vingron [14], which studies alignment problem as an integer
linear program. With polyhedral approach, variations of a basic problem can
often be conveniently modeled through the addition of further constraints to the
basic linear programming [14]. This algorithm solves MSA problem to optimality
for non trivial algorithms of 18 sequences or more.
3
Randomized Algorithm
The idea of randomized sampling for local alignment was proposed by
Rajasekaran et. al [2]. Just like any other randomized algorithm, we are going to
try to show that instead of evaluating entire sequences of length N, we can
achieve same result by evaluating NЄ characters where 0 < Є < 1. This procedure
V. Thapar
6
BME 300 - Bioinformatics
has a potential of theoretically getting results which are significantly close order
of magnitude reduction.
Traditional algorithms take O (M2*N2) time to create a distance matrix where M
is number of sequences and N is the length of aligned sequences. This could be
supported by the fact that traditional Needleman-Wunsch [4] algorithm will
require O(N2) time to find alignment score of any two sequences. There are M
sequences so, all possible combination of pairwise sequence alignment will take
M2 operations. Thus, total time taken by Needleman Wunsch type algorithm will
be O (M2*N2).
Our heuristic works to reduce time from pairwise-alignment and in effect
reducing overall time of any algorithm that requires distance matrix calculations.
It selects a subset of length NЄ from sequence S starting at randomly selected
location between S1 to S (N- NЄ). Similarly same length subset starting at the
same location is chosen from sequence T. These subsequences are aligned and
score is recorded. Since the length of subsequences is NЄ, time complexity to find
pairwise alignment is O(N2Є). This will result in an overall time of O(M 2*N2Є).
This is a significant reduction if the resulting distance matrix can return a reliable
and accurate score.
Algorithm
Input:
A file containing DNA or Protein sequences separated by new line
character, value of Є.
Output: Distance matrix calculated for all of the sequences T1 to Tn and total
sum of distances for each sequence.
Algorithm:
(1)
(2)
V. Thapar
Read and store all sequences from the input file into an array.
For all sequences T1 to Tn Do
a. For all sequences P1 to Pn Do
i. Select a Random number R that works as a starting point.
ii. Select |Pj| Є characters from Pj starting at position PjR.
7
BME 300 - Bioinformatics
(3)
iii. Similarly select same number of characters from Ti starting at
position TiR. Step ii and iii will result in two new sequences
Pj’ and Ti’.
iv. Use Needleman-Wunsch algorithm to evaluate pairwise
alignment score of Pj’ and Ti’.
b. Record score from step a-iv in Matrix M at M(Ti, Pj).
c. Increment j by 1.
At the end of step 2, we will have a complete matrix M with distance
scores for each combination of sequences. Now sum alignment score in
n
row order where Sumi   M (Ti, Pj) .
j 1
(4)
(5)
Select the lowest score from Sumi and use it as center of star-alignment.
Repeat the same process for different value of Є.
Analysis
This algorithm is closely related to Needleman-Wunsch algorithm for pairwise
alignment. It requires a value of Є from the user along with input file containing
sequences of same length. Step 1 reads in the input from input file. Step 2 loops
around to exhaust all possible combination of sequences. This step is repeated
once for each of the N sequences. Step 2a also iterates through each one of the N
sequences. Thus, Step 2 takes O(N2) time. After selecting a random number as a
starting position, we select a subsequence from both sequences and align them
using Needleman-Wunsch or any other pairwise alignment algorithms. For our
purpose, step 2iv will take O(|Pj|2Є) time. The score is recorded in the appropriate
column of the distance matrix. Step 3 sums up all pairwise alignment scores for a
given sequence. The sequence with lowest negative score or highest positive
score gets selected. The running time of the algorithm is O(N2*|Pj|2Є).
4
Implementation
In this section, we will explain the implementation detail of this algorithm on Java
platform. The algorithm uses a design from Neobio [15]. The implementation of
this algorithm was carried out in java. The logic for the algorithm is simple and
has been designed with future additions in mind. As of now the algorithm uses a
randomized form of Needleman Wuncsh algorithm for alignment, but in future it
can be easily extended to use any algorithm that can globally align two sequences.
V. Thapar
8
BME 300 - Bioinformatics
The basic set of class framework has been referenced from the Neobio package
[15].
The main classes in the algorithm are in the package TheMatrix. The classes are:
1. RandomMatrixCalculation.java : This class has the main method which take as
input the file that contains all the sequences which are to be aligned. The file can
be in FAST-A format or it can be just a sequence of characters. The scoring
scheme can be specified in this class and the penalties for gap, match and
mismatch can be set according to choice. We have used the standard convention
of gap=-2, match=+1 and mismatch=-1 for our application. They can be changed
easily.
2. BasicScoringScheme.java: This class extends the class ScoringScheme.java
which is an abstract class. This can be used to set the scoring scheme and it can be
also used to sensitize the scoring scheme by implementing the methods in the
ScoringScheme class in anyway that is required by the user. The use of abstract
classes gives us the freedom to dynamically modify the scoring schemes like the
choice of the algorithm for the program dynamically based on the user preference.
3. PairwiseAlignmentAlgorithm.java: This is again the abstract class whose object
“algorithm” is used through out in the program for all purposes and finally based
on the users choice of algorithm, (in our case as of now its Needleman Wunsch
but more can be added), at runtime the object is dynamically attached to this
variable, “algorithm”. The methods that are implemented by any class that
extends this class are loadAllsequenceFile() {This loads all the sequences from a
file into the memory}, computePairwiseAlignmentAll(), {This method when
implemented will contain the details of alignment of all sequences, they are
aligned in pairs. Based on which algorithm class extends this class, the
implementations will vary.}
4. CharFile.java: This file is used in the reading of the sequences from the disk to the
memory and storing them in the desired format. In our case we have stored each
sequence as a character array and the arrays are stored in vectors, (extendable
arrays in java).
V. Thapar
9
BME 300 - Bioinformatics
5. IncompatibleScoringSchemeException.java
and
InvalidScoringMatrixException.java : These have been taken from NeoBio
package[15] and extend the Exception class of java and are used to display
meaningful messages in case of errors.
6. NeedlemanWunsch.java: This is the major class that extends the class
PairwiseAlignmentAlgorithm class and thus implements the methods described
above in its way. So at run time the variable of the abstract class
PairwiseAlignmentAlgorithm is assigned to the object of the NeedlemanWunsch
class. Thus even though all throughout the program the methods are called for the
PairwiseAlignment class, at run time the methods that are actually implemented
will be those of this class and so later on when we need to add a new algorithm
we
can
easily
just
create
one
class
and
then
extend
the
PairwiseAlignmentAlgorithm class in that, implement the same methods in our
own way and we would have to make no changes to the existing program. This is
the basis for a flexible framework. The main methods implemented in this class
are:
a. ComputePairwiseAlignmentAll()
b. ComputeScoreBetSeqIAndJ()
The first method reads sequences one at a time, compares it to all the others by
calling in a loop the method ComputeScoreBetSeqIAndJ() and recording the score
for each comparison in the score matrix. Also the randomization step occurs in the
second method ComputeScoreBetSeqIAndJ() where based on a fixed value of 
between 0.0 and 1.0 the lengths of the 2 sequences to be compared are reduced
and then starting from a random point, “n*” lengths are taken from both
sequences and compared using the standard Needleman Wuncsh algorithm.
The output is then recorded in a file, “Output.txt” again along with the time
elapsed for the computation of the matrix.
5
Results
We are going to compare results from three different input files. Input files are
given as appendices A, B and C. We are going to compare actual results for
V. Thapar
10
BME 300 - Bioinformatics
lowest distant score sum for each input file for various values of . We will also
look at time it took to evaluate complete alignment (when  = 1.0) as opposed to 
< 1.0.
Table 1 shows sum of the values of distant scores for various .
FIRST RUN
Input in Appendix A
N=9

1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
S1
-738
-678
-635
-583
-486
-362
-304
-276
-230
S2
-656
-592
-553
-494
-424
-354
-287
-223
-206
S3
-980
-898
-796
-703
-627
-489
-387
-303
-219
S4
-914
-862
-740
-660
-576
-490
-432
-323
-225
S5
-1012
-913
-806
-721
-627
-532
-452
-302
-246
S6
-1194
-1080
-968
-894
-775
-608
-504
-382
-287
|Si| = 600
S7
-1076
-976
-840
-752
-676
-554
-433
-350
-304
S8
-1032
-951
-873
-797
-693
-578
-459
-367
-284
S9
-976
-860
-785
-730
-618
-525
-386
-286
-231
The highlighted part in table 1 shows that for different values of , lowest sum
was consistently for sequence S2. Even going as low as  = 0.2 gave accurate
prediction of which sequence will have lowest sum. For  = 0.2, run time was
only 1/6th of what it was for  = 1.0. This gives us a rough idea of the magnitude
of time that could be saved with randomized approach.
Table 2 shows sum of the values of distant scores for Input in Appendix B.
Highlighted part in this section is in various columns. This shows the kind of
inaccuracy that could arise with randomized approach.
V. Thapar
11
But, majority of the
BME 300 - Bioinformatics
Run Time
3687ms
3203ms
2360ms
1953ms
1516ms
1281ms
985ms
1157ms
609ms
values of  have given the right values. It is not safe to take  to be very low. For
 = 0.60, right sequence has been picked for lowest sum. Runtime reduction is a
little more than ½ for this case.
Table 3 shows distant matrix values for input in Appendix C.
Highlighted part in this section is for S7 for all values of . This shows consistent
results throughout different values of . For  = 0.6, runtime reduction is more
than ½.
6
Conclusion
It can be concluded from the implementation of the algorithm presented in this
paper that for a value of  to be equal to 0.6 we are able to get a reduction in the
time of the algorithm by more than 50% and the accuracy is also maintained. Also
the implementation has supported our hypothesis about the improvement that can
be brought about using the randomized approach for distance matrix calculation.
As can be expected for very small values of , the results lose their accuracy and
hence the choice for the proper value of  would lead to a speedup while
maintaining the accuracy of the algorithm
7
Discussion
In this paper, we have discussed various methods of Multiple Sequence
Alignment. We have also introduced a new approach that deals with randomly
sampling sequences and aligning the samples to achieve the same result in terms
of distance matrix calculation and achieve a significant runtime improvement.
V. Thapar
12
BME 300 - Bioinformatics
We have backed up our claim of speed up and accuracy by empirical data and
examples. It can be noticed that since most algorithms that are currently being
used for MSA are using the distance matrix calculation as an initial step, this time
reduction could be of importance.
8
Future Work
There has been no significant work done in the area of randomized algorithms for
MSA. This leaves a lot of opportunities for us for future work. We plan to make
certain very critical improvements to our algorithm. First of all, we would like to
prove theoretical complexity of this algorithm and also show that it is in reality a
faster algorithm. We would also like to show that randomization gives the same
result with very high probability.
At this time, we have assumed that all
sequences are of same length. We would like to expand our work such that
sequences of uneven lengths can also be aligned using random approach. There is
a possibility of taking this work further and implementing randomized portions
for CLUSTAL W, MAFFT and other popular MSA packages in order to increase
their speed. In our opinion, further speedup can be achieved by randomizing not
just pairwise alignment but also sequence selection, but this hypothesis still needs
further work.
References
[1]
[2]
[3]
[4]
[5]
[6]
Berger M. P., P. J. Munson. A novel randomized iterative strategy for aligning multiple
protein sequences. Computer Applications in Biosciences. Vol. 7, No. 4 1991. Pages
479-484.
S. Rajasekaran, H. Nick, P.M. Pardalos, S. Sahni, G. Shaw, Efficient algorithms for local
alignment search. Journal of Combinatorial Optimization. 5(1), 2001, pp. 117-124.
K. Charter, J. Schaeffer, D. Szafron. Sequence Alignmetn using FastLSA. International
Conference on Mathematics and Engineering Techniques in Medicine and Biological
Sciences. 2000.
S. Needleman, C. Wunsch. A general method applicable to the search for similarities in
the amino acid sequence of two proteins. Journal of Molecular Biology. 48:443-453,
1970.
D. Feng, R. Doolittle. Progressive sequence alignment as a prerequisite to correct
phylogenetic trees. Journal of Molecular Evolution. 25:351-360, 1987.
J. Thompson, D Higgins, T. Gibson. CLUSTAL W: improving the sensitivity of
progressive multiple sequence alignment through sequence weighting, position-specific
gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673-4680.
V. Thapar
13
BME 300 - Bioinformatics
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
K. Katoh, K. Misawa, K Kuma, T. Miyata. MAFFT: a novel method for rapid multiple
sequence alignment based on fast Fourier transform. Nucleic Acid Res. 30(14), 30593066.
F. Corpet. Multiple sequence alignment with hierarchical clustering. Nucleic Acid Res.
Vol 16, 10881-10890. November 1998.
G. Karypis, S. Han, V. Kumar. CHAMELEON: A hierarchical clustering algorithm
using dynamic modeling. Technical report TR-99. University of Minnesota,
Minneapolis, 1999.
A. Szymkowiak, J. Larsen, L. Hansen. Hierarchical clustering for datamining. Fifth
International Conference on Knowledge-Based Intelligent Information Engineering
Systems & Allied Technologies. 2001.
C. Notredame. Recent progress in multiple sequence alignment: a survey.
Pharmacogenomics 3(1). 2002.
O. Gotoh. Furhter improvement in methods of group-to-group sequence alignment with
generalized profile operations. Computer Applications in Biosciences, 10 (4), 1994, pp.
379-387.
O. Gotoh. Optimal alignment between groups of sequences and its application to
multiple sequence alignment. Computer Applications in biosciences, 9(3), 1993, pp.
361-370.
J. Kececioglu, H. Lenhof, K. Mehlhorn, P. Mutzel, K. Reinert, M. Vingron. A polyhedral
approach to sequence alignment problems. Discrete applied mathematics 104 (2000), pp.
143-186.
S. Anibal de Carvalho. http://neobio.sourceforge.net/. Department of Computer Science,
King’s college, London, UK.
Appendix A
Input 1
S1:AGGCTATACTTAAGTGGTCGTTATGGCCGTACACCGACCAGCGAGGAACGCATAACAGCGACCTACAT
AAGTTTGTGGTGCATCAAGCTACCGCTTTGCTGATGGCGGACGAAACGCAATTGTTAGAAAGGGGGCGGCA
CAGTACCGAACACGCGTTTCCACGGTCATATTCAGAGGTGCTGTTTTTCTCGTGTAACGCGGCACCTTCCA
TGTCGCCGTTAGTGCGATGAGACTCCAGACCGTGCCCACACTTTGCTCATCGCGCACCAAGAGGAGACCCC
TGTTATCAGGCGTCGCAGTTCCTAGGGGCGCTATCCCACCGTCGCATAACGCCCGACCAAAGGACCACCAA
TCGTTCCGGCGCTGATTTGTCTGGCTCGAGGCGAGTGTCTGATCTGCACTGAGTAGCGGTCCCACTTGGTG
CGCTATTACGGGACGCATGAGCCCTGCGTTTTCTCTCTAATAGTTAGAGAGTATCCTTCTATGCGTCATGC
GAGAGGTTTCGCCTTAGACTAGGTTTTCGAGCTGCCCAGGGTTCCAGTGTGCTTAAGCCGCCATTTATGGT
TTACTCAAGGGTAAAGGTGATCCCATGATTTGATA
S2:ACTCCCACACCACTACTACTAGCCGTTCTTTGCTGTAGAATTCGAAACACCTTTCAGACTGTACCCTG
CCTGCAACTTATAGGGTGCTCATACCGACTCCTAGCCTGAGTCTGACTTGTCGGAAAAATACTGCGCTCGT
ATGGAAAAGTACACCGAGATGCTGAGCCTGAGTTACAAATCAGGCAGTTTTTGGGTCTTATTACTAGGCCC
ACGCTATCTTTGAACATATACTTCTCAGATAACGAGATTTATGTGCTAAGCGATACGTGGCTCAATCCCCG
CTAGGATCTGCCACAACACCACGACTGTCACTCCTTATCAATGACACTCAGTTTTCCAAACGCGGCTGTAG
GTGGTTATTGGTTACGAACGCGACGAACTTACTGTCTTACCTATTGTCAAAGGCCTATAATGCCACACTCT
AAAGCGAGCGGACAACTACCGTTTAAAGCGAATAATGTACCGACCCAAAAAGAACATTTCCCGGTCCCGTC
AGTAGAGCTGGTCAAGAAGGTAGTCTGAATAACTCACGGAGGTATCTTTAGGCTAGGAGCTGAACAAACTT
CAGAAATATAACGCCCCGCCGCCTGCACATGCGCA
S3:TGCTCTCAGTCTTTGTGTCGGCGTCTGAGTACCGTTGAGCGATCCGACAGTGGGGCCAGCCTGCGGAC
CGTCACGAACGTCGTTACCTTGATGCGCATAGTTGCCGTTCTCGCCGAGGCTGGGTGTCCAAGGTGGTCTT
V. Thapar
14
BME 300 - Bioinformatics
TAGCGCCTGCTTTTCAAAGGTAGTAACCTGGTATAATCTGGGGCGATAGTGTCGCCAGTTCAAGGCGTTCA
ACGAGTCGCGCACCTGCTATTACACTGGGAGTAACTATTCAATCAAGTATGAGGCTCAGAACCACAGGTAT
TATTGATGATAAGCCAGACCTTCGAGGATCGTCTCTAGCACATGATCGTTTGATAGAAAGTGTGCAGCTGG
TGAAGTTTTTAACATCCCGTGAGGACGTACACTGGCCTCTCTTGTGCCGGGCGTTAAACAATACCTTAAAG
CATGCCACAATCGTACCGGGCATAGGATGCTGATTTATGCCTTCATAAAGGGACTCGGCCACGTTGTAAGG
TGTGAATGCTAGATCTACCACGAAAGGGCCTGTTAGCACACATGCCGCCCTTGTCGCTAAAGGTTTTATAA
TACGCGTACGCTCATGCCCCCGAAAGAAGACCATGAGTTGACATTCGCTCATAATACAGGTCAGGCATAGG
TGGAGCTCGTGGATTTCTTATCGTTACAAACCATCGCAGAGCACCGTTCGATATACAATAGAGCTTCGGGC
ACTACGCCTACGCGGGTGATTAGGAACCCGTTACAAGGCAAGGACTCAATGGTGTCCCGGAATTTACGCCA
ACAACGGTTGTGAAGGGGATGCGGCGGACTATTGTTTAATGTGGTTGGATCCCACCGTGTGCAATCAGCCT
AGGGGAAACGCAGGAGTCAGAGGCAGTTGGAGTCAGATTGTGCATTAATCAGTTCGTAAGCCTTCCACGGA
GAGTAATCACAACGTCTCGGACAGAAGCTCCCTAGACGACTAGCTGAAAGTGCCCCCAAAGTGCTATGGCA
TCAATCCCT
S4:GCCTATTCGGATGTACTCTCTCCGCCCAGAAGTGAAGGAGTCAGATAGGTCCTTGCTATAACAGCCGC
AACACTCATCGTGCCGGCAGCCTAGCAGTTACCTGGATCCCAGATCTACCTTACCATTTCAGGCTAAATTT
AGGCTCGGGTACAAAAAACATCGCCGGGCTTCAACCTTGCCGCCCTTAACACACGGTGTGACTTTATACAG
GGAGATGGAGCATGGGCTGGCCTAGTGGGGTGTGGCGCTAATTTCCTCGCTAATGCTATGCGGAGCCCTGA
AAGCTGACTGGAGGAGGCCGAGCCGACAATGTCTCGTGAGTGGCATTGCGTTTAAGGAAGACTTTTGTCCG
ATCTACACCTTCCTCGAGTCTCCGCAGGGTTGTGCATAGTGGCTGTAGACAGAATCCAGCTGACAGGTCTG
CATTTAGAAATAGCTTAGCGTCCGCCGGACCACTGTCAACTTTACTGTGGCTCTCGTCTGCTGACTTTGAT
TATCTGAATGTGAGTCTCAGTAACTGACCTGGGCGTCTTCGGCGAAGGATCAATGAACGAATCAAAGAGGT
GAAGGGGCTTTCCTGCTAAGACCGTGCATCAGTACTAGCCGGTCGAGTCCTTTGCACGTCCGCCGCAGCCG
TACAGTCGATTGATATAGTCTACCCTCGATCCTTTAGCAAGTGCATATGCAGCCGACCAACCTTGCGGCAT
ACTCCAATCAACACTACCCAGATCCTAAGGTGACGGTTTCAGAGGATATACGAAGCGTATTGCACCGCGTA
TGTATTTAAGAACGGTGGGTGTTATGTCAGACGCGTCCGGTTTTAACCCTTTATACAAATCGTCTCGACAC
ACTACATCAATATATTACATGAAGGTGCATCACAGCCGGTCCACACCGGTT
S5:TCGGCTGTATTGGCGACCCAGGCGTGGGCTTAATGAATCAGAGACTCTGCAGCCAGGGAGTATGTATA
GCAGTTCTTTAAACGGTCTGCGACGAGGAAGGTTTCGAGTGTGCAACGTGAGGCTATCGTAAAAGTGTTTC
AACAGATGGGGGGCTATGAGCCGCTCGAACGTTACACACTGCACGCGGGGTCGACTAATGGAAGCTAACCT
AAGCTAATTGCCCTATTCGTGAAGAAACATCTAATTCCTTCCTTGTATGTGTTCTCCCTACAGCACATATC
GACAATAGGTTTTAGTGCTTTACCACAAGTAGCAAGTACAACTTGAATTGGGTAAGACTTGCACTTCATGT
ATTTGAAATCGCTATCCCACGACTTGGTGTCAACCCCCGGCTCTTTATCACCTTGCATACCCAGCGGCATC
AAGTGACCGACATATGATCTGGTAGTAGTTCAACCCTGAAGACTATCTTTAGCTCAGCGCGTTAAGTCCTT
ATACACTCTAGCGAGTGGGAAGGATGGATCGGCCGGACATCGTACGTAATTTAGAACCCAGTACCGAGACG
CGTTCGACAGTCCTAAGGCTCCATCAGAGTAGCTTACTACGTCACGAGTCAGGTAAAGCCGAGAGCGTCCG
ATCCATCCTTGGTGGATCAGCGTTCTCTGTTGTTGAACGCGAGGTAAACGTTGGTAACTTTTTCAACAGCA
GTAGAGTAGCGTGTAGTTACTCGGAGATCGACGTAACTGCGCGCCCTGCAACACTAAGCGCTGCGCTGTCT
GCTGCGCAGACTCTATGAGAGTCGCTCGTCTCCGTCTGCTTAGGGGGCGTTAGCACACTAATCACGGCTCA
AATATGTTAAAGAAGGAGCCCCATTTCCGTGACGTCAGTACGAGCAATTTACGATGGCAAAGAGAGCAAGA
CCTTCGCGCAGGGTACGGACCTGACAGCATGGGTTATCAAGGCCCTTTCCAGGTAATAAATTTCAGATTTA
GTACTTATCATGTAGATAAGTTGGAAACCTTGA
S6:GAAGACTCAGGGAGAGAAATTTTTCTTGATTCATTCTGCAGATTGGCTTACTACACATGCTCTTTTCC
ATGAAGTTGCAAAATTGGATGTGGTGAAATTATTATACAATGAGCAGTTTGCTGTTCAAGGGTTGTTGAGA
TACCATACATATGCAAGATTTGGCATTGAAATTCAAGTTCAGATAAACCCTACACCTTTCCAACAGGGGGG
ATTGATCTGTGCTATGGTTCCTGGTGACCAGAGCTATGGTTCTATAGCATCATTGACTGTTTATCCTCATG
GTTTGTTAAATTGCAATATTAACAATGTGGTTAGAATAAAGGTTCCATTTATTTACACAAGAGGTGCTTAC
CACTTTAAAGATCCACAATACCCAGTTTGGGAATTGACAATTAGAGTTTGGTCAGAATTAAATATTGGGAC
AGGAACTTCAGCTTATACTTCACTCAATGTTTTAGCTAGATTTACAGATTTGGAGTTGCATGGATTAACTC
CTCTTTCTACACAAATGATGAGAAATGAATTTAGGGTCAGTACTACTGAGAATGTGGTGAATCTGTCAAAT
TATGAAGATGCAAGAGCAAAGATGTCTTTTGCTTTGGATCAGGAAGATTGGAAATCTGATCCGTCCCAGGG
TGGTGGGATCAAAATTACTCATTTTACTACTTGGACATCTATTCCAACTTTGGCTGCTCAGTTTCCATTTA
ATGCTTCAGACTCAGTTGGTCAACAAATTAAAGTTATTCCAGTTGACCCATATTTTTTCCAAATGACAAAT
ACGAATCCTGACCAAAAATGTATAACTGCTTTGGCTTCTATTTGTCAGATGTTTTGTTTTTGGAGAGGAGA
TCTTGTCTTTGATTTTCAAGTTTTTCCCACCAAATATCATTCAGGTAGATTACTGTTTTGTTTTGTTCCTG
GCAATGAGCTAATAGATGTTTCTGGAATCACATTAAAGCAAGCAACTACTGCTCCTTGTGCAGTAATGGAT
ATTACAGGAGTGCAGTCAAC
V. Thapar
15
BME 300 - Bioinformatics
S7:CAGTGGCGATGACCCTGGAAAAGAATATGCCGATCGGTTCGGGCTTAGGCTCCAGTGCCTGTTCGGTG
GTCGCGGCGCTGATGGCGATGAATGAACACTGCGGCAAGCCGCTTAATGACACTCGTTTGCTGGCTTTGAT
GGGCGAGCTGGAAGGCCGTATCTCCGGCAGCATTCATTACGACAACGTGGCACCGTGTTTTCTCGGTGGTA
TGCAGTTGATGATCGAAGAAAACGACATCATCAGCCAGCAAGTGCCAGGGTTTGATGAGTGGCTGTGGGTG
CTGGCGTATCCGGGGATTAAAGTCTCGACGGCAGAAGCCAGGGCTATTTTACCGGCGCAGTATCGCCGCCA
GGATTGCATTGCGCACGGGCGACATCTGGCAGGCTTCATTCACGCCTGCTATTCCCGTCAGCCTGAGCTTG
CCGCGAAGCTGATGAAAGATGTTATCGCTGAACCCTACCGTGAACGGTTACTGCCAGGCTTCCGGCAGGCG
CGGCAGGCGGTCGCGGAAATCGGCGCGGTAGCGAGCGGTATCTCCGGCTCCGGCCCGACCTTGTTCGCTCT
GTGTGACAAGCCGGAAACCGCCCAGCGCGTTGCCGACTGGTTGGGTAAGAACTACCTGCAAAATCAGGAAG
GTTTTGTTCATATTTGCCGGCTGGATACGGCGGGCGCACGAGTACTGGAAAACTAAATGAAACTCTACAAT
CTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCCGTAACCCAGGGGTTGGGCAAAAATCAGGGGCT
GTTTTTTCCGCACGACCTGCCGGAATTCAGCCTGACTGAAATTGATGAGATGCTGAAGCTGGATTTTGTCA
CCCGCAGTGCGAAGATCCTCTCGGCGTTTATTGGTGATGAAATCCCACAGGAAATCCTGGAAGAGCGCGTG
CGCGCGGCGTTTGCCTTCCCGGCTCCGGTCGCCAATGTTGAAAGCGATGTCGGTTGTCTGGAATTGTTCCA
CGGGCCAACGCTGGCATTTAAAGATTTCGGCGG
S8:AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCA
GCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCA
ATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGC
CCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGT
TCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGG
CAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAA
AACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGG
GACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAA
ATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTG
CCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGTTACTGTTA
TCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCTGAGTCCACC
CGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGA
AAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGCTGCCTGTTTAC
GCGCCGATTGTTGCGAGATTTGGACGGACGTTG
S9:ACCCATAACGGGCAATGATAAAAGGAGTAACCTGTGAAAAAGATGCAATCTATCGTACTCGCACTTTC
CCTGGTTCTGGTCGCTCCCATGGCAGCAGAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGA
TAGGCGATCGTGATAATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAA
CATTATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCATAAGAAAGC
TCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAAATGACAAATGCCGGGTAACAAT
CCGGCATTCAGCGCCTGATGCGACGCTGGCGCGTCTTATCAGGCCTACGTTAATTCTGCAATATATTGAAT
CTGCATGCTTTTGTAGGCAGGATAAGGCGTTCACGCCGCATCCGGCATTGACTGCAAACTTAACGCTGCTC
GTAGCGTTTAAACACCAGTTCGCCATTGCTGGAGGAATCTTCATCAAAGAAGTAACCTTCGCTATTAAAAC
CAGTCAGTTGCTCTGGTTTGGTCAGCCGATTTTCAATAATGAAACGACTCATCAGACCGCGTGCTTTCTTA
GCGTAGAAGCTGATGATCTTAAATTTGCCGTTCTTCTCATCGAGGAACACCGGCTTGATAATCTCGGCATT
CAATTTCTTCGGCTTCACCGATTTAAAATACTCATCTGACGCCAGATTAATCACCACATTATCGCCTTGTG
CTGCGAGCGCCTCGTTCAGCTTGTTGGTGATGATATCTCCCCAGAATTGATACAGATCTTTCCCTCGGGCA
TTCTCAAGACGGATCCCCATTTCCAGACGATAAGGCTGCATTAAATCGAGCGGGCGGAGTACGCCATACAA
GCCGGAAAGCATTCGCAAATGCTGTTGGGCAAAATCGAAATCGTCTTCGCTGAAGGTTTCGGCCTGCAAGC
CGGTGTAGACATCACCTTTAAACGCCAGAATCG
Appendix B
Input 2
V. Thapar
16
BME 300 - Bioinformatics
Appendix C
Input 3
V. Thapar
17
BME 300 - Bioinformatics
Download