A Fast Algorithm Name: FASTA (revised version) Developed: Pearson & Lipman 1988, Chao & Miller 1995, Huang 1996, Altschul 1990 Function: identify similar regions between two sequences and produce alignment for each similar region by means of dynamic programming alignment technique. Feature: time complexity is better than quadratic; space complexity is linear, suitable for long, distantly related sequence comparison The analytic technique of biologic sequence is divided into three level, dynamic programming, heuristic and probabilistic. The dynamic programming algorithms take consideration of every match of letter, so they are guaranteed to find the optimal score according to the specified scoring scheme. The time complexity of these algorithms is quadratic. The heuristic algorithms search as small a fraction as possible of the cells in the dynamic programming matrix, while still looking at all the high scoring alignment. These algorithms will probably miss the best scoring alignment, but reduce much more run time. The probabilistic algorithms will not directly consider any specific match of words; instead utilize historical and input data to predict the alignment with optimal probability in statistical and probabilistic way. The tradeoff lies in sensibility and speed. PRINCIPLE AND ASSUMPTION Definition: Segment pair: An alignment between A and B without any gaps. Segment score: sum of scores of each match and mismatch of pair in a segment pair First antidiagonal: ants = astart + bstart of a segment pair Last antidiagonal: antid = aend + bend of a segment pair Chain: a sequence of segment pairs in the increasing order of last antidiagonal Similarity can be found in an optimal chain. Optimal chain has the maximum score among the chains of segment pairs with increasing order of antidiagonal and without intersecting. For example: GGATCGTTC ATTGTCGGTTC GGATCGTTC ATTGTCGGTTC GG and AT are intersection pair, and will not appear in optimal chain GGATCGTTC ATTGTCGGTTC overlap is allowed, for example AT and TC Possible optimal chain AT TC CG GT TT TC (we call it equivalence chain for chain class with starting AT and ending TC) Chain Antd AT 3+1 =4 TC 4+5 =9 CG 5+6 =11 GT 6+8 = 14 TT 7+9 =16 TC 8+10 = 18 The chain does contain similarity information, for example GGAT___CG_TTC || || ||| ATTGTCGGTTC And GGA___TC_GTTC | || |||| ATTGTCGGTTC Next question is which one is the best alignment, and how we can find inside information over this diagonal band. Those problems are left to global aligning, such as the previously presented linear-space global alignment algorithm. The method is to select the regions from A and B, in which the start and end positions are indicated by start segment pair AT and end segment pair TC, and apply global alignment algorithm to them, then find all information we need. When selecting the regions for the analysis of global alignment, we may use the information of segment pair list to shrink the original sequences, for example, we shrink the two sequences as : ATCGTTC ATTCGTTC Apparently this reduces the problem size for global alignment. This is one of beauties of this algorithm. STEPS 1. Create hash or lookup table. Arguments: Sequence A and word length w Function: store all possible words or character sequences occurred in A with length w to hash table. This pre-process makes matching segment pairs between A and B faster. Rule of deciding w: w affects the size of hash table. Generally it should make the hash table size close to the length of A. The smaller the w, the closer the hash table size to the length of A and the more the similarity information is kept. However smaller w will cause more operations and more memories. Programming note: 1. Every entry contains a list of offset or start value in increasing order. The entry also contains a state attribute, which denotes the last finding index of the offset list. This state should be reset before next search of segment pair from B. 2. The implementation of static hash table. For the static hash table, we set a fixed table size corresponding to w, in which the hash value is same as hash index such that we need not sorting or inserting, but need a huge dedicated memory if the A sequence is very long and w small. For previous example: A: GGATCGTTC B: ATTGTCGGTTC Increasing sort by hash value, word length w =2, hash value = ASCII of letter Hash index corresponding to hash value …. 135 … 138 … 142 … 149 … 151 … 155 … 168 … Hash Table Entry (hash value,state, offset list) Denotation (135,0,(1)) GA (138,0,(4)) CG (142,0,(0)) GG (149,0,(2)) AT (151,0,(3,7 )) TC (155,0,(5)) GT (168,0,(6)) TT 2. Computing high-scoring segment pairs list Function: Scan and move a sliding window of length w on B from left to right to find every segment pair against A, which is either standard length pair or extended pair, then prepare the segment pair list. Definition: Cutoff d: threshold or criteria, which decide whether the segment score deselects this pair. Its value is always positive. Cutoff d3: threshold or criteria, which decide whether the score drop of match extending stops the further extending. Programming note: 1. The acquire of astart (offset) from hash table is based on the offset state stored in the hash table 2. The extended pair is got by extending segment pair in both directions until the score drop by at least d3. 3. The entity for segment pairs contains astart, bstart, length, score, ants and antid. For previous example: All segment pairs: astart GA 1 CG 4 GG 0 AT 2 TC1 3 GT 5 TT 6 TC2 7 bstart 5 6 0 4 7 8 9 aend 2 5 1 3 4 6 7 8 List in increasing order of antd: AT GG TC1 CG GT TT TC2 3. Computing high-scoring chain of segment pair bend ants antid 6 7 1 5 8 9 10 9 6 2 7 12 14 16 11 8 4 9 14 16 18 Function: Get optimal chain list. The score of optimal chain indicates the similarity in corresponding region. Definition: Closeness of segment pairs: if two adjacent segment pairs in sequence S1,S2 satisfy the requirement: ants(S2) – antd(S1) < d1 aend(S1) – astart(S2) < d2 bend(S1) – bstart(S2) < d2 This requirement ensures us that two selected and adjacent segment pairs is not too far away and there is no too mush overlap measure by two positive cutoffs d1 and d2 respectively. tscore: sum of scores of the longest portion of a segment pair that has no overlap with adjacent segment pair. cutoff ic: positive threshold or criteria, which decide whether the tscore of two adjacent segments pair ignore latter pair’s contribution to the chain, which means deselecting the latter pair from the chain. Its value is always positive. cutoff f: positive threshold or criteria, which decide whether a chain class is counted. If no chain class for a segment pair, this pair will be deselected. gap(): gap penalty for two arbitrary segment pairs chain score: computed score for a chain of segment pairs, which is defined as: K score(c) = score(s1) + i =2 [tscore(si-1, si ) - gap(si-1, si )] Actually this equation is useless in the fast algorithm, where we will use a so called traceback technique, rather than directly calculating it, to get a value to replace it. That is the key to improve speed. chain class: a group of chains with same start and end segment pairs equivalence class: A chain class with same end segment pair, in which every chain is of maximum chain score among its chain class optimal chain: A chain with maximum chain score among equivalence class with same end segment pair Qc(si, sk): Maximum score for chain class Qc(si, sk) = Max { score (chain | start si , end sk) } Q(sk): Maximum score for equivalence class, the result is chain score of optimal chain Q(sk) = Max {Qc(si, sk) | 1<= i < k) K(s): Start segment pair for optimal chain For example, a list of segment pairs (s1,s2,s3,s4) Chain class 1: { chain 1: [s1 s2 s3 s4] = 26, chain 2: [s1 s2 s4] = 20, chain 3: [s1 s3 s4] = 25, chain 4: [s1 s4] = 15} Qc(class 1) = 26; Chain class 2: { chain 5: [s2 s3 s4] = 30, chain 6: [s2 s4] = 16} Qc(class 2) = 30; Chain class 3: { chain 7: [s3 s4] = 28} Qc(class 3) = 28; Equivalence class = {chain 1, chain 5, chain 7} Therefore Q(s4) = Qc(class 2) = 30; K(s4) = K(chain 5 | class 2) = s2; Programming note: Programming flow chart 1. The chain class (programming class) object S contain maximum score Q(s), start segment pair K(s), and segment pair itself. The output for this computation is a list of S object. S is optimal chains, K(s) is the corresponding start segment pair. 2. The key computation of the maximum score of a chain of segment pairs is in an iterative style: Let s1,s2,…., be a chain of segment pair, the maximum score for this chain: Q(s1) = score (s1) Q(si) = max {score(sj), Q(sj) + tscore(sj,si) – gap(sj,si) | 1<= j < i, close(sj,si),and tscore(sj,si) > ic} for i > 1 where gap(s1,s2) = q + r * [l(astart(s2) – aend(s1)) + l(bstart(s2) – bend(s1))] l(x) = x if x >0 and 0 otherwise. 3. Effective computation of tscore by dealing with array R of size d2 Since the formula of tscore could be, tscore(s1, s2) = score(s2) ; if aover(s1,s2) > 0 and bovver(s1,s2) >0 = score(s2) – R(max{-aover(s1,s2), -bover(s1,s2)}) ;otherwise where aover(s1,s2) = astart(s2) - aend(s1); bover(s1,s2) = bstart(s2) - bend(s1); R(t) = sum of scores of first t+1 aligned pairs in s1; if there are at least t+1 aligned pair in s1 = score(s1) ; otherwise where for 0 <= t <d2 So, we had better computed the R array for an si before computing Q(si), this will make whole computation process fluently and efficiently. 4. Computing the largest-scoring alignment over the band of diagonal. Function: Pick the high-scoring chain of segment pair to form two sequences corresponding to A and B respectively. Then apply the linear-space alignment algorithm (or other efficient alignment algorithm) to these sequences to get the result. SUMMARY Programming Flow Chart Techniques Contributed To Fast Computation 1. Hash table to speed up search 2. Extended matching 3. Dynamic programming computation for maximum score of chain 4. Traceback or Backward technique to eliminate huge memory exhaust 5. Using array operation to compute tscore 6. Time complexity: at least better than quadratic (prove not done yet) 7. Space complexity: The memory exhaust is proportional to the number of total segment pairs. This number is less than the longest length of sequence. So the space complexity is linear. 8. Using linear-space alignment technique to computer a largest scoring alignment over a band of diagonal (optional) 9. All cutoffs used in this algorithm, including d,d1,d2,d3,ic,f , sliding window size w, as well as scores of letter match and mismatch, penalties p for open gap and r for repetition gap significantly influence the efficiency. They are case-based. So the fine and skillful tune will fasten the computation. (This raises question for parameter estimation technique and AI) 10. Through my comparing to the original from the reference, I found that the improvements of this revised version contain the extended matching and cutting off for chain whose score is less than f and the computation method of tscore. COMMENTS 1. This article didn’t consider a situation where a word occurs more than one time in sequence A. This is why it didn’t mention we need set an offset state and offset list in hash table to reflect this situation. The article also didn’t mention what kind hash table the algorithm uses, static or dynamic? I checked the reference for original; I found the original FASTA did use an offset list. It uses the static hash table. It seems more reasonable because the memory exhaust is less critical than search time. Static or dynamic hash table? There are two ways to implement a hash table, statically and dynamically. For the static hash table, we have a fixed table size corresponding to w in advance, in which the hash value is same as hash index such that we need not sorting or inserting, but need a huge dedicated memory if the A sequence is very long and w small. The dynamic hash table needs an extra space in every hash entry for the hash value. But since the memory is allocated dynamically, the hash table size wouldn’t bother us critically. 2. An instruction described in step 2 for considering the extension of segment pair is not accurate. The article says “If a word match is contained in a segment pair already considered, the match is not extended.”, this only applies the situation where the offset of new segment pair is less than the end of previous segment pair. This is reasonable because we don’t want re-compute this. But if the new segment pair has left the extended segment pair, we still need consider the extension. See the illustration below w Extension A start end offset B Previous extended segment pair Non-extended new segment pair new extended segment pair I didn’t find how the original defined. I hope my understanding is right. Following issues raised here may exceed the scope of algorithm discussion, related to modeling problem instead. But they may lead to further research of this algorithm. 3. When selecting a similar region for the alignment analysis, we may explore the information existing in segment pair list to shrink the original sequences. Such that we may reduce the problem size for global alignment. This is a beauty the author didn’t mention, and deserve for us to explore. 4. It is nice to assign different matching score based on knowledge, rather than simple match and mismatch score. This may provide opportunities for AI to play role, and also actually enhance the role of offcut d3 defined in this algorithm. 5. There are too many parameters needing tune for a good performance of this algorithm. To some extent this may cause this algorithm become useless because we can’t expect user have enough skill to do this very well. Hopefully, we have some probabilistic approach to address this problem, such as Maximum likelihood estimation of parameter. 6. Although the probabilistic models or algorithms, such as Hidden Markov Model, are more suitable to deal with extra long and distantly related sequence comparison problem, they still need counting on heuristic or dynamic programming technique to get distribution information. Moreover, in an AI approach, the combination of probabilistic model, high efficient FASTA and machine learning may make the biologic sequence analysis technique more powerful and widely suitable REFERENCE 1. Jiang, Xu, and Zhang, Current Topics in Computational Molecular biology 2. Setabal and Meidanis, Introduction to Computational Molecular biologyDurbin, R., Eddy, S., Krogh, A. and Mitchison, G. 1998. Biological Sequence Analysis. Cambridge University Press. 3. Pearson, W.R. and Lipman, D.J. 1988. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the USA 4:2444-2448. 4. http://kisac.cmb.ki.se/gcgmanual/fasta.html 5. http://www.iturls.com/English/TechHotspot/TH_Bioinformatics.asp 6. http://courses.cs.vt.edu/~algnbio/FASTA.php 7. http://newfish.mbl.edu/Course/Software/FASTA/ 8. http://www.cs.ualberta.ca/~bioinfo/public/references.html