Definition: - Computer Science

advertisement
A Fast Algorithm
Name: FASTA (revised version)
Developed: Pearson & Lipman 1988, Chao & Miller 1995, Huang 1996, Altschul 1990
Function: identify similar regions between two sequences and produce alignment for each
similar region by means of dynamic programming alignment technique.
Feature: time complexity is better than quadratic; space complexity is linear, suitable for
long, distantly related sequence comparison
The analytic technique of biologic sequence is divided into three level, dynamic
programming, heuristic and probabilistic. The dynamic programming algorithms take
consideration of every match of letter, so they are guaranteed to find the optimal score
according to the specified scoring scheme. The time complexity of these algorithms is
quadratic. The heuristic algorithms search as small a fraction as possible of the cells in the
dynamic programming matrix, while still looking at all the high scoring alignment. These
algorithms will probably miss the best scoring alignment, but reduce much more run time.
The probabilistic algorithms will not directly consider any specific match of words; instead
utilize historical and input data to predict the alignment with optimal probability in
statistical and probabilistic way. The tradeoff lies in sensibility and speed.
PRINCIPLE AND ASSUMPTION
Definition:
Segment pair: An alignment between A and B without any gaps.
Segment score: sum of scores of each match and mismatch of pair in a segment pair
First antidiagonal: ants = astart + bstart of a segment pair
Last antidiagonal: antid = aend + bend of a segment pair
Chain: a sequence of segment pairs in the increasing order of last antidiagonal
Similarity can be found in an optimal chain. Optimal chain has the maximum score among
the chains of segment pairs with increasing order of antidiagonal and without intersecting.
For example:
GGATCGTTC
ATTGTCGGTTC
GGATCGTTC
ATTGTCGGTTC
GG and AT are intersection pair, and will not appear in optimal chain
GGATCGTTC
ATTGTCGGTTC
overlap is allowed, for example AT and TC
Possible optimal chain
AT TC CG GT TT TC (we call it equivalence chain for chain class with starting AT and
ending TC)
Chain
Antd
AT
3+1
=4
TC
4+5
=9
CG
5+6
=11
GT
6+8
= 14
TT
7+9
=16
TC
8+10
= 18
The chain does contain similarity information, for example
GGAT___CG_TTC
||
|| |||
ATTGTCGGTTC
And
GGA___TC_GTTC
|
|| ||||
ATTGTCGGTTC
Next question is which one is the best alignment, and how we can find inside information
over this diagonal band. Those problems are left to global aligning, such as the previously
presented linear-space global alignment algorithm. The method is to select the regions from
A and B, in which the start and end positions are indicated by start segment pair AT and end
segment pair TC, and apply global alignment algorithm to them, then find all information we
need.
When selecting the regions for the analysis of global alignment, we may use the information
of segment pair list to shrink the original sequences, for example, we shrink the two
sequences as :
ATCGTTC
ATTCGTTC
Apparently this reduces the problem size for global alignment. This is one of beauties of this
algorithm.
STEPS
1. Create hash or lookup table. Arguments: Sequence A and word length w
Function: store all possible words or character sequences occurred in A with length w to
hash table. This pre-process makes matching segment pairs between A and B faster.
Rule of deciding w: w affects the size of hash table. Generally it should make the hash table
size close to the length of A. The smaller the w, the closer the hash table size to the length of
A and the more the similarity information is kept. However smaller w will cause more
operations and more memories.
Programming note:
1. Every entry contains a list of offset or start value in increasing order. The entry also
contains a state attribute, which denotes the last finding index of the offset list. This state
should be reset before next search of segment pair from B.
2. The implementation of static hash table. For the static hash table, we set a fixed table size
corresponding to w, in which the hash value is same as hash index such that we need not
sorting or inserting, but need a huge dedicated memory if the A sequence is very long and
w small.
For previous example:
A: GGATCGTTC
B: ATTGTCGGTTC
Increasing sort by hash value, word length w =2, hash value =  ASCII of letter
Hash index
corresponding
to hash value
….
135
…
138
…
142
…
149
…
151
…
155
…
168
…
Hash Table Entry
(hash value,state,
offset list)
Denotation
(135,0,(1))
GA
(138,0,(4))
CG
(142,0,(0))
GG
(149,0,(2))
AT
(151,0,(3,7 ))
TC
(155,0,(5))
GT
(168,0,(6))
TT
2. Computing high-scoring segment pairs list
Function: Scan and move a sliding window of length w on B from left to right to find every
segment pair against A, which is either standard length pair or extended pair, then prepare the
segment pair list.
Definition:
Cutoff d: threshold or criteria, which decide whether the segment score deselects this pair.
Its value is always positive.
Cutoff d3: threshold or criteria, which decide whether the score drop of match extending
stops the further extending.
Programming note:
1. The acquire of astart (offset) from hash table is based on the offset state stored in the hash
table
2. The extended pair is got by extending segment pair in both directions until the score drop
by at least d3.
3. The entity for segment pairs contains astart, bstart, length, score, ants and antid.
For previous example:
All segment pairs:
astart
GA
1
CG
4
GG
0
AT
2
TC1
3
GT
5
TT
6
TC2
7
bstart
5
6
0
4
7
8
9
aend
2
5
1
3
4
6
7
8
List in increasing order of antd:
AT GG TC1 CG GT TT TC2
3. Computing high-scoring chain of segment pair
bend
ants
antid
6
7
1
5
8
9
10
9
6
2
7
12
14
16
11
8
4
9
14
16
18
Function: Get optimal chain list. The score of optimal chain indicates the similarity in
corresponding region.
Definition:
Closeness of segment pairs: if two adjacent segment pairs in sequence S1,S2 satisfy the
requirement:
ants(S2) – antd(S1) < d1
aend(S1) – astart(S2) < d2
bend(S1) – bstart(S2) < d2
This requirement ensures us that two selected and adjacent segment pairs is not too far away
and there is no too mush overlap measure by two positive cutoffs d1 and d2 respectively.
tscore: sum of scores of the longest portion of a segment pair that has no overlap with
adjacent segment pair.
cutoff ic: positive threshold or criteria, which decide whether the tscore of two adjacent
segments pair ignore latter pair’s contribution to the chain, which means deselecting the latter
pair from the chain. Its value is always positive.
cutoff f: positive threshold or criteria, which decide whether a chain class is counted. If no
chain class for a segment pair, this pair will be deselected.
gap(): gap penalty for two arbitrary segment pairs
chain score: computed score for a chain of segment pairs, which is defined as:
K
score(c) = score(s1) + 
i =2
[tscore(si-1, si ) - gap(si-1, si )]
Actually this equation is useless in the fast algorithm, where we will use a so
called traceback technique, rather than directly calculating it, to get a value to
replace it. That is the key to improve speed.
chain class: a group of chains with same start and end segment pairs
equivalence class: A chain class with same end segment pair, in which every chain is of
maximum chain score among its chain class
optimal chain: A chain with maximum chain score among equivalence class with same end
segment pair
Qc(si, sk): Maximum score for chain class
Qc(si, sk) = Max { score (chain | start si , end sk) }
Q(sk): Maximum score for equivalence class, the result is chain score of optimal chain
Q(sk) = Max {Qc(si, sk) | 1<= i < k)
K(s): Start segment pair for optimal chain
For example, a list of segment pairs (s1,s2,s3,s4)
Chain class 1: { chain 1: [s1 s2 s3 s4] = 26, chain 2: [s1 s2 s4] = 20, chain 3: [s1 s3 s4] = 25,
chain 4: [s1 s4] = 15}
Qc(class 1) = 26;
Chain class 2: { chain 5: [s2 s3 s4] = 30, chain 6: [s2 s4] = 16}
Qc(class 2) = 30;
Chain class 3: { chain 7: [s3 s4] = 28}
Qc(class 3) = 28;
Equivalence class = {chain 1, chain 5, chain 7}
Therefore Q(s4) = Qc(class 2) = 30;
K(s4) = K(chain 5 | class 2) = s2;
Programming note:
Programming flow chart
1. The chain class (programming class) object S contain maximum score Q(s), start segment
pair K(s), and segment pair itself. The output for this computation is a list of S object. S
is optimal chains, K(s) is the corresponding start segment pair.
2. The key computation of the maximum score of a chain of segment pairs is in an iterative
style:
Let s1,s2,…., be a chain of segment pair, the maximum score for this chain:
Q(s1) = score (s1)
Q(si) = max {score(sj), Q(sj) + tscore(sj,si) – gap(sj,si)
| 1<= j < i, close(sj,si),and tscore(sj,si) > ic} for i > 1
where
gap(s1,s2) = q + r * [l(astart(s2) – aend(s1)) + l(bstart(s2) – bend(s1))]
l(x) = x if x >0 and 0 otherwise.
3. Effective computation of tscore by dealing with array R of size d2
Since the formula of tscore could be,
tscore(s1, s2) = score(s2) ; if aover(s1,s2) > 0 and bovver(s1,s2) >0
= score(s2) – R(max{-aover(s1,s2), -bover(s1,s2)}) ;otherwise
where
aover(s1,s2) = astart(s2) - aend(s1);
bover(s1,s2) = bstart(s2) - bend(s1);
R(t) = sum of scores of first t+1 aligned pairs in s1; if there are at least t+1 aligned pair in s1
= score(s1) ; otherwise
where for 0 <= t <d2
So, we had better computed the R array for an si before computing Q(si), this will make
whole computation process fluently and efficiently.
4. Computing the largest-scoring alignment over the band of diagonal.
Function: Pick the high-scoring chain of segment pair to form two sequences corresponding
to A and B respectively. Then apply the linear-space alignment algorithm (or other efficient
alignment algorithm) to these sequences to get the result.
SUMMARY
Programming Flow Chart
Techniques Contributed To Fast Computation
1. Hash table to speed up search
2. Extended matching
3. Dynamic programming computation for maximum score of chain
4. Traceback or Backward technique to eliminate huge memory exhaust
5. Using array operation to compute tscore
6. Time complexity: at least better than quadratic (prove not done yet)
7. Space complexity: The memory exhaust is proportional to the number of total segment
pairs. This number is less than the longest length of sequence. So the space complexity is
linear.
8. Using linear-space alignment technique to computer a largest scoring alignment over a
band of diagonal (optional)
9. All cutoffs used in this algorithm, including d,d1,d2,d3,ic,f , sliding window size w, as
well as scores of letter match and mismatch, penalties p for open gap and r for repetition
gap significantly influence the efficiency. They are case-based. So the fine and skillful
tune will fasten the computation. (This raises question for parameter estimation technique
and AI)
10. Through my comparing to the original from the reference, I found that the improvements
of this revised version contain the extended matching and cutting off for chain whose
score is less than f and the computation method of tscore.
COMMENTS
1. This article didn’t consider a situation where a word occurs more than one time in sequence
A. This is why it didn’t mention we need set an offset state and offset list in hash table to
reflect this situation. The article also didn’t mention what kind hash table the algorithm uses,
static or dynamic?
I checked the reference for original; I found the original FASTA did use an offset list. It uses
the static hash table. It seems more reasonable because the memory exhaust is less critical
than search time.
Static or dynamic hash table?
There are two ways to implement a hash table, statically and dynamically. For the static
hash table, we have a fixed table size corresponding to w in advance, in which the hash value
is same as hash index such that we need not sorting or inserting, but need a huge dedicated
memory if the A sequence is very long and w small. The dynamic hash table needs an extra
space in every hash entry for the hash value. But since the memory is allocated dynamically,
the hash table size wouldn’t bother us critically.
2. An instruction described in step 2 for considering the extension of segment pair is not
accurate. The article says “If a word match is contained in a segment pair already considered,
the match is not extended.”, this only applies the situation where the offset of new segment
pair is less than the end of previous segment pair. This is reasonable because we don’t want
re-compute this. But if the new segment pair has left the extended segment pair, we still need
consider the extension. See the illustration below
w
Extension
A
start
end
offset
B
Previous extended
segment pair
Non-extended new
segment pair
new extended
segment pair
I didn’t find how the original defined. I hope my understanding is right.
Following issues raised here may exceed the scope of algorithm discussion, related to modeling
problem instead. But they may lead to further research of this algorithm.
3. When selecting a similar region for the alignment analysis, we may explore the information
existing in segment pair list to shrink the original sequences. Such that we may reduce the
problem size for global alignment. This is a beauty the author didn’t mention, and deserve for
us to explore.
4. It is nice to assign different matching score based on knowledge, rather than simple match
and mismatch score. This may provide opportunities for AI to play role, and also actually
enhance the role of offcut d3 defined in this algorithm.
5. There are too many parameters needing tune for a good performance of this algorithm. To
some extent this may cause this algorithm become useless because we can’t expect user have
enough skill to do this very well. Hopefully, we have some probabilistic approach to address
this problem, such as Maximum likelihood estimation of parameter.
6. Although the probabilistic models or algorithms, such as Hidden Markov Model, are more
suitable to deal with extra long and distantly related sequence comparison problem, they still
need counting on heuristic or dynamic programming technique to get distribution
information. Moreover, in an AI approach, the combination of probabilistic model, high
efficient FASTA and machine learning may make the biologic sequence analysis technique
more powerful and widely suitable
REFERENCE
1. Jiang, Xu, and Zhang, Current Topics in Computational Molecular biology
2. Setabal and Meidanis, Introduction to Computational Molecular biologyDurbin, R., Eddy, S.,
Krogh, A. and Mitchison, G. 1998. Biological Sequence Analysis. Cambridge University
Press.
3. Pearson, W.R. and Lipman, D.J. 1988. Improved tools for biological sequence comparison.
Proceedings of the National Academy of Sciences of the USA 4:2444-2448.
4. http://kisac.cmb.ki.se/gcgmanual/fasta.html
5. http://www.iturls.com/English/TechHotspot/TH_Bioinformatics.asp
6. http://courses.cs.vt.edu/~algnbio/FASTA.php
7. http://newfish.mbl.edu/Course/Software/FASTA/
8. http://www.cs.ualberta.ca/~bioinfo/public/references.html
Download