A compression Algorithm for DNA sequences and its applications in

advertisement
A compression Algorithm
for DNA sequences and
its applications in
Genome Comparison
X. Chen, S. Kwong, M. Li, Genome Informatics
(GIW'99), Tokyo, Japan, pp.51-61, 1999
 Compressibility
of DNA
 Relatedness between 2 DNA
sequences
Can we compress DNA sequences?


DNA sequences only consist of 4 nucleotide
bases {A,G,C,T}  2 bits for each base
Regular-global compression algorithms do not
work (they even expand the data) for DNA.
Why?
Related Work

(in the spirit of Ziv and Lempel data
compression method) Biocompress and
Biocompress-2 (second one uses arithmetic
coding of order-2).
=> They are searching the previously
processed part of the sequence for
repeats.
Related Work…

Cfact: idea same as Biocompress-2 but no use
of arithmetic coding of order 2.
1st phase: Builds Suffix Tree
2nd Phase: repetitions are coded with garanteed
gain; otherwise, two-bit per base encoding is
used.
=> Searching the whole sequence for exact matches.
Related Work…


Some compression schemes based on appropriate
string matching which are lossy.
In the paper, they
are not interested in lossy compression.
“Relatedness” between two DNA sequences. Some
approaches to define such distance d(x,y):
conditional compression, transformation distance both are not
symmetric, hence cannot be used in general. Some
authors tried (d(x,y)+d(y,x))/2. Obviously such a
definition is not well founded.
Related Work…

Some Biologists: methods like counting the
number of shared genes in two genomes or
comparing the ordering of genes.
Encoding Edit Operations:



Replace: (R, p, char) means replacing the
character at position p by character char.
Insert: (I, p, char) means inserting character
char at p.
Delete: (D, p) means deleting the character at p.
Encoding Edit Operations:
1) Two Bits Encoding Method
2) Exact Matching Method
3) Approximate Matching Method
4) Approximate Matching Method with edit
operation sequence
1 ) Two Bits Encoding Method
10 00 01 01 10 11 01 00 encodes “gaccgtca”
It needs 16 bits.
“gaccgtca” using string “gaccttca”
2) Exact Matching Method
We can use (repeat position, repeat length)
Use one bit to indicate if the next pair is a pair
string can be encoded “gaccgtca” using
string “gaccttca”
0 000 100 1 10 0 100 011 ….in 17 bits
{(0,4),g,(5,3)} is encoded…
3) Approximate Matching Method
can be more than one representation for a specific
sequence in this case..
“gaccgtca” using string “gaccttca”
{(0,8),(R,4,g)} or
0 000 111 1 00 100 10
in this case R encoded by 00, I encoded by 01, and
D encoded by 11, and 0/1 indicating next item
double /triple ?
4) Approximate with Edit OP sequence
If we use the edit operation sequence (I,4,g),(D,6).
Then the string “gaccgtca” can be encoded as
{(0,8),(I,4,g),(D,6)} or…
0 000 111 1 01 100 10 1 10 110 in total 21 bits…
From the examples Approximate (3rd) has the
least number of bits…
GenCompress: an algorithm based on
approximate matching



GenCompress is a one-pass algorithm.
For input w, part of it already processed, say v,
and the remaining part is u, i.e. w = vu.
Finds “optimal prefix” of u s.t. it approx
matches some substring in v and this prefix of u
can be encoded economically.
Condition C, and the Compression Gain
Algorithm


condition C = (k, b) : #of edit operations for length k
in prefix s of u is not larger than a threshold value b.
Experiments show that (k, b)=(12, 3).
G (s, t, λ) = max {0, 2|s| - |(|s|, i)| - wλ*|λ(s, t)| - c }
-t: substring appear at position i in v,
-|(|s|, i)|: encoding size of (|s|, i)
-w λ: weighted value of encoding an edit operation
-|λ(s, t)|: #of edit operations in λ(s, t)
-c: overhead proportional to the size of control bits

Optimal prefix: G(s, t, λ) is maximized over all λ and t.
Optimal Prefix?
Lemma :An optimal prefix u with G(u,v, λ) > 0
always ends right before a mismatch
0:l
Lemma : Let λ be the optimal edit operation
sequence from x to y. If y is copied from x in λ
when converting from y to x, then λ (x , y ) is
also the optimal edit operation sequence among
all the edit operations sequences from x to y
i
i
0:l
0:l
0:l
0:j
Symmetry?






Symmetry of information ?
Distance from sequence A to B = Distance from
sequence B to A
Based on Kolmogorov Complexity
In the sense…If you compress A based on B is equal
to B compressed based on A
Is also shown by experiments that GeneCompress is
symmetric …
Also detects approximate palindromes in gene
sequences..
Relatedness



Also detects approximate palindromes in gene
sequences..
Could be used for searching sequences…
Unlike other metrics (like edit distance…etc)
trees that are generated using this compression
method fits with evolutionary trees.!!
Questions?
Download