A compression Algorithm for DNA sequences and its applications in Genome Comparison X. Chen, S. Kwong, M. Li, Genome Informatics (GIW'99), Tokyo, Japan, pp.51-61, 1999 Compressibility of DNA Relatedness between 2 DNA sequences Can we compress DNA sequences? DNA sequences only consist of 4 nucleotide bases {A,G,C,T} 2 bits for each base Regular-global compression algorithms do not work (they even expand the data) for DNA. Why? Related Work (in the spirit of Ziv and Lempel data compression method) Biocompress and Biocompress-2 (second one uses arithmetic coding of order-2). => They are searching the previously processed part of the sequence for repeats. Related Work… Cfact: idea same as Biocompress-2 but no use of arithmetic coding of order 2. 1st phase: Builds Suffix Tree 2nd Phase: repetitions are coded with garanteed gain; otherwise, two-bit per base encoding is used. => Searching the whole sequence for exact matches. Related Work… Some compression schemes based on appropriate string matching which are lossy. In the paper, they are not interested in lossy compression. “Relatedness” between two DNA sequences. Some approaches to define such distance d(x,y): conditional compression, transformation distance both are not symmetric, hence cannot be used in general. Some authors tried (d(x,y)+d(y,x))/2. Obviously such a definition is not well founded. Related Work… Some Biologists: methods like counting the number of shared genes in two genomes or comparing the ordering of genes. Encoding Edit Operations: Replace: (R, p, char) means replacing the character at position p by character char. Insert: (I, p, char) means inserting character char at p. Delete: (D, p) means deleting the character at p. Encoding Edit Operations: 1) Two Bits Encoding Method 2) Exact Matching Method 3) Approximate Matching Method 4) Approximate Matching Method with edit operation sequence 1 ) Two Bits Encoding Method 10 00 01 01 10 11 01 00 encodes “gaccgtca” It needs 16 bits. “gaccgtca” using string “gaccttca” 2) Exact Matching Method We can use (repeat position, repeat length) Use one bit to indicate if the next pair is a pair string can be encoded “gaccgtca” using string “gaccttca” 0 000 100 1 10 0 100 011 ….in 17 bits {(0,4),g,(5,3)} is encoded… 3) Approximate Matching Method can be more than one representation for a specific sequence in this case.. “gaccgtca” using string “gaccttca” {(0,8),(R,4,g)} or 0 000 111 1 00 100 10 in this case R encoded by 00, I encoded by 01, and D encoded by 11, and 0/1 indicating next item double /triple ? 4) Approximate with Edit OP sequence If we use the edit operation sequence (I,4,g),(D,6). Then the string “gaccgtca” can be encoded as {(0,8),(I,4,g),(D,6)} or… 0 000 111 1 01 100 10 1 10 110 in total 21 bits… From the examples Approximate (3rd) has the least number of bits… GenCompress: an algorithm based on approximate matching GenCompress is a one-pass algorithm. For input w, part of it already processed, say v, and the remaining part is u, i.e. w = vu. Finds “optimal prefix” of u s.t. it approx matches some substring in v and this prefix of u can be encoded economically. Condition C, and the Compression Gain Algorithm condition C = (k, b) : #of edit operations for length k in prefix s of u is not larger than a threshold value b. Experiments show that (k, b)=(12, 3). G (s, t, λ) = max {0, 2|s| - |(|s|, i)| - wλ*|λ(s, t)| - c } -t: substring appear at position i in v, -|(|s|, i)|: encoding size of (|s|, i) -w λ: weighted value of encoding an edit operation -|λ(s, t)|: #of edit operations in λ(s, t) -c: overhead proportional to the size of control bits Optimal prefix: G(s, t, λ) is maximized over all λ and t. Optimal Prefix? Lemma :An optimal prefix u with G(u,v, λ) > 0 always ends right before a mismatch 0:l Lemma : Let λ be the optimal edit operation sequence from x to y. If y is copied from x in λ when converting from y to x, then λ (x , y ) is also the optimal edit operation sequence among all the edit operations sequences from x to y i i 0:l 0:l 0:l 0:j Symmetry? Symmetry of information ? Distance from sequence A to B = Distance from sequence B to A Based on Kolmogorov Complexity In the sense…If you compress A based on B is equal to B compressed based on A Is also shown by experiments that GeneCompress is symmetric … Also detects approximate palindromes in gene sequences.. Relatedness Also detects approximate palindromes in gene sequences.. Could be used for searching sequences… Unlike other metrics (like edit distance…etc) trees that are generated using this compression method fits with evolutionary trees.!! Questions?