Error Control Coding, Cancer Development and Statistical

advertisement
The Information Processing
Mechanism of DNA and Efficient
DNA Storage
Olgica Milenkovic
University of Colorado, Boulder
Joint work with B. Vasic
Outline











PART I: HOW DOES DNA ENSURE ITS DATA INTEGRITY?
Information Theory of Genetics: an emerging discipline
Error-Correction and Proofreading in genetic processes
What type of codes “operate” at the level of bio-chemical
processes of the Central Dogma?
Spin Glasses, Kaufmann’s “NK” Model, Regulatory Network of Gene
Interactions and Low-Density Parity-Check (LDPC) Codes
Cancer, dysfunctional proofreading and chaos theory
PART II: HOW DOES ONE STORE DNA? (DNA COMPRESSION)
Structure of DNA: Statistics and Modeling
DNA Compression
Genome Compression
New Distance Measures and One-Way Communication
PART III: NEW CODING PROBLEMS IN GENETICS
Information theory of genetics
2003: 50th Anniversary of discovery: DNA has a double-helix structure!
(Crick, Watson, Franklin, Wilkins 1953)
2003: Completion of the Human Genome Project (98% HDNA sequenced)
Every day an average of 15 new sequences added to SwissProt+GeneBank
Vast amount of genetic data just starting to be analyzed!
DNA is a CODE, but very little is known about its
• exact information content
• nature of redundancy
• statistical properties
• secondary structure
• influence on disease development and control
• underlying error-correcting mechanism
Information Theory of DNA
Helps in understanding the
EVOLUTION of DNA
FUNCTIONALITY of DNA
DISEASE DEVELOPMENT
IT community still not involved in this area!
Signal Processing Community is just getting involved:
Special Issue of Signal Processing Journal devoted to Genetics, 2003.
The League of
Extraordinary Gentlemen…
I
How is information stored in a
genetic sequence?
What are the atoms of information?
The DNA Polymer…
PO4
Sugar
PO4
S
B
U
A
G
C
A
K
R
B
-
O
P
N
H
E
O
S
P
Sugar
H
A
T
PO4
E
5’
O
CH2OH
OH
1’
4’
H
H
H
3’
H
OH
H
2’
Deoxiribose (Sugar)
The Bases…
D
O
U
B
L
E
Purine Bases: Adenine (A); Guanine (G)
H
E
L
I
X
Pyramidine Bases:Thymine (T); Cytosine (C)
The Pairing Rule…
A and T paired through TWO
hydrogen bonds
G and C paired through THREE
hydrogen bonds
The Genetic Code…
Second Letter
U
UUU
U
UUC
UUA
First Letter
A
UCU
UAU
UCC
UAC
ser
tyr
UGU
UGC
cys
UGA
stop
UCG
UAG
stop
UGG
trp
CUA
CCU
CAU
CUC
CCC
CAC
CUA
leu
leu
UCA
G
stop
CCA
pro
CAA
CUG
CCG
CAG
AUU
ACU
AAU
ACC
AAC
AUC
ile
AUA
AUG
G
leu
A
UAA
UUG
C
C
ACA
AAA
ACG
AAG
GUU
GCU
GAU
GUC
GCC
GAC
GUA
GUG
met
thr
val
GCA
GCG
ala
GAA
GAG
his
gin
asn
lys
asp
glu
CGU
CGC
CGA
arg
CGG
AGU
AGC
AGA
AGG
GGA
GGG
U
C
A
G
Third Letter
ser
arg
GGU
GGC
U
C
A
G
gly
U
C
A
G
U
C
A
G
Abbreviations
ala = alanine
arg = arginine
asn = asparagine
asp = aspartic acid
cys = cysteine
gln = glutamine
glu = glutamic acid
gly = glycine
his = histidine
ile = isoleucine
leu = leucine
lys = lysine
met = methionine
phy = phenylalanine
pro = proline
ser = serine
thr = threonine
trp = tryptophan
tyr = tyrosine
val = valine
Genes, Exons, Introns (Junk DNA)…

Genes: Sequence of base pairs coding for chains of amino-acids
Consist of exons (coding) and introns (non-coding) regions
Length- anything between several tenths up to several millions
EXAMPLE: Among most complex identified genes is
DYSTROPHINE
(2 million bps, more than 60 exons, codes for 4000 amino acids)
Escherichia Coli: around 4000 genes; Humans: 35000-40000 genes

Junk DNA: “Disrespectful” name for introns

Significant fraction of DNA
Shown (last year) to be “somewhat” responsible for RNA coding
(Far from being “junk”, but function still not well understood…)
The Central Dogma…
DNA
Replication
mRNA
Transcription
Proteins
Translation
A Communication Theory Perspective:
DNA sequence
DNA sequence
Genetic Channel
mRNA
Proteins
What kind of errors are introduced by the Genetic Channel?
Processing in the Genetic Channel: DNA REPLICATION
DNA within Chromosomes (tight packing):
• DNA wrapped around HISTONES (proteins)
• HISTONES are organized in NUCLEOSOMES
• NUCLEOSOMES  CHROMATINE folded in
CHROMOSOMES
Untying the knots: Topoisomerases
Unwinding the helix: Helicases
Getting it all started: Primers
Doing the hard work: Polymerases
Sealing the segments: Ligases
Helping to keep two sides apart: SSB
Replication: more details
Rules:
Facts:
Replication always proceeds in 5’ to 3’ direction;
Timing for replication:
Replication is semi-conservative;
E. Coli: 40 min
Replication is a parallel process for eukaryotes;
Humans (parallel): < 2 hours
Polymerases can stitch together any combination
Can be prolonged for proofreading
purposes
of bases (“Ps are a little bit sloppy’’)
Errors…
Combination of substitution, deletion, insertion (replication fork), shift, reversal, etc errors
(Complete exon or intron deleted, or simple base pair deletions)
1. Tautomeric shifts (transition/transvertion): *T-G, *G-T, *C-A, *A-C
2. Recombination between non-identical molecules (“HETERODUPLEX mismatches”)
3. Spontaneous DEAMINATION (C to U, C to T, C-G to T-A), METHYLATION (CpG), rare
4. APURINIC/APYRAMIDINIC SITES (due to HYDROLISIS)
5. CROSS-LINKS
6. STRAND-BREAKAGE, OXIDATIVE DAMAGE ERRORS
7. LOSS OF 5000-10000 PURINE and 200-500 PYRIMIDINE bases (20 hours) due to
radiation
Replication Errors: Polymerases miss-insertion probability between 10e-3/10e-5
Miscoding
A-G-A-T-G
C-T-G-C-T-A-C
Slippage
A-A-T-G
Miscoding - Realignment
A-G-A-T-G
C-T C-T-A-C
G
C-G-T-T A-C
T
Slippage-Dislocation
G-A-A-T-G
C-G -T -T-T-A-C
Bio-chemical mechanism responsible for error
correction?
Proofreading (Maroni, Molecular and Genetic Analysis of Human Traits):
Replication polymerases error rate  10 3; human DNA with  3 109 bps, total of 106 errors
Example:
C to U conversion causes presence of deoxyuridine, detected by uracil-DNA GLYCOSYLASE
Glycosylase process acts like erasure channel
1. Proofreading based on semi-conservative nature of replication
2. Excision Repair Mechanisms: Arrays of Exonucleases
Show large degree of pre-correction binding activity – correction performed by EXCISION
“Jumping’’ occurs between different genes !!! (Lin, Lloyd, Roberts, Nucleases)
Reduce error levels by an additional several orders of magnitude
Mismatch-specific post-replication enzymes
Total number of errors per human DNA replication: on average JUST ONE
Replication and Repair have been optimized for balancing spontaneous
mutational load:
Permitting evolution without threatening fitness or survival
Characteristics of DNA ECC:
Error-correction performed on different levels
Error correction performed in very short time
Extremely large number of very diverse errors corrected
Error correction tied to global structure of DNA
(not to consecutive base pairs)
Error correction also depends on DNA topology
Identify ECCs of DNA…
Error-Correcting Codes in DNA: Forsdyke (1981), Wolny (1983), Eigen (1993),
Liebovitch et al (1996), Battail (1997), Rosen and Moore (2003), McDonaill
(2003)
Theories:
 Non-coding regions are in-series error detecting sequences!
 Ordering of coding/non-coding regions responsible for error-correction!
 Complementary base pairing corresponds to error-detecting code!
Acceptor/Donor: hydrogen atom/lone electrons

1 represents donor, 0 acceptor
Additionally, add 0 or 1 for purine and pyramidine
Code:
A 1010
G 0110
T 0101
C 1001
BEST ERROR CORRECTING MECHANISM: Deinococcus radiodurans
• Microbe with extreme radiation resistance
• Enabled to survive radiation doses thousands of times higher than would kill
most organisms, including humans.
• Surpasses the cockroach by orders of magnitude!
Why? Because of its remarkable DNA-repair mechanism!!!
D. radiodurans flawlessly regenerates its radiation-shattered genome in about
24 hours.
‘’Conan The Bacterium’’
(to conquer the Red Planet !)
Something seemingly unrelated…
Spin Glasses, the Ising Model, Hopfield Networks or “Boltzmann Machines”:
State x of a spin glass with N spins that may take values in {-1,+1}
Energy of the state x: E, external field h
The Hamiltonian
1

E ( x;{J }i , j ,{h}i )     J m,n xm xn   hn xn 
n
 2 m ,n

Hamiltonian for Ising model
1

E ( x; J , h)    J  xm xn  h  xn 
n
 2 m ,n

+
+
Example:
+
+
+
“frustration”
-
Water exists as a gas, liquid or solid, but
all microscopic elements are H2O molecules
This is due to intermolecular interactions
depending on temperature, pressure etc.
Something seemingly unrelated…
Codes on graphs: the most powerful class of error correcting codes in information
theory, including Turbo, Low-Density Parity-Check (LDPC), Repeat-Accumulate
(RA) Codes
 1 1 1 0 1 0 0


H  0 1 1 1 0 1 0 
 1 0 1 1 0 0 1 
Most important consequence of graphical
description: efficient iterative decoding
Variable nodes communicate to check nodes their reliability
Check nodes decide which variables are unreliable and “suppress” their
inputs
Number of edges in graph = density of H
Sparse = small complexity
Variables Checks
Detrimental for convergence of decoder: presence of short cycle in code graph
Applications of LDPC codes: for cryptography, compression, distributed source
coding for sensor networks, error control coding in optical, wireless comm and
magnetic and optical storage…
Gallager’s Decoding Algorithm A
Works for (Binary Symmetric Channel) BSC:
Each variable sends its channel reliability unless all incoming messages from
checks say “change”
Each check sends estimate of the bit based on modulo two sum of other bits
participating in the check
Alternative view: Variables=Atoms; Binary Values=Spins;
Variables “align” or “misalign” according to interaction patterns
LDPC equivalent to diluted spin glasses
H   J i1i2 ... ir Si1 Si2 ...Sir
( hi Si )
Ground state search for above Hamiltonian = maximum aposteriori decoding
of codeword
Average magnetization at a site = MAP decision for individual variable
Something seemingly unrelated…
The regulatory Network of
Gene Interactions (RNGI)
Kaufmann (1960’s): “NK” Evolution
through Changing Interactions
between Genes
Life exists at the Edge of Chaos!
BASED ON SPIN GLASSES!
RANDOM BOOLEAN FUNCTION MODEL:
Evolution carried by genes, not base
pairs, and the way genes interact!
G1
G3
G2
T
T+1
G1 G2 G3
G1 G2 G3
000
001
001
001
010
101
011
000
100
101
101
010
110
001
111
011
G1G2
G1
G1G3
G2
G1G2
G3
00
0
00
0
00
1
01
1
01
0
01
1
10
0
10
0
10
0
11
0
11
1
11
1
Chaos, Attractors, Connectivity
Boolean networks: dynamical systems
Characterized by network topology+
choices of Boolean node functions
100
111
000
011
001
101
110
010
Attractors: point and periodic
Number and period lengths
MOST IMPORTANT topological factor:
CONNECTIVITY
KEY: Sparse connectivity allows enough
variability for evolutionary processes, produces selforganizing structures, but doesn’t allow the system
to “get trapped in” chaotic behavior
MOST IMPORTANT Boolean function factors:
BIAS (number of 1 outputs)
Kimatograph of the
network
CANNALIZATION (depends on number of inputs
determining output)
The NK model and RNGI
N= number of genes; K=number of genes co-interacting with one given gene
K=2 critical value (mainly frozen states with islands of changing interaction)
Interaction between genes in regulatory network: very limited in scope
K ranges everywhere between 2-3 to 10-15: If we check carefully, logarithmic in
N, i.e. number of genes
Between 2 and 3 for Escherichia Coli (around a thousand genes)
4 and 8 for higher metazoea (several thousand genes)
Can explain the process of cell differentiation: genetic material of each cell
the same, yet cells functionally and morphologically very different
Each cell type CORRESPONDS TO ONE GIVEN ATTRACTOR of the RNGI
Counting attractors for networks with N=40000 genes, K=2 gives  260
Cell types (correct number 258).
KEY IDEA: LDPC Code with Given Decoding
Algorithm is a BOOLEAN NETWORK, SPIN
GLASS,…
Example: LDPC Code under Gallager’s A Algorithm
G1
In the Control Graph, edge (i,j) exists if
i-th bit controls j-th bit (i.e. if i and j are
at distance exactly two)
G2
G3
G4
LDPC Code:
LDPC Code:
Variables and Checks
Morse-Thue:
The Control Graph
Boolean function determined by decoding
algorithm: For Gallager’s A algorithm,
takes form of truncated/periodically
repeated MORSE-THUE sequence
0 1 2 3 4 5 6 7 …
Properties:
0
Self-Similar (fractal)
1 10 11 100 101 110 111 …
0 1 1 0 1
0 0 1 …
Results in unbiased Boolean functions
Use Boolean Network Analysis for LDPC Codes



No cycles of length four, code regular: uniform choice for Boolean function
Cycles of length four: Boolean functions vary, many more attractors
In no case are the functions canalizing
Fi  (Gi  1) Ni1 Ni2 ... Nis  Gi (1  (1  Ni1 )...(1  Nis ))
N i1 , N i2 ,..., N is
mod 2
modulo two sums of variable nodes connected to controls Ci1 ,Ci2 ,...,Cis
Can use mean-field theorems to see when initial perturbations in the codewords
disappear in the limit: use the Boolean derivative, sensitivity analysis, iterative
Jacobian and Lyapunov exponent (as in Schmulevich et.al):
f( j ) 

 

f
 f x ( j ,0)  f x ( j ,1) , x ( j ,l )  x1 ,..., x j 1 , l , x j 1 ,..., x N
x j


i
Jacobian F is a N  N matrix with f( j ) in entry (i,j).
d(t  1)  F (t )  d(t )  H ( F (t )  d(t )), H (( g1 ,..., gN ))  ( H ( g1 ),..., H ( gN )),
0, gi  0
H ( gi )  
 1, gi  0
Use Boolean Network Analysis for LDPC Codes
Iterated Jacobian:
IJ (t )  F (0)  F (1)  ...  F (t )
Lyapunov exponent:
1
 (T ) 
T
 log (t )
t
2
N
 (t ) | IJ (t  1)|/| IJ (t )|, | M | (1/ N )  M i , j
i, j
The influence of variable x i on the Boolean function f  x1 , x2 ,..., x N  is defined as
the expectation of the partial derivative, with respect to the distribution of the
variables
Ii ( f )  E[f / xi ]  P {f / xi . 1}
Influence carries important information about frozen states, error susceptibility
etc.
K
s(t )  c   s(t )K i (1  s(t ))i
i
iterative change of size of “stable core”
Control of the chaotic phase in the a Boolean network by means of periodic
pulses (with period T) that “freeze” a fraction of nodes
LDPC Codes and Gallager’s A Decoding Algorithm
f ( z1 , z2 , z3 )  (1  z1 )z2 z3  z1 (1  z2 )(1  z3 )
f ( z1 , z2 , z3 )/ z1  1  z2  z3 , f ( z1 , z2 , z3 )/ z2  z1  z3 ,
f ( z1 , z2 , z3 )/ z3  z1  z2
A
(B)=C1
0
0
0
(C)=C2
(D)=C3
F3(A)
A
(B
C)
D
C1
C2
F1(A)
A
(B
C
D)
C1
F2(A)
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
1
0
0
0
0
1
1
1
0
0
1
0
0
0
0
1
0
1
0
0
0
0
1
0
1
1
0
0
1
1
0
0
0
1
1
1
1
1
0
0
1
1
0
0
0
1
0
0
0
0
1
0
0
1
0
0
0
1
0
0
1
1
0
1
0
1
0
0
1
0
1
1
1
1
0
1
0
1
0
0
0
1
1
0
0
0
1
1
0
0
0
0
0
1
1
0
0
0
0
1
1
1
1
0
1
1
1
0
1
0
0
1
1
1
1
1
1
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
1
1
0
0
1
0
1
1
1
0
0
1
1
1
1
0
1
0
1
1
0
1
0
1
0
1
1
0
1
0
1
1
1
0
1
1
1
1
0
1
1
1
1
1
1
0
1
1
0
0
1
1
0
0
1
1
1
0
0
1
0
1
1
1
0
0
1
1
1
1
0
1
1
1
1
0
1
1
1
1
1
1
0
1
0
0
1
1
1
0
1
1
1
1
0
0
0
0
1
1
1
0
0
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
0
1
New decoding methods for LDPC and other
Block Codes…
Work in Progress:
• Decoders that don’t operate on the frozen core
• Decoders that periodically freeze some variables to avoid chaotic behavior
• Iterative decoders that work for asymmetric channels and channels with
insertion/deletion errors
Bold Conjecture:
The ECC of DNA Replication operates on multiple levels
Carrier of information is gene, not base pair
The Global level involves Genes;
Local levels may involve exons or base pairs in general;
The Global Code is an LDPC Code!
Wigner observed that the same mathematical concepts turn up
in entirely unexpected connections in whole of science…
(no explanation as of yet)
LDPC related to statistical physics (spin glasses) to neural networks to
self-organizing systems to …
R. Sole and B. Goodwin, Signs of Life: How Complexity Pervades Biology
The Corresponding LDPC Code
Table 1: Example of 15-node regulatory network in
terms of gene controls
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1,3
2
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
4,5,6
4,5,6,1,2
3
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
4
5,6
5,6,3
4
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
4,9,6
5
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
5
4,9,6
6
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
6
5,9,3,4,7,8
3,4,5,7,8,9
7
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
7
15,8,6
15,8,6
8
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
9
0
0
0
0
0
0
0
0
0
1
0
0
1
0
1
8
7,9
7,9
10
0
0
0
0
0
0
0
1
0
0
1
1
1
0
0
9
4,5,6
4,5,6
11
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
10
9,13,15
9,13,15
12
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
11
8,12,13
8,12,13
12
11,13,15
8,11,13,15
13
8,14,15
8,14,15,11,12
14
X
X
15
11,12
11,12,13,14,10,7
Gene
Controls
Control (after addition)
1
2,3,13
2,313
2
1,3
3
15-gene interaction example by Hashimoto
(Shmulevich, Anderson Cancer Center)
Need q-ary LDPC code corresponding to
different levels of interaction
Cancer: genetic disorder of somatic cells
Human cancer: INDUCED and SPONTANEOUS
• Accumulation of mutant (erroneous) genes that control cell cycle, maintain
genomic stability, and mediate apoptosis
• Causes of mutation: depurination and depyrimidation of DNA; proofreading
and mismatch errors during DNA replication
•Deamination of 5-methylcytosine to produce C to T base pair substitutions; and
damage to DNA and its replication imposed by products of metabolism (notably
oxidative damage caused by oxygen free radicals)
• Defective DNA excision-repair; low levels of antioxidants, antioxidant enzymes,
and nucleophiles that trap DNA-reactive electrophiles; and enzymes that
conjugate nucleophiles with DNA-damaging electrophiles
Cancer Research
To summarize: Various forms of cancer tightly linked to
malfunctioning of proofreading (ECC) mechanism
Cancer cells: correspond to a special type of attractor of the RNGI
(A cancer cell is “just another configuration” of RNGI)
(Schmulevich et.al., Anderson Cancer Research Center)
This attractor has genes interacting in a way that results in
uncontrolled cell division
Key observation: C-Change in RNGI results in further weakening of
the proofreading system, and VV
Example 1: Cancer cells cheat the proofreading mechanism
regulating reduction in length of telomeres
Aging: during each cell division,
telomeres get shorter and shorter…
When they become too short, errors in
replication happen, leading to cancer
(a time bomb in our body)
Cancer cells “cheat” proofreading
mechanism and allow telomeres to
maintain constant length
Finding the error-control mechanism: classifying diseases accurately, curing
diseases (including cancer) by gene therapy, making telomer lengths constant
over long time…
Example 2: Breast Cancer Oncogene BRCA1 tightly linked to errorcontrol of DNA and cell division regulation
How to obtain results practically? DNA Microarrays!
Figure taken from Schmulevich et.al.
II
How can one efficiently store
DNA sequences?
DNA Storage: Compression








GenBank/Swiss-Prot: storage of large number of DNA and protein
sequences (17471 million sequences in GenBank, 2002)
Every day, an average of 15 new sequences added to database
DNA compression absolutely necessary to maintain banks
Fractal DNA structure to be exploited
Possible use of Tsallis entropy
Need novel compression algorithms
DNA sequences of related species differ in very small percent of base
pairs: need cross-reference compression
Need meaningful definition of DNA distance
-- major paradigm shift from base-pair
distance to chromosomal distance --
Statistical properties of DNA sequences
Bases within the human mitochondrion (length approximately 17000) appear
with the following frequencies:
A
T
G
C
0.31
0.13
0.25
0.31
while within different regions of human fetal globin gene:
Introns
Parts of genetic sequences can be modeled
by Markov chains of given order
and transition probabilities; order 2-7
Exons
A
T
G
C
0.27
0.29
0.27
0.17
A
T
G
C
0.24
0.22
0.28
0.25
Regions of uniform distribution: isochors; can stretch in length up to hundreds Kbps
Repetitive patterns: tandem repeats (TR), random repeats (RR), short interspersed
repeat sequences (SINE’s, 9% of DNA), long interspersed repeat sequences (LINE’s).
BPs, like CG, have very small probability: most notorious triplet repeats, related to
Huntington’s disease and Fragile-X mental retardation, consist of these very
unlikely “CG” pairs: (CGG)m ,(CCG)m, m = number of repetitions;
Junk-DNA seems to have long-range (fractal) characteristics.
A fractal patterns arises from the so-called DNA walk: a graphical
representation of the DNA sequence in which one moves up for C or T and
down for A or G.
Can have two, three-dimensional random walk: further differentiation A,G,C,T
C
A
T
G
Fractal dimension of the DNA
molecule:
0.85 for higher species, 1 for lower
Use lingual analysis of human
languages for exploring DNA
"language" (Zipf method)
http://library.thinkquest.org/26242/full/ap/ap13.html
DNAWalker http://athena.bioc.uvic.ca/pbr/walk/
DNA and Cantor Sets
Provata and Almirantis, 2003: Fractal Cantor pattern in DNA
Exons - filled regions
Introns - empty regions
Random, fractal, Cantor-like set
Implication: atom (carrier of information) exon/intron pairs
History-based random walk and DNA description in terms of urn models
Only introns in higher species have higher complexity than in lower species
Both coding and non-coding regions exhibit long range correlation, with spectral
density of introns 1 / f b
Known algorithms
GenCompress (Chen, ’97)
Biocompress (Grumbach/Tachi, ‘94)
Fact (Rivals, ’00)
GenomeSequenceCompress (Sato et.al 00’)
Use characteristics of DNA like repeats,
reverse complements…
Compression rate is about 1.74 bits per
base (78% in compression ratio)
Two classes: statistical and grammar based compression algorithms
Huffman, Lempel-Ziv, Arithmetic Coding, Burrows-Wheeler,
Kieffer’s Grammar Based Schemes
(with DNA specific modifications)
No known algorithm specially suited for fractal nature of DNA, although 90% fractal!
FILE
COMPRESSION
RATE
(ACHIEVABLE)
Human Growth
Hormone
(HUMGHCSA)
2.00
GZIP
2.065
ARITHM.
2.052
VPS2A
1.607
UNIX
COMPRESS
BIOCOMPRESS
BWT
GTAC
2.19
1.31
1.608
1.1
Different Entropy Measures:

Shannon Entropy:

H S ( p)   pi log pi
i

Renyi Entropy:

Tsallis Entropy:


1
H R ( p) 
log  piq
1 q
i
 

q
H T ( p )    pi  1 /(1  q)
 i



zn  1
H S ( p)  H T ( p)
log z  lim
n 
derived
n
TE non-additive in the way that for two independent PS A,B
HT ( A  B)  HT ( A)  HT ( B)  (1  q) HT ( A) HT ( B)

Hausdorff Dimension:
log N
r0 log (1 / r )
lim
Approach: Use “Fractal Grammars”
Inference of context-free grammars from fractal data sets
Syntactic generation of fractals
Theory of formal languages can be used to state the problem of "syntactic fractal
pattern recognition"
Explore Connections with Wavelets
(ideas by Jacques Blanc-Talon)
Example: Heighway dragon and
Koch curve
G  {{a, b, c, d , e, f , g , h},{1 , 2 },{a}}
1 (a )  ac, 1 (c)  ec, 1 (e)  eg , 1 ( g )  ag ,
1 ( x )  x, x  {b, d , f , h}
2 (a )  abha, 2 (b)  bcab,..., 2 (h )  hagh
Barthel, Brandau, Hermesmeier, Heising: Fractal Prediction, 1997
Zerotree wavelet coding using fractal prediction
How does one compress sets of related DNA Sequences?
Distributed Source Coding Problem: Peculiar Correlation Patterns
Could explore Wavelet Based Compression
Distributed Source Coding with LDPC Codes…
Genomic Distance and One-Way Communication
Major paradigm shift in genetic distance measure:
From base-pair distance (involving deletion, insertion and substitution): Sankoff,
Kruskal,Time Warps, String Edits, and Macromolecules) to Chromosomal
Distance based on global arrangements of genes
Inversions are primary mechanism of genome rearrangement!
REVERSAL DISTANCE
The smallest number of inversions necessary to transform one genome into
another
Finding the minimum number of reversals needed to “sort” a permutation
Permutations are signed, indicating direction of transcription
Example: (+1 +3 +2)
(+1 -2 -3)
(+1 +2 -3)
(+1 +2 +3)
How does one perform one-way communication (SENDING INFORMATION TO
A RECEIVER WHO POSESESS CORRELATED INFORMATION) under the
reversal distance measure?
The other way around:
DNA compression methods increase network efficiency by up to 10 times
Peribit's SR-50 compressor


Uses molecular sequence reduction (MSR) algorithms similar to those used
to match patterns in the study of DNA.
The algorithms identify and eliminate repetitions previously undetected in
network traffic in wide area networks (Wans) to give compression ratios of
between 1.2:1 for voice and video and 5:1 for SQL traffic.
III
Additional Coding Problems in Genetics
DNA Computing
Codes with Constant GC Content and invariant under
Watson-Crick Inversion
Microarray Error Control Coding
Using design theory to reduce error rate of DNA array data
Use novel clustering algorithms for DNA Array Data
Conclusion
Genetics is the most exciting source of new ideas for coding theory
The atom of information is a gene, not a base pair or a triple of base pairs
The error control code of the genome is to be found operating on the level of
genes
Compression, phylogenic tree construction: comparison of species has to be
performed on the level of genes first
Once the genes are compared, can move to local base pair comparisons
Download