The Information Processing Mechanism of DNA and Efficient DNA Storage Olgica Milenkovic University of Colorado, Boulder Joint work with B. Vasic Outline PART I: HOW DOES DNA ENSURE ITS DATA INTEGRITY? Information Theory of Genetics: an emerging discipline Error-Correction and Proofreading in genetic processes What type of codes “operate” at the level of bio-chemical processes of the Central Dogma? Spin Glasses, Kaufmann’s “NK” Model, Regulatory Network of Gene Interactions and Low-Density Parity-Check (LDPC) Codes Cancer, dysfunctional proofreading and chaos theory PART II: HOW DOES ONE STORE DNA? (DNA COMPRESSION) Structure of DNA: Statistics and Modeling DNA Compression Genome Compression New Distance Measures and One-Way Communication PART III: NEW CODING PROBLEMS IN GENETICS Information theory of genetics 2003: 50th Anniversary of discovery: DNA has a double-helix structure! (Crick, Watson, Franklin, Wilkins 1953) 2003: Completion of the Human Genome Project (98% HDNA sequenced) Every day an average of 15 new sequences added to SwissProt+GeneBank Vast amount of genetic data just starting to be analyzed! DNA is a CODE, but very little is known about its • exact information content • nature of redundancy • statistical properties • secondary structure • influence on disease development and control • underlying error-correcting mechanism Information Theory of DNA Helps in understanding the EVOLUTION of DNA FUNCTIONALITY of DNA DISEASE DEVELOPMENT IT community still not involved in this area! Signal Processing Community is just getting involved: Special Issue of Signal Processing Journal devoted to Genetics, 2003. The League of Extraordinary Gentlemen… I How is information stored in a genetic sequence? What are the atoms of information? The DNA Polymer… PO4 Sugar PO4 S B U A G C A K R B - O P N H E O S P Sugar H A T PO4 E 5’ O CH2OH OH 1’ 4’ H H H 3’ H OH H 2’ Deoxiribose (Sugar) The Bases… D O U B L E Purine Bases: Adenine (A); Guanine (G) H E L I X Pyramidine Bases:Thymine (T); Cytosine (C) The Pairing Rule… A and T paired through TWO hydrogen bonds G and C paired through THREE hydrogen bonds The Genetic Code… Second Letter U UUU U UUC UUA First Letter A UCU UAU UCC UAC ser tyr UGU UGC cys UGA stop UCG UAG stop UGG trp CUA CCU CAU CUC CCC CAC CUA leu leu UCA G stop CCA pro CAA CUG CCG CAG AUU ACU AAU ACC AAC AUC ile AUA AUG G leu A UAA UUG C C ACA AAA ACG AAG GUU GCU GAU GUC GCC GAC GUA GUG met thr val GCA GCG ala GAA GAG his gin asn lys asp glu CGU CGC CGA arg CGG AGU AGC AGA AGG GGA GGG U C A G Third Letter ser arg GGU GGC U C A G gly U C A G U C A G Abbreviations ala = alanine arg = arginine asn = asparagine asp = aspartic acid cys = cysteine gln = glutamine glu = glutamic acid gly = glycine his = histidine ile = isoleucine leu = leucine lys = lysine met = methionine phy = phenylalanine pro = proline ser = serine thr = threonine trp = tryptophan tyr = tyrosine val = valine Genes, Exons, Introns (Junk DNA)… Genes: Sequence of base pairs coding for chains of amino-acids Consist of exons (coding) and introns (non-coding) regions Length- anything between several tenths up to several millions EXAMPLE: Among most complex identified genes is DYSTROPHINE (2 million bps, more than 60 exons, codes for 4000 amino acids) Escherichia Coli: around 4000 genes; Humans: 35000-40000 genes Junk DNA: “Disrespectful” name for introns Significant fraction of DNA Shown (last year) to be “somewhat” responsible for RNA coding (Far from being “junk”, but function still not well understood…) The Central Dogma… DNA Replication mRNA Transcription Proteins Translation A Communication Theory Perspective: DNA sequence DNA sequence Genetic Channel mRNA Proteins What kind of errors are introduced by the Genetic Channel? Processing in the Genetic Channel: DNA REPLICATION DNA within Chromosomes (tight packing): • DNA wrapped around HISTONES (proteins) • HISTONES are organized in NUCLEOSOMES • NUCLEOSOMES CHROMATINE folded in CHROMOSOMES Untying the knots: Topoisomerases Unwinding the helix: Helicases Getting it all started: Primers Doing the hard work: Polymerases Sealing the segments: Ligases Helping to keep two sides apart: SSB Replication: more details Rules: Facts: Replication always proceeds in 5’ to 3’ direction; Timing for replication: Replication is semi-conservative; E. Coli: 40 min Replication is a parallel process for eukaryotes; Humans (parallel): < 2 hours Polymerases can stitch together any combination Can be prolonged for proofreading purposes of bases (“Ps are a little bit sloppy’’) Errors… Combination of substitution, deletion, insertion (replication fork), shift, reversal, etc errors (Complete exon or intron deleted, or simple base pair deletions) 1. Tautomeric shifts (transition/transvertion): *T-G, *G-T, *C-A, *A-C 2. Recombination between non-identical molecules (“HETERODUPLEX mismatches”) 3. Spontaneous DEAMINATION (C to U, C to T, C-G to T-A), METHYLATION (CpG), rare 4. APURINIC/APYRAMIDINIC SITES (due to HYDROLISIS) 5. CROSS-LINKS 6. STRAND-BREAKAGE, OXIDATIVE DAMAGE ERRORS 7. LOSS OF 5000-10000 PURINE and 200-500 PYRIMIDINE bases (20 hours) due to radiation Replication Errors: Polymerases miss-insertion probability between 10e-3/10e-5 Miscoding A-G-A-T-G C-T-G-C-T-A-C Slippage A-A-T-G Miscoding - Realignment A-G-A-T-G C-T C-T-A-C G C-G-T-T A-C T Slippage-Dislocation G-A-A-T-G C-G -T -T-T-A-C Bio-chemical mechanism responsible for error correction? Proofreading (Maroni, Molecular and Genetic Analysis of Human Traits): Replication polymerases error rate 10 3; human DNA with 3 109 bps, total of 106 errors Example: C to U conversion causes presence of deoxyuridine, detected by uracil-DNA GLYCOSYLASE Glycosylase process acts like erasure channel 1. Proofreading based on semi-conservative nature of replication 2. Excision Repair Mechanisms: Arrays of Exonucleases Show large degree of pre-correction binding activity – correction performed by EXCISION “Jumping’’ occurs between different genes !!! (Lin, Lloyd, Roberts, Nucleases) Reduce error levels by an additional several orders of magnitude Mismatch-specific post-replication enzymes Total number of errors per human DNA replication: on average JUST ONE Replication and Repair have been optimized for balancing spontaneous mutational load: Permitting evolution without threatening fitness or survival Characteristics of DNA ECC: Error-correction performed on different levels Error correction performed in very short time Extremely large number of very diverse errors corrected Error correction tied to global structure of DNA (not to consecutive base pairs) Error correction also depends on DNA topology Identify ECCs of DNA… Error-Correcting Codes in DNA: Forsdyke (1981), Wolny (1983), Eigen (1993), Liebovitch et al (1996), Battail (1997), Rosen and Moore (2003), McDonaill (2003) Theories: Non-coding regions are in-series error detecting sequences! Ordering of coding/non-coding regions responsible for error-correction! Complementary base pairing corresponds to error-detecting code! Acceptor/Donor: hydrogen atom/lone electrons 1 represents donor, 0 acceptor Additionally, add 0 or 1 for purine and pyramidine Code: A 1010 G 0110 T 0101 C 1001 BEST ERROR CORRECTING MECHANISM: Deinococcus radiodurans • Microbe with extreme radiation resistance • Enabled to survive radiation doses thousands of times higher than would kill most organisms, including humans. • Surpasses the cockroach by orders of magnitude! Why? Because of its remarkable DNA-repair mechanism!!! D. radiodurans flawlessly regenerates its radiation-shattered genome in about 24 hours. ‘’Conan The Bacterium’’ (to conquer the Red Planet !) Something seemingly unrelated… Spin Glasses, the Ising Model, Hopfield Networks or “Boltzmann Machines”: State x of a spin glass with N spins that may take values in {-1,+1} Energy of the state x: E, external field h The Hamiltonian 1 E ( x;{J }i , j ,{h}i ) J m,n xm xn hn xn n 2 m ,n Hamiltonian for Ising model 1 E ( x; J , h) J xm xn h xn n 2 m ,n + + Example: + + + “frustration” - Water exists as a gas, liquid or solid, but all microscopic elements are H2O molecules This is due to intermolecular interactions depending on temperature, pressure etc. Something seemingly unrelated… Codes on graphs: the most powerful class of error correcting codes in information theory, including Turbo, Low-Density Parity-Check (LDPC), Repeat-Accumulate (RA) Codes 1 1 1 0 1 0 0 H 0 1 1 1 0 1 0 1 0 1 1 0 0 1 Most important consequence of graphical description: efficient iterative decoding Variable nodes communicate to check nodes their reliability Check nodes decide which variables are unreliable and “suppress” their inputs Number of edges in graph = density of H Sparse = small complexity Variables Checks Detrimental for convergence of decoder: presence of short cycle in code graph Applications of LDPC codes: for cryptography, compression, distributed source coding for sensor networks, error control coding in optical, wireless comm and magnetic and optical storage… Gallager’s Decoding Algorithm A Works for (Binary Symmetric Channel) BSC: Each variable sends its channel reliability unless all incoming messages from checks say “change” Each check sends estimate of the bit based on modulo two sum of other bits participating in the check Alternative view: Variables=Atoms; Binary Values=Spins; Variables “align” or “misalign” according to interaction patterns LDPC equivalent to diluted spin glasses H J i1i2 ... ir Si1 Si2 ...Sir ( hi Si ) Ground state search for above Hamiltonian = maximum aposteriori decoding of codeword Average magnetization at a site = MAP decision for individual variable Something seemingly unrelated… The regulatory Network of Gene Interactions (RNGI) Kaufmann (1960’s): “NK” Evolution through Changing Interactions between Genes Life exists at the Edge of Chaos! BASED ON SPIN GLASSES! RANDOM BOOLEAN FUNCTION MODEL: Evolution carried by genes, not base pairs, and the way genes interact! G1 G3 G2 T T+1 G1 G2 G3 G1 G2 G3 000 001 001 001 010 101 011 000 100 101 101 010 110 001 111 011 G1G2 G1 G1G3 G2 G1G2 G3 00 0 00 0 00 1 01 1 01 0 01 1 10 0 10 0 10 0 11 0 11 1 11 1 Chaos, Attractors, Connectivity Boolean networks: dynamical systems Characterized by network topology+ choices of Boolean node functions 100 111 000 011 001 101 110 010 Attractors: point and periodic Number and period lengths MOST IMPORTANT topological factor: CONNECTIVITY KEY: Sparse connectivity allows enough variability for evolutionary processes, produces selforganizing structures, but doesn’t allow the system to “get trapped in” chaotic behavior MOST IMPORTANT Boolean function factors: BIAS (number of 1 outputs) Kimatograph of the network CANNALIZATION (depends on number of inputs determining output) The NK model and RNGI N= number of genes; K=number of genes co-interacting with one given gene K=2 critical value (mainly frozen states with islands of changing interaction) Interaction between genes in regulatory network: very limited in scope K ranges everywhere between 2-3 to 10-15: If we check carefully, logarithmic in N, i.e. number of genes Between 2 and 3 for Escherichia Coli (around a thousand genes) 4 and 8 for higher metazoea (several thousand genes) Can explain the process of cell differentiation: genetic material of each cell the same, yet cells functionally and morphologically very different Each cell type CORRESPONDS TO ONE GIVEN ATTRACTOR of the RNGI Counting attractors for networks with N=40000 genes, K=2 gives 260 Cell types (correct number 258). KEY IDEA: LDPC Code with Given Decoding Algorithm is a BOOLEAN NETWORK, SPIN GLASS,… Example: LDPC Code under Gallager’s A Algorithm G1 In the Control Graph, edge (i,j) exists if i-th bit controls j-th bit (i.e. if i and j are at distance exactly two) G2 G3 G4 LDPC Code: LDPC Code: Variables and Checks Morse-Thue: The Control Graph Boolean function determined by decoding algorithm: For Gallager’s A algorithm, takes form of truncated/periodically repeated MORSE-THUE sequence 0 1 2 3 4 5 6 7 … Properties: 0 Self-Similar (fractal) 1 10 11 100 101 110 111 … 0 1 1 0 1 0 0 1 … Results in unbiased Boolean functions Use Boolean Network Analysis for LDPC Codes No cycles of length four, code regular: uniform choice for Boolean function Cycles of length four: Boolean functions vary, many more attractors In no case are the functions canalizing Fi (Gi 1) Ni1 Ni2 ... Nis Gi (1 (1 Ni1 )...(1 Nis )) N i1 , N i2 ,..., N is mod 2 modulo two sums of variable nodes connected to controls Ci1 ,Ci2 ,...,Cis Can use mean-field theorems to see when initial perturbations in the codewords disappear in the limit: use the Boolean derivative, sensitivity analysis, iterative Jacobian and Lyapunov exponent (as in Schmulevich et.al): f( j ) f f x ( j ,0) f x ( j ,1) , x ( j ,l ) x1 ,..., x j 1 , l , x j 1 ,..., x N x j i Jacobian F is a N N matrix with f( j ) in entry (i,j). d(t 1) F (t ) d(t ) H ( F (t ) d(t )), H (( g1 ,..., gN )) ( H ( g1 ),..., H ( gN )), 0, gi 0 H ( gi ) 1, gi 0 Use Boolean Network Analysis for LDPC Codes Iterated Jacobian: IJ (t ) F (0) F (1) ... F (t ) Lyapunov exponent: 1 (T ) T log (t ) t 2 N (t ) | IJ (t 1)|/| IJ (t )|, | M | (1/ N ) M i , j i, j The influence of variable x i on the Boolean function f x1 , x2 ,..., x N is defined as the expectation of the partial derivative, with respect to the distribution of the variables Ii ( f ) E[f / xi ] P {f / xi . 1} Influence carries important information about frozen states, error susceptibility etc. K s(t ) c s(t )K i (1 s(t ))i i iterative change of size of “stable core” Control of the chaotic phase in the a Boolean network by means of periodic pulses (with period T) that “freeze” a fraction of nodes LDPC Codes and Gallager’s A Decoding Algorithm f ( z1 , z2 , z3 ) (1 z1 )z2 z3 z1 (1 z2 )(1 z3 ) f ( z1 , z2 , z3 )/ z1 1 z2 z3 , f ( z1 , z2 , z3 )/ z2 z1 z3 , f ( z1 , z2 , z3 )/ z3 z1 z2 A (B)=C1 0 0 0 (C)=C2 (D)=C3 F3(A) A (B C) D C1 C2 F1(A) A (B C D) C1 F2(A) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 0 0 1 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 1 0 1 0 0 1 0 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 1 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 0 0 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 New decoding methods for LDPC and other Block Codes… Work in Progress: • Decoders that don’t operate on the frozen core • Decoders that periodically freeze some variables to avoid chaotic behavior • Iterative decoders that work for asymmetric channels and channels with insertion/deletion errors Bold Conjecture: The ECC of DNA Replication operates on multiple levels Carrier of information is gene, not base pair The Global level involves Genes; Local levels may involve exons or base pairs in general; The Global Code is an LDPC Code! Wigner observed that the same mathematical concepts turn up in entirely unexpected connections in whole of science… (no explanation as of yet) LDPC related to statistical physics (spin glasses) to neural networks to self-organizing systems to … R. Sole and B. Goodwin, Signs of Life: How Complexity Pervades Biology The Corresponding LDPC Code Table 1: Example of 15-node regulatory network in terms of gene controls 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1,3 2 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 4,5,6 4,5,6,1,2 3 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 4 5,6 5,6,3 4 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 4,9,6 5 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 5 4,9,6 6 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 6 5,9,3,4,7,8 3,4,5,7,8,9 7 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 7 15,8,6 15,8,6 8 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 8 7,9 7,9 10 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 9 4,5,6 4,5,6 11 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 10 9,13,15 9,13,15 12 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 11 8,12,13 8,12,13 12 11,13,15 8,11,13,15 13 8,14,15 8,14,15,11,12 14 X X 15 11,12 11,12,13,14,10,7 Gene Controls Control (after addition) 1 2,3,13 2,313 2 1,3 3 15-gene interaction example by Hashimoto (Shmulevich, Anderson Cancer Center) Need q-ary LDPC code corresponding to different levels of interaction Cancer: genetic disorder of somatic cells Human cancer: INDUCED and SPONTANEOUS • Accumulation of mutant (erroneous) genes that control cell cycle, maintain genomic stability, and mediate apoptosis • Causes of mutation: depurination and depyrimidation of DNA; proofreading and mismatch errors during DNA replication •Deamination of 5-methylcytosine to produce C to T base pair substitutions; and damage to DNA and its replication imposed by products of metabolism (notably oxidative damage caused by oxygen free radicals) • Defective DNA excision-repair; low levels of antioxidants, antioxidant enzymes, and nucleophiles that trap DNA-reactive electrophiles; and enzymes that conjugate nucleophiles with DNA-damaging electrophiles Cancer Research To summarize: Various forms of cancer tightly linked to malfunctioning of proofreading (ECC) mechanism Cancer cells: correspond to a special type of attractor of the RNGI (A cancer cell is “just another configuration” of RNGI) (Schmulevich et.al., Anderson Cancer Research Center) This attractor has genes interacting in a way that results in uncontrolled cell division Key observation: C-Change in RNGI results in further weakening of the proofreading system, and VV Example 1: Cancer cells cheat the proofreading mechanism regulating reduction in length of telomeres Aging: during each cell division, telomeres get shorter and shorter… When they become too short, errors in replication happen, leading to cancer (a time bomb in our body) Cancer cells “cheat” proofreading mechanism and allow telomeres to maintain constant length Finding the error-control mechanism: classifying diseases accurately, curing diseases (including cancer) by gene therapy, making telomer lengths constant over long time… Example 2: Breast Cancer Oncogene BRCA1 tightly linked to errorcontrol of DNA and cell division regulation How to obtain results practically? DNA Microarrays! Figure taken from Schmulevich et.al. II How can one efficiently store DNA sequences? DNA Storage: Compression GenBank/Swiss-Prot: storage of large number of DNA and protein sequences (17471 million sequences in GenBank, 2002) Every day, an average of 15 new sequences added to database DNA compression absolutely necessary to maintain banks Fractal DNA structure to be exploited Possible use of Tsallis entropy Need novel compression algorithms DNA sequences of related species differ in very small percent of base pairs: need cross-reference compression Need meaningful definition of DNA distance -- major paradigm shift from base-pair distance to chromosomal distance -- Statistical properties of DNA sequences Bases within the human mitochondrion (length approximately 17000) appear with the following frequencies: A T G C 0.31 0.13 0.25 0.31 while within different regions of human fetal globin gene: Introns Parts of genetic sequences can be modeled by Markov chains of given order and transition probabilities; order 2-7 Exons A T G C 0.27 0.29 0.27 0.17 A T G C 0.24 0.22 0.28 0.25 Regions of uniform distribution: isochors; can stretch in length up to hundreds Kbps Repetitive patterns: tandem repeats (TR), random repeats (RR), short interspersed repeat sequences (SINE’s, 9% of DNA), long interspersed repeat sequences (LINE’s). BPs, like CG, have very small probability: most notorious triplet repeats, related to Huntington’s disease and Fragile-X mental retardation, consist of these very unlikely “CG” pairs: (CGG)m ,(CCG)m, m = number of repetitions; Junk-DNA seems to have long-range (fractal) characteristics. A fractal patterns arises from the so-called DNA walk: a graphical representation of the DNA sequence in which one moves up for C or T and down for A or G. Can have two, three-dimensional random walk: further differentiation A,G,C,T C A T G Fractal dimension of the DNA molecule: 0.85 for higher species, 1 for lower Use lingual analysis of human languages for exploring DNA "language" (Zipf method) http://library.thinkquest.org/26242/full/ap/ap13.html DNAWalker http://athena.bioc.uvic.ca/pbr/walk/ DNA and Cantor Sets Provata and Almirantis, 2003: Fractal Cantor pattern in DNA Exons - filled regions Introns - empty regions Random, fractal, Cantor-like set Implication: atom (carrier of information) exon/intron pairs History-based random walk and DNA description in terms of urn models Only introns in higher species have higher complexity than in lower species Both coding and non-coding regions exhibit long range correlation, with spectral density of introns 1 / f b Known algorithms GenCompress (Chen, ’97) Biocompress (Grumbach/Tachi, ‘94) Fact (Rivals, ’00) GenomeSequenceCompress (Sato et.al 00’) Use characteristics of DNA like repeats, reverse complements… Compression rate is about 1.74 bits per base (78% in compression ratio) Two classes: statistical and grammar based compression algorithms Huffman, Lempel-Ziv, Arithmetic Coding, Burrows-Wheeler, Kieffer’s Grammar Based Schemes (with DNA specific modifications) No known algorithm specially suited for fractal nature of DNA, although 90% fractal! FILE COMPRESSION RATE (ACHIEVABLE) Human Growth Hormone (HUMGHCSA) 2.00 GZIP 2.065 ARITHM. 2.052 VPS2A 1.607 UNIX COMPRESS BIOCOMPRESS BWT GTAC 2.19 1.31 1.608 1.1 Different Entropy Measures: Shannon Entropy: H S ( p) pi log pi i Renyi Entropy: Tsallis Entropy: 1 H R ( p) log piq 1 q i q H T ( p ) pi 1 /(1 q) i zn 1 H S ( p) H T ( p) log z lim n derived n TE non-additive in the way that for two independent PS A,B HT ( A B) HT ( A) HT ( B) (1 q) HT ( A) HT ( B) Hausdorff Dimension: log N r0 log (1 / r ) lim Approach: Use “Fractal Grammars” Inference of context-free grammars from fractal data sets Syntactic generation of fractals Theory of formal languages can be used to state the problem of "syntactic fractal pattern recognition" Explore Connections with Wavelets (ideas by Jacques Blanc-Talon) Example: Heighway dragon and Koch curve G {{a, b, c, d , e, f , g , h},{1 , 2 },{a}} 1 (a ) ac, 1 (c) ec, 1 (e) eg , 1 ( g ) ag , 1 ( x ) x, x {b, d , f , h} 2 (a ) abha, 2 (b) bcab,..., 2 (h ) hagh Barthel, Brandau, Hermesmeier, Heising: Fractal Prediction, 1997 Zerotree wavelet coding using fractal prediction How does one compress sets of related DNA Sequences? Distributed Source Coding Problem: Peculiar Correlation Patterns Could explore Wavelet Based Compression Distributed Source Coding with LDPC Codes… Genomic Distance and One-Way Communication Major paradigm shift in genetic distance measure: From base-pair distance (involving deletion, insertion and substitution): Sankoff, Kruskal,Time Warps, String Edits, and Macromolecules) to Chromosomal Distance based on global arrangements of genes Inversions are primary mechanism of genome rearrangement! REVERSAL DISTANCE The smallest number of inversions necessary to transform one genome into another Finding the minimum number of reversals needed to “sort” a permutation Permutations are signed, indicating direction of transcription Example: (+1 +3 +2) (+1 -2 -3) (+1 +2 -3) (+1 +2 +3) How does one perform one-way communication (SENDING INFORMATION TO A RECEIVER WHO POSESESS CORRELATED INFORMATION) under the reversal distance measure? The other way around: DNA compression methods increase network efficiency by up to 10 times Peribit's SR-50 compressor Uses molecular sequence reduction (MSR) algorithms similar to those used to match patterns in the study of DNA. The algorithms identify and eliminate repetitions previously undetected in network traffic in wide area networks (Wans) to give compression ratios of between 1.2:1 for voice and video and 5:1 for SQL traffic. III Additional Coding Problems in Genetics DNA Computing Codes with Constant GC Content and invariant under Watson-Crick Inversion Microarray Error Control Coding Using design theory to reduce error rate of DNA array data Use novel clustering algorithms for DNA Array Data Conclusion Genetics is the most exciting source of new ideas for coding theory The atom of information is a gene, not a base pair or a triple of base pairs The error control code of the genome is to be found operating on the level of genes Compression, phylogenic tree construction: comparison of species has to be performed on the level of genes first Once the genes are compared, can move to local base pair comparisons