Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT) Seoul National University btzhang@bi.snu.ac.kr http://bi.snu.ac.kr/~btzhang Outline Bioinformation Technology Bioinformatics DNA Chip Data Analysis: IT for BT DNA Computing: BT for IT DNA Computing with DNA Chips Outlook 2 Human Genome Project A New Disease Encyclopedia Goals • Identify the approximate 40,000 genes in human DNA • Determine the sequences of the 3 billion bases that make up human DNA • Store this information in database • Develop tools for data analysis • Address the ethical, legal and social issues that arise from genome research Genome Health Implications New Genetic Fingerprints New Diagnostics New Treatments 3 Bioinformatics vs. Biocomputing Bioinformatics IT BT Biocomputing 4 Bioinformatics 5 What is Bioinformatics? Bio – molecular biology Informatics – computer science Bioinformatics – solving problems arising from biology using methodology from computer science. Bioinformatics vs. Computational Biology Bioinformatik (in German): Biology-based computer science as well as bioinformatics (in English) 6 Molecular Biology: Flow of Information DNA RNA Protein Function ACTGG AAGCT T A TC DNA Phe Cys Cys Protein 7 DNA (Gene) Control statement RNA TATA start Protein Termination stop Control statement Gene Ribosome binding 5’ utr Transcription (RNA polymerase) mRNA 3’ utr Translation (Ribosome) Protein 8 Nucleotide and Protein Sequence DNA (Nucleotide) Sequence SQ sequence 1344 BP; 291 A; C; 401 G; 278 T; 0 other aacctgcgga aggatcatta gcgggcccgc cgcttgtcgg cgcttgtcgg ccgagtgcgg gtcctttggg ccgccggggg ggcgcctctg ccccccgggc cccaacctcc catccgtgtc ccccccgggc ccgtgcccgc cggagacccc tattgtaccc tgttgcttcg aacctgcgga aggatcatta ctgtctgaaa gcgggcccgc cgcttgtcgg ccgagtgcgg gtcctttggg tgagttgatt ccgccggggg ggcgcctctg cccaacctcc catccgtgtc agttaaaact ccccccgggc ccgtgcccgc tattgtaccc tgttgcttcg gatctcttgg cggagacccc aacacgaaca gcgggcccgc cgcttgtcgg ccgagtgcgg ctgtctgaaa gcgtgcagtc agttaaaact ttcaacaatg cccaacctcc tgagttgatt gaatgcaatc gatctcttgg ttccggctgc tattgtaccc agttaaaact ttcaacaatg tattgtaccc tgttgcttcg gcgggcccgc gatctcttgg ttccggctgc gcgggcccgc cgcttgtcgg ccgccggggg tattgtaccc tgttgcttcg ccgccggggg ggcgcctctg agttaaaact gcgggcccgc cgcttgtcgg ccccccgggc ccgtgcccgc gatctcttgg ccgccggggg ggcgcctctg cggagacccc tgttgcttcg tattgtaccc ccccccgggc ccgtgcccgc gcgggcccgc cgcttgtcgg gcgggcccgc cggagacccc tgttgcttcg ccgccggggg cggagacccc ccgccggggg gcgggcccgc cgcttgtcgg gcgggcccgc cgcttgtcgg ccccccgggc ccgccggggg cggagacccc ccgccggggg ggcgcctctg cggagacccc ccgccggggg ccgtgcccgc aacacgaaca gcgtgcagtc gaatgcaatc ttcaacaatg aacctgcgga gtcctttggg catccgtgtc tgttgcttcg cgcttgtcgg ggcgcctctg ttcaacaatg ttccggctgc tgttgcttcg cgcttgtcgg ggcgcctctg ccgtgcccgc tgttgcttcg Protein (Amino Acid) Sequence CG2B_MARGL Length: 388 April 2, 1997 14:55 Type: P Check: 9613 .. 1 MLNGENVDSR ARNNLQAGAK EKAKPQSPEP NPQLCSEFVN SILIDWLVQV KLQLVGVTSM RSMECNILRR AKYLMELTLP GTTLVHYSAY YSSAKFMNVS IMGKVATRAS KELVKAKRGM MDMSEINSAL DIYQYMRKLE HLRFHLLQET LIAAKYEEMY LDFSLGKPLC EYAFVPYDPS SEDHLMPIVQ TISALTSSTV SKGVKSTLGT RGALENISNV TKSKATSSLQ SVMGLNVEPM EAFSQNLLEG VEDIDKNDFD REFKVRTDYM TIQEITERMR LFLTIQILDR YLEVQPVSKN PPEIGDFVYI TDNAYTKAQI IHFLRRNSKA GGVDGQKHTM EIAAAALCLS SKILEPDMEW KMALVLKNAP TAKFQAVRKK MDLADQMC 9 Some Facts 1014 cells in the human body. 3 109 letters in the DNA code in every cell in your body. DNA differs between humans by 0.2% (1 in 500 bases). Human DNA is 98% identical to that of chimpanzees. 97% of DNA in the human genome has no known function. 10 Topics in Bioinformatics Sequence analysis Sequence alignment Structure and function prediction Gene finding Structure analysis Protein structure comparison Protein structure prediction RNA structure modeling Expression analysis Gene expression analysis Gene clustering Pathway analysis Metabolic pathway Regulatory networks 11 Extension of Bioinformatics Concept Genomics Functional genomics Structural genomics Proteomics: large scale analysis of the proteins of an organism Pharmacogenomics: developing new drugs that will target a particular disease Microarray: DNA chip, protein chip 12 Applications of Bioinformatics Drug design Identification of genetic risk factors Gene therapy Genetic modification of food crops and animals Biological warfare, crime etc. Personal Medicine? E-Doctor? 13 Bioinformatics as Information Technology GenBank SWISS-PROT Database Information Retrieval Hardware Supercomputing Biomedical text analysis Bioinformatics Algorithm Agent Information filtering Monitoring agent Sequence alignment Machine Learning Clustering Rule discovery Pattern recognition 14 Background of Bioinformatics Biological information infra Biological information management systems Analysis software tools Communication networks for biological research Massive biological databases DNA/RNA sequences Protein sequences Genetic map linkage data Biochemical reactions and pathways Need to integrate these resources to model biological reality and exploit the biological knowledge that is being gathered. 15 Areas and Workflow of Bioinformatics AGCTAGTTCAGTACA TGGATCCATAAGGTA CTCAGTCATTACTGC AGGTCACTTACGATA TCAGTCGATCACTAG CTGACTTACGAGAGT Microarray (Biochip) Structural Genomics Functional Proteomics Genomics Pharmacogenomics Infrastructure of Bioinformatics 16 DNA Chip Data Analysis: IT for BT 17 cDNA Microarray Excitation Laser 2 Scanning Laser 1 PCR product amplification purification cDNA clones (probes) mRNA target Emission Printing Overlay images and normalize 0.1nl/spot Hybridize target to microarray Microarray Analysis 18 The Complete Microarray Bioinformatics Solution Databases Data Management Cluster Analysis Statistical Analysis Data Mining Image Processing Automation 19 DNA Chip Applications Gene discovery: gene/mutated gene Growth, behavior, homeostasis … Disease diagnosis Cancer classification Drug discovery: Pharmacogenomics Toxicological research: Toxicogenomics 20 Disease Diagnosis: Cancer Classification with DNA Microarray cDNA microarray data of 6567 gene expression levels [Khan ’01]. - Filter genes that are correlated to the classification of cancer using PCA and ANN learning. - Hierarchical clustering of the DNA chip samples based on the filtered 96 genes. - Disease diagnosis based on DNA chip. - [Fig.] Flowchart of the experimental procedure. 21 Disease Diagnosis: Hierarchical Clustering Based on Gene Expression Levels Hierarchical clustering of cancer by 96 gene expression levels. - - The relation between gene expression and cancer category. Four cancer diagnostic categories - [Fig.] The dendrogram of four cancer clusters and gene expression levels (row: genes, column: 22 AI Methods for DNA Chip Data Analysis Classification and prediction ANNs, support vector machines, etc. Disease diagnosis Cluster analysis Hierarchical clustering, probabilistic clustering, etc. Functional genomics Genetic network analysis Differential models, relevance networks, Bayesian networks, etc. Functional genomics, drug design, etc. 23 Cluster Analysis [Gene Cluster 1] [Gene Cluster 2] [Gene Cluster 3] [DNA microarray dataset] [Gene Cluster 24 Methods for Cluster Analysis Hierarchical clustering [Eisen ’98] Self-organizing maps [Tamayo ’99] Bayesian clustering [Barash ’01] Probabilistic clustering using latent variables [Shin ’00] Non-negative matrix factorization [Shin ’00] Generative topographic mapping [Shin ’00] 25 Clustering of Cell Cycle-regulated Genes in S. cerevisiae (the Yeast) Identify cell cycle-regulated genes by cluster analysis. 104 genes are already known to be cell-cycle regulated. Known genes are clustered into 6 clusters. Cluster 104 known genes and other genes together. The same cluster similar functional categories. [Fig.] 104 known gene expression levels according to the cell cycle (row: time step, column: gene). 26 Probabilistic Clustering Using Latent Variables gi: ith gene zk: kth cluster tj: jth time step p(gi|zk): generating probability of ith gene given kth cluster vk=p(t|zk): prototype of kth cluster p (g i z k ) p ( z k | g i ) p (g i | z k ) p ( z k ) similarity (x , v ) j xijvkj i k p (g i ) f (g, t , z ) g ij log( p( zk ) p(g i | zk ) p(t j | zk )) : (*) objective function i j k (maximized by EM) 27 Experimental Result: Identify Cell Cycle-Regulated Genes Clustering result [Table] Clustering result with -factor arrest data. In 4 clusters, the genes, that have high probability of being cell cycle-regulated, were found. 28 Experimental Result: Prototype Expression Levels of Found Clusters • The genes in the same cluster show similar expression patterns during the cell cycle. • The genes with similar expression patterns are likely to have correlated functions. [Fig.] Prototype expression levels of genes found to be cell cycleregulated (4 clusters). 29 Clustering Using Non-negative Matrix Factorization (NMF) NMF (non-negative matrix factorization) G WH NMF as a latent variable model r (G ) i ( WH ) i Wia H a h1 a 1 G : gene expression data matrix h2 … hr W W : basis matrix (prototypes) H : encoding matrix (in low dimension) … g1 g2 g Wh gn Gi , Wia , H a 0 30 Experimental Result: Five Clusters Found by NMF 5 prototype expression levels during the cell cycle. 0.18 Expression level 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Time step in cell cycle 31 Clustering Using Generative Topographic Mapping (GTM) • GTM: a nonlinear, parametric mapping y(x;W) from a latent space to a data space. Grid t3 Generation y(x;W): mapping x2 t2 Visualization x1 <Latent space> t1 <Data space> 32 Experimental Result: Clusters Found by GTM Three cell cycle-regulated clusters found by GTM Cluster center S/G2 No. of train Data/ no. in cluster Correct no. / test data Overall mean expression levels (Cln/b) of known genes 5 / 1 / 2 (.148 .184 -.367 -.044) S (0.111 –0.333) 5 / 5 5 / 5 (100%) (1.075 1.482 -.233 -.375) M/G1 c1 c2 c3 (0.111 0.333) (-0.111 –0.111) (0.323 0.1) 13 / 7 / 2 / 2 1 / 6 0 / 6 0 / 6 (-.171 -.573 .091 .311) G2/M c1 c2 (0.111 0.333) (0.111 0.111) 10 / 5 / 3 0 / 5 3 / 5 (80%) (-.616 –1.01 1.832 1.596) G1 (-0.111 0.333) (-0.111 0.111) 35 / 18 / 7 10 / 16 (62%) 0 / 16 (.894 .907 -.766 -.479) c1 c2 33 Experimental Result: Comparison with other methods Comparison of prototype expression levels No. of selected genes Mean expression levels by GTM No. of selected genes by Spellman Mean expression levels by Spellman S/G2 92 (.13 -.06 -.1 .01) 121 (.13 .05 -.16 .03) S 25 (.84 .81 -.42 -.33) 71 (.46 .47 -.43 -.18) M/G1 c1 c2 c3 120 34 10 (.82 .65 -.65 -.38) (-.04 -.37 -.01 -.11) (.32 .29 -.3 .05) 113 (-.21 -.61 -.04 .07) G2/M c1 c2 33 60 (-.59 -.96 1.34 1.29) (.08 -.30 .51 .57) 195 (-.32 -.62 .49 .54) G1 122 74 (total = 570) (.92 .74 -.62 -.33) (.79 .82 -.48 -.34) 300 (.66 .49 -.55 -.33) c1 c2 (total = 800) 34 Genetic Network Analysis Discover the complex regulatory interaction among genes. - Disease diagnosis, pharmacogenomics and toxicogenomics - - Boolean networks - Differential equations - Relevance networks [Butte ’97] Bayesian networks [Friedman ’00] [Hwang ’00] - [Fig.] Basin of attraction of 12-gene Boolean genetic network model [Somogyi ’96]. 35 Bayesian Networks Represent the joint probability distribution among random variables efficiently using the concept of conditional independence. A An edge denotes the possibility of the causal relationship between nodes. B C D •A, C and D are independent given B. •C asserts dependency between A and B. •A, B and E are independent given C. E P( A, B, C , D, E ) P( A) P( B | A) P(C | A, B) P( D | A, B, C ) P( E | A, B, C , D) (by chain rule) P( A) P( B) P(C | A, B) P( D | B) P( E | C ) (by the example Bayes net) 36 Bayesian Networks Learning Dependence analysis [Margaritis ’00] Mutual information and 2 test Score-based search p ( D, S ) p ( S ) p ( D | S ) p( S ) i 1 j 1 n qi ( ij ) ( ij N ij ) ri k 1 ( ijk N ijk ) ( ijk ) • D: data, S: Bayesian network structure NP-hard problem Greedy search Heuristics to find good massive network structures quickly (local to global search algorithm) 37 The Small Bayesian Network for Classification of Cancer Zyxin Leukemia class LTC4S C-myb MB-1 •The Bayesian network was learned by full search using BD (Bayesian Dirichlet) score with uninformative prior [Heckerman ’95] from the DNA microarray data for cancer classification (http://waldo.wi.mit.edu/MPR/). [Table] Comparison of the classification performance with other methods [Hwang ’00]. Training error Test error Bayes nets 0/38 2/34 Neural trees 0/38 1/34 RBF networks 0/38 1.3/34 38 Large-Scale Bayesian Network with 1171 Genes Genetic networks for understanding the regulatory interaction among genes and their derivatives - Pharmacogenomics and Toxicogenomics - [Fig.] The Bayesian network structure constructed from DNA microarray data for cancer classification (partial view). 39 DNA Computing: BT for IT 40 DNA Computing: BioMolecules as Computer 011001101010001 ATGCTCGAAGCT 41 Why DNA Computing? 6.022 1023 molecules / mole Immense, brute force search of all possibilities Desktop: 109 operations / sec Supercomputer: 1012 operations / sec 1 mol of DNA: 1026 reactions Favorable energetics: Gibb’s free energy G 8kcal mol -1 1 J for 2 1019 operations Storage capacity: 1 bit per cubic nanometer 42 Flow of DNA Computing Encoding HPP Node 0: ACG Node 3: TAA Node 1: CGA Node 4: ATG Node 2: GCA Node 5: TGC Node 6: CGT 4 3 ATG ... ... ... CGA ACG GCA ... ... ... ... TAA ... ... ... ... CGT... TGC 1 0 6 2 ... Ligation ACGCGAGCATAAATGTGCACGCGT ... ... ... ... ACGCGAGCATAAATGCGATGCACGCGT ... CGACGTAGCCGT ... CGACGT ... ACGCGAGCATAAATGTGCCGT ACGGCATAAATGTGCACGCGT ... PCR (Polymerase Chain Reaction) Decoding 4 Affinity Column 1 0 5 ... ACGCGTAGCCGT ACGCGAGCATAAATGTGCCGT 6 2 Gel Electrophoresis ACGCGAGCATAAATGCGATGCCGT Solution 3 TAAACGGCAACG ACGCGAGCATAAATGTGCCGT 5 ... TAAACG ... ... ATGTGCTAACGAACG ... ... ... ACGCGAGCATAAATGTGCACGCGT... ... ACGCGAGCATAAATGTGCCGT ... ... ... ... ACGCGT ACGCGAGCATAAATGCGATGCACGCGT 43 Biointelligence on a Chip? Bioinformation Technology Biological Computer Information Technology Biointelligence Chip Computing Models: The limit of conventional computing models Computing Devices: The limit of silicone semiconductor technology Biotechnology Molecular Electronics 44 Intelligent Biomolecular Information Processing Theoretical Models InputInput A A Controller GFP Cytochrome c Output Reaction Chamber (Calculating) S Bio-Memory Bio-Processor Biocomputing 45 Evolvable Biomolecular Hardware Sequence programmable and evolvable molecular systems have been constructed as cell-free chemical systems using biomolecules such as DNA and proteins. 46 DNA Computers vs. Conventional Computers DNA-based computers Microchip-based computers slow at individual operations fast at individual operations can do billions of operations simultaneously can do substantially fewer operations simultaneously can provide huge memory in small space smaller memory setting up a problem may involve considerable preparations setting up only requires keyboard input DNA is sensitive to chemical deterioration electronic data are vulnerable but can be backed up easily 47 Molecular Operators for DNA Computing • Hybridization: complementary pairing of two single- stranded polynucleotides 5’- AGCATCCA –3’ + 3’- TCGTAGGT –5’ 5’- AGCATCCA –3’ 3’- TGCTAGGT –5’ • Ligation: attaching sticky ends ATGCATGC TACG + TGAC TACGACTG to a blunt-ended molecule ATGCATGCTGAC TACGTACGTGAC sticky end 48 Research Groups MIT, Caltech, Princeton University, Bell Labs EMCC (European Molecular Computing Consortium) is composed of national groups from 11 European countries BioMIP Institute (BioMolecular Information Processing) at the German National Research Center for Information Technology (GMD) Molecular Computer Project (MCP) in Japan Leiden Center for Natural Computation (LCNC) 49 Applications of Biomolecular Computing Massively parallel problem solving Combinatorial optimization Molecular nano-memory with fast associative search AI problem solving Medical diagnosis Cryptography Drug discovery Further impact in biology and medicine: Wet biological data bases Processing of DNA labeled with digital data Sequence comparison Fingerprinting 50 NACST (Nucleotide Acid Computing Simulation Toolkit) DNA Sequence Generator DNA Sequence Optimizer Genetic Algorithm GUI NACST Engine Controller Ligation Unit PCR Unit Electrophoresis Unit Affinity Column Unit Enzyme Unit 51 NACST Inputs Outputs 52 Combinatorial Problem Solver TSP (Traveling Salesman Problem) 3 4 3 1 7 AGCT TAGG 0 3 P1A P1B 11 5 3 3 3 2 9 11 6 3 5 1 3 01234560 7 ATCC ATCA TACC P1B W12 P2A 2 ATGG CATG P2A P2B 3 ATCC GCCT GCTA P1B W13 P3A Representations 3 CGAT CGAA P3A P3B 53 Combinatorial Problem Solver Weight representation methods 1. 2. 3. Molecules with high G-C content tend to hybridize easily. Molecules with high G-C content tend to be denatured at higher temperature. Molecules with larger population in tube will have more probability to hybridize. Hybridization/Ligation PCR/Gel electrophoresis Affinity chromatography PCR/Gel electrophoresis Temperature Gradient Gel Electrophoresis Graduate PCR 54 Experimental Results for 4-TSP Ligation result Hybridization (37°C) Ligation (16 °C 15hr) PCR (36 cycle) Gel electrophoresis (10% polyacrylamide gel) 50 bp marker Final PCR result (140bp) Oligomer mixture 55 Molecular Theorem Prover Resolution refutation method Problem under P Q R P S T Q S consideration: P Q R, S T Q, S , T , P Q R T Q T R R true ? Turn A B into A B , add R as R P Q R , S T Q S , T , P , R Q R R is true! nil 56 Molecular Theorem Prover (Abstract Implementation) Implementation 1 ¬S ¬T Q S T P ¬R ¬Q ¬P R P ¬S ¬T Q ¬Q ¬P R S T Implementation 2 ¬Q ¬T ¬R ¬P ¬S R Q T P S ¬R ¬S ¬T Q P ¬R S T ¬Q ¬P R 57 Molecular Theorem Prover (Experiments for Method 1) 20 bp DMA marker (Talara) 실험 과정 Mixture Reaction 실험 결과 I. 각 분자들을 혼합 100pmol/each Total 20 ul 1 2 3 4 5 6 II. Denaturation ( 95°C 10 min) 200 bp III. Annealing 95°C 1 min 15 °C : 1°C down/min 20 bp IV. Polyacrylamide gel Electrophoresis(20%) ( PAGE ) V. Detection of solution : 75bp ds DNA 58 Solving Logic Problems by Molecular Computing Satisfiability Problem Find Boolean values for variables that make the given formula true 3-SAT Problem Every NP problems can be seen as the search for a solution that simultaneously satisfies a number of logical clauses, each composed of three variables. ( x1 x3 x4 ) ( x4 ) ( x2 x3 ) ( x1 or x2 or x3 ) AND ( x4 or x5 or x6 ) ( x1 or x2 or x3 ) AND ( x1 or x2 or x3 ) 59 DNA Computing with DNA Chips DNA Chips for DNA Computing I. Make: oligomer synthesis II. Attach (Immobilized): 5’HS-C6-T15-CCTTvvvvvvvvTTCG-3’ III. Mark: hybridization IV. Destroy: Enzyme rxn (ex.EcoRI) V. Unmark * 문제를 만족시키지 않는 모든 strand 제거 VI. Readout: N cycle의 마지막 단계에 해가 남게 되 면, PCR로 증폭하여 확인! 61 Variable Sequences and the Encoding Scheme 62 Tree-dimensional Plot and Histogram of the Fluorescence S3: w=0, x=0, y=1, z=1 S7: w=0, x=1, y=1, z=1 S8: w=1, x=0, y=0, z=0 S9 : w=1, x=0, y=0, z=1 y=1: (w V x V y) z=1: (w V y V z) x=0 or y=1: (x V y) w=0: (w V y) 만족 만족 만족 만족 Four spots with high fluorescence intensity correspond to the four expected solutions. DNA sequences identified in the readout step via addressed array hybridization. 63 Outlook IT gets a growing importance in the advancement of BT. Bioinformatics DNA Microarray Data Mining IT can benefit much from BT. Biocomputing and Biochips DNA Computing (with DNA Chips) Bioinformation technology (BIT) is essential as a next-generation information technology. In Silico Biology vs. In Vivo Computing 64 References [Barash ’01] Barash, Y. and Friedman, N., Context-specific Bayesian clustering for gene expression data, Proc. of RECOMB’01, 2001. [Butte ’97] Butte, A.J. et al., Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks, Proc. Natl Acad. Sci. USA, 94, 1997. [Eisen ’98] Eisen, M.B. et al., Cluster analysis and display of genomewide expression patterns, Proc. Natl Acad. Sci. USA, 95, 1998. [Friedman ’00] Friedman, N. et al, Using Bayesian networks to analyze expression data, Proc. of RECOMB’00, 2000. [Heckerman ’95] Heckerman, D. et al., Learning Bayesian networks: the combination of knowledge and statistical data, Machine Learning, 20(3), 1995. [Hwang ’00] Hwang, K.-B. et al., Applying machine learning techniques to analysis of gene expression data: cancer diagnosis, CAMDA’00, 2000. 65 References [Khan ’01] Khan, J. et al., Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nature Medicine, 7(6), 2001. [Margaritis ’00] Margaritis, D. and Thrun, S., Bayesian network induction via local neighborhoods, Proc. of NIPS’00, 2000. [Shin ’00] Shin, H.-J. et al., Probabilistic models for clustering cell cycle-regulated genes in the yeast, CAMDA’00, 2000. [Somogyi ’96] Somogyi, R. and Sniegoski, C.A., Modeling the complexity of genetic networks: understanding multigenic and pleiotropic regulation, Complexity, 1(6), 1996. [Tamayo ’99] Tamayo, P. et al., Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation, Proc. Natl Acad. Sci. USA, 96, 1999. 66 More information at http://cbit.snu.ac.kr/ http://bi.snu.ac.kr/ 67