Contents of this Talk [Used as intro to Genome Databases Seminar, 2002] Overview of bioinformatics Motivations for genome databases Analogy of virus reverse-eng to genome analysis Questions to ask of a genome DB Overview of Genome Databases Peter D. Karp, Ph.D. SRI International pkarp@ai.sri.com www-db.stanford.edu/dbseminar/seminar.html Talk Overview Definition of bioinformatics Motivations Computer Issues for genome databases virus analogy in building genome databases Definition of Bioinformatics Computational techniques for management and analysis of biological data and knowledge Methods for disseminating, archiving, interpreting, and mining scientific information Computational Genome theories of biology Databases is a subfield of bioinformatics Motivations for Bioinformatics Growth in molecular-biology knowledge (literature) Genomics 1. Study of genomes through DNA sequencing 2. Industrial Biology Example Genomics Datatypes Genome sequences DOE Joint Genome Institute Gene 511M bases in Dec 2001 11.97G bases since Mar 1999 and protein expression data Protein-protein Protein interaction data 3-D structures Genome Databases Experimental data Archive experimental datasets Retrieving past experimental results should be faster than repeating the experiment Capture alternative analyses Lots of data, simpler semantics Computational symbolic theories Complex theories become too large to be grasped by a single mind The database is the theory Biology is very much concerned with qualitative relationships Less data, more complex semantics Bioinformatics Distinct intellectual field at the intersection of CS and molecular biology Distinct field because researchers in the field should know CS, biology, and bioinformatics Spectrum from CS research to biology service Rich source of challenging CS problems Large, noisy, complex data-sets and knowledge-sets Biologists and funding agencies demand working solutions Bioinformatics Research algorithms + data structures = programs algorithms + databases = discoveries Combine sophisticated algorithms with the right content: Properly structured Carefully curated Relevant data fields Proper amount of data Goals of Systems Biology Catalog the molecular parts lists of cells Understand the function(s) of each part Understand how those parts interact to produce the behavior of a cell or organism Understand parts the evolution of those molecular Analogy: Genome Analysis and Virus Analysis Given: Virus binary executable file for known machine architecture Reverse engineer the program Procedures Call graph Specifications for I/O behavior of the program and all procedures Capture and publish an annotated analysis of the virus Comparative analysis of related viruses Genome Analysis Example: Given: M. tuberculosis genome 4.4Mbp of DNA (genome) Infer: Molecular parts list of Mtb A model of the biochemical machinery of Mtb cell DNA is a blueprint for the program of life Start 4.4Mbyte binary program 4.4Mbp DNA sequence Step 1 Distinguish code from data segments Find procedure boundaries Distinguish coding from non-coding regions – Gene Finding Step 2 Predict semantics of procedures A B Predict gene functions C D Step 3 Predict procedure call graph A B D C A C B A B D C Predict biochemical and gene networks D Step 4 Predict conditions under which procedures are invoked A B D Q R S C Predict expression of network fragments Step 5 Infer complete program specification Formulate dynamic cellular simulation Step 6 Internet publishing of structured program annotation with explanations, references, commentary Internet publishing of structured genome annotation with explanations, references, commentary Step 7 Comparative analysis of viruses Evolutionary relationships among viruses Comparative analysis of genomes Evolutionary relationships among genomes Step 8 Identify measures to disable virus or prevent its spread A B D Q R S C Identify target proteins for anti-microbial drug discovery Database of Viruses Create a database that stores Binaries for all viruses All annotation of virus programs by different investigators Comparative analyses Support Remote API access Click-at-a-time browsing Reference on Major Genome Databases Nucleic Acids Research Database Issue http://nar.oupjournals.org/content/vol30/issue1/ 112 databases Questions to Ask of a New Genome Database What are Database Goals and Requirements? How many users? What expertise do users have? What problems will database be used to solve? What is its Organizing Principle? Different DBs partition the space of genome information in different dimensions Experimental Organism methods (Genbank, PDB) (EcoCyc, Flybase) What is its Level of Interpretation? Laboratory data Primary literature (Genbank) Review (SwissProt, MetaCyc) Does DB model disagreement? What are its Semantics and Content? What How entities and relationships does it model? does its content overlap with similar DBs? How many entities of each type are present? Sparseness of attributes and statistics on attribute values What are Sources of its Data? Potential information sources Laboratory instruments Scientific literature Manual entry Natural-language text mining Direct submission from the scientific community Genbank Modification policy DB staff only Submission of new entries by scientific community Update access by scientific community What DBMS is Employed? None Relational Object oriented Frame knowledge representation system Distribution / User Access Multiple distribution forms enhance access Browsing access with visualization tools API Portability What Validation Approaches are Employed? None Declarative consistency constraints Programmatic Internal What consistency checking vs external consistency checking types of systematic errors might DB contain? Database Documentation Schema and its semantics Format API Data acquisition techniques Validation techniques Size of different classes Coverage of subject matter Sparseness of attributes Error rates Relationship of Database Field to Bioinformatics Scientists generally ignorant of basic DB principles Complex queries vs click-at-a-time access Data model Defined semantics for DB fields Controlled vocabularies Regular syntax for flatfiles Automated consistency checking Most biologists take one programming class Evolution of typical genome database Finer points of DB research off their radar screen Handfull of DB researchers work in bioinformatics Database Field For many years, the majority of bioinformatics DBs did not employ a DBMS Flatfiles were the rule Scientists want to see the data directly Commercial DBMSs too expensive, too complex DBAs too expensive Most scientists do not understand Differences between BA, MS, PhD in CS CS research vs applications Implications for project planning, funding, bioinformatics research Recommendation Teaching scientists programming is not enough Teaching scientists how to build a DBMS is irrelevant Teach scientists basic aspects of databases and symbolic computing Database requirements analysis Data models, schema design Knowledge representation, ontologies Formal grammars Complex queries Database interoperability