• Bioinformatics - Why E. coli? Development of the www.EcoliHub.org Information Resource – Oldest molecular model system • Information is distributed among many established sites • E. coli information is frequently basal in other resources • Significant information predates electronic age (>100,000 literature articles) Barry L. Wanner Purdue University West Lafayette, IN 47907 USA bl blwanner@purdue.edu @ d d – Information is deep • • • • • Majority of E. coli genes/proteins have known functions Over half have known 3-D 3 D structures Detailed enzymology Genetic and physical interactions Fundamental processes from DNA replication, transcription, translation, mutation, DNA repair, protein secretion, protein folding, protein repair, disulfide bond formation, cell division, DNA movement, protein movement • Ca. 600 E. coli proteins are highly conserved in eucaryotes including humans. GeneSys Inaugural Meeting National e-Science Centre, Edinburgh, UK 3 October 2008 1 2 Supported Supported by by NIH NIH NIGMS NIGMS U24 U24 GM077905 GM077905 Supported by NIH NIGMS U24 GM077905 Vision 1. Create biology-driven information resource that is comprehensive, accurate, and up-to-date for experimentalists and modelers. 2. Develop an integrated “one-stop-shopping” E. coli K-12 information resource to make full use of existing knowledge and to enable new discoveries leading to deeper understanding of life processes. 3. Implement web services (Web2) architecture that will interoperate across multiple resources via simple transparent interfaces, interfaces which will be broadly useful for development of other procaryote databases. 4. Develop bacterial database schema and a core database for data not now easily accessible 5. Develop and nucleate a process for expert curation by members of the community (EcoliWiki) 6. Facilitate development of accurate and up-to-date annotation records for the K-12 group of organisms 3 4 Supported by NIH NIGMS U24 GM077905 Supported Supported by by NIH NIH NIGMS NIGMS U24 U24 GM077905 GM077905 E. coli Information is deep Data Overload Wealth of detailed information on innumerable biochemical and molecular processes continues to accumulate. High-throughput experimentation from DNA microarray, multiple E. coli genomes, ChIP-chip, proteomics, protein-protein interactions metabolomics, interactions, metabolomics experimental resources, genetic interactions are rapidly increasing. • Part of what we don't know yet is how the things we do know fit together (Integration Tools) • Finding the missing and inconsistent information is difficult (Conflict Detection) • Requires facilities to compare and contrast information (What’s Different? What’s New?) New ways are needed to make these data more accessible for reuse by others and for data integration Data Overload 5 Supported by NIH NIGMS U24 GM077905 6 Supported by NIH NIGMS U24 GM077905 1 Web 2.0 (web services) • …the philosophy of mutually maximizing collective intelligence and added value for each participant by formalized and dynamic information sharing and creation EcoliHub Goals EcoliHub will not replace existing information resources Our goal is to add value to these resources by: 1. improving the ability to share information and computational Services among Resources, 2. improving the community’s community s ability to find information and resources (EcoliHub Websearch, Multi-site Search, Workbench) 3. providing new information and resources that 'fill in the gaps' between existing resources and improve the quality of information provided by all participating E. coli resources. 4. allowing resources to be combined (piped together) in new ways, without requiring additional development effort by the provider (Integration, under development) Problems and approaches Finding and sharing data from different resources • EcoliHub - information from collaborating biological electronic data resources Making data curation faster, cheaper, and better • EcoliWiki - community annotation for E. coli K-12 7 Supported by NIH NIGMS U24 GM077905 Investigators Walid G. Aref (Purdue) Julio Collado-Vides (UNAM) Tyrrell Conway (OU) Michael R. Gribskov (Purdue) Peter D. Karp (SRI) Daisuke Kihara (Purdue) James C C. Hu (TAMU) Hirotada Mori (NAIST) Kenneth E. Rudd (Miami) Debby Siegele (TAMU) Todd J. Vision (UNC) Barry L. Wanner (Principal, Purdue) Management Team Dawn R. Whitaker, Project Manager Sara C. Ess, Assistant Project Manager Deana L. Galema, Administrative Assistant Ali Roumani, Lead Architect 8 Supported by NIH NIGMS U24 GM077905 EcoliHub People (current) Shikha Agrawal, Lead Programmer, Purdue Dave Clements, GMOD Help Desk, NESCent, UNC Kirill A. Datsenko, Research Associate, Biology, Purdue Hicham G. Emongui, Grad. R.A., Computer Science, Purdue Joe Grissom, Lead Programmer, OU James B. Hengenius, Grad. R.A., Bioinformatics, Purdue Yi-Ju Hsieh, Grad. R.A., Biology, Purdue Rajasekar Karthik, Programmer, Purdue Yusuf Kaya, Post-doc Biocurator Nathan Liles, Undergraduate Programmer, TAMU Thomas McGrew, Programmer, Purdue Brenley McIntosh, Post-doc, Bioinformatics, TAMU Daniel Renfro, Lead Programmer, TAMU Yasin Laura Silva, Grad. R.A., Computer Science, Purdue Rikiya Takeuchi, Grad. R.A., Bioinformatics, NAIST Matthew F. Traxler, Grad R. A., OU Anand Venkatraman, Grad. R.A., Bioinformatics, TAMU Samuel D. Wehrspann, Web Developer, Purdue John E. Wertz, Consultant, CGSC-Yale Yifeng David Yang, Grad. R.A., Bioinformatics, Purdue Jindan Zhou, Grad. R.A., Computer Science, Miami Gregory R. Ziegler, Grad. R.A., Bioinformatics, Purdue Adrienne Zweifel, Grad. R.A., Biochemistry, TAMU Steering Committee James J. Anderson (NIGMS) Patricia C. Babbitt (UCSF) Rex L. Chisholm (Northwestern) Valentina di Francesco (NIAID) Carol A. Gross (UCSF) James C. Hu ((TAMU)) Michael Hucka (Caltech) Robert Landick (chair, Wisconsin) Philip Matsumura (UIC) Thomas J. Silhavy (Princeton) Paul W. Sternberg (Caltech) Barry L. Wanner (Purdue) Owen R. White (Maryland) Matthew E. Portnoy (ad hoc member; Program Director, NIGMS) 9 Supported by NIH NIGMS U24 GM077905 EcoliHub Goals 1. 2. 3. 4. 10 Supported by NIH NIGMS U24 GM077905 E. coli information pages indexed Providing Services and Resources Improving the community’s ability to find information and resources (EcoliHub Websearch, Multi-site Search, Workbench) Sharing and discussing information via a forum Training videos – Help how to use the resource •ASAP: A Systematic Annotation Package for Community Analysis of Genomes •CGSC: The Coli Genetic Stock Center •EcoCyc: Encyclopedia of Escherichia coli K-12 Genes and Metabolism • EcoliWiki: EcoliHub's subsystem for community annotation. •EcoGene Database of Escherichia coli Sequence and Function • ECOR Collection: E.coli Reference Collection • ecce: The E. E coli Cell Envelope Protein Data Collection • epd: E. coli protease database • GenoBase: Functional Genomic Analysis of E.coli in Japan • GIB: Genome Information Broker • GTD: The Genomic Threading Database • GtRNAdb: Genomic tRNA Database • IS Finder: IS Database • PEC: Profiling of E.coli Chromosome • RegulonDB: a database on transcriptional regulation in Escherichia coli. • Rfam (Janelia): The Rfam database of RNA alignments and CMs • RPG: Ribosomal Protein Gene Database • TCDB: Transport Classification Database • TransportDB: Genomic Comparisons of Membrane Transport Systems 11 Supported by NIH NIGMS U24 GM077905 12 Supported by NIH NIGMS U24 GM077905 2 13 Supported by NIH NIGMS U24 GM077905 14 Supported by NIH NIGMS U24 GM077905 15 Supported by NIH NIGMS U24 GM077905 EcoliHub Databases EcoliLiterature a comprehensive database of all articles, book chapters, and books with basic information on E. coli, its phages, and plasmids EcoliPredict - computationally predicted and experimentally determined structures of proteins encoded by E. coli K-12 EcoliWiki - community annotation system for EcoliHub. GenExpDB is a comprehensive database of publicly deposited DNA microarray gene expression data on E. coli GenoBase - legacy E. coli database on comprehensive resources, e. g., Keio collection and ASKA ORFeome clone - now being further developed at EcoliHub Participating Databases EcoCyc - professionally curated encyclopedic source of information on the genome, metabolic pathways, and regulatory network EcoGene - knowledgebase derived from extensive literature surveys and bioinformatics research that documents the functions of DNA, protein and RNA in E. coli K-12 RegulonDB - source of highly curated knowledge on regulation of transcription initiation and operon organization, and regulatory networks 16 Supported by NIH NIGMS U24 GM077905 17 Supported by NIH NIGMS U24 GM077905 18 Supported by NIH NIGMS U24 GM077905 3 Making Resources Work Together – Integration – Bring all information to one place and centrally curate • Labor intensive • Puts the least expert people in charge of data • Does not scale – Federation – Leave data where it is and build a unified query system • Keeps experts in charge of data • Scales better • Superschema is logistically difficult to implement, and makes resources "rigid" • Difficult to extend (unanimity is essential) – Interoperation – Leave the data where it is, but provide well-defined low level access • Keeps experts in charge • Scales • Freely extensible 19 Supported by NIH NIGMS U24 GM077905 EcoliHub Web Services • Building a collaborative system – Community participation via interactive systems – Interoperation using web services where possible • Data warehousing or mirroring where necessary • A framework for sharing • Goal – a network of independent resources that easily make their respective information and computational services available to the community of E. coli resources – What can be shared? – Web services to share information (memes?) – Create bridge g services to g glue resources together g • • • • 20 Supported by NIH NIGMS U24 GM077905 • • logical intersection translation services object transformations and mappers portable displays • – Data (lookup and display) Annotation (lookup services) – Get gene information from gene ID – Find Fi d expression i experiments i t from f ID Calculations (computational services) – Sequence searches (e.g., BLAST) – Predictions (e.g., promoters, terminators, microRNAs, 3-D structure) – Pathway and network analysis (integration) Glue services • • • • • translate from one ID to corresponding - SwissProt to Genbank Translate from one kind of object to another- DNA sequence to Protein Sequence AND/OR/NOT/XOR of objects Local Storage and workspace Provenance 21 Supported by NIH NIGMS U24 GM077905 22 Supported by NIH NIGMS U24 GM077905 Webservice/Workspace Integration Web Services • A simple transaction Consumer User Keyword IDfromKeyword List of IDs EcoCyc EcoGene Vendor 23 Supported by NIH NIGMS U24 GM077905 24 Supported by NIH NIGMS U24 GM077905 4 Web Services Diesel fuel from E. coli Hub and Spoke Peer to peer Workflows 25 Supported by NIH NIGMS U24 GM077905 26 Supported by NIH NIGMS U24 GM077905 5