High throughput biology data management and data intensive computing drivers George Michaels The Scope of the Problem A highly multidimensional world of complicated dynamic events Both synchronous and asynchronous processes Vast scales of time and space A hierarchy of simultaneous levels of activity Thousands of types of cells and environments 2 It’s all About the Complexity The Human genome has changed the way biologists approach scientific challenges. Biology is an information science Biology applications are scaling at a rate that exceeds the computing capability GTL presents the opportunity to expand throughput in 5-50 fold increases per year. 3 Billions of Bases in GenBank According to the GOLD database, there are 146 published genomes, 344 prokaryotic ongoing genomes projects, and 243 eukaryotic ongoing genome projects. DOE never supported a comprehensive and effective data management and curation program for Genbank. The Protein Data Bank (PDB) is a repeat of the same scenario. Both data base efforts were ahead of the science that capitalized on the work. Curation, Provanance strategies are still unsloved hard problems for these data. 1982 1986 1990 1994 1998 2002 4 Growth of Proteomic Data vs. Sequence Data 1000 100 1 0.1 0.01 0.001 Proteomic data 0.0001 GenBank 0.00001 19 88 19 90 19 92 19 94 19 96 19 98 20 00 20 02 20 04 20 06 20 08 20 10 20 12 20 14 20 16 PetaBytes 10 Years 5 From BERAC – December 2002 6 Computing Issues for GTL Facilities and Projects 6. Infrastructure Creating an Integrated Computational Biology Environment 5. The Community Data Resource 4. Interpretation / Modeling / Simulation 3. Data Analysis / Reduction 2. Data Capture and Archiving 1. LIMS & Workflow Management 7 Central Role of GTL Facilities in Compute Planning • • • • • The GTL Facilities will represent the cornerstone of the GTL enterprise and major sites for development of computing systems. Creating an Integrated Computational Biology Environment They will generate massive amounts of data for use by the community and for constructing models of biology The facilities will be the sites where experiment workflow must be facilitated, data must be analyzed, and systems biology data and models provided to the community They are likely to contain integrated high performance computing, share suites of tools to analyze data and massive data archives. Their combined and integrated output will become the major portion of the GTL community resource (GTL knowledge base) 8 Need New Data Handling and Computing Resources to Handle Data Tsunami DATA Current data infrastructure Help! 9 Experiment Design Metadata Issues Experiment design context provides the most powerful context dependent annotation for gene/protein activities Experiments designs will evolve over time Experiment designs should specify what data needs capturing Statistical experiment designs should drive Discovery activities Flexible approaches are needed to adapt to new data collection modes and data types Model driven experimentation needs to include the prediction/hypothesis tested Experiments [samples, genetics, treatments, conditions, time, [quality measures]] Samples [attributes,[measurements,[qc measures]] 10 GTL Experiment Template Experiment templates for a single microbe class of time experiment points simple (scratching the surface) 10 moderate 25 upper mid 50 complex 20 real interesting 20 Profiling method Proteomics Metabolites Transcription total Proteomics genetic biological biological data volume Metabolite Transcription treatments conditions variants replication samples in TB data in TB data in TB 1 3 3 5 5 3 5 5 5 5 1 1 5 20 50 3 3 3 3 3 90 1125 11250 30000 75000 18.0 225.0 2250.0 6000.0 15000.0 13.5 168.8 1687.5 4500.0 11250.0 0.018 0.225 2.25 6 15 Looking at a possible 6000 proteins per microbe assuming ~200 GB per sample Looking a panel of 500-1000 different molecules assuming ~150GB per sample 6000 genes & 2 arrays per sample ~100 MB Typically a single significant scientific question takes the multidimensional analysis of at least 1000 biological samples 11 Creating an Integrated Computational Biology Environment The GTL Informatics Whole Picture “The GTL ORACLE” Protein Production DB Protein Expression and Regulation DB Shared Tools Libs Expression Analysis Lib Protein Machines DB Modeling & simulation Tools Lib Cell & Community Systems DB Large-scale shared bulk data archives Regulatory network modeling tools Expression Archive Facility x Output to community data resource Modeling and Simulation Confocal Image analysis tools Lib Protein Machine modeling tools Mass spec analysis tools Lib Molecular Dynamics Simulation Library ... ... Image Archive MassSpec Archive ... Facility y Output to community data resource Modeling and Simulation Data Analysis / Reduction Data Analysis / Reduction Data Capture and Archiving DBs Data Capture and Archiving DBs LIMS & Workflow Management Shared LIMS / Workflow LIMS & Workflow Management 12 Community Data Resource What’s in the Knowledgebase? Facility 1 Data Resources 1. Protein Production DB - microbial baseline annotation, genes, proteins... - catalog of proteins and reagents produced / inventory - biophysical and biochemical characterizations of proteins - protocols and methods Microbial genome baseline annotation Proteins and reagents catalog Protein biophysical/ biochemical data Protein production protocols / methods Facility 2 Data Resources 2. Protein Expression & Regulation DB - protein expression data per condition per microbe - regulatory networks based on expression data - metabolite / metabolic network data - protocols and methods 3. Protein Machines DB - protein machines catalog - protein machines models of organization / dynamics - protein interaction network models and simulations - protocols and methods 4. Cell and Community Systems DB - in vivo cell measurements of expression / machines - measurements of community interactions/ metabolism - integrated cell models (regulation, metabolism, signaling) - integrated community models Protein expression DB Regulatory network models database Metabolic network models database Cell growth & methods & protocols Facility 3 Data Resources Protein machines catalog Protein machines models & simulations Interaction network models database Protein machines protocols / methods DB Cell models and simulations Community models and simulations Facility 4 Data Resources In vivo protein and machine expression / localization Community metabolism and interactions 13 Community Data Resource R & D Challenges Design and Integration of the major databases Huge data volumes, great schema complexity - need for new types of databases (hardware and software) Database technologies – object-relational, graph DBs, … Data standards, representations, ontologies for very complex objects User Access Systems for browsing, query, visualization, and to run analysis or simulations Supporting Simulation from DBs - how to allow users to utilize models and run simulations; how to link simulations to underlying data Integration - Provide integrated view of the biology - With data from other community sources. Community access to compute power to run long timescale simulations IP issues and reward system How to represent incomplete, sparse, conflicting data