DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular Systems Initiative Pacific Northwest National Laboratory (www.sysbio.org) Pacific Northwest National Laboratory U.S. Department of Energy Information Intensive Science Goals of IIS Understanding systems versus individual phenomena Strengthening/automating links between different types of data from different scales Examples Biology: Cell Signaling Biology: BIRN Chemistry: CMCS Homeland Defense Complexity of systems is becoming pervasive Challenges Efficient federation, graph-based queries Continuous data correlation Managing complex experiments, data provenance using multiple independent data and analysis resources Priorities High-performance federation, data mining, semantic query capabilities (software, hardware architecture) Knowledge environments (lightweight, evolvable, powerful, …) Organization and Visualization of large-scale, complex information 2 Combustion is a Multi-scale Chemical Science Challenge A systems-science approach to address complex problems New knowledge is assimilated from different data, tools, and disciplines at each scale Real-time bi-directional information flow Deep analysis across scales Multiple applications for the same information Challenges Data, provenance, annotation publication Syntactic and Semantic Federation Standardization versus innovation Examples: IUPAC – update of radical thermochemistry reference values by global expert group PrIMe – community developed optimized reaction mechanisms guiding experimental plans across scales, providing community resources for applied research 3 Homeland Security: Pulling insight out of information overload Communications Shipping Financial Sensors Immigration Is there a domestic terrorist plot? Can we detect and prevent a terrorist attack BEFORE it happens? Volume of data, orders of magnitude larger and at different levels of abstraction Complexity of information spaces into very high dimensions, 200 the norm Information often out of context, incomplete, fuzzy Deception Information in all media types: text, imagery, video, voice, web, sensor data Time and temporal dynamics fundamentally change the approach Spatial, yet non-spatial abstract data Multiple ontologies, languages, cultures Privacy Issues For homeland security and science we now turn to data-intensive visual analytics 4 5 Systems Biology of Cells Cell function: death, proliferation, differentiation, migration, ... Molecular parameters: protein levels / states / locations / interactions / activities Ultimate aim: Understanding and prediction of effects of component properties 6 7 8 What, Where, Quantity, Quality? To successfully model a complex biological system, one must minimally know the following information: What parts are being made? (identity) What is the regulatory network structured? (interactions) Where are the proteins located in cell? (location) What are their levels? (quantity) How do they interact with their partners? (activity) As a function of covalent modification Contribution of steric restrictions Forward and reverse rate constants 9 Cells as Input-Output Systems Biologists look at their experiments as input-output systems We start with a “defined” system to which we apply a stimulus (Input: independent variable) We then look for a specific response (output: dependent variable) The relationship between the input and output provides insight into the workings of the system Input Unknown context System Output So unless we control the experimental context, we cannot interpret our experiments 10 The Two Greatest Challenges of Systems Biology 1. Working with indeterminate systems 2. Understanding context - what it is and how to control and capture it 11 Defining the composition of living systems is driving analytical technologies Genomics Proteomics Metabanomics Expression profiling Imaging Etc……. All of these technologies seek to rigorously define the composition of living systems 12 Global simultaneous quantitative proteome measurements Proteins identified and quantified using accurate mass and time (AMT) tags Dimension one - separation time Capillary LC-FTICR 2-D display of peptides from a yeast soluble protein digest 2-D display of detected peptides >160,000 isotopic distributions corresponding to >100,000 polypeptides detected 2,500 2,243 1,987 1,731 0 42 84 Mass MW 126 LC elution time (min) 1,475 Dimension two - accurate mass 1,218 962 706 750 m/z 1000 1250 1500 450 24 33 44 Time52 62 71 13 High Throughput Proteomics 1 Experiment per hour 5000 spectra per experiment 4 MByte per spectrum Per instrument: 20 Gbytes per hour 480 Gbytes per day These are based on today's technologies. 9.4 Tesla High Throughput Mass Spectrometer Time to analyze offsite: Time to analyze onsite: Time to analyze onsite with smart storage: 1 week 48 hours 2 hours 14 Integrated, High-throughput Experiments will Generate Enormous Amounts of Data Experiment templates for a single microbe class of experiment simple (scratching the surface) moderate upper mid complex real interesting time points treatments 10 25 50 20 20 genetic variants conditions 1 3 3 5 5 3 5 5 5 5 1 1 5 20 50 total biological samples biological replication 3 3 3 3 3 Proteomics data volume in TB 90 1125 11250 30000 75000 1.8 22.5 225.0 600.0 1500.0 Metabolite data in TB 1.4 16.9 168.8 450.0 1125.0 Transcription data in TB 0.009 0.1125 1.125 3 7.5 Profiling method Proteomics Looking at a possible 6000 proteins per microbe assuming ~20 GB per sample Metabolites Looking a panel of 500-1000 different molecules assuming ~15GB per sample Transcription 6000 genes & 2 arrays per sample ~100 MB Typically a single significant scientific question takes the multidimensional analysis of at least 1000 biological samples 15 16 The Molecular Interaction Scaffold is Huge Trey Ideker 17 Cell Imaging New multispectral, multidimensional imaging techniques can generate enormous amounts of data 18 Cell Imaging Workflow Complex set of metadata collected here 19 How Much Data From Imaging? Currently, a high quality image of a single cell field is 4mb per image, obtained at 4fps (16mb/s) Following cell through one cell cycle is 24h, or approximately 1.4tb New hyperspectral microscopes analyzing only 10 wavelengths would generate 7tb/day Characterizing dynamics of most abundant set of genes (4000) would require 5.5pb This is for a single instrument and a single experiment using today’s technology 20 Understanding the influence of cell context is driving experimental and computational biology Cell Signaling Developmental biology Cancer and growth control Host-pathogen interactions Dynamics of microbial communities Cellular responses to stress 21 Computational Modeling Approaches -- Diverse Spectrum SPECIFIED ABSTRACTED differential equations * Markov chains mechanisms (including structure) Boolean models Bayesian networks statistical mining influences * relationships 22 Computer Models Allow Reconstruction of Processes Across Different Scales MODEL DATABASE Species Species 11 Species 1 Species N Species Organ Model11 Model Model Tissue N 1 Tissue Cell Solution Par. Input_par ID Input_par ID React. Rates Chemical Par. Concen. Val. - Geometric Par. Input_par ID Input_par ID Value_par - Organ 11 Organ Organ Organ1 N Model 1 1 Model Cell Data Set N Unique ID Model Name Model Descr. Default Par. Default Comp. Timestamp Security Equation Compute Par. Input_par ID Input_par ID Value_par - Docs. Initial Conditions Input_fld ID Input_fld ID Value_par Value_par - Parameter Docs. Input_par ID Input_par ID References Limits - Input_par ID Input_par ID Symbolic Source - 23 24 25 26 Obstacles preventing scientists from utilizing available data Data is distributed across many repositories with various ontologies and data formats Analysis tools do not address integration of heterogeneous data sets Minimal informatics based analysis tools that support a systems biology approach Collaboration capabilities are primitive to support shared knowledge among researchers 27 The Challenge for Data Handling is Two-fold 1. Managing the massive amounts of compositional data necessary to define all of the relevant experimental systems 2. Capture all of the data on the relationships between context, composition and response Integration of the analytical and experimental methodologies into a single system is necessary to link all of the data in a useful way 28 END 29 Understanding Living Cells Cell responses are multiphasic Different classes of stimulants (information) are processed at characteristic time scales Processing nodes within cells are spatially segregated Each cell responds independently depending on its specific context A response generally induces a reprogramming of the cell machinery To create cell simulations, we must “abstract” this information to create a reference model which can then be modified 30