DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular Systems Initiative

advertisement
DOE Data Workshop
View from Information-intensive Applications
H. Steven Wiley
Biomolecular Systems Initiative
Pacific Northwest National Laboratory
(www.sysbio.org)
Pacific Northwest National Laboratory
U.S. Department of Energy
Information Intensive Science
Goals of IIS


Understanding systems versus individual phenomena
Strengthening/automating links between different types of data from different scales
Examples





Biology: Cell Signaling
Biology: BIRN
Chemistry: CMCS
Homeland Defense
Complexity of systems is becoming pervasive
Challenges



Efficient federation, graph-based queries
Continuous data correlation
Managing complex experiments, data provenance using multiple independent data and analysis
resources
Priorities



High-performance federation, data mining, semantic query capabilities (software, hardware
architecture)
Knowledge environments (lightweight, evolvable, powerful, …)
Organization and Visualization of large-scale, complex information
2
Combustion is a Multi-scale
Chemical Science Challenge
A systems-science approach to address
complex problems




New knowledge is assimilated from different data,
tools, and disciplines at each scale
Real-time bi-directional information flow
Deep analysis across scales
Multiple applications for the same information
Challenges



Data, provenance, annotation publication
Syntactic and Semantic Federation
Standardization versus innovation
Examples:


IUPAC – update of radical thermochemistry reference
values by global expert group
PrIMe – community developed optimized reaction
mechanisms
guiding experimental plans across scales, providing
community resources for applied research
3
Homeland Security: Pulling insight out of
information overload
Communications
Shipping
Financial
Sensors
Immigration
Is there a
domestic
terrorist
plot?
Can we detect and prevent a terrorist
attack BEFORE it happens?
Volume of data, orders of magnitude
larger and at different levels of
abstraction
Complexity of information spaces into
very high dimensions, 200 the norm
Information often out of context,
incomplete, fuzzy
Deception
Information in all media types: text,
imagery, video, voice, web, sensor data
Time and temporal dynamics
fundamentally change the approach
Spatial, yet non-spatial abstract data
Multiple ontologies, languages, cultures
Privacy Issues
For homeland security and science
we now turn to data-intensive visual analytics
4
5
Systems Biology of Cells
Cell
function:
death,
proliferation,
differentiation,
migration, ...
Molecular
parameters:
protein levels / states /
locations / interactions /
activities
Ultimate aim:
Understanding
and
prediction
of effects of
component
properties
6
7
8
What, Where, Quantity, Quality?
To successfully model a complex
biological system, one must minimally
know the following information:
What parts are being made? (identity)
What is the regulatory network structured? (interactions)
Where are the proteins located in cell? (location)
What are their levels? (quantity)
How do they interact with their partners? (activity)



As a function of covalent modification
Contribution of steric restrictions
Forward and reverse rate constants
9
Cells as Input-Output Systems
Biologists look at their experiments as input-output systems
We start with a “defined” system to which we apply a stimulus (Input:
independent variable)
We then look for a specific response (output: dependent variable)
The relationship between the input and output provides insight into the
workings of the system
Input
Unknown context
System
Output
So unless we control the
experimental context, we cannot
interpret our experiments
10
The Two Greatest Challenges of
Systems Biology
1. Working with indeterminate systems
2. Understanding context - what it is
and how to control and capture it
11
Defining the composition of living systems
is driving analytical technologies
Genomics
Proteomics
Metabanomics
Expression profiling
Imaging
Etc…….
All of these technologies
seek to rigorously define
the composition of living
systems
12
Global simultaneous quantitative proteome measurements
Proteins identified and quantified using
accurate mass and time (AMT) tags
Dimension one - separation time
Capillary LC-FTICR 2-D display of peptides from a yeast soluble protein digest
2-D display of detected peptides
>160,000 isotopic distributions corresponding to >100,000 polypeptides detected
2,500
2,243
1,987
1,731
0
42
84
Mass
MW
126
LC elution time (min)
1,475
Dimension two - accurate mass
1,218
962
706
750
m/z
1000
1250
1500
450
24
33
44
Time52
62
71
13
High Throughput Proteomics
1 Experiment per hour
5000 spectra per experiment
4 MByte per spectrum
Per instrument:
20 Gbytes per hour
480 Gbytes per day
These are based on
today's technologies.
9.4 Tesla High Throughput Mass Spectrometer
Time to analyze offsite:
Time to analyze onsite:
Time to analyze onsite with smart storage:
1 week
48 hours
2 hours
14
Integrated, High-throughput Experiments will
Generate Enormous Amounts of Data
Experiment templates for a single microbe
class of
experiment
simple
(scratching the
surface)
moderate
upper mid
complex
real interesting
time
points
treatments
10
25
50
20
20
genetic
variants
conditions
1
3
3
5
5
3
5
5
5
5
1
1
5
20
50
total
biological
samples
biological
replication
3
3
3
3
3
Proteomics
data volume
in TB
90
1125
11250
30000
75000
1.8
22.5
225.0
600.0
1500.0
Metabolite
data in TB
1.4
16.9
168.8
450.0
1125.0
Transcription
data in TB
0.009
0.1125
1.125
3
7.5
Profiling method
Proteomics
Looking at a possible 6000 proteins per microbe assuming ~20 GB per sample
Metabolites
Looking a panel of 500-1000 different molecules assuming ~15GB per sample
Transcription
6000 genes & 2 arrays per sample ~100 MB
Typically a single significant scientific question takes the multidimensional analysis of at least 1000 biological samples
15
16
The Molecular Interaction Scaffold is Huge
Trey Ideker
17
Cell Imaging
New multispectral, multidimensional imaging techniques
can generate enormous amounts of data
18
Cell Imaging Workflow
Complex set
of metadata
collected here
19
How Much Data From Imaging?
Currently, a high quality image of a single cell field is 4mb
per image, obtained at 4fps (16mb/s)
Following cell through one cell cycle is 24h, or
approximately 1.4tb
New hyperspectral microscopes analyzing only 10
wavelengths would generate 7tb/day
Characterizing dynamics of most abundant set of genes
(4000) would require 5.5pb
This is for a single instrument and a single experiment
using today’s technology
20
Understanding the influence of cell context
is driving experimental and computational
biology
Cell Signaling
Developmental biology
Cancer and growth control
Host-pathogen interactions
Dynamics of microbial communities
Cellular responses to stress
21
Computational Modeling Approaches
-- Diverse Spectrum
SPECIFIED
ABSTRACTED
differential equations
*
Markov chains
mechanisms
(including
structure)
Boolean models
Bayesian networks
statistical mining
influences
*
relationships
22
Computer Models Allow Reconstruction of
Processes Across Different Scales
MODEL DATABASE
Species
Species
11
Species
1
Species N
Species
Organ
Model11
Model
Model
Tissue N 1
Tissue
Cell
Solution Par.
Input_par ID
Input_par ID
React. Rates
Chemical Par.
Concen. Val.
-
Geometric
Par.
Input_par ID
Input_par ID
Value_par
-
Organ
11
Organ
Organ
Organ1 N
Model
1 1
Model
Cell Data
Set N
Unique ID
Model Name
Model Descr.
Default Par.
Default Comp.
Timestamp
Security Equation
Compute
Par.
Input_par ID
Input_par ID
Value_par
-
Docs.
Initial
Conditions
Input_fld ID
Input_fld ID
Value_par
Value_par
-
Parameter
Docs.
Input_par ID
Input_par ID
References
Limits
-
Input_par ID
Input_par ID
Symbolic
Source
-
23
24
25
26
Obstacles preventing scientists from
utilizing available data
Data is distributed across many repositories with various
ontologies and data formats
Analysis tools do not address integration of heterogeneous
data sets
Minimal informatics based analysis tools that support a
systems biology approach
Collaboration capabilities are primitive to support shared
knowledge among researchers
27
The Challenge for Data Handling is Two-fold
1. Managing the massive amounts of compositional data
necessary to define all of the relevant experimental
systems
2. Capture all of the data on the relationships between
context, composition and response
Integration of the analytical and experimental
methodologies into a single system is necessary to
link all of the data in a useful way
28
END
29
Understanding Living Cells
Cell responses are multiphasic
Different classes of stimulants (information) are
processed at characteristic time scales
Processing nodes within cells are spatially segregated
Each cell responds independently depending on its
specific context
A response generally induces a reprogramming of the
cell machinery
To create cell simulations, we must “abstract” this information to create
a reference model which can then be modified
30
Download