High throughput biology data management and data intensive computing drivers George Michaels

advertisement
High throughput biology
data management and data intensive
computing drivers
George Michaels
The Scope of the Problem
A highly multidimensional world
of complicated dynamic events
Both synchronous and
asynchronous processes
Vast scales of time and space
A hierarchy of simultaneous
levels of activity
Thousands of types of cells and
environments
2
It’s all About the Complexity
The Human genome has
changed the way biologists
approach scientific challenges.
Biology is an information
science
Biology applications are
scaling at a rate that exceeds
the computing capability
GTL presents the opportunity
to expand throughput in 5-50
fold increases per year.
3
Billions of Bases in GenBank
According to the GOLD database, there
are 146 published genomes, 344
prokaryotic ongoing genomes projects,
and 243 eukaryotic ongoing genome
projects.
DOE never supported a comprehensive
and effective data management and
curation program for Genbank.
The Protein Data Bank (PDB) is a repeat
of the same scenario.
Both data base efforts were ahead of
the science that capitalized on the
work.
Curation, Provanance strategies are
still unsloved hard problems for these
data.
1982
1986
1990
1994
1998
2002
4
Growth of Proteomic Data vs. Sequence Data
1000
100
1
0.1
0.01
0.001
Proteomic data
0.0001
GenBank
0.00001
19
88
19
90
19
92
19
94
19
96
19
98
20
00
20
02
20
04
20
06
20
08
20
10
20
12
20
14
20
16
PetaBytes
10
Years
5
From BERAC – December 2002
6
Computing Issues for
GTL Facilities and Projects
6. Infrastructure
Creating an Integrated Computational
Biology Environment
5. The Community Data Resource
4. Interpretation / Modeling / Simulation
3. Data Analysis / Reduction
2. Data Capture and Archiving
1. LIMS & Workflow Management
7
Central Role of GTL Facilities in Compute Planning
•
•
•
•
•
The GTL Facilities will represent
the cornerstone of the GTL enterprise
and major sites for development of
computing systems.
Creating an Integrated Computational
Biology Environment
They will generate massive amounts
of data for use by the community and
for constructing models of biology
The facilities will be the sites where
experiment workflow must be facilitated,
data must be analyzed, and systems
biology data and models provided to
the community
They are likely to contain integrated high performance computing, share
suites of tools to analyze data and massive data archives.
Their combined and integrated output will become the major portion of the
GTL community resource (GTL knowledge base)
8
Need New Data Handling and Computing Resources to Handle Data
Tsunami
DATA
Current data
infrastructure
Help!
9
Experiment Design Metadata Issues
Experiment design context provides the most powerful context
dependent annotation for gene/protein activities
Experiments designs will evolve over time
Experiment designs should specify what data needs capturing
Statistical experiment designs should drive Discovery activities
Flexible approaches are needed to adapt to new data collection modes
and data types
Model driven experimentation needs to include the
prediction/hypothesis tested
Experiments [samples, genetics, treatments, conditions, time, [quality
measures]]
Samples [attributes,[measurements,[qc measures]]
10
GTL Experiment Template
Experiment templates for a single microbe
class of
time
experiment
points
simple
(scratching
the surface)
10
moderate
25
upper mid
50
complex
20
real interesting
20
Profiling method
Proteomics
Metabolites
Transcription
total
Proteomics
genetic biological biological data volume Metabolite Transcription
treatments conditions variants replication samples in TB
data in TB data in TB
1
3
3
5
5
3
5
5
5
5
1
1
5
20
50
3
3
3
3
3
90
1125
11250
30000
75000
18.0
225.0
2250.0
6000.0
15000.0
13.5
168.8
1687.5
4500.0
11250.0
0.018
0.225
2.25
6
15
Looking at a possible 6000 proteins per microbe assuming ~200 GB per sample
Looking a panel of 500-1000 different molecules assuming ~150GB per sample
6000 genes & 2 arrays per sample ~100 MB
Typically a single significant scientific question takes the multidimensional analysis of at least 1000 biological samples
11
Creating an Integrated Computational Biology Environment
The GTL Informatics Whole Picture
“The GTL ORACLE”
Protein
Production DB
Protein Expression
and Regulation DB
Shared Tools Libs
Expression
Analysis Lib
Protein
Machines DB
Modeling & simulation
Tools Lib
Cell & Community
Systems DB
Large-scale
shared
bulk data archives
Regulatory network
modeling tools
Expression Archive
Facility x
Output to community data resource
Modeling and Simulation
Confocal Image
analysis tools Lib
Protein Machine
modeling tools
Mass spec analysis
tools Lib
Molecular Dynamics
Simulation Library
...
...
Image Archive
MassSpec Archive
...
Facility y
Output to community data resource
Modeling and Simulation
Data Analysis / Reduction
Data Analysis / Reduction
Data Capture and Archiving DBs
Data Capture and Archiving DBs
LIMS & Workflow Management
Shared LIMS / Workflow
LIMS & Workflow Management
12
Community Data Resource
What’s in the Knowledgebase?
Facility 1 Data Resources
1. Protein Production DB
- microbial baseline annotation, genes, proteins...
- catalog of proteins and reagents produced / inventory
- biophysical and biochemical characterizations of proteins
- protocols and methods
Microbial
genome
baseline
annotation
Proteins
and
reagents
catalog
Protein
biophysical/
biochemical
data
Protein
production
protocols /
methods
Facility 2 Data Resources
2. Protein Expression & Regulation DB
- protein expression data per condition per microbe
- regulatory networks based on expression data
- metabolite / metabolic network data
- protocols and methods
3. Protein Machines DB
- protein machines catalog
- protein machines models of organization / dynamics
- protein interaction network models and simulations
- protocols and methods
4. Cell and Community Systems DB
- in vivo cell measurements of expression / machines
- measurements of community interactions/ metabolism
- integrated cell models (regulation, metabolism, signaling)
- integrated community models
Protein
expression
DB
Regulatory
network
models
database
Metabolic
network
models
database
Cell
growth &
methods &
protocols
Facility 3 Data Resources
Protein
machines
catalog
Protein
machines
models &
simulations
Interaction
network
models
database
Protein
machines
protocols
/ methods DB
Cell
models and
simulations
Community
models
and
simulations
Facility 4 Data Resources
In vivo
protein and
machine
expression /
localization
Community
metabolism
and
interactions
13
Community Data Resource
R & D Challenges
Design and Integration of the major databases
Huge data volumes, great schema complexity - need for
new types of databases (hardware and software)
Database technologies – object-relational, graph DBs,
…
Data standards, representations, ontologies for very
complex objects
User Access Systems for browsing, query, visualization,
and to run analysis or simulations
Supporting Simulation from DBs - how to allow users
to utilize models and run simulations; how to link
simulations to underlying data
Integration
- Provide integrated view of the biology
- With data from other community sources.
Community access to compute power to run long timescale simulations
IP issues and reward system
How to represent incomplete, sparse, conflicting data
Download