In VL-e

advertisement
e-Science and Grid
The VL-e approach
L.O. (Bob) Hertzberger
Computer Architecture and Parallel Systems Group
Department of Computer Science
Universiteit van Amsterdam
bob@science.uva.nl
Content
• Developments in Grid
• Developments in e-Science
 Objectives
• Virtual Lab for e-Science
 Research philosophy
• Conclusions
Background
ICTpush developments
•
•
•
•
Processing power doubles every 18 month
Memory size doubles every 12 month
Network speed doubles every 9 month
Something has to be done to harness this
development
 Virtualization of ICT resources
 Internet
 Web
 Grid
Mbit/s
Internal versus external
bandwidth
100000
10000
Computer busses
1000
100
10
1
0.1
networks
0.01
0.001
1980
1990
1995
1997
1998
2001
2002
Web& Grid & Web/Grid Services
• Web services is a paradigm/way of using/accessing
information
 Web resources are mostly human-centered (understood by
humans, computers can read but can’t understand)
• Grid is about accessing & sharing computing
resources by virtualization
 Data & Information repositories
 Experimental facilities
• OGSA: Service oriented Grid standard based on Web
services
Web & Grid & Semantic Web/Grid
• Semantic Web resources should also be
understandable by computers
 This way, many complex tasks formulated and assigned by
humans can be automated and executed by agents working
on the Semantic Web
• But, both Grid and Web Services only focus on single
services.
• Semantic Web/Grid should be able to describe a
single service as well as the relationship between
services e.g. aggregate service using a number of
services so making knowledge explicit
Levels of Grid abstraction
Knowledge Web/Grid
Information Web/Grid
Data Grid
Computational Grid
Background information
experimental sciences
• There is a tendency to look ever deeper in:
 Matter e.g. Physics
 Universe e.g. Astronomy
 Life e.g. Life sciences
• Therefore experiments become increasingly more
complex
• Instrumental consequences are increase in detector:
 Resolution & sensitivity
 Automation & robotization
• Results for instance in life science in:
 So called high throughput methods
 Omics experimentation
 genome ===> genomics
New technologies in
Life Sciences research
cell
Methodology/
Technology
DNA
Genomics
RNA
Transcriptomics
protein
metabolites
Proteomics
Metabolomics
University of Amsterdam
Paradigm shift in Life science
• Past experiments where hypothesis
driven
Evaluate hypothesis
Complement existing knowledge
• Present experiments are data driven
Discover knowledge from large amounts
of data
 Apply statistical techniques
Background information
experimental sciences
• Experiments become increasingly more
complex
 Driven by detector developments
 Resolution increases
 Automation & robotization increases
• Results in an increase in amount and
complexity of data
The Application data crisis
• Scientific experiments start to generate
lots of data





medical imaging (fMRI):
Bio-informatics queries:
Satellite world imagery:
Current particle physics:
LHC physics (2007):
~ 1 GByte per measurement (day)
500 GByte per database
~ 5 TByte/year
1 PByte per year
10-30 PByte per year
• Data is often very distributed
Background information
experimental sciences
• Experiments become increasingly more
complex
 Driven by detector developments
 Resolution increases
 Automation & robotization increases
• Results in an increase in amount and
complexity of data
• Something has to be done to harness this
development
 Virtualization of experimental resources: e-Science
The what of e-Science
• e-Science is the application domain “Science” of Grid & Web
 More than only coping with data explosion
 A multi-disciplinary activity combining human expertise & knowledge
between:
 A particular domain scientist
 ICT scientist
• e-Science demands a different approach to experimentation because
computer is integrated part of experiment
 Consequence is a radical change in design for experimentation
• e-Science should apply and integrate Web/Grid methods where and
whenever possible
Grid and Web Services
Convergence
Grid
Started
far
apart in
apps &
tech
Have been
converging
WSRF
Web
Definition of Web Service Resource Framework(WSRF) makes explicit
distinction between “service” and stateful entities acting upon
service i.e. the resources
Means that Grid and Web communities can move forward on a
common base!!!
Ref: Foster
e-Science Objectives
•
•
It should enhance the scientific process by:
Stimulating collaboration by sharing data & information

Result is re-use of data & information
The data sharing potential for
Cognition
• Collaborative scientific
research
 Information sharing
 Metadata modeling
• Allows for experiment
validation
 Independent confirmation of
results
• Statistical methodologies
 Access to large collections
of data and metadata
• Training
 Train the next generation
using peer reviewed
publications and the
associated data
Electron tomography data
pipeline
Acquisition
Alignment
Reconstruction
Segmentation
Interpretation
Cell Centred Database
 NCIMR (San Diego)
 Maryann Martone and Mark Ellisman
e-Science Objectives
•
•
It should enhance the scientific process by:
Stimulating collaboration by sharing data & information
 Improve re-use of data & information
•
Combing data and information from different modalities
 Sensor data & information fusion
The anatomy of a painting:
Scrutinizing Rembrandt in multiple dimensions
The anatomy lesson of dr. Tulp, Rembrandt (1632), Mauritshuis, Inv nr. 146
VL chemical imaging data experiment
Experiment 1
Experiment 2
To metadata
database
FTIR
raw
data
MS
raw
data
Me ta data
generation
(PC A)
Me ta data
generation
(PC A)
VL Exp.1
Reduced
metadata
set
VL
Result 1
VL Exp.2
Reduced
metadata
set
Improved
metadata
set
VL
Result 3
VL
Result 4
VL
Result 2
Metadata
correlation
analysis (CCA)
Improved
metadata
set
VL Exp.3 + 4
EXP 1
300 mx3 00m
256 x256 pixels
3-D Da taset
Metadata
generation
VL EXP 1
Reconstructed
MS
400 mx4 00m
64x64 pixels
3-D Da taset
VL EXP 2
(+)
(+)
(-)
(-)
(+)
(+)
(-)
(-)
Score images
Score images
MS Im aging
FTIR Imaging
Reconstructed
FTIR spectra
EXP 1
Calculate Cross Correlation matrix C
using principal components 1 to 10
from each dataset :
C = S TTRIFT S FTIR
Canonical Correlation Analysis
VL EXP 3
VL EXP 4
Improved FTIR
spectra, S/N >100%
increase
Pos
CV1
EV 0.736
5.6 % var
CV1
EV 0.736
15.9 % var
R=0.858
Neg
Pos
CV2
EV 0.382
2.1 % var
CV2
EV 0.382
5.7 % var
R=0.618
Neg
Score images
Reconstructed
MS
Score images
Reconstructed
FTIR spectra
e-Science Objectives
•
•
It should enhance the scientific process by:
Stimulating collaboration by sharing data & information
 Improve re-use of data & information
 Combing data and information from different modalities

•
Sensor data & information fusion
Realize the combination of real life & (model based)
simulation experiments
Simulated Vascular Reconstruction
in a Virtual Operating Theatre
• An example of the combination of real life &
(model based) simulation experiments
•
•
• patient specific vascular geometry (from CT)
• Segmentation
• blood flow simulation (Latice Bolzmann)
Pre-operative planning (interaction)
Suitable for parallelization through
functional decomposition
Patient’s vascular geometry
(CTA)
Simulated “Fem-Fem”
bypass
Grid resources
e-Science Objectives
•
•
It should enhance the scientific process by:
Stimulating collaboration by sharing data & information
 Improve re-use of data & information
 Combing data and information from different modalities

•
•
Sensor data & information fusion
Realize the combination of real life & (model based)
simulation experiments
Modeling of dynamic systems
Bird behaviour in relation to weather and landscape
RADAR
Calibration and
Data assimilation
Dynamic
bird behaviour
MODELS
Bird distributions
Ensembles
Predictions
and on-line
warnings
e-Science Objectives
• It should result in :
• Computer aided support for rapid prototyping of ideas
 Stimulate the creativity process
• It should realize that by creating & applying:
 New ICT methodologies and a computing infrastructure stimulating
this
• From this ICT point of view it should support the following
application steps:




Design
Development & realization
Execution
Analysis & interpretation
• We try to realize e-Science and their applications via
the Virtual Lab for e-Science (VL-e) project
Virtual Lab for e-Science
research Philosophy
• Multidisciplinary research & the development of
related ICT infrastructure
• Generic application support
 Application cases are drivers for computer & computational
science and engineering research
VL-e project
Medical
Diagnosis &
Imaging
BioDiversity
BioInformatics
Data
Data intensive
Insive
sciences
Science/
LOFAR
Food
Informatics
VL-e
Management
Application
Oriented
of comm.
& Services
computing
Grid Services
Harness multi-domain distributed resources
Dutch
Telescience
Two sides of Bioinformatics
as an e-Science
• The scientific responsibility to develop the underlying
computational concepts and models to convert
complex biological data into useful biological and
chemical knowledge
• Technological responsibility to manage and integrate
huge amounts of heterogeneous data sources from
high throughput experimentation
Role of bioinformatics
Genomics
RNA
Transcriptomics
protein
metabolites
Proteomics
Metabolomics
Integrative/System Biology
Data usage/user interfacing
DNA
Data integration/fusion
methodology bioinformatics
Data generation/validation
cell
Virtual Lab for e-Science
research Philosophy
• Multidisciplinary research and development of
related ICT infrastructure
• Generic application support
 Application cases are drivers for computer & computational
science and engineering research
 Problem solving partly generic and partly specific
 Re-use of components via generic solutions whenever
possible
Potential Generic
part
Management
of comm. &
computing
Application
Specific
Part
Potential Generic
part
Virtual
Laboratory
Management
of comm.
& Services
Application
Oriented
computing
Application
Specific
Part
Potential Generic
part
Management
of comm. &
computing
Grid/ Web Services
Harness multi-domain distributed resources
Application pull
Application
Specific
Part
Generic e-Science aspects
•
•
Virtual Reality Visualization & user interfaces
Modeling & Simulation
 Interactive Problem Solving
•
Data & information management
 Data modeling
 dynamic work flow management
•
Content (knowledge) management
 Semantic aspects
 Meta data modeling
 Ontologies
•
•
Wrapper technology
Design for Experimentation
Virtual Lab for e-Science
research Philosophy
• Multidisciplinary research and development of
related ICT infrastructure
• Generic application support
 Application cases are drivers for computer & computational
science and engineering research
 Problem solving partly generic and partly specific
 Re-use of components via generic solutions whenever
possible
• Rationalization of experimental process
 Reproducible & comparable
Issues for a reproducible scientific
experiment
Parameter settings,
Calibrations,
Protocols
…
acquisition
experiment
sensors,amplifiers
imaging devices,
,…
parameters/settings,
algorithms,
intermediate results,
…
raw data
processing
conversion, filtering,
analyses, simulation, …
software packages,
algorithms
…
processed data
presentation
visualization, animation
interactive exploration, …
interpretation
Rationalization of the experiment and processes via protocols
Metadata
Much of this is lost when an
experiment is completed.
Scientific experiments & e-Science
Step1: designing
an experiment
Step2:
performing the
experiment
Step3:
analyzing the
experiment results
success
For complex experiments:
 contain complex processes
 require interdisciplinary expertise
 need large scale resource
Grid & high level
support
Components in a VL-e experiment
COMMENTATOR
(LINK)
Process-Flow Templates(PFT)
COMMENTATOR
(LINK)
isMadeBy
(LINK)
isMadeBy
(LINK)
EXPERIMENT
(LINK)
COMMENT
(COPY)
– Derived from ontologies
– Graphical representation of data elements
and processing steps in an experimental
procedure
– Information to support context-sensitive
assistance (semantics)
COMMENT
(COPY)
hasComments
(COPY)
hasComments
(COPY)
hasPrevExperiment
(NOREUSE)
PROJECT
(LINK)
hasNextStep
(NOREUSE)
ARRAY
MEASUREMENT
(COPY)
EXPERIMENT
(COPY)
hasExperiments
(NOREUSE)
hasOwner
LINK
isPerformedBy
(LINK)
ownsExperiments
(NOREUSE)
hasContributors
(LINK)
hasPerformed
(NOREUSE)
contributedExperiments
(NOREUSE)
OWNER
(LINK)
OWNER
(LINK)
CONTRIBUTOR
(LINK)
COMMENTATOR
(LINK)
COMMENTATOR
(LINK)
isMadeBy
(LINK)
isMadeBy
(LINK)
EXPERIMENT
(LINK)
COMMENT
(COPY)
COMMENT
(COPY)
hasComments
(COPY)
PROPERTY
(COPY)
hasNextExperiment
(NOREUSE)
hasComments
(COPY)
hasPrevExperiment
(NOREUSE)
hasProperties
(COPY)
hasSteps
(NOREUSE)
isPartOfProject
(NOREUSE)
PROJECT
(LINK)
hasNextStep
(NOREUSE)
ARRAY
MEASUREMENT
(COPY)
EXPERIMENT
(COPY)
hasExperiments
(NOREUSE)
hasOwner
LINK
hasContributors
(LINK)
ownsExperiments
(NOREUSE)
isPerformedBy
(LINK)
hasPerformed
(NOREUSE)
contributedExperiments
(NOREUSE)
OWNER
(LINK)
DATA ANALYSIS
(COPY)
hasPrevStep
(NOREUSE)
isPartOfExperiment
(NOREUSE)
OWNER
(LINK)
CONTRIBUTOR
(LINK)
Experiment Topology
Graphical representation of self-contained data
processing modules attached to each other in a workflow
DATA ANALYSIS
(COPY)
hasPrevStep
(NOREUSE)
isPartOfExperiment
(NOREUSE)
– Descriptions of experimental
–
hasProperties
(COPY)
hasSteps
(NOREUSE)
isPartOfProject
(NOREUSE)
Study
steps represented as an
instance of a PFT with
references to experiment
topologies
PROPERTY
(COPY)
hasNextExperiment
(NOREUSE)
e-Science environment
Step1: designing
an experiment
Step2: performing
the experiment
Step3: analyzing
the experiment results
Scientific Workflow Management
Systems
Taverna, Kepler, and Triana: model
processes for computing tasks.
Process Flow
Template
PFT in VLAMG
Process Flow
Template
PFT instances in VLAMG
In VL-e: model both computing tasks and human activity based processes, and
model them from the perspective of an entire lifecycle. It tries to support
• Collaboration in different stages
• Information sharing
• Reuse of experiment
success
Definition of experiment
protocols
Workflow definitions
Recreate complex experiments into process flows
Workflow execution
Maintain control over the experiment
Data processing execution
Topologies of data processing modules
Interpretation
Visualization of processed results to help intuition
Ontology definitions can help in obtaining
a well-structured definition of experiment,
data and metadata.
Virtual Lab for e-Science
research Philosophy
• Multidisciplinary research and development of related ICT
infrastructure
• Generic application support
 Application cases are drivers for computer & computational science and
engineering research
 Problem solving partly generic and partly specific
 Re-use of components via generic solutions whenever possible
• Rationalization of experimental process
 Reproducible & comparable
• Two research experimentation environments
 Proof of concept for application experimentation
 Rapid prototyping for computer & computational science experimentation
The VL-e infrastructure
Application
specific
service
Application
Potential
Generic service
&
Virtual
Lab. services
Grid
&
Network
Services
Telescience
Medical
Application
Bio
Informatics
Applications
Virtual Laboratory
Grid Middleware
Surfnet
VL-e Proof of Concept Environment
Test & Cert.
VL-software
Virtual Lab.
rapid prototyping
(interactive simulation)
Test & Cert.
Grid Middleware
Additional
Grid Services
(OGSA services)
Test & Cert.
Compatibility
Network Service
(lambda networking)
VL-e Certification
Environment
VL-e Experimental
Environment
Infrastructure for Applications
• Applications are a driving force of the
PoC
• Experience shows applications value
stability
• Foster two-way interaction to make
this happen
e-Science environment
Step1: designing
an experiment
Step2: performing
the experiment
Step3: analyzing
the experiment results
Research activity
Stable developments
to be developed in the to be used in VL-e in the Proof of
Rapid Prototyping
Concept environment
environment
success
Research activity
to be developed in the
Rapid Prototyping
environment
Scientific Workflow Management
Systems
PFT
Taverna, Kepler, Triana and VLAM.
SWMS
PFT
VL-e PoC environment
•
•
•
•
•
Latest certified stable software environment of core
grid and VL-e services
Core infrastructure built around clusters and storage
at SARA and NIKHEF (‘production’ quality)
Controlled extension to other platforms and
distributions
On the user end: install needed servers: user
interface systems, storage elements for data
disclosure, grid-secured DB access
Focus on stability and scalability
Hosted services for VL-e
• Key services and resources are offered centrally for
all applications in VL-e
• Mass data and number crunching on the large
resources at SARA
• Storage for data replication & distribution
• Persistent ‘strategic’ storage on tape
• Resource brokers, resource discovery, user group
management
Why such a complex scheme?
• “software is part of the infrastructure”
• stability of core software needed to
develop the new scientific applications
• enable distributed systems management
(who runs what version when?)
“the grid is one big error amplifier”
“computers make mistakes like humans,
only much, much faster”
What did we learn
• It is not enough to just transport current applications
to Grid or e-Science infrastructures
• To fully exploit the potential of e-Science
infrastructures one has to learn what is possible
• Therefore the full lifecycle of an experiment has to
be taken into account
 Workflow management is a first step
 We add semantic information via Process Flow Templates
• Application innovation such as for instance biobanking should be the aim
• Grids should be transparent for the end-user
Conclusions
• e-Science is a lot more more than trying to cope with
data explosion alone
• Implementation of e-Science systems requires
further rationalization and standardization of
experimentation process
• e-Science success demands the realization of an
environment allowing
 application driven experimentation &
 rapid dissemination of feed back of these new methods
• We try to do that via development of Proof of
Concept based on Grid
Electron tomography data
pipeline development
SARA - Amsterdam
1 Gbit/s
TOM (Matlab)
acquisition and
data storage with
the SRB
Grid-computing 3D reconstruction refinements
Currently for EMAN and later also for TOM (Matlab)
SRB Storage
With the CCDB
database for
retrieval of
experimental
data sets
Download