Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications

advertisement
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Astrophysics, Biology, Climate, Combustion,
Fusion, Nanoscience
Working Group on Simulation-Driven Applications
10 CS, 10 Sim, 1 VR
Workflows
• Critical Need: Enable (and Automate) Scientific Work
Flows
–
–
–
–
Data Storage
Data Transfer
Data Analysis
Visualization
• An order of magnitude more time can be spent on
manually managing these work flows than on performing
the simulation science itself.
5/17/2004
Chicago Meeting
DOE Data Management
2
Simulations
• Simulations run in batch mode.
• Remaining workflow interactive or “on demand.”
• Simulation and analyses performed by distributed teams
of research scientists.
– Need to access remote and distributed data, resources.
– Need for distributed collaborative environments.
• Some solutions will be team dependent.
• Example: Remote Viz. vs. Local Viz., Parallel HDF5 vs. Parallel
netcdf, …
5/17/2004
Chicago Meeting
DOE Data Management
3
Let thought be the bottleneck 
• Simulation Scientists generally have scripts to semiautomate this process.
• To expedite this process they need to:
– fully automate the workflow,
– remove the bottlenecks.
• Better visualization, better data analysis routines, will allow users to
decrease the interpretation time.
• Better routines to “find the needle in the haystack” will allow the
thought process to be decreased.
• Faster turn around time for simulations will decrease the code
runtimes.
–
–
–
–
–
5/17/2004
Better numerical algorithms.
More scalable algorithms.
Faster processors, faster networking, faster I/O.
Better batch systems…
More HPC systems.
Chicago Meeting
DOE Data Management
4
Data Management (2)
• To expedite this process they need to:
– Have a common data model to move data from simulation to analysis to
viz.
– Need for metadata, annotation, and provenance:
• Nature of Metadata
–
–
–
–
Code versions.
Simulation parameters.
Model parameters.
Information on simulation inputs (e.g., from experiments and/or other
simulations).
– Machine configuration.
– Compiler information.
– Need for tools to record provenance in databases.
• Additional provenance (above that provided by the above metadata)
needed to describe:
– Reliability of data.
– How the data arrived in the form in which it was accessed.
– Data ownership.
5/17/2004
Chicago Meeting
DOE Data Management
5
Critical to develop a unified data model.
• Can we build analysis routines which can be used for
multiple codes? Multiple disciplines??
• Standards.
• Data Model must allow flexibility.
– Commonly we add/subtract variables used in the
simulations/analysis routines.
– Must deal with AMR calculations.
5/17/2004
Chicago Meeting
DOE Data Management
6
Biggest Bottleneck: Interpretation of
Results
• This is the biggest bottleneck because:
– Babysitting
• Scientists spend their “real-time” babysitting computational
experiments (trying to interpret results, and move data, and
orchestrate the computational pipeline).
• Deciding if the analysis routines are working properly with this “new”
data.
– Non scaleable data analysis routines
• Looking for the “needle in the haystack”.
• Better analysis routines could mean less time in the thought process
and in the interpretation of the results.
5/17/2004
Chicago Meeting
DOE Data Management
7
Important Component: Parallel I/O
– Need for significant developments in parallel I/O.
• Need for a portable, efficient industry standard.
• Need for interoperability between parallel and non-parallel I/O.
– Degree of parallelism varies across the work flow.
• Important in multiple stages of many Work Flows:
– From: Output of simulation data.
– To: I/O for parallel rendering for end-product scientific
visualization.
– Need to cache, archive, replicate, subset, and distribute
large data sets.
• Archival storage required to store data that takes months to
produce.
• Data will be post-processed as it is produced, requiring that it
be cached/staged.
• Replication, subsetting, and distribution serve multiple
purposes (e.g., data staging for visualization).
5/17/2004
Chicago Meeting
DOE Data Management
8
Needed Technologies
Auto
Workflow
Data Storage
and Access
Data
Movement
Data
Analysis
Metadata
DB Access
and Query
Data
Visualization
Astro
5 (1)
6 (1)
7 (1)
3 (1/2)
2
1
4 (1/2)
Fusion
6 (3/2) 5 (1/2)
7 (1/2)
4 (1)
2
1
3 (1/2)
Combustion
3
6 (1/2)
7 (1/2)
5 (2)
2
1
4 (1)
Climate
3 (2)
6
7
5
2
1
4 (2)
Nano
7 (1/2) 4 (1/2)
2
6 (1)
3 (1)
1 (1/2)
5 (1/2)
Biology
2
3
4
6 (1)
1
5 (1)
5/17/2004
7 (2)
Chicago Meeting
DOE Data Management
9
Download