QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR Workflows • Critical Need: Enable (and Automate) Scientific Work Flows – – – – Data Storage Data Transfer Data Analysis Visualization • An order of magnitude more time can be spent on manually managing these work flows than on performing the simulation science itself. 5/17/2004 Chicago Meeting DOE Data Management 2 Simulations • Simulations run in batch mode. • Remaining workflow interactive or “on demand.” • Simulation and analyses performed by distributed teams of research scientists. – Need to access remote and distributed data, resources. – Need for distributed collaborative environments. • Some solutions will be team dependent. • Example: Remote Viz. vs. Local Viz., Parallel HDF5 vs. Parallel netcdf, … 5/17/2004 Chicago Meeting DOE Data Management 3 Let thought be the bottleneck • Simulation Scientists generally have scripts to semiautomate this process. • To expedite this process they need to: – fully automate the workflow, – remove the bottlenecks. • Better visualization, better data analysis routines, will allow users to decrease the interpretation time. • Better routines to “find the needle in the haystack” will allow the thought process to be decreased. • Faster turn around time for simulations will decrease the code runtimes. – – – – – 5/17/2004 Better numerical algorithms. More scalable algorithms. Faster processors, faster networking, faster I/O. Better batch systems… More HPC systems. Chicago Meeting DOE Data Management 4 Data Management (2) • To expedite this process they need to: – Have a common data model to move data from simulation to analysis to viz. – Need for metadata, annotation, and provenance: • Nature of Metadata – – – – Code versions. Simulation parameters. Model parameters. Information on simulation inputs (e.g., from experiments and/or other simulations). – Machine configuration. – Compiler information. – Need for tools to record provenance in databases. • Additional provenance (above that provided by the above metadata) needed to describe: – Reliability of data. – How the data arrived in the form in which it was accessed. – Data ownership. 5/17/2004 Chicago Meeting DOE Data Management 5 Critical to develop a unified data model. • Can we build analysis routines which can be used for multiple codes? Multiple disciplines?? • Standards. • Data Model must allow flexibility. – Commonly we add/subtract variables used in the simulations/analysis routines. – Must deal with AMR calculations. 5/17/2004 Chicago Meeting DOE Data Management 6 Biggest Bottleneck: Interpretation of Results • This is the biggest bottleneck because: – Babysitting • Scientists spend their “real-time” babysitting computational experiments (trying to interpret results, and move data, and orchestrate the computational pipeline). • Deciding if the analysis routines are working properly with this “new” data. – Non scaleable data analysis routines • Looking for the “needle in the haystack”. • Better analysis routines could mean less time in the thought process and in the interpretation of the results. 5/17/2004 Chicago Meeting DOE Data Management 7 Important Component: Parallel I/O – Need for significant developments in parallel I/O. • Need for a portable, efficient industry standard. • Need for interoperability between parallel and non-parallel I/O. – Degree of parallelism varies across the work flow. • Important in multiple stages of many Work Flows: – From: Output of simulation data. – To: I/O for parallel rendering for end-product scientific visualization. – Need to cache, archive, replicate, subset, and distribute large data sets. • Archival storage required to store data that takes months to produce. • Data will be post-processed as it is produced, requiring that it be cached/staged. • Replication, subsetting, and distribution serve multiple purposes (e.g., data staging for visualization). 5/17/2004 Chicago Meeting DOE Data Management 8 Needed Technologies Auto Workflow Data Storage and Access Data Movement Data Analysis Metadata DB Access and Query Data Visualization Astro 5 (1) 6 (1) 7 (1) 3 (1/2) 2 1 4 (1/2) Fusion 6 (3/2) 5 (1/2) 7 (1/2) 4 (1) 2 1 3 (1/2) Combustion 3 6 (1/2) 7 (1/2) 5 (2) 2 1 4 (1) Climate 3 (2) 6 7 5 2 1 4 (2) Nano 7 (1/2) 4 (1/2) 2 6 (1) 3 (1) 1 (1/2) 5 (1/2) Biology 2 3 4 6 (1) 1 5 (1) 5/17/2004 7 (2) Chicago Meeting DOE Data Management 9