Earth systems data in real time applications: low latency, metadata, and preservation Beth Plale Director, Data To Insight Center of Pervasive Technologies Institute School of Informatics and Computing Indiana University Bloomington Data Ranges from a) couple KB to few GB; b) arrival rates from 12/hr to 2/day; c) “anything older than 10 min isn’t interesting” The Data Types: netCDF, ASCII text, “level 2” Delivery: NWS watches and warnings, Unidata Internet Data Dissemination system (IDD) (LDM), THREDDS, OPeNDAP. The Workflows (I) 338secs Ter r ai n Pr ePr ocessor W r f St at i c 0. 2M B 4secs 0. 2M B 147M B Lat er al Boundar y I nt er pol at or 488M B 146secs 19M B North American Mesoscale (NAM) initialized forecast workflow. 147M B 3D I nt er pol at or 88secs 243M B ARPS2W RF 78secs 206M B 4570secs/ 16 pr ocessor s W RF 2422M B 13 MB Workflows (II) 30secs Pre I nt erproscan 100 KB 5400 secs N=135 Size • Total Number of Tasks • Number of Parallel Tasks – max number of parallel tasks (width) • Longest Chain - number of tasks in longest chain Resource Usage • Max task processor width – max concurrent number of processors required by workflow. • Total Computation time. • Data Sizes -- sizes of workflow inputs, • outputs and intermediate data products. Ramakrishnan and Plale, under review I nt erproscan … I nt erproscan 500 KB Post I nt erproscan 3600 secs/ 256 processors 60secs 71MB 599 MB Mot i f 599 MB Structural pattern • Sequential - tasks that follow one after another. • Parallel - multiple tasks run at same time. • Parallel-split - one task's output feeds to multiple tasks. • Parallel-merge - multiple tasks merge into one task. • Parallel-merge-split - parallelmerge and parallel-split. • Mesh - task dependencies are interleaved. 1432 MB Linked Environments for Atmospheric Discovery, LEAD I • Framework for running WRF, ARPS tool suites, IDV, using LDM streams and OU ADAS assimilation data • Execute task sequences “workflows” on Teragrid • LEAD I ended Sept 09. LEAD II housed at IU. Real time Obs data inflow Analysis and forecast Data management and curation Postprocess / Visualization Teragrid Cyberinfrastructure Model LEAD I : Science Gateway • Single sign-on to portal (Science Gateway) gives access to cloud storage and Teragrid resources • Overcame significant hurdles in using Teragrid in providing resource to respond to severe weather events. • Pioneered web service wrapper to incorporate legacy code • Pioneered large scale service oriented architecture (SOA) – Modularity, common set of standards, good performance – Adopting Event messaging mechanism in SOA fostered research in provenance, metadata collection, and workflow monitoring LEAD II : Science-in-a-Box • Subsystem that carried out Workflow orchestration and submission on Teragrid was complex and the code delicate. – Teragrid may not be right venue for 24x7 production community resource • Conversation with Microsoft Summer 2009 on using Trident Scientific Workflow Workbench for workflow execution and Windows HPC Server for application execution. – … but not all meteorology codes run on Windows HPC Server • Support for WRF, ARPS tool suites, IDV, LDM stream, ADAS • Execute workflows on local Windows cluster and call out to cloud resources SC09 in-a-box LEAD/HPC Demo • WRF ARW Ideal Case, Trident front end • Work done by Dan Connors, John Michalakes, Tony Heller, Wen-Ming Ye. • Used data from benchmark page: http://www.mmm.ucar.edu/WG2bench/conus 12km_data_v3/ Demonstration workflow Namelist file configuration using WRF Domain Wizard WRF ported in to Windows. (Uses MSMPI from Windows HPC pack) NCL Scripts (linux) running inside Cygwin Visualization using Vapor Linux applications • Many scientific applications need a Linux environment to execute • Options to run Linux applications on Windows are: – Porting the application to Windows – Use Linux emulator • Cygwin (a Linux emulator) can run most Linux applications • LEAD-in-the-box demonstrated for first time at SC09 Trident orchestrated workflow activities running Linux applications through Cygwin Running Linux Application on Windows Linux Application Cygwin Microsoft Windows Workflow Runs Inside Cygwin Vapor Integration • Visualize parameters extracted from WRF outputs – Temperature, pressure, precipitation, etc., variations • Vapor scripts run inside Workflow Activities, convert outputs of NCLScripts to compatible viewable format Current effort • Integrating real time data into Trident • Important obs data in remote data repositories via (http, ftp, OPeNDAP), Web service data catalogs (WCS, WFS) • Support Vortex 2 with 5 daily forecasts (with OU, UNC) Issue I: handling real time data in workflow systems Scientific workflows are accepted approach to executing sequences of tasks. Many geoscience workflows need to interact with sensors that produce large continuous streams of data, but programming models provided by scientific workflows are not equipped to handle real time data streams. Approach: tighter integration and expression of streams in workflow engine Herath and Plale, Streamflow Programming Model for Data Streaming in Scientific Workflows, CCGrid 2010, Melbourne, June 2010 Mechanism Issue 2: Sharing and use of scientific data over long term “After you have captured the data, you need to curate it before you can start doing any kind of data analysis, and we lack good tools for both data curation and data analysis.” “But curation is not cheap. […] This is why we need to automate the whole curation process.” From Jim Gray essay in The Fourth Paradigm: Data-Intensive Scientific Discovery What to collect? Information required to be preserved for different levels of completeness and what they mean in e-Science Level Name What it means in e-Science 1 Intellectual & Technical Metadata Ownership, intellectual property, copyright, and domain-specific attributes 2 Structural Metadata Data products and research objects; semantic information through i.e., controlled vocabulary (CF vocabulary) 3 Provenance Lineage of data products as well as that of processes 4 Rendering Software Domain-specific applications & dependency libraries 5 Processing Software Draw line here: determines scope of what we think we can collect XMC Cat metadata catalog strength is adaptability to new community schema metadata Schema is Partitioned Based on Concepts . . . identification citation description ... spatial data . . . keywords . . . distribution contact ... order process Metadata “Shredded” to Relational Tables Complex Search . . . Build Response From Concepts Query Result Based on Community XML Schema Scott Jensen, primary author, is present at workshop Concepts Stored as XML Collecting metadata: role of XMC Cat metadata catalog Workflow N Outputs Workflow Configuration and puts Metadata Catalog Rec os eW or r rW or e Qu ito mp o W M Co or yF s on Workflow low rkf kfl ow Portal sults In rkflow o W o rd Intermediate Results Search R e Workflo w s ws Record otification kf lo Notifications Workflow Message Bus Automating Metadata Capture node node node node Nodes Register Data Products myLEAD Agent Archived to the data repository Data Repository Minimal source metadata is recorded XMC Cat Metadata Catalog Registration events added to queue Database worker worker LEAD Portal Scientists query over complex metadata data registration event queue plugin plugin plugin plugin Post-processing of data registration events Preservation packet – the research object 1. Req u est to a rc experim hive an ent 2. Reque st to colle ct metadata for artifac ts XMC Cat Metadata Catalog for Domain Science CI Preservation System for e-Science 1 . Co 3. Meta data for artifa cts 4. Request to collect provenance Karma Provenance Collection Service ansfer st to tr e u q e en IDs 6. R ith giv files w Archive & Preservation Dissemination 5. Provenance information Query & Discovery using Preservation XMC Cat Name Resolution & File Transfer Service 7. Files Optional Service Registry & Code Repository l l e ct o co t t s eque rvice 8. R ional se n t o p t o ma i infor FedoraCommons (Future Work) nal ptio 9 . O vi ce ser tion rma info Sun and Plale, Provenance for Preservation, under review mplex qu specif ery over do ic me tadata main- 2. IDs of entries m atch que cka g n Pa o i t a v r rese the IDs tch P on 3. Fe based tion rva e s re ges 4 . P a cka P ry es