Earth systems data in real time applications: low

advertisement
Earth systems data in real time applications: low
latency, metadata, and preservation
Beth Plale
Director, Data To Insight Center of Pervasive
Technologies Institute
School of Informatics and Computing
Indiana University Bloomington
Data Ranges from a) couple KB to few GB; b) arrival rates
from 12/hr to 2/day; c) “anything older than 10 min isn’t
interesting”
The Data
Types: netCDF, ASCII text, “level 2”
Delivery: NWS watches and warnings, Unidata Internet Data
Dissemination system (IDD) (LDM), THREDDS, OPeNDAP.
The Workflows (I)
338secs
Ter r ai n
Pr ePr ocessor
W
r f St at i c
0. 2M
B
4secs
0. 2M
B
147M
B
Lat er al
Boundar y
I nt er pol at or
488M
B
146secs
19M
B
North American
Mesoscale (NAM)
initialized forecast
workflow.
147M
B
3D
I nt er pol at or
88secs
243M
B
ARPS2W
RF
78secs
206M
B
4570secs/ 16 pr ocessor s
W
RF
2422M
B
13 MB
Workflows (II)
30secs
Pre
I nt erproscan
100 KB
5400 secs
N=135
Size
• Total Number of Tasks
• Number of Parallel Tasks – max
number of parallel tasks (width)
• Longest Chain - number of
tasks in longest chain
Resource Usage
• Max task processor width – max
concurrent number of
processors required by
workflow.
• Total Computation time.
• Data Sizes -- sizes of workflow
inputs,
•
outputs and intermediate data
products.
Ramakrishnan and Plale, under review
I nt erproscan
…
I nt erproscan
500 KB
Post
I nt erproscan
3600 secs/
256 processors
60secs
71MB
599 MB
Mot i f
599 MB
Structural pattern
• Sequential - tasks that follow
one after another.
• Parallel - multiple tasks run at
same time.
• Parallel-split - one task's
output feeds to multiple tasks.
• Parallel-merge - multiple tasks
merge into one task.
• Parallel-merge-split - parallelmerge and parallel-split.
• Mesh - task dependencies are
interleaved.
1432 MB
Linked Environments for Atmospheric Discovery, LEAD I
• Framework for running WRF, ARPS tool suites, IDV, using LDM
streams and OU ADAS assimilation data
• Execute task sequences “workflows” on Teragrid
• LEAD I ended Sept 09. LEAD II housed at IU.
Real time
Obs data
inflow
Analysis and
forecast
Data management and
curation
Postprocess /
Visualization
Teragrid
Cyberinfrastructure Model
LEAD I : Science Gateway
• Single sign-on to portal (Science Gateway) gives access to
cloud storage and Teragrid resources
• Overcame significant hurdles in using Teragrid in providing
resource to respond to severe weather events.
• Pioneered web service wrapper to incorporate legacy code
• Pioneered large scale service oriented architecture (SOA)
– Modularity, common set of standards, good performance
– Adopting Event messaging mechanism in SOA fostered research in
provenance, metadata collection, and workflow monitoring
LEAD II : Science-in-a-Box
• Subsystem that carried out Workflow orchestration and
submission on Teragrid was complex and the code delicate.
– Teragrid may not be right venue for 24x7 production community
resource
• Conversation with Microsoft Summer 2009 on using Trident
Scientific Workflow Workbench for workflow execution and
Windows HPC Server for application execution.
– … but not all meteorology codes run on Windows HPC Server
• Support for WRF, ARPS tool suites, IDV, LDM stream, ADAS
• Execute workflows on local Windows cluster and call out to
cloud resources
SC09 in-a-box LEAD/HPC Demo
• WRF ARW Ideal Case, Trident front end
• Work done by Dan Connors, John Michalakes,
Tony Heller, Wen-Ming Ye.
• Used data from benchmark page:
http://www.mmm.ucar.edu/WG2bench/conus
12km_data_v3/
Demonstration workflow
Namelist file
configuration using
WRF Domain
Wizard
WRF ported in to Windows.
(Uses MSMPI from Windows HPC
pack)
NCL Scripts (linux) running
inside Cygwin
Visualization using
Vapor
Linux applications
• Many scientific applications need a Linux
environment to execute
• Options to run Linux applications on Windows
are:
– Porting the application to Windows
– Use Linux emulator
• Cygwin (a Linux emulator) can run most Linux
applications
• LEAD-in-the-box demonstrated for first time at
SC09 Trident orchestrated workflow activities
running Linux applications through Cygwin
Running Linux Application on
Windows
Linux Application
Cygwin
Microsoft Windows
Workflow Runs Inside Cygwin
Vapor Integration
• Visualize parameters extracted
from WRF outputs
– Temperature, pressure,
precipitation, etc., variations
• Vapor scripts run inside
Workflow Activities, convert
outputs of NCLScripts to
compatible viewable format
Current effort
• Integrating real time data into Trident
• Important obs data in remote data repositories via (http,
ftp, OPeNDAP), Web service data catalogs (WCS, WFS)
• Support Vortex 2 with 5 daily forecasts (with OU, UNC)
Issue I: handling real time data in workflow systems
Scientific workflows are accepted approach to executing
sequences of tasks. Many geoscience workflows need to
interact with sensors that produce large continuous streams
of data, but programming models provided by scientific
workflows are not equipped to handle real time data streams.
Approach: tighter integration and expression of
streams in workflow engine
Herath and Plale, Streamflow Programming Model for Data Streaming in
Scientific Workflows, CCGrid 2010, Melbourne, June 2010
Mechanism
Issue 2: Sharing and use of scientific
data over long term
“After you have captured the data, you need to
curate it before you can start doing any kind of
data analysis, and we lack good tools for both
data curation and data analysis.”
“But curation is not cheap. […] This is why we
need to automate the whole curation
process.”
From Jim Gray essay in The Fourth Paradigm: Data-Intensive Scientific Discovery
What to collect? Information required to be preserved for different
levels of completeness and what they mean in e-Science
Level
Name
What it means in e-Science
1
Intellectual & Technical Metadata
Ownership, intellectual property, copyright, and
domain-specific attributes
2
Structural Metadata
Data products and research objects; semantic
information through i.e., controlled vocabulary (CF
vocabulary)
3
Provenance
Lineage of data products as well as that of processes
4
Rendering Software
Domain-specific applications & dependency libraries
5
Processing Software
Draw line here: determines scope of what we think we can collect
XMC Cat metadata catalog
strength is adaptability to new community schema
metadata
Schema is Partitioned
Based on Concepts
. . .
identification
citation
description
...
spatial data
. . .
keywords
. . .
distribution
contact
...
order
process
Metadata “Shredded” to Relational Tables
Complex
Search
.
.
.
Build Response
From Concepts
Query Result Based on Community XML Schema
Scott Jensen, primary author, is present at workshop
Concepts
Stored as XML
Collecting metadata: role of XMC
Cat metadata catalog
Workflow
N
Outputs
Workflow Configuration and
puts
Metadata Catalog
Rec
os
eW
or
r
rW
or
e
Qu
ito
mp
o
W
M
Co
or
yF
s
on
Workflow
low
rkf
kfl
ow
Portal
sults
In
rkflow
o
W
o rd
Intermediate Results
Search R
e
Workflo
w
s
ws
Record
otification
kf
lo
Notifications
Workflow
Message Bus
Automating Metadata Capture
node
node
node
node
Nodes Register Data Products
myLEAD Agent
Archived to the
data repository
Data
Repository
Minimal source metadata is recorded
XMC Cat Metadata Catalog
Registration events
added to queue
Database
worker
worker
LEAD Portal
Scientists query over
complex metadata
data
registration
event queue
plugin
plugin
plugin
plugin
Post-processing of data
registration events
Preservation packet – the research object
1. Req
u
est to a
rc
experim hive an
ent
2. Reque
st to colle
ct
metadata
for artifac
ts
XMC Cat Metadata
Catalog for Domain
Science CI
Preservation System for
e-Science
1 . Co
3. Meta
data
for artifa
cts
4. Request to collect
provenance
Karma Provenance
Collection Service
ansfer
st to tr
e
u
q
e
en IDs
6. R
ith giv
files w
Archive
&
Preservation
Dissemination
5. Provenance
information
Query
&
Discovery using
Preservation
XMC Cat
Name Resolution &
File Transfer Service
7. Files
Optional Service
Registry & Code
Repository
l l e ct
o co
t
t
s
eque
rvice
8. R ional se n
t
o
p
t
o
ma i
infor
FedoraCommons
(Future Work)
nal
ptio
9 . O vi ce
ser tion
rma
info
Sun and Plale, Provenance for Preservation, under review
mplex
qu
specif ery over do
ic me
tadata main-
2. IDs of
entries m
atch que
cka g
n Pa
o
i
t
a
v
r
rese the IDs
tch P
on
3. Fe based
tion
rva
e
s
re
ges
4 . P a cka
P
ry
es
Download