Metadata, Provenance, and Search in e-Science Beth Plale

advertisement
Metadata, Provenance, and Search in
e-Science
Beth Plale
Director, Center for Data and Search Informatics
School of Informatics
Indiana University
May 29, 2007
Credits:
PhD students Yogesh Simmhan, Nithya
Vijayakumar, and Scott Jensen.
Dennis Gannon, IU, key collaborator on discovery
cyberinfrastructure
Sept 17, 2007
Nature of Computational Science
Discovery
•
•
•
•
•
•
•
Extract data from heterogeneous databases,
Execute task sequences (“workflows”) on your behalf,
Mine data from sensors and instruments and
responding,
Try out new algorithms,
Explore data through visualization, and
Go back and repeat steps again: with new data,
answering new questions, or with new algorithms.
How is this discovery process supported today?
•
•
Through cyberinfrastructure.
CyberInfrastructure that supports
•
•
•
On demand knowledge discovery
Automated experiment management (data and workflow)
Data protection, and automated data product provenance tracking.
Sept 17, 2007
CyberInfrastructure:
framework for
discovery
•
Plug and play data sources and
analysis tools. Complex what-if
scenarios. Through
•
•
•
•
•
•
•
User portal
Personal metadata catalog of data
exploration results
Data product index/catalog
Data provenance service
Workflow engine and composition
tools
Tied together with Internet-scale
event bus.
Results publishable to digital library.
Sept 17, 2007
Cyberinfrastructure for computing: DSI
DataCenter
Supports analysis, use, visualization and search
research. Supports multiple datasets.
Sept 17, 2007
Distributed services provide functionali
capability
Sept 17, 2007
Vision for Data Handling
•
Capturing metadata about data sets as generated is
key
•
•
•
•
•
•
Syntatic: file size, date of creation and
Semantic or domain specific: spatial region, logical time
Context of file is key search parameter
Provenance, or history of data product, needed to
assess quality
Volume of data used in computational science too
large: manage on behalf of user
Indexes help efficiency
Sept 17, 2007
The Realization in Software
Workflow graph
Application
services
Compute Engine
User’s Browser
Workflow
Engine
App
factory
Event Notification Bus
Portal
server
MyLEAD
Agent
service
Data
Catalog
service
MyLEAD User
Metadata
catalog
Data
Management
Service
Provenance
Collection
service
Data Storage
Sept 17, 2007
Infrastructure is portal
based - that is, all services are available
through a web server
Sept 17, 2007
e-Science Gateway Architecture
User’s Grid Desktop
Grid
Portal Server
Gateway Services
Proxy Certificate
Server (Vault)
Application
Deployment
Events & Messaging
Resource
Registry
Workflow engine
Community & User
Metadata Catalog
Resource Broker
Core Grid Services
Security
Services
Information
Services
Self
Management
Resource
Management
Execution
Management
Data
Services
Resource Virtualization (OGSA)
Compute Resources
Data Resources
Instruments & Sensors
Sept 17, 2007
[1] Service Oriented Architectures for Science Gateways on Grid Systems, Gannon, D., et al.; ICSOC, 2005
LEAD-CI Cyberinfrastructure
•
•
•
Workflows run on the LEADgrid and on Teragrid.
Portal and persistent back-end web services run
on LEADgrid.
Data storage resources for storing usergenerated data products are provided by Indiana
University.
Sept 17, 2007
Typical weather forecast runs as workflow
Pre-Processing
Terrain data files
arpstrn
Radar data (level II)
Assimilation
Forecast
Visualization
ETA, RUC, GFS data
IDV viz
Ext2arps-ibc
Ext2arps-lbc
Surface data files
WRF
arpssfc
88d2arps
Radar data (level III)
arps2wrf
wrf2arps
ADAS
assimilation
Surface, upper air
mesonet & wind
profiler data
nids2arps
arpsplot
Satellite data
mci2arps
Sept 17, 2007
~400 Data Products Consumed &
Produced – transformed – during
Workflow Lifecycle
To set up workflow
experiment,
we select a workflow
(not shown)
then set model
parameters here
Sept 17, 2007
Supported
community
data
collections
Sept 17, 2007
Data Integration
Globally integrated view:
Data Catalog Service
Local view: crosswalk point of
presence supports crawling,
publishes difference list as LEAD
Metadata Schema (LMS)
documents
CASA radar
Collection,
Months (ftp)
Oklaho
ma
List of results as
LEAD Metadata
Schema documents
Web service API
Boolean search query
• Crawler crawls catalogs;
• Builds index of results;
• Web service API;
• Boolean search query
with spatial/temporal
support
Latest 3 days
Unidata IDD
Distribution
Indiana
(XML web server)
Level II and
III radar, latest
3 days
Colorado
(XML web server)
Index
XMLDB native XML database
and Lucene for index
ETA, NCEP,
Colorado
NAM,
METAR, etc.
(XML web server)
crosswalks
Sept 17, 2007
LEAD Personal Workspace
•
•
CyberInfrastructure extends user’s
desktop to incorporate vast data
analysis space.
As users go about doing scientific
experiments, the CI manages
back-end storage and compute
resources.
•
•
Portal provides ways to explore
this data and search and discover
it.
Metadata about experiments is
largely automatically generated,
and highly searchable.
•
Describes data object (the file) in
application-rich terms, and
provides URI to data service that
can resolve an abstract unique
identifier to real, on-line data “file”.
Sept 17, 2007
Searching for experiments using model
configuration parameters: 2 attributes selected
Sept 17, 2007
Searching for experiments based on model
parameters: 4 returned experiments; one displayed
Sept 17, 2007
How forecast model configuration
parameters stored in personal catalog
Forecast model
configuration
file handed off
to plugin that
shreds XML
document into
queriable
attributes
associated with
experiment
Sept 17, 2007
What & Why of Provenance
•
Derivation history of a data product
•
•
•
What (when, where) application created the data
Its parameters & configuration
Other input data used by application
Data.In.1
Data.In.2
Config.A
•
Application
A
Data.Out.1
Data Provenance::Data.Out.1
Process: Application_A
Timestamp: 2006-06-23T12:45:23
Host: tyr20.cs.indiana.edu …
Input: Data.In.1, Data.In.2
Config: Config.A
Workflow is composed from building blocks like
these. So provenance for data used in workflow
gives workflow trace
Sept 17, 2007
The What & Why of Provenance
•
Trace Workflow Execution
•
•
•
Audit Trail
•
•
•
What applications were used to derived data products?
Which workflows use a certain data product?
Attribution
•
•
•
What resources were used during workflow execution?
Data Quality & Reuse
•
•
What services were used during workflow execution?
Validate if all steps of execution successful?
Who performed the experiment?
Who owns the workflow & data products?
Discovery
•
•
Locate data generated by a workflow
Locate workflows containing App-X that succeeded
Sept 17, 2007
Collection Framework
Query for Workflow, Process,
& Data Provenance
Karma Provenance Service
Provenance
Listener
Subscribe & Listen to
Activity Notifications
Activity
DB
Message Bus
WS-Eventing Service API
Workflow–Started &
–Finished Activities
Publish Provenance
Activities as Notifications
10C
Provenance
Browser Client
Provenance
Query API
Application–Started & –Finished,
Data–Produced & –Consumed
Activities
Workflow
Engine
Service
1
Service
2
10P/10C
10P
Service
9
…
10C
Service
10
10P
10P/10C
Workflow Instance
10 Data Products Consumed & Produced by each Service
A Framework for Collecting Provenance in Data-Centric Scientific
Workflows, Simmhan, Y., et al., ICWS Conference, 2006
Sept 17, 2007
WS-Messenger
Notification Broker
Generating Karma Provenance
Activities
•
•
Instrument applications to publish provenance
Simple Java Library available to
•
•
•
•
Create provenance activities
Publish activities as messages
Jython “wrapper” scripts use library to publish
provenance & invoke application
Generic Factory toolkit easily converts
applications to web service
•
Built-in provenance instrumentation
Sept 17, 2007
Sample Sequence of Activities
appStarted(App1)
info(‘App1 starting’)
fileReceiveStarted(File1)
-- do gridftp get to stage input file File1 -fileReceiveFinished(File1)
fileConsumed(File1)
computationStarted(Code1)
-- call Fortran code Code1 to process input files -computationFinished(Code1)
fileProduced(File2)
fileSendStarted(File2)
-- do gridftp put to save output file File2 -fileSendFinished(File2)
publishURL(File2)
appFinishedSuccess(App1, File2) | appFinishedFailed(App1, ERR)
flush()
Sept 17, 2007
Performance perturbation
10
22
1
22 6
33
00
25
00
20
6
0
5
4
16
43
1
00
15
5
4
4
16
53
5
-3
0
-5
28
296
6
41
9
42
6
00
10
0
-10
60
69
0
50
14
19
Cumulative Time for Execution (Secs)
00
30
15
-15
p
rt
RF
roc
roc Inter
W
Sta
P
P
e
2
e
Pr
Pr
3D
PS
n
e
R
i
c
A
rra
rfa
Te
Su
F
e
lot
PS
P
ag
R
WR
S
A
m
I
P
2
F2
AR
PS
WR
Sept 17, 2007
Workflow Application Scripts Execution Sequence
Provenance Overhead for Each Script (Secs)
27
8
28 5
0
28 5
09
28
34
Cumulative Time w/o Provenance (Secs)
Cumulative Time w/ Provenance (Secs)
Provenance Overhead
Standalone tool for provenance collection
and experience reuse: future direction
Sept 17, 2007
Forecast start time
can also be set to
occur on
severe
weather conditions
(not shown here)
Sept 17, 2007
Weather triggered workflows
•
•
•
Goal is cyberinfrastructure that allows scientists and
students to run weather models dynamically and
adaptively in response to weather events.
Accomplished by coupling events processing and
triggered forecast workflows
Vijayakumar et al (2006) presented framework for this
purpose
•
•
•
•
Events-processing system does temporal and spatial filtering.
Storm detection algorithm (SDA) detects storm events in
remaining streams
SDA returns detected storm events
Events processing system generates trigger to workflow engine
Sept 17, 2007
Continuous stream mining
•
•
•
•
In stream mining of weather,
events of interest are
anomalies
Event processing queries can
be deployed to sites in the
LEAD grid (rectangles)
Data streams delivered to
each site through Unidata
Internet Data Dissemination
system
CEP enables real-time
response to the weather
query
computation
node
data
generation
source
Sept 17, 2007
Example CEP query
•
•
•
Scientists can set up a 6-hour
weather forecast over a region
of say a 700 sq. mile bounding
box, and submit a workflow
that will run sometime in the
future
CEP query detects severe
storm conditions developing in
the region
The forecast workflow is
started at a future point in time
as determined by the CEP
query
Sept 17, 2007
Stream Provenance Tracking
•
•
•
•
Data stream provenance - derivation history of data
product where data product is derived time-bounded
stream
Stream provenance can establish correlations between
significant events (e.g., storm occurrences)
Anticipate resource needs by examining provenance
data and discover trends in weather forecast model
output
Determine when next wave of users will arrive, and
where their resources might need to be allocated
Sept 17, 2007
Stream processing as part of
cyberinfrastructure
•
•
SQL-based queries responding to input streams eventby-event within stream and concurrent across streams
Each query generates time-bounded output stream
Workflow graph
Application
services
Compute Engine
User’s Browser
Workflow
Engine
App
factory
Event Notification Bus
NEXRAD
Streams
Portal
server
MyLEAD
Agent
service
Data
Catalog
service
Mining
queries
Data
Management
Service
MyLEAD User
Metadata
catalog
Sept 17, 2007
Calder Stream
Mining Service
Data Storage
Doppler
Radars
Provenance Service in Calder
Process flow /
invocation
User Query
Planner Service
Obtain
continuous
query
Distribute
query
Deploy
queries
Compile SQL to
TCL query
Setup buffer
to aggregate
results
Query start / stop
/ distribution plan
chance
Queries
Computational mesh
executing query
execution engines
Rowset service
aggregating derived
streams
Updates on Stream
rates,
approximations etc
Create
Ring
Buffer
Results if
any
Provenance
Service
Process updates
and store in DB
Sept 17, 2007
DB
Calder
internal
messaging
WSMessenger
notifications
Provenance Update Handling
Scalability
•
•
Update processing time - time taken from instant user sends a
notification to instant provenance service completes corresponding
update
Experiment
•
•
•
Bombard provenance service at different update rates by simulating
many clients sending provenance updates simultaneously
Measure incoming rate at provenance service and overall time taken for
handling each update.
Overhead includes time to create message, send and receive through
WS-Messenger, process message and store it in DB
Sept 17, 2007
•
Problem:
•
•
•
Severe weather can bring many storms over a local
region of interest
It is infeasible and unnecessary to run weather model
in response to each of them
Solution:
•
•
Group storm events into spatial clusters
Trigger model runs in response to clusters of storms
Sept 17, 2007
Spatial Clustering: DBSCAN
algorithm*
•
•
DBSCAN is a density-based clustering
algorithm and it can do spatial clustering
location parameters are treated as features.
DBSCAN algorithm has two parameters
•
•
•
ε: radius within which a point is considered to be a
neighbor of another point
minPt: minimum number of neighboring points that a
point has to have to be considered as a core point.
The two parameters determine the clustering
result
* Mining work done by Xiang Li, University of Alabama Huntsville
Sept 17, 2007
Data
•
•
•
•
WSR88D radar data
on 3/27/2007
Total of 134 radar
sites covering
CONUS
The time period
examined is
between 1:00 pm to
6:00pm EST.
The 5 hrs time
period is divided into
20 time interval with
each interval of 15
min. Storm events
within the same time
interval is clustered
Storm events detected at 1:00 pm – 1:15 pm
* Mining work done by Xiang Li, University of Alabama Huntsville
Sept 17, 2007
Algorithm comparison: DBSCAN and Kmeans
Time period: 1:00 pm – 1:15 pm
Number of clusters: 3
DBSCAN result
K-means result
Conclusion: DBSCAN algorithmSept
performs
better than k-means algorithm
17, 2007
Future Work
•
Publication of provenance to digital library
Generalized support for metadata systems
Enhanced support for mining triggers
•
Personal weather predictor
•
•
•
•
•
LEAD framework packaged
into single 8-16 core multicore
machine
Expands educational
opportunities: suitable for
small schools
Engage communities beyond
meteorologists
Sept 17, 2007
Thank you for the interest.
Thanks to my many domain science
and CS collaborators, to my students,
and to the funding agents.
Please feel free to contact me at
plale@indiana.edu
Sept 17, 2007
Download