Metadata, Provenance, and Search in e-Science Beth Plale Director, Center for Data and Search Informatics School of Informatics Indiana University May 29, 2007 Credits: PhD students Yogesh Simmhan, Nithya Vijayakumar, and Scott Jensen. Dennis Gannon, IU, key collaborator on discovery cyberinfrastructure Sept 17, 2007 Nature of Computational Science Discovery • • • • • • • Extract data from heterogeneous databases, Execute task sequences (“workflows”) on your behalf, Mine data from sensors and instruments and responding, Try out new algorithms, Explore data through visualization, and Go back and repeat steps again: with new data, answering new questions, or with new algorithms. How is this discovery process supported today? • • Through cyberinfrastructure. CyberInfrastructure that supports • • • On demand knowledge discovery Automated experiment management (data and workflow) Data protection, and automated data product provenance tracking. Sept 17, 2007 CyberInfrastructure: framework for discovery • Plug and play data sources and analysis tools. Complex what-if scenarios. Through • • • • • • • User portal Personal metadata catalog of data exploration results Data product index/catalog Data provenance service Workflow engine and composition tools Tied together with Internet-scale event bus. Results publishable to digital library. Sept 17, 2007 Cyberinfrastructure for computing: DSI DataCenter Supports analysis, use, visualization and search research. Supports multiple datasets. Sept 17, 2007 Distributed services provide functionali capability Sept 17, 2007 Vision for Data Handling • Capturing metadata about data sets as generated is key • • • • • • Syntatic: file size, date of creation and Semantic or domain specific: spatial region, logical time Context of file is key search parameter Provenance, or history of data product, needed to assess quality Volume of data used in computational science too large: manage on behalf of user Indexes help efficiency Sept 17, 2007 The Realization in Software Workflow graph Application services Compute Engine User’s Browser Workflow Engine App factory Event Notification Bus Portal server MyLEAD Agent service Data Catalog service MyLEAD User Metadata catalog Data Management Service Provenance Collection service Data Storage Sept 17, 2007 Infrastructure is portal based - that is, all services are available through a web server Sept 17, 2007 e-Science Gateway Architecture User’s Grid Desktop Grid Portal Server Gateway Services Proxy Certificate Server (Vault) Application Deployment Events & Messaging Resource Registry Workflow engine Community & User Metadata Catalog Resource Broker Core Grid Services Security Services Information Services Self Management Resource Management Execution Management Data Services Resource Virtualization (OGSA) Compute Resources Data Resources Instruments & Sensors Sept 17, 2007 [1] Service Oriented Architectures for Science Gateways on Grid Systems, Gannon, D., et al.; ICSOC, 2005 LEAD-CI Cyberinfrastructure • • • Workflows run on the LEADgrid and on Teragrid. Portal and persistent back-end web services run on LEADgrid. Data storage resources for storing usergenerated data products are provided by Indiana University. Sept 17, 2007 Typical weather forecast runs as workflow Pre-Processing Terrain data files arpstrn Radar data (level II) Assimilation Forecast Visualization ETA, RUC, GFS data IDV viz Ext2arps-ibc Ext2arps-lbc Surface data files WRF arpssfc 88d2arps Radar data (level III) arps2wrf wrf2arps ADAS assimilation Surface, upper air mesonet & wind profiler data nids2arps arpsplot Satellite data mci2arps Sept 17, 2007 ~400 Data Products Consumed & Produced – transformed – during Workflow Lifecycle To set up workflow experiment, we select a workflow (not shown) then set model parameters here Sept 17, 2007 Supported community data collections Sept 17, 2007 Data Integration Globally integrated view: Data Catalog Service Local view: crosswalk point of presence supports crawling, publishes difference list as LEAD Metadata Schema (LMS) documents CASA radar Collection, Months (ftp) Oklaho ma List of results as LEAD Metadata Schema documents Web service API Boolean search query • Crawler crawls catalogs; • Builds index of results; • Web service API; • Boolean search query with spatial/temporal support Latest 3 days Unidata IDD Distribution Indiana (XML web server) Level II and III radar, latest 3 days Colorado (XML web server) Index XMLDB native XML database and Lucene for index ETA, NCEP, Colorado NAM, METAR, etc. (XML web server) crosswalks Sept 17, 2007 LEAD Personal Workspace • • CyberInfrastructure extends user’s desktop to incorporate vast data analysis space. As users go about doing scientific experiments, the CI manages back-end storage and compute resources. • • Portal provides ways to explore this data and search and discover it. Metadata about experiments is largely automatically generated, and highly searchable. • Describes data object (the file) in application-rich terms, and provides URI to data service that can resolve an abstract unique identifier to real, on-line data “file”. Sept 17, 2007 Searching for experiments using model configuration parameters: 2 attributes selected Sept 17, 2007 Searching for experiments based on model parameters: 4 returned experiments; one displayed Sept 17, 2007 How forecast model configuration parameters stored in personal catalog Forecast model configuration file handed off to plugin that shreds XML document into queriable attributes associated with experiment Sept 17, 2007 What & Why of Provenance • Derivation history of a data product • • • What (when, where) application created the data Its parameters & configuration Other input data used by application Data.In.1 Data.In.2 Config.A • Application A Data.Out.1 Data Provenance::Data.Out.1 Process: Application_A Timestamp: 2006-06-23T12:45:23 Host: tyr20.cs.indiana.edu … Input: Data.In.1, Data.In.2 Config: Config.A Workflow is composed from building blocks like these. So provenance for data used in workflow gives workflow trace Sept 17, 2007 The What & Why of Provenance • Trace Workflow Execution • • • Audit Trail • • • What applications were used to derived data products? Which workflows use a certain data product? Attribution • • • What resources were used during workflow execution? Data Quality & Reuse • • What services were used during workflow execution? Validate if all steps of execution successful? Who performed the experiment? Who owns the workflow & data products? Discovery • • Locate data generated by a workflow Locate workflows containing App-X that succeeded Sept 17, 2007 Collection Framework Query for Workflow, Process, & Data Provenance Karma Provenance Service Provenance Listener Subscribe & Listen to Activity Notifications Activity DB Message Bus WS-Eventing Service API Workflow–Started & –Finished Activities Publish Provenance Activities as Notifications 10C Provenance Browser Client Provenance Query API Application–Started & –Finished, Data–Produced & –Consumed Activities Workflow Engine Service 1 Service 2 10P/10C 10P Service 9 … 10C Service 10 10P 10P/10C Workflow Instance 10 Data Products Consumed & Produced by each Service A Framework for Collecting Provenance in Data-Centric Scientific Workflows, Simmhan, Y., et al., ICWS Conference, 2006 Sept 17, 2007 WS-Messenger Notification Broker Generating Karma Provenance Activities • • Instrument applications to publish provenance Simple Java Library available to • • • • Create provenance activities Publish activities as messages Jython “wrapper” scripts use library to publish provenance & invoke application Generic Factory toolkit easily converts applications to web service • Built-in provenance instrumentation Sept 17, 2007 Sample Sequence of Activities appStarted(App1) info(‘App1 starting’) fileReceiveStarted(File1) -- do gridftp get to stage input file File1 -fileReceiveFinished(File1) fileConsumed(File1) computationStarted(Code1) -- call Fortran code Code1 to process input files -computationFinished(Code1) fileProduced(File2) fileSendStarted(File2) -- do gridftp put to save output file File2 -fileSendFinished(File2) publishURL(File2) appFinishedSuccess(App1, File2) | appFinishedFailed(App1, ERR) flush() Sept 17, 2007 Performance perturbation 10 22 1 22 6 33 00 25 00 20 6 0 5 4 16 43 1 00 15 5 4 4 16 53 5 -3 0 -5 28 296 6 41 9 42 6 00 10 0 -10 60 69 0 50 14 19 Cumulative Time for Execution (Secs) 00 30 15 -15 p rt RF roc roc Inter W Sta P P e 2 e Pr Pr 3D PS n e R i c A rra rfa Te Su F e lot PS P ag R WR S A m I P 2 F2 AR PS WR Sept 17, 2007 Workflow Application Scripts Execution Sequence Provenance Overhead for Each Script (Secs) 27 8 28 5 0 28 5 09 28 34 Cumulative Time w/o Provenance (Secs) Cumulative Time w/ Provenance (Secs) Provenance Overhead Standalone tool for provenance collection and experience reuse: future direction Sept 17, 2007 Forecast start time can also be set to occur on severe weather conditions (not shown here) Sept 17, 2007 Weather triggered workflows • • • Goal is cyberinfrastructure that allows scientists and students to run weather models dynamically and adaptively in response to weather events. Accomplished by coupling events processing and triggered forecast workflows Vijayakumar et al (2006) presented framework for this purpose • • • • Events-processing system does temporal and spatial filtering. Storm detection algorithm (SDA) detects storm events in remaining streams SDA returns detected storm events Events processing system generates trigger to workflow engine Sept 17, 2007 Continuous stream mining • • • • In stream mining of weather, events of interest are anomalies Event processing queries can be deployed to sites in the LEAD grid (rectangles) Data streams delivered to each site through Unidata Internet Data Dissemination system CEP enables real-time response to the weather query computation node data generation source Sept 17, 2007 Example CEP query • • • Scientists can set up a 6-hour weather forecast over a region of say a 700 sq. mile bounding box, and submit a workflow that will run sometime in the future CEP query detects severe storm conditions developing in the region The forecast workflow is started at a future point in time as determined by the CEP query Sept 17, 2007 Stream Provenance Tracking • • • • Data stream provenance - derivation history of data product where data product is derived time-bounded stream Stream provenance can establish correlations between significant events (e.g., storm occurrences) Anticipate resource needs by examining provenance data and discover trends in weather forecast model output Determine when next wave of users will arrive, and where their resources might need to be allocated Sept 17, 2007 Stream processing as part of cyberinfrastructure • • SQL-based queries responding to input streams eventby-event within stream and concurrent across streams Each query generates time-bounded output stream Workflow graph Application services Compute Engine User’s Browser Workflow Engine App factory Event Notification Bus NEXRAD Streams Portal server MyLEAD Agent service Data Catalog service Mining queries Data Management Service MyLEAD User Metadata catalog Sept 17, 2007 Calder Stream Mining Service Data Storage Doppler Radars Provenance Service in Calder Process flow / invocation User Query Planner Service Obtain continuous query Distribute query Deploy queries Compile SQL to TCL query Setup buffer to aggregate results Query start / stop / distribution plan chance Queries Computational mesh executing query execution engines Rowset service aggregating derived streams Updates on Stream rates, approximations etc Create Ring Buffer Results if any Provenance Service Process updates and store in DB Sept 17, 2007 DB Calder internal messaging WSMessenger notifications Provenance Update Handling Scalability • • Update processing time - time taken from instant user sends a notification to instant provenance service completes corresponding update Experiment • • • Bombard provenance service at different update rates by simulating many clients sending provenance updates simultaneously Measure incoming rate at provenance service and overall time taken for handling each update. Overhead includes time to create message, send and receive through WS-Messenger, process message and store it in DB Sept 17, 2007 • Problem: • • • Severe weather can bring many storms over a local region of interest It is infeasible and unnecessary to run weather model in response to each of them Solution: • • Group storm events into spatial clusters Trigger model runs in response to clusters of storms Sept 17, 2007 Spatial Clustering: DBSCAN algorithm* • • DBSCAN is a density-based clustering algorithm and it can do spatial clustering location parameters are treated as features. DBSCAN algorithm has two parameters • • • ε: radius within which a point is considered to be a neighbor of another point minPt: minimum number of neighboring points that a point has to have to be considered as a core point. The two parameters determine the clustering result * Mining work done by Xiang Li, University of Alabama Huntsville Sept 17, 2007 Data • • • • WSR88D radar data on 3/27/2007 Total of 134 radar sites covering CONUS The time period examined is between 1:00 pm to 6:00pm EST. The 5 hrs time period is divided into 20 time interval with each interval of 15 min. Storm events within the same time interval is clustered Storm events detected at 1:00 pm – 1:15 pm * Mining work done by Xiang Li, University of Alabama Huntsville Sept 17, 2007 Algorithm comparison: DBSCAN and Kmeans Time period: 1:00 pm – 1:15 pm Number of clusters: 3 DBSCAN result K-means result Conclusion: DBSCAN algorithmSept performs better than k-means algorithm 17, 2007 Future Work • Publication of provenance to digital library Generalized support for metadata systems Enhanced support for mining triggers • Personal weather predictor • • • • • LEAD framework packaged into single 8-16 core multicore machine Expands educational opportunities: suitable for small schools Engage communities beyond meteorologists Sept 17, 2007 Thank you for the interest. Thanks to my many domain science and CS collaborators, to my students, and to the funding agents. Please feel free to contact me at plale@indiana.edu Sept 17, 2007