Barbara Minsker
Director, Environmental Engineering, Science, & Hydrology Group,
National Center for Supercomputing Applications;
Professor, Dept of Civil & Environ. Engineering;
University of Illinois, Urbana, IL, USA
January 9, 2007
University of Illinois at Urbana-Champaign National Center for Supercomputing Applications
Background
•
NSF Office of Cyberinfrastructure is funding NCSA and
SDSC to:
– Work with leading edge communities to develop cyberinfrastructure to support science and engineering
– Incorporate successful prototypes into a persistent cyberinfrastructure
•
NCSA runs the CLEANER Project Office, which is leading planning for the WATERS Network, one of 3 NSF proposed environmental observatories
– Co-Directors: Barbara Minsker, Jerald Schnoor (U of Iowa),
Chuck Haas (Drexel U)
• To support WATERS planning, NCSA’s Environmental
CyberInfrastructure Demonstrator (ECID) project is creating a prototype CI
– Driven by requirements gathering and close community collaborations
National Center for Supercomputing Applications
WAT
E
R
S
Joint collaboration between the CLEANER Project Office and CUAHSI, Inc, sponsored by ENG & GEO
Directorates at the National Science Foundation (NSF)
CLEANER = Collaborative Large Scale Engineering Analysis
Network for Environmental Research
CUAHSI = Consortium of Universities for the Advancement of
Hydrologic Science
Planning underway to build a nationwide environmental observatory network using NSF’s Major Research
Equipment and Facility Construction (MREFC) funding
Target construction date: 2011
Target operation date: 2015
The WATERS Network will transform our understanding of the
Earth’s water and related biogeochemical cycles across multiple spatial and temporal scales to enable forecasting and management of critical water processes affected by human activities.
• To detect the interactions of human activities and natural perturbations with the quantity, distribution and quality of water in real time.
• To predict the patterns and variability of processes affecting the quantity and quality of water at scales from local to continental.
• To achieve optimal management of water resources through the use of institutional and economic instruments.
Enable multi-scale, dynamic predictive modeling for water, sediment, and water quality (flux, flow paths, rates), including:
Near-real-time assimilation of data
Feedback for observatory design
Point- to national-scale prediction
Network provides data sets and framework to test:
Sufficiency of the data
Alternative model conceptualizations
Master Design Variables:
Scale
Climate (arid vs humid)
Coastal vs inland
Land use, land cover, population density
Nested (where appropriate) Observatories over Range of Scales:
Point
Plot (100 m 2 )
Subcatchment (2 km 2 )
Catchment (10 km 2 ) – single land use
Watershed (100 –10,000 km 2 ) – mixed use
Basin (10,000 –100,000 km 2 )
Continental
•
Interviews at conferences and meetings (Tom
Finholt and staff, U. of Michigan)
•
Usability studies (NCSA, Wentling group)
•
Community survey (Finholt group)
– AEESP and CUAHSI surveyed in 2006 as proxies for environmental engineering and hydrology communities
– 313 responses out of 600 surveys mailed (52.2% response rate)
– Key findings are driving ECID cyberenvironment development
National Center for Supercomputing Applications
What is the single most important obstacle to using data from different sources?
Learning how to quality control the data
Non-standard data form ats
Processing the data from raw form into variables that can be used by other tools
Not applicable
Existence of m etadata
Other
Investigator w ho collected the data is unknow n to m e
Unknow n or inconsistent units
Consistency of m etadata
Irregular or different tim e steps
Non-standard spatial scales 2
4
6
5
5
7
9
9
14
18
21 Nonstandard/ inconsistent units/formats
Metadata problems
Other obstacles
0 20 40 60 80
Percent
55% concerned about insufficient credit for shared data
•
N=278
100
National Center for Supercomputing Applications
What three software packages do you use most frequently in your work?
Excel
Other*
ArcGIS 8
MATLAB 13
19
SAS 7 6
MS Access 7 4
SPSS 7 2
SQL/Server 1 2
0
42
20
29
20
24
46
Majority are not using highend computational tools.
80 40
Percent
60
AEESP
CUAHSI
*Other:
• MS Word
• MS PowerPoint
• Statistics applications (e.g.,
Stata, R, S-Plus)
• SigmaPlot
• PHREEQC
• MathCAD
• FORTRAN compiler
• Mathematica
• GRASS GIS
• Groundwater models
• Modflow
100
National Center for Supercomputing Applications
Factors influencing technology adoption
Clarity of interface/ease of use
Ability to do things I cannot do w ith current softw are/hardw are
Professional technical support
Fulfillm ent of m y current research needs
22
17
18
16
17
16
27
27
AEESP
CUAHSI
Stability of softw are for long-term use 9 13
Com patibility w ith existing tools that I use 9
Necessity of learning new tools
Speed of loading the CyberCollaboratory pages on m y com puter (w ith the Internet connection speed that I
6
Upgrades for long-term use 6
Ability to access and m odify source code (e.g., for m odels or w orkflow s)
2 7
4
6
6 5
Security of m y personal inform ation
Having to install softw are on m y personal com puter
(rather than accessing everything through a Web
4
3 2
Necessity of creating an account 1 1
2
10
Ease of use, good support, and new capabilities are essential.
0 20 40 60 80 100
Percent
National Center for Supercomputing Applications
What are the three most compelling factors that would lead you to collaborate with another person in your field?
Com plem entary areas of expertise
Access to another’s expertise
Shared interests
Opportunity to brainstorm ideas w ith others
Trusting the person
Access to equipm ent (e.g., sensors, com puters)
Access to data
Shared values
Leveraging funding by com bining budgets
4
9
6
4 4
Access to m odels 3 4
16
12
11
13
25
25
9
15
6
16
Proxim ity of the person 3 3
Preference for w orking w ith others 3 3
Career advancem ent 3 2
Other 1 4
Shared m ethods 2 1
23
29
25
AEESP
CUAHSI
Community seeks collaborations to gain different expertise.
0 20 40 60 80 100
Percent
National Center for Supercomputing Applications
•
Clearly, the first requirement for observatory CI is that the community must gain access to observatory data
•
However, simply delivering the data through a
Web portal is not going to allow the observatories to reach their full potential and meet the community’s requirements
National Center for Supercomputing Applications
•
Understanding data quality and getting credit for data sharing requires an integrated provenance system to track what has been done with the data
•
Enabling users who do not have strong computational skills to work with the flood of environmental data requires:
– Easy-to-use tools for manipulating large data sets, analyzing them, and assimilating them into models
– Workflow integrators that allow users to integrate their tools and models with real-time streaming environmental data
•
The vast community of observatory users & the resources they generate create a need for knowledge networking tools to help them find collaborators, data, workflows, publications, etc.
•
To address these requirements, cyberenvironments are needed
National Center for Supercomputing Applications
Environmental CI Architecture: Research
Services
ECID Project Focus: Cyberenvironments
Supporting Technology
Knowledge
Services
Data
Services
Workflows
& Model
Services
Meta-
Workflows
Collaboration
Services
Digital
Library
Create
Hypothesis
HIS Project Focus
Obtain
Data
Analyze
Data &/or
Assimilate into
Model(s)
Link &/or
Run
Analyses
&/or
Model(s)
Discuss
Results
Publish
Research
Process
National Center for Supercomputing Applications
•
Couple traditional desktop computing environments coupled with the resources and capabilities of a national cyberinfrastructure
•
Provide unprecedented ability to access, integrate, automate, and manage complex, collaborative projects across disciplinary and geographical boundaries.
•
ECID is demonstrating how cyberenvironments can:
– Support observatory sensor and event management, workflow and scientific analyses, and knowledge networking, including provenance information to track data from creation to publication.
– Provide collaborative environments where scientists, educators, and practitioners can acquire, share, and discuss data and information.
•
The cyberenvironments are designed with a flexible, service-oriented architecture, so that different components can be substituted with ease
National Center for Supercomputing Applications
ECID CyberEnvironment Components
CyberCollaboratory:
Collaborative Portal
CI:KNOW: Network Browser/
Recommender
CyberIntegrator:
Exploratory Workflow
Integration
Tupelo
Metadata Services
CUAHSI HIS Data Services
Single Sign-On
Security (coming)
Community Event
Management/Processing
SSO
National Center for Supercomputing Applications
•
Studying complex environmental systems requires:
– Coupling analyses and models
– Real-time, automated updating of analyses and modeling with diverse tools
•
CyberIntegrator is a prototype workflow executor technology to support exploratory modeling and analysis of complex systems. Integrates the following tools to date:
– Excel
– IM2Learn image processing and mining tools, including ArcGIS image loading
– D2K data mining
– Java codes, including event management tools
•
Matlab & Fortran codes to be added soon. Additional tools will be included based on high priority needs of beta users.
National Center for Supercomputing Applications
CyberIntegrator Architecture
Example of CyberIntegrator Use:
Carrie Gibson created a fecal coliform prediction model in ArcGIS using
Model Builder that predicts annual average concentrations.
Ernest To rewrote the model as a macro in Excel to perform Monte Carlo simulation to predict median and 90th percentile values.
CyberIntegrator’s goal: Reduce manual labor in linking these tools, visualizing the results, and updating in real time.
National Center for Supercomputing Applications
Real-Time Simulation of Copano Bay TMDL with CyberIntegrator
CyberIntegrator
Excel Executor Im2Learn Executor
1
Streamflows to
Distributions
(Excel)
2
Fecal Coliform
Concentrations
Model
(Excel)
USGS Daily
Streamflows
(web services)
3
Load
Shapefiles
(Im2Learn)
4
Geo-reference and Visualize Results
(Im2Learn)
Shapefiles
For Copano
Bay
National Center for Supercomputing Applications call data
User subscribes to anomaly detector workflows
CCBay Sensor Map
Listens for data events & creates event when anomaly discovered.
Event Manager
Anomalies
Anomaly
Detector 1
Anomaly
Detector 2
Anomalies
Dashboard
Alerts user to anomaly detection, along with other events (logged-in users, new documents, etc.)
CyberIntegrator
CC Bay Sensor Monitor Page
CyberIntegrator loads recommended workflow.
User adjusts parameters to CCBay Sensor.
Sensor map shows nearby related sensors so user can check data.
Anomaly detector is faulty. CI-KNOW recommends alternate anomaly detector from
Chesapeake Bay observatory.
CI-KNOW Network
National Center for Supercomputing Applications
Raw Data
Anomaly Subscription
JMS JMS
JMS Broker
(ActiveMQ 4.0.1)
CyberDashboard
Desktop Application
Anomaly Publication
JMS
Data and Anomaly
Subscriptions
JMS
Data Subscriptions
JMS
Workflow Service
CyberIntegrator Workflow
CyberIntegrator Workflow
CyberCollaboratory
Sensor Page Reference
URL
CyberIntegrator
Workflow Reference
URL
Recommender Network
Web Service
SOAP
Workflow Publication/
Retrieval
Web Services
SOAP
CI-KNOW
Tupelo
Provenance
Semantic Content
ECID Managed Data/Metadata
Event Topics
RDBMS
User Subscriptions
Workflow Templates
National Center for Supercomputing Applications
Metadata
Data
Anomalies
ECID & Corpus Christi Bay (CCBay)
WATERS Observatory Testbed
•
CCBay WATERS Observatory Testbed is one of
10 observatory testbeds recently funded by NSF
– Collaboration of environmental engineering, hydrology, biology, and information technology researchers
•
Goal of the testbed:
– Integrate ECID and HIS technology to create end-toend environmental information system
– Use the technology to study hypoxia in CCBay
•
Use real-time data streams from diverse monitoring systems to predict hypoxia one day ahead
•
Mobilize manual sampling crews when conditions are right
National Center for Supercomputing Applications
NCDC station
TCEQ stations
TCOON stations
Hypoxic Regions
Montagna stations
USGS gages
SERF stations
National Datasets (National HIS)
USGS NCDC
Regional Datasets (Workgroup HIS)
TCOON Dr. Paul Montagna TCEQ SERF
National Center for Supercomputing Applications
CCBay Environmental Information System
CCBay Sensors
Eventdriven
Research
Event-
Triggered
Workflow
Execution
Anomaly Detector
Hypoxia Predictor
Dashboard Alert
Storage for Later
Research CyberIntegrator:
Forecast
CyberCollaboratory:
Contact Collaborators
Data hosted by other regional research agency
TCOON
Web server
CRWR Workgroup Server
Regional data stored on server in ODM schema
Dr. Paul
Montagna
SERF
Webservices
ODM webservices
Webscraper
Webservices
TCEQ
National Center for Supercomputing Applications
CCBay Near-Real-Time Hypoxia Prediction
Anomaly
Detection
Replace or
Remove Errors
Update Boundary
Condition Models
Sensor net
C++ code
D2K workflows
Fortran numerical models
IM2Learn workflows
Data
Archive
Hypoxia Machine
Learning Models
Hypoxia Model
Integrator
Visualize
Hypoxia Risk
Hydrodynamic
Model
Water Quality
Model
Visualize
Hydrodynamics
National Center for Supercomputing Applications
•
Automating QA/QC in a real-time network
– David Hill is creating sensor anomaly detectors using statistical models (autoregressive models using naïve, clustering, perceptron, and artificial neural network approaches; and multi-sensor models using dynamic Bayesian networks)
– While statistical models can identify anomalies, it is sometimes difficult to differentiate sensor errors from unusual environmental phenomena
•
Getting access to the data, which are collected by different groups, stored in multiple formats in different locations
– The project is defining a common data dictionary and units and will build Web services to translate
National Center for Supercomputing Applications
•
Integrating data into diverse models
– Calibration uses historical data, typically done by hand
– Near-real-time updating needs automated approaches
– Models are complex and derivative-based calibration approaches would be difficult to implement
•
Model integration
– Grids change from one type of model to another – defining a common coarse grid, with finer grids overlaid where needed
– Data transformers must be built between models
National Center for Supercomputing Applications
•
Creating CI for environmental data is challenging but the benefits in enabling larger-scale, near-real-time research will be enormous
•
The ECID Cyberenvironment demonstrates the benefits of end-to-end integration of cyberinfrastructure and desktop tools, including:
– HIS-type data services
– Workflow
– Event management
– Provenance and knowledge management, and
– Collaboration for supporting environmental researchers, educators, and outreach partners
•
This creates a powerful system for linking observatory operations with flexible, investigator-driven research in a community framework (i.e., the national network).
– Workflow and knowledge management support testing hypotheses across observatories
– Provenance supports QA/QC and rewards for community contributions in an automated fashion.
National Center for Supercomputing Applications
•
Contributors:
– NCSA ECID team (Peter Bajcsy, Noshir Contractor, Steve
Downey, Joe Futrelle, Hank Green, Rob Kooper, Yong Liu,
Luigi Marini, Jim Myers, Mary Pietrowicz, Tim Wentling,
York Yao, Inna Zharnitsky)
– Corpus Christi Bay Testbed team (PIs: Jim Bonner, Ben
Hodges, David Maidment, Barbara Minsker, Paul
Montagna)
•
Funding sources:
– NSF grants BES-0414259, BES-0533513, and SCI-
0525308
– Office of Naval Research grant N00014-04-1-0437
National Center for Supercomputing Applications