Cyberinfrastructure Challenges for Environmental Observatories

advertisement

Cyberinfrastructure Challenges for

Environmental Observatories

Barbara Minsker

Director, Environmental Engineering, Science, & Hydrology Group,

National Center for Supercomputing Applications;

Professor, Dept of Civil & Environ. Engineering;

University of Illinois, Urbana, IL, USA

January 9, 2007

University of Illinois at Urbana-Champaign National Center for Supercomputing Applications

Background

NSF Office of Cyberinfrastructure is funding NCSA and

SDSC to:

– Work with leading edge communities to develop cyberinfrastructure to support science and engineering

– Incorporate successful prototypes into a persistent cyberinfrastructure

NCSA runs the CLEANER Project Office, which is leading planning for the WATERS Network, one of 3 NSF proposed environmental observatories

– Co-Directors: Barbara Minsker, Jerald Schnoor (U of Iowa),

Chuck Haas (Drexel U)

• To support WATERS planning, NCSA’s Environmental

CyberInfrastructure Demonstrator (ECID) project is creating a prototype CI

– Driven by requirements gathering and close community collaborations

National Center for Supercomputing Applications

WATERS Network

WAT

er and

E

nvironmental

R

esearch

S

ystems Network

 Joint collaboration between the CLEANER Project Office and CUAHSI, Inc, sponsored by ENG & GEO

Directorates at the National Science Foundation (NSF)

CLEANER = Collaborative Large Scale Engineering Analysis

Network for Environmental Research

CUAHSI = Consortium of Universities for the Advancement of

Hydrologic Science

 Planning underway to build a nationwide environmental observatory network using NSF’s Major Research

Equipment and Facility Construction (MREFC) funding

Target construction date: 2011

Target operation date: 2015

WATERS DRAFT VISION

The WATERS Network will transform our understanding of the

Earth’s water and related biogeochemical cycles across multiple spatial and temporal scales to enable forecasting and management of critical water processes affected by human activities.

WATERS DRAFT GRAND CHALLENGES

• To detect the interactions of human activities and natural perturbations with the quantity, distribution and quality of water in real time.

• To predict the patterns and variability of processes affecting the quantity and quality of water at scales from local to continental.

• To achieve optimal management of water resources through the use of institutional and economic instruments.

Network Design Principles:

Enable multi-scale, dynamic predictive modeling for water, sediment, and water quality (flux, flow paths, rates), including:

Near-real-time assimilation of data

Feedback for observatory design

Point- to national-scale prediction

Network provides data sets and framework to test:

Sufficiency of the data

Alternative model conceptualizations

Master Design Variables:

Scale

Climate (arid vs humid)

Coastal vs inland

Land use, land cover, population density

Nested (where appropriate) Observatories over Range of Scales:

Point

Plot (100 m 2 )

Subcatchment (2 km 2 )

Catchment (10 km 2 ) – single land use

Watershed (100 –10,000 km 2 ) – mixed use

Basin (10,000 –100,000 km 2 )

Continental

CI Requirements Gathering

Interviews at conferences and meetings (Tom

Finholt and staff, U. of Michigan)

Usability studies (NCSA, Wentling group)

Community survey (Finholt group)

– AEESP and CUAHSI surveyed in 2006 as proxies for environmental engineering and hydrology communities

– 313 responses out of 600 surveys mailed (52.2% response rate)

– Key findings are driving ECID cyberenvironment development

National Center for Supercomputing Applications

What is the single most important obstacle to using data from different sources?

Learning how to quality control the data

Non-standard data form ats

Processing the data from raw form into variables that can be used by other tools

Not applicable

Existence of m etadata

Other

Investigator w ho collected the data is unknow n to m e

Unknow n or inconsistent units

Consistency of m etadata

Irregular or different tim e steps

Non-standard spatial scales 2

4

6

5

5

7

9

9

14

18

21  Nonstandard/ inconsistent units/formats

 Metadata problems

 Other obstacles

0 20 40 60 80

Percent

55% concerned about insufficient credit for shared data

N=278

100

National Center for Supercomputing Applications

What three software packages do you use most frequently in your work?

Excel

Other*

ArcGIS 8

MATLAB 13

19

SAS 7 6

MS Access 7 4

SPSS 7 2

SQL/Server 1 2

0

42

20

29

20

24

46

Majority are not using highend computational tools.

80 40

Percent

60

AEESP

CUAHSI

*Other:

• MS Word

• MS PowerPoint

• Statistics applications (e.g.,

Stata, R, S-Plus)

• SigmaPlot

• PHREEQC

• MathCAD

• FORTRAN compiler

• Mathematica

• GRASS GIS

• Groundwater models

• Modflow

100

National Center for Supercomputing Applications

Factors influencing technology adoption

Clarity of interface/ease of use

Ability to do things I cannot do w ith current softw are/hardw are

Professional technical support

Fulfillm ent of m y current research needs

22

17

18

16

17

16

27

27

AEESP

CUAHSI

Stability of softw are for long-term use 9 13

Com patibility w ith existing tools that I use 9

Necessity of learning new tools

Speed of loading the CyberCollaboratory pages on m y com puter (w ith the Internet connection speed that I

6

Upgrades for long-term use 6

Ability to access and m odify source code (e.g., for m odels or w orkflow s)

2 7

4

6

6 5

Security of m y personal inform ation

Having to install softw are on m y personal com puter

(rather than accessing everything through a Web

4

3 2

Necessity of creating an account 1 1

2

10

Ease of use, good support, and new capabilities are essential.

0 20 40 60 80 100

Percent

National Center for Supercomputing Applications

What are the three most compelling factors that would lead you to collaborate with another person in your field?

Com plem entary areas of expertise

Access to another’s expertise

Shared interests

Opportunity to brainstorm ideas w ith others

Trusting the person

Access to equipm ent (e.g., sensors, com puters)

Access to data

Shared values

Leveraging funding by com bining budgets

4

9

6

4 4

Access to m odels 3 4

16

12

11

13

25

25

9

15

6

16

Proxim ity of the person 3 3

Preference for w orking w ith others 3 3

Career advancem ent 3 2

Other 1 4

Shared m ethods 2 1

23

29

25

AEESP

CUAHSI

Community seeks collaborations to gain different expertise.

0 20 40 60 80 100

Percent

National Center for Supercomputing Applications

WATERS CI Challenges

Clearly, the first requirement for observatory CI is that the community must gain access to observatory data

However, simply delivering the data through a

Web portal is not going to allow the observatories to reach their full potential and meet the community’s requirements

National Center for Supercomputing Applications

WATERS CI Challenges, Cont’d.

Understanding data quality and getting credit for data sharing requires an integrated provenance system to track what has been done with the data

Enabling users who do not have strong computational skills to work with the flood of environmental data requires:

– Easy-to-use tools for manipulating large data sets, analyzing them, and assimilating them into models

– Workflow integrators that allow users to integrate their tools and models with real-time streaming environmental data

The vast community of observatory users & the resources they generate create a need for knowledge networking tools to help them find collaborators, data, workflows, publications, etc.

To address these requirements, cyberenvironments are needed

National Center for Supercomputing Applications

Environmental CI Architecture: Research

Services

Integrated CI

ECID Project Focus: Cyberenvironments

Supporting Technology

Knowledge

Services

Data

Services

Workflows

& Model

Services

Meta-

Workflows

Collaboration

Services

Digital

Library

Create

Hypothesis

HIS Project Focus

Obtain

Data

Analyze

Data &/or

Assimilate into

Model(s)

Link &/or

Run

Analyses

&/or

Model(s)

Discuss

Results

Publish

Research

Process

National Center for Supercomputing Applications

Cyberenvironments

Couple traditional desktop computing environments coupled with the resources and capabilities of a national cyberinfrastructure

Provide unprecedented ability to access, integrate, automate, and manage complex, collaborative projects across disciplinary and geographical boundaries.

ECID is demonstrating how cyberenvironments can:

– Support observatory sensor and event management, workflow and scientific analyses, and knowledge networking, including provenance information to track data from creation to publication.

– Provide collaborative environments where scientists, educators, and practitioners can acquire, share, and discuss data and information.

The cyberenvironments are designed with a flexible, service-oriented architecture, so that different components can be substituted with ease

National Center for Supercomputing Applications

ECID CyberEnvironment Components

CyberCollaboratory:

Collaborative Portal

CI:KNOW: Network Browser/

Recommender

CyberIntegrator:

Exploratory Workflow

Integration

Tupelo

Metadata Services

CUAHSI HIS Data Services

Single Sign-On

Security (coming)

Community Event

Management/Processing

SSO

National Center for Supercomputing Applications

CyberIntegrator

Studying complex environmental systems requires:

– Coupling analyses and models

– Real-time, automated updating of analyses and modeling with diverse tools

CyberIntegrator is a prototype workflow executor technology to support exploratory modeling and analysis of complex systems. Integrates the following tools to date:

– Excel

– IM2Learn image processing and mining tools, including ArcGIS image loading

– D2K data mining

– Java codes, including event management tools

Matlab & Fortran codes to be added soon. Additional tools will be included based on high priority needs of beta users.

National Center for Supercomputing Applications

CyberIntegrator Architecture

Example of CyberIntegrator Use:

Carrie Gibson created a fecal coliform prediction model in ArcGIS using

Model Builder that predicts annual average concentrations.

Ernest To rewrote the model as a macro in Excel to perform Monte Carlo simulation to predict median and 90th percentile values.

CyberIntegrator’s goal: Reduce manual labor in linking these tools, visualizing the results, and updating in real time.

National Center for Supercomputing Applications

Real-Time Simulation of Copano Bay TMDL with CyberIntegrator

CyberIntegrator

Excel Executor Im2Learn Executor

1

Streamflows to

Distributions

(Excel)

2

Fecal Coliform

Concentrations

Model

(Excel)

USGS Daily

Streamflows

(web services)

3

Load

Shapefiles

(Im2Learn)

4

Geo-reference and Visualize Results

(Im2Learn)

Shapefiles

For Copano

Bay

National Center for Supercomputing Applications call data

Sensor Anomaly Detection Scenario

User subscribes to anomaly detector workflows

CCBay Sensor Map

Listens for data events & creates event when anomaly discovered.

Event Manager

Anomalies

Anomaly

Detector 1

Anomaly

Detector 2

Anomalies

Dashboard

Alerts user to anomaly detection, along with other events (logged-in users, new documents, etc.)

CyberIntegrator

CC Bay Sensor Monitor Page

CyberIntegrator loads recommended workflow.

User adjusts parameters to CCBay Sensor.

Sensor map shows nearby related sensors so user can check data.

Anomaly detector is faulty. CI-KNOW recommends alternate anomaly detector from

Chesapeake Bay observatory.

CI-KNOW Network

National Center for Supercomputing Applications

Cyberenvironment Technologies

Raw Data

Anomaly Subscription

JMS JMS

JMS Broker

(ActiveMQ 4.0.1)

CyberDashboard

Desktop Application

Anomaly Publication

JMS

Data and Anomaly

Subscriptions

JMS

Data Subscriptions

JMS

Workflow Service

CyberIntegrator Workflow

CyberIntegrator Workflow

CyberCollaboratory

Sensor Page Reference

URL

CyberIntegrator

Workflow Reference

URL

Recommender Network

Web Service

SOAP

Workflow Publication/

Retrieval

Web Services

SOAP

CI-KNOW

Tupelo

Provenance

Semantic Content

ECID Managed Data/Metadata

Event Topics

RDBMS

User Subscriptions

Workflow Templates

National Center for Supercomputing Applications

Metadata

Data

Anomalies

ECID & Corpus Christi Bay (CCBay)

WATERS Observatory Testbed

CCBay WATERS Observatory Testbed is one of

10 observatory testbeds recently funded by NSF

– Collaboration of environmental engineering, hydrology, biology, and information technology researchers

Goal of the testbed:

– Integrate ECID and HIS technology to create end-toend environmental information system

– Use the technology to study hypoxia in CCBay

Use real-time data streams from diverse monitoring systems to predict hypoxia one day ahead

Mobilize manual sampling crews when conditions are right

National Center for Supercomputing Applications

Sensors in Corpus Christi Bay

NCDC station

TCEQ stations

TCOON stations

Hypoxic Regions

Montagna stations

USGS gages

SERF stations

National Datasets (National HIS)

USGS NCDC

Regional Datasets (Workgroup HIS)

TCOON Dr. Paul Montagna TCEQ SERF

National Center for Supercomputing Applications

CCBay Environmental Information System

CCBay Sensors

Eventdriven

Research

Event-

Triggered

Workflow

Execution

Anomaly Detector

Hypoxia Predictor

Dashboard Alert

Storage for Later

Research CyberIntegrator:

Forecast

CyberCollaboratory:

Contact Collaborators

Data hosted by other regional research agency

TCOON

Web server

CRWR Workgroup Server

Regional data stored on server in ODM schema

Dr. Paul

Montagna

SERF

Webservices

ODM webservices

Webscraper

Webservices

TCEQ

National Center for Supercomputing Applications

CCBay Near-Real-Time Hypoxia Prediction

Anomaly

Detection

Replace or

Remove Errors

Update Boundary

Condition Models

Sensor net

C++ code

D2K workflows

Fortran numerical models

IM2Learn workflows

Data

Archive

Hypoxia Machine

Learning Models

Hypoxia Model

Integrator

Visualize

Hypoxia Risk

Hydrodynamic

Model

Water Quality

Model

Visualize

Hydrodynamics

National Center for Supercomputing Applications

CCBay CI Challenges

Automating QA/QC in a real-time network

– David Hill is creating sensor anomaly detectors using statistical models (autoregressive models using naïve, clustering, perceptron, and artificial neural network approaches; and multi-sensor models using dynamic Bayesian networks)

– While statistical models can identify anomalies, it is sometimes difficult to differentiate sensor errors from unusual environmental phenomena

Getting access to the data, which are collected by different groups, stored in multiple formats in different locations

– The project is defining a common data dictionary and units and will build Web services to translate

National Center for Supercomputing Applications

CCBay CI Challenges, Contd.

Integrating data into diverse models

– Calibration uses historical data, typically done by hand

– Near-real-time updating needs automated approaches

– Models are complex and derivative-based calibration approaches would be difficult to implement

Model integration

– Grids change from one type of model to another – defining a common coarse grid, with finer grids overlaid where needed

– Data transformers must be built between models

National Center for Supercomputing Applications

Conclusions

Creating CI for environmental data is challenging but the benefits in enabling larger-scale, near-real-time research will be enormous

The ECID Cyberenvironment demonstrates the benefits of end-to-end integration of cyberinfrastructure and desktop tools, including:

– HIS-type data services

– Workflow

– Event management

– Provenance and knowledge management, and

– Collaboration for supporting environmental researchers, educators, and outreach partners

This creates a powerful system for linking observatory operations with flexible, investigator-driven research in a community framework (i.e., the national network).

– Workflow and knowledge management support testing hypotheses across observatories

– Provenance supports QA/QC and rewards for community contributions in an automated fashion.

National Center for Supercomputing Applications

Acknowledgments

Contributors:

– NCSA ECID team (Peter Bajcsy, Noshir Contractor, Steve

Downey, Joe Futrelle, Hank Green, Rob Kooper, Yong Liu,

Luigi Marini, Jim Myers, Mary Pietrowicz, Tim Wentling,

York Yao, Inna Zharnitsky)

– Corpus Christi Bay Testbed team (PIs: Jim Bonner, Ben

Hodges, David Maidment, Barbara Minsker, Paul

Montagna)

Funding sources:

– NSF grants BES-0414259, BES-0533513, and SCI-

0525308

– Office of Naval Research grant N00014-04-1-0437

National Center for Supercomputing Applications

Download