Trigger service - Pegasus Workflow Management System

advertisement
Monitoring the
Earth System Grid
with MDS4
Ann Chervenak
USC Information Sciences Institute
Jennifer M. Schopf, Laura Pearlman,
Mei-Hui Su, Shishir Bharathi, Luca Cinquini,
Mike D’Arcy, Neill Miller, David Bernholdt
Talk Outline




Overview of the Earth System Grid
Overview of Monitoring in the Globus
Toolkit
Globus Monitoring Services in ESG

Monitoring and Discovery System

Trigger Service
Summary
The Earth System Grid: Turning
Climate Datasets into Community
Resources
www.earthsystemgrid.org
The growing importance of
climate simulation data

DOE invests broadly in
climate change research:


Development of climate
models

Climate change simulation

Model intercomparisons

Observational programs
Results from the Parallel Climate Model (PCM) depicting wind
vectors, surface pressure, sea surface temperature, and sea
ice concentration. Prepared from data published in the ESG
using the FERRET analysis tool by Gary Strand, NCAR.
Climate change research is
increasingly data-intensive:


Analysis and intercomparison
of simulation and observations
from many sources
Data used by model
developers, impacts analysts,
policymakers
4 Bernholdt_ESG_SC07
Slide Courtesy of Dave Bernholdt, ORNL
Earth System Grid objectives
To support the infrastructural needs of the national and international
climate community, ESG is providing crucial technology to securely
access, monitor, catalog, transport, and distribute data in today’s grid
computing environment.
HPC
hardware running
climate models
ESG Portal
ESG
Sites
5 Bernholdt_ESG_SC07
Slide Courtesy of Dave Bernholdt, ORNL
ESG facts and figures
Main ESG Portal
146 TB of data at four locations



1,059 datasets
958,072 files
Includes the past 6 years of joint DOE/NSF
climate modeling experiments
IPCC AR4 ESG Portal
35 TB of data at one location



77,400 files
Generated by a modeling campaign coordinated by the
Intergovernmental Panel on Climate Change
Model data from 13 countries
4,910 registered users
1,245 registered analysis projects
Downloads to date
Downloads to date


30 TB
106,572 files



Worldwide ESG user base
245 TB
914,400 files
500 GB/day
(average)
IPCC Daily Downloads (through
7/2/07)
> 300 scientific papers published to
date based on analysis of IPCC AR4
data
Slide Courtesy of Dave Bernholdt, ORNL
ESG architecture and
underlying technologies
First Generation ESG Architecture

Climate data tools

Metadata catalog

NcML (metadata schema)



RLS
SRM
OPenDAP-G
(aggregation and subsetting)
Data Mover Lite
Storage Resource Manager

Globus Security Infrastructure

GridFTP

ORNL
HPSS
RLS
SRM
MyProxy
SRM
RLS
Access control

MyProxy

User registration
SRM
User
Registration
Catalogs
Browsing
Access
Control
Climate
Metadata
Data
Download
Data
Subsetting
Data
Publishing
Usage
Metrics
RLS
LANL
Cache
Monitoring
Services
Replica Location Service

OPeNDAP-G
Data
Search
Monitoring and Discovery
Services
Security
DISK
Cache
SRM
ESG Web Portal
NERSC
Globus toolkit


NCAR
MSS
Data management


NCAR
Cache
Web
Browser
Data
Provider
publish
search
browse
download
Web
Browser
DML
Data
User
MSS, HPSS: Tertiary
data storage systems
Slide Courtesy of Dave Bernholdt, ORNL
Evolving ESG to petascale
ESG Data System Evolution
2006
Central database
• Centralized curated data
archive
• Time aggregation
• Distribution by file
transport
• No ESG responsibility for
analysis
• Shopping-cart-oriented
web portal
CCSM
IPCC
Early 2009
2011
Testbed data sharing
• Federated metadata
• Federated portals
• Unified user interface
• Selected server-side analysis
• Location independence
• Distributed aggregation
• Manual data sharing
• Manual publishing
Full data sharing (add to testbed…)
• Synchronized federation
– metadata, data
• Full suite of server-side
analysis
• Model/observation integration
• ESG embedded into desktop
productivity tools
• GIS integration
• Model intercomparison metrics
• User support, life cycle
maintenance
ESG Data Archive
Terabytes
Petabytes
CSSM, IPCC,
satellite, In situ
biogeochemistry,
ecosystems
Slide Courtesy of Dave Bernholdt, ORNL
Architecture of the
next-generation ESG

Broader geographical
distribution of archives
 across the United States
 around the world
Easy federation of sites
(CDAT, NCL, Ferret, GIS, Publishing, OPeNDAP, DML, Modeling, etc.)
Web Portals
Local, Remote, and Web Services Interfaces
Applications Components
(data transfer, data publishing, search, analysis, visualization, post-processing, computation)
Increased flexibility and
robustness
Distribution
Online
Data
Distribution
Online
Data
Deep
Archives
Workflow & Orchestration

Remote Application Clients
Browser Clients
(security, logging, monitoring)

Petascale data archives
Cross-Cutting Concerns

Second Generation ESG Architecture
CPU
ESG Node
ESG Node
ESG Gateway (CCSM)
ESG Gateway (CCES)
ESG Node
Web Portal
Web Portal
ESG Node
Interfaces
Interfaces
Applications
Applications
ESG Node
ESG Gateway (IPCC)
Data &
Metadata
Holdings
ESG Node
Data &
Metadata
Holdings
Web Portal
ESG Node
Interfaces
Applications
Data &
Metadata
Holdings
Federated ESG Deployment
Slide Courtesy of Dave Bernholdt, ORNL
The team and sponsors
National Oceanic
& Atmospheric
Administration/PMEL
Lawrence Berkeley
National Laboratory
National Center for
Atmospheric Research
Lawrence Livermore
National Laboratory/
PCMDI
USC Information
Science Institute
Los Alamos
National Laboratory
Argonne
National Laboratory
Oak Ridge
National Laboratory
Climate Data Repository and
ESG participant
ESG participant
Slide Courtesy of Dave Bernholdt, ORNL
Monitoring ESG


ESG consists of heterogeneous components
deployed across multiple administrative domains
The climate community has come to depend on the
ESG infrastructure as a critical resource




Failures of ESG components or services can disrupt
the work of many scientists
Need to minimize infrastructure downtime
Monitoring components to determine their current
state and detect failures is essential
Monitoring systems:


Collect, aggregate, and sometimes act upon data
describing system state
Monitoring can help users make resource selection
decisions and help administrators detect problems
GT4 Monitoring and Discovery System

A Web service adhering to the Web Services Resource
Framework (WSRF) standards
Consists of two higher-level services:
 Index service collects and publishes aggregated
information about Grid resources


Trigger service collects resource information from the
Index Service and performs actions when certain trigger
conditions are met
Information about resources is obtained from external
components called information providers

Currently in ESG, these are simple scripts and programs
that check the status of services
Monitoring Host at NCAR
Web Services Container
poll
MDS4 Index Service
(aggregates information from
sources, answers queries)
poll
status (XML)
MDS4 Trigger Service
resource status
poll
(pulls data from Index, checks
trigger conditions, takes actions)
status (XML)
Aggregator Source
Aggregator Source
(wraps status with
additional information)
(wraps status with
additional information)
status (XML)
status (XML)
Information Provider
Information Provider
(runs RLS client to check
status of service)
(runs SRM client to check
SRM & storage system status)
resource status
Site A
RLS Server
resource status
Site B
Storage Resource
Manager
Hierarhical
Storage System
ESG Services Currently Monitored







GridFTP server:
NCAR
OPeNDAP server:
NCAR
Web Portal:
NCAR
HTTP Dataserver:
LANL, NCAR
RLS servers:
LANL , LBNL, NCAR, ORNL
Storage Resource Managers:
LBNL, NCAR, ORNL
Hierarchical Mass Storage Systems:
LBNL, NCAR,
ORNL
Monitoring Overall System Status







Monitored data are collected in
MDS4 Index service
Information providers check
resource status at a configured
frequency
Currently, every 10 minutes
Report status to Index Service
This resource information in
Index Service is queried by the
ESG Web portal
Used to generate overall picture
of state of ESG resources
Displayed on ESG Web portal
page
Trigger Actions Based on
Monitoring Information


MDS4 Trigger service periodically polls Index Service
Based on the current resource status, Trigger service
determines whether specified trigger rules and
conditions are satisfied


Current action: Trigger service sends email to system
administrators when services fail


If so, performs specified action for each trigger
Ideally, system failures can be detected and corrected by
administrators before they affect larger ESG community
Future plans: include richer recovery operations as
trigger actions, e.g., automatic restart of failed services
Example Monitoring Information
Total error messages for May 2006
47
Messages related to certificate and configuration problems at
LANL
38
Failure messages due to brief interruption in network service at
ORNL on 5/13
2
HTTP data server failure at NCAR 5/17
1
RLS failure at LLNL 5/22
1
Simultaneous error messages for SRM services at NCAR,
ORNL, LBNL on 5/23
3
RLS failure at ORNL 5/24
1
RLS failure at LBNL 5/31
1
Successes and Lessons Learned
in ESG Monitoring


Overview of current system state for users and
system administrators
 ESG portal displays an overall picture of the current
status of the ESG infrastructure
 Gives users and administrators an understanding at
a glance of which resources and services are
currently available
Failure notification
 Failure messages from Trigger service have helped
system administrators to identify and quickly address
failed components and services
 Before the monitoring system was deployed, services
would fail and might not be detected until a user
tried to access an ESG dataset
 MDS4 deployment has enabled a unified interface
and notification system across ESG resources
Successes and Lessons Learned
in ESG Monitoring (cont.)

More information was needed on failure types
 An enhancement to MDS4 based on our experience:
include additional information about location and
type of failed service in subject line of trigger
notification email messages
 Allow message recipients to filter these messages
and quickly identify which services need attention.
Successes and Lessons Learned
in ESG Monitoring (cont.)

Validation of new deployments







Sometimes make significant changes to the Grid
infrastructure
E.g., Modification of service configurations or deployment
of a new component version
May encounter a series of failure messages for particular
classes of components over a period of days or weeks due
to these changes
Example: pattern of failure messages for RLS servers that
corresponded to a configuration problem related to
updates among the services
Example: series of SRM failure messages relating to a new
feature that had unexpected behavior
Monitoring messages helped to identify problems with
these newly deployed or reconfigured services
Absence of failure messages can in part validate a new
configuration or deployment
Successes and Lessons Learned
in ESG Monitoring (cont.)

Failure deduction




The monitoring system can be used to deduce the reason
for complex failures.
Example: we used MDS4 to gain insights into why the ESG
portal crashed occasionally due to a lack of available file
descriptors
 Used monitoring infrastructure to check file descriptor
usage by different services running on ESG portal
Example: Failure messages indicated that SRMs at three
different locations had failed simultaneously. Such
simultaneous independent failures are highly unlikely. We
investigated and found a problem with a query expression
in our monitoring software.
The monitoring system can be used to deduce reason
for complex failures


System-wide monitoring can be used to detect a pattern of
failures that occur close together in time
Deduce a problem at a different level of the system
Successes and Lessons Learned
in ESG Monitoring (cont.)

Warn of Certificate Problems and Imminent Expirations
 All ESG services at the LANL site reported failures




simultaneously
Problem was expiration of the host certificate for the ESG
node at that site
Downtime resulted while the problem was diagnosed and
administrators requested and installed a new host
certificate
To avoid such downtime in the future, we implemented
additional information providers and triggers that check
the expiration date of host certificates on services where
this information can be queried
Trigger Service checks informs system administrators
when certificate expiration is imminent
Successes and Lessons Learned
in ESG Monitoring (cont.)

Scheduled Downtime



When a particular site has scheduled downtime for site
maintenance, it is not necessary to send failure messages
to system administrators
Developed a simple mechanism that disables particular
triggers for the specified downtime period
Monitoring infrastructure still collects information about
service state during this period, but failure conditions do
not trigger actions by the Trigger Service
Acknowledgements




ESG is funded by the US Department of Energy under the
Scientific Discovery Through Advanced Computing Program
MDS is funded by the US National Science Foundation under
the Office of Cyberinfrastructure
ESG Team includes:
 National Center for Atmospheric Research: Don Middleton,
Luca Cinquini, Rob Markel, Peter Fox, Jose Garcia, others
 Lawrence Livermore National Laboratory: Dean Williams,
Bob Drach and others
 Argonne National Laboratory: Veronika Nefedova, Ian
Foster, Rachana Ananthakrishnan, Frank Seibenlist, others
 Lawrence Berkeley National Laboratory: Arie Shoshani,
Alex Sim and others
 Oak Ridge National Laboratory: David Bernholdt, Meili
Chen and others
 Los Alamos National Laboratory: Phillip Jones and others
 USC Information Sciences Institute: Ann Chervenak,
Robert Schuler, Shishir Bharathi, Mei Hui Su
MDS Team includes:


Argonne National Laboratory: Jen Schopf, Neill Miller
USC ISI: Laura Pearlman, Mike D’Arcy
More on Metadata
Metadata Services


Metadata is information that describes data
Metadata services allow scientists to






Record information about the creation, transformation,
meaning and quality of data items
Query for data items based on these descriptive attributes
Accurate identification of desired data items is essential for
correct analysis of experimental and simulation results.
In the past, scientists have largely relied on ad hoc
methods (descriptive file and directory names, lab
notebooks, etc.) to record information about data items
However, these methods do not scale to terabyte and
petabyte data sets consisting of millions of data items.
Extensible, reliable, high performance metadata services
are required to support registration and query of metadata
information
Presentation from SC2003 talk by
Gurmeet Singh
Example: ESG Collection Level
Metadata
Class Definitions
 Project






A project is an organized activity that produces data. The
scope and duration of a project may vary, from a few
datasets generated over several weeks or months, to a
multi-year project generating many terabytes. Typically a
project will have one or more principal investigators and a
single funding source.
A project may be associated with multiple ensembles,
campaigns, and/or investigations. A project may be a
subproject of another project.
Examples:
CMIP (Coupled Model Intercomparison Project)
CCSM (Community Climate System Model)
PCM (Parallel Climate Model)
ESG Collection Level Metadata (cont.)

Ensemble


An ensemble calculation is a set of simulations that are
closely related, in that typically all aspects of the model
configuration and boundary conditions are held constant,
while the initial conditions and/or external forcing are
varied in a prescribed manner. Each set of initial conditions
generates one or more dataset.
Campaign
 A Campaign is a set of observational activities that share a
common goal (e.g., observation of the ozone layer during
the winter/spring months), and are related either
geographically (e.g, a campaign at the South Pole) and/or
temporally (e.g., measurements of rainfall at several
observation stations during December 2003).

Investigation

An investigation is an activity, within a project, that
produces data. The scope of the investigation is narrower
and more focused than for the project. An investigation
may be a simulation, experiment, observation, or analysis.
Example: ESG Collection
Level Metadata
Other Classes

Simulation

Experiment

Observation

Analysis

Dataset

Service
Attributes of classes

Project

Id: a unique identifier for the project.








Name: a brief name for the project intended for display
in a browser, etc.
Topics: one or more keywords, qualified by an optional
encoding, intended to be used by specialized search and
discovery engines. See, for example,
http://gcmd.gsfc.nasa.gov/Resources/valids/gcmd_para
meters.html
Persons – project participants and their respective roles.
Description: a textual description of the project, intended
to provide more in-depth information than the Name.
Notes: additional, ad-hoc information about the project.
References – links or references to additional project
information: web pages, publications, etc.
Funding: funding agencies or sources.
Rights: description of the ownership and access
conditions to the data holdings of the project.

Ensemble







Id: a unique identifier for this ensemble.
Name: name for this ensemble.
Description: a textual description of the project, intended
to provide more in-depth information than the Name.
Notes: additional, ad-hoc information about the project.
Persons – those responsible for the ensemble data.
References – optional links or references to additional
project information: web pages, publications, etc.
Rights: optional description of the ownership and access
conditions to the data holdings of the ensemble, if
different from the project.





















A standard name is a description of a scientific quantity
generated by a model run
Follows the CF standard name table, and is hierarchical
For example, the standard name ‘atmosphere’ is a standard
name category that includes more specific quantities such as
‘air pressure
- Atmosphere
- Air Pressure
-…
- Carbon Cycle
- Biomass Burning Carbon Flux
-…
- Cloud
- Air Pressure at Cloud Base
-…
- Hydrology
- Atmosphere Water Content
-…
- Ocean
- Baroclinic Eastward Sea Water Velocity
-…
- Radiation
- Atmosphere Net Rate of Absorption of Longwave Energy
-…
Metadata Services in Practice…

Generic metadata services have not proven to be
very useful




Virtual organizations (scientists) agree on
appropriate metadata schema to describe data
Typically deploy a specialized metadata service



MCS used in Pegasus workflow system to manage its
metadata, provenance, etc.
Not widely used in science deployments
Relational database with indexes on domain-specific
attributres to support common queries
RDF tuple services
Provide faster, more targeted queries on agreed
metadata than a generic catalog
Download