CZO_Poster_2010 - San Diego Supercomputer Center

advertisement
Building an integrated information system for publishing
heterogeneous Critical Zone Observatory data
Thomas
1San
1
Whitenack ,
2
Williams ,
3
Tarboton ,
1
Zaslavsky ,
4
Durcik ,
5
Lucas ,
6
Dow ,
Mark
David
Ilya
Matej
Ryan
Charles
Xiande
7
8
2
6
6
3
5
Brian Bills , Miguel Leon , Chi Yang , Melanie Arnold , Anthony Aufdenkampe , Kim Schreuders , Otto Alvarez
5
Meng ,
Diego Supercomputer Center, University of California, San Diego, La Jolla, CA, United States; 2University of Colorado, Boulder, Boulder, CO, United States; 3Utah State University, Logan, UT, United States; 4Department of Hydrology and Water Resources, University of Arizona, Tucson, AZ, United States.
5University of California, Merced, Merced, CA, United States; 6Stroud Water Research Center, Chester, PA, United States; 7Penn State University, State College, PA, United States; 8University of Pennsylvania, Philadelphia, PA, United States;
Abstract
The Critical Zone Observatory (CZO) program is a collaborative effort to
advance scientific understanding of multi-scale environmental
interactions in the critical zone from bedrock to the atmospheric
boundary layer. CZO sites use a mix of established and novel data
collection methods to examine hydrogeochemical and physical
processes in the critical zone. Publishing, analyzing and archiving these
data in a consistent and integrated manner across all CZO sites is
challenging due to the inherent heterogeneity in data collection and
processing techniques.
A goal of the CZO program is an integrated data management model
across all CZO sites that can be used to discover, browse, retrieve and
unambiguously interpret CZO data. This paper describes the ongoing
development of such a system by a team comprising data managers
from each CZO site and cyberinfrastructure researchers. The design
follows a uniform, standards-based approach that draws on experience
and software developed in related NSF-supported cyberinfrastructure
projects (LTER, CUAHSI HIS, EarthChem, etc.)
We describe the information model used (adapted from CUAHSI's
observations data model), present a web based mechanism for
publishing CZO point observations data, and discuss potential
extensions of the publication model to other types of data. In this
system each CZO site maintains its own data management system,
and generates human- and computer-readable “display files” that follow
an agreed upon format and contain necessary metadata. The display
files reside on individual CZO web sites, but are harvested by a
centralized CZO application that parses the files, identifying and
validating new data and related metadata, ultimately loading the new
data on to a CZO-modified version of the CUAHSI Observations Data
Model. The ingested data values are then published using web services
that follow the Water Markup Language (WaterML) specification, and
can be retrieved and analyzed by WaterML-compatible client
applications such as HydroDesktop.
System design
CZO follows
design.
service-oriented
architecture
Data in standard CZO formats are harvested into
a community data repository, and presented as
standard pre-registered CZO services for each
site. The services are used to harvest metadata
and add it to the CZO metadata catalog.
CZO data publication system is designed to be extensible to different
types of data collected at CZO sites: point time series, geochemical
samples, geophysical and biological data, spatial data, etc. Each type of
data will be available as web services following OGC service interface
specifications and community standards for data exchange.
Display file describes components of measurement:
where (location), when (datetime), what (attribute),
how (method), who (investigator) + value
 \doc (title, abstract, investigator, var names, etc.)
 \header
DEFAULT_PARAMETER (pertains to entire file
unless overridden)
Column headers (define each column – i.e. time
series or group of time series):
Using desktop and online tools, users can
discover and retrieve the data, and create
derived CZO data products, which are in turn
registered at the CZO Central.
COL4. label=VariableName, value=pH, units=pH
units, missing value indicator=-9999
We also plan to collaborate with DataNet on long
term preservation and register CZO services in
various domain registries (e.g. CUAHSI
HydroCatalog, EarthChem Portal)
 \data
GREEN LAKE 4,820311,,6.4,18,88.51,0.40,,114.77,…
CZO community data registry and repository
Accessing CZO services
• A CZO scheduler application checks all CZO web sites
for new display files at regular intervals.
• New or updated data are retrieved and parsed by CZO
Data Interpreter, and validated against shared
vocabularies.
• After the Data Interpreter loads data into respective
ODM databases, the services are updated, and the CZO
Central harvester uses the services to retrieve metadata
and populate the CZO Central metadata catalog.
• The catalog is searchable from client applications via
search web services
The display file format incorporates information model enhancements
such as multiple types of named vertical and horizontal offsets and data
loggers collecting information from multiple sampling locations within a
single ‘site’. Additionally, the display file provides a tiered approach to
publishing environmental data. Harvesting these data via a centralized
system provides continuity to the data collected across all CZO sites
that ultimately facilitates cross-site data exploration and analysis.
Towards CZO web services
model
CZO data publication
model
Once CZO web service is updated and registered in CZO Central, it can
be discovered in HydroDesktop (CZODesktop), an open source
application with rich mapping and time series analysis capabilities.
The services can be also accessed by other clients, including Matlab, R,
Excel and ArcGIS.
We work with Open Geospatial Consortium towards WaterML 2.0 as an
OGC standard.
HydroDesktop,
showing one of 31
newly ingested
time series from
Boulder Creek CZO
OGC/WMO Hydrology Domain Working Group
http://external.opengis.org/twiki_public/bin/view/HydrologyDWG/WebHome
The CZO Central service registry
Leveraging earth science projects
Synthesizing information management experience
and software from CZO partners and neighboring
earth science projects into a standards-based system
for publishing environmental data to emphasize the
“critical zone” nature of our shared data sets
A water
data
service
page for
the Jemez
River Basin
CZO
Conclusion
The CZO integrated data publishing system enables CZO participants to share data in standard formats via web services. The display file
format is flexible to accommodate information model enhancements and extensible to other types of CZO data beyond point time series
This CZO Central user interface is used to associate
CZO variables with terms in a hydrologic concept
hierarchy, to support concept-based search
A water data service page for the Boulder Creek CZO. The
page is used by data managers to present general
metadata about their observations (including an abstract
and recommended citation), test the services and associate
variable names with hydrologic concepts
The CZO data publication model provides an attractive option for publishing environmental data : a) all current data are available from
individual CZO web sites in human-readable format, b) CZOs maintain their own data systems and are not required to install or maintain
additional servers, and c) the data are harvested, validated, versioned and archived at a central site (eventually, hosted on the cloud) which
is responsible for making them available as standards-based web services, and evolve web service format as new environmental standards
are adopted.
Download