Building an integrated information system for publishing heterogeneous Critical Zone Observatory data Thomas 1San 1 Whitenack , 2 Williams , 3 Tarboton , 1 Zaslavsky , 4 Durcik , 5 Lucas , 6 Dow , Mark David Ilya Matej Ryan Charles Xiande 7 8 2 6 6 3 5 Brian Bills , Miguel Leon , Chi Yang , Melanie Arnold , Anthony Aufdenkampe , Kim Schreuders , Otto Alvarez 5 Meng , Diego Supercomputer Center, University of California, San Diego, La Jolla, CA, United States; 2University of Colorado, Boulder, Boulder, CO, United States; 3Utah State University, Logan, UT, United States; 4Department of Hydrology and Water Resources, University of Arizona, Tucson, AZ, United States. 5University of California, Merced, Merced, CA, United States; 6Stroud Water Research Center, Chester, PA, United States; 7Penn State University, State College, PA, United States; 8University of Pennsylvania, Philadelphia, PA, United States; Abstract The Critical Zone Observatory (CZO) program is a collaborative effort to advance scientific understanding of multi-scale environmental interactions in the critical zone from bedrock to the atmospheric boundary layer. CZO sites use a mix of established and novel data collection methods to examine hydrogeochemical and physical processes in the critical zone. Publishing, analyzing and archiving these data in a consistent and integrated manner across all CZO sites is challenging due to the inherent heterogeneity in data collection and processing techniques. A goal of the CZO program is an integrated data management model across all CZO sites that can be used to discover, browse, retrieve and unambiguously interpret CZO data. This paper describes the ongoing development of such a system by a team comprising data managers from each CZO site and cyberinfrastructure researchers. The design follows a uniform, standards-based approach that draws on experience and software developed in related NSF-supported cyberinfrastructure projects (LTER, CUAHSI HIS, EarthChem, etc.) We describe the information model used (adapted from CUAHSI's observations data model), present a web based mechanism for publishing CZO point observations data, and discuss potential extensions of the publication model to other types of data. In this system each CZO site maintains its own data management system, and generates human- and computer-readable “display files” that follow an agreed upon format and contain necessary metadata. The display files reside on individual CZO web sites, but are harvested by a centralized CZO application that parses the files, identifying and validating new data and related metadata, ultimately loading the new data on to a CZO-modified version of the CUAHSI Observations Data Model. The ingested data values are then published using web services that follow the Water Markup Language (WaterML) specification, and can be retrieved and analyzed by WaterML-compatible client applications such as HydroDesktop. System design CZO follows design. service-oriented architecture Data in standard CZO formats are harvested into a community data repository, and presented as standard pre-registered CZO services for each site. The services are used to harvest metadata and add it to the CZO metadata catalog. CZO data publication system is designed to be extensible to different types of data collected at CZO sites: point time series, geochemical samples, geophysical and biological data, spatial data, etc. Each type of data will be available as web services following OGC service interface specifications and community standards for data exchange. Display file describes components of measurement: where (location), when (datetime), what (attribute), how (method), who (investigator) + value \doc (title, abstract, investigator, var names, etc.) \header DEFAULT_PARAMETER (pertains to entire file unless overridden) Column headers (define each column – i.e. time series or group of time series): Using desktop and online tools, users can discover and retrieve the data, and create derived CZO data products, which are in turn registered at the CZO Central. COL4. label=VariableName, value=pH, units=pH units, missing value indicator=-9999 We also plan to collaborate with DataNet on long term preservation and register CZO services in various domain registries (e.g. CUAHSI HydroCatalog, EarthChem Portal) \data GREEN LAKE 4,820311,,6.4,18,88.51,0.40,,114.77,… CZO community data registry and repository Accessing CZO services • A CZO scheduler application checks all CZO web sites for new display files at regular intervals. • New or updated data are retrieved and parsed by CZO Data Interpreter, and validated against shared vocabularies. • After the Data Interpreter loads data into respective ODM databases, the services are updated, and the CZO Central harvester uses the services to retrieve metadata and populate the CZO Central metadata catalog. • The catalog is searchable from client applications via search web services The display file format incorporates information model enhancements such as multiple types of named vertical and horizontal offsets and data loggers collecting information from multiple sampling locations within a single ‘site’. Additionally, the display file provides a tiered approach to publishing environmental data. Harvesting these data via a centralized system provides continuity to the data collected across all CZO sites that ultimately facilitates cross-site data exploration and analysis. Towards CZO web services model CZO data publication model Once CZO web service is updated and registered in CZO Central, it can be discovered in HydroDesktop (CZODesktop), an open source application with rich mapping and time series analysis capabilities. The services can be also accessed by other clients, including Matlab, R, Excel and ArcGIS. We work with Open Geospatial Consortium towards WaterML 2.0 as an OGC standard. HydroDesktop, showing one of 31 newly ingested time series from Boulder Creek CZO OGC/WMO Hydrology Domain Working Group http://external.opengis.org/twiki_public/bin/view/HydrologyDWG/WebHome The CZO Central service registry Leveraging earth science projects Synthesizing information management experience and software from CZO partners and neighboring earth science projects into a standards-based system for publishing environmental data to emphasize the “critical zone” nature of our shared data sets A water data service page for the Jemez River Basin CZO Conclusion The CZO integrated data publishing system enables CZO participants to share data in standard formats via web services. The display file format is flexible to accommodate information model enhancements and extensible to other types of CZO data beyond point time series This CZO Central user interface is used to associate CZO variables with terms in a hydrologic concept hierarchy, to support concept-based search A water data service page for the Boulder Creek CZO. The page is used by data managers to present general metadata about their observations (including an abstract and recommended citation), test the services and associate variable names with hydrologic concepts The CZO data publication model provides an attractive option for publishing environmental data : a) all current data are available from individual CZO web sites in human-readable format, b) CZOs maintain their own data systems and are not required to install or maintain additional servers, and c) the data are harvested, validated, versioned and archived at a central site (eventually, hosted on the cloud) which is responsible for making them available as standards-based web services, and evolve web service format as new environmental standards are adopted.