CZO Integrated Data Management Data Model and Metadata David Tarboton Based on CUAHSI HIS CUAHSI HIS Sharing hydrologic data Metadata HIS Central Data Discovery and Integration Analysis Data HydroServer Data Publication ODM Internet based system to support the sharing of hydrologic data comprised of hydrologic databases and servers connected through web services and software for data publication, discovery and access. WaterML GML OGC Services Geo Data Support EAR 0622374 HydroDesktop Data Synthesis and Research Data System Overview CZO Desktop CZO Central Data Store GetSites GetSiteInfo GetVariableInfo GetValues WaterML WaterOneFlow Web Service Harvester ASCII text Standardized web based display Boulder Shale Sierra Luquillo CZO Servers Jemez Christina Requirements • Sufficient metadata for published CZO data to be unambiguously interpreted and used • Each CZO operate own local data management system • Format used to present data and metadata should be identical across CZOs and should support heterogeneous local systems • Local systems are autonomous with local control on the release and publication of data Access • Users required to agree to CZO data use policies • Same data use agreement for all CZOs • Data accessible freely to registered users who have agreed to policy Information Hierarchy • National CZO • Experimental Watershed • Sites • Variables • Series • Data values Abstract data model • • • • • • (where) location, object or platform identifier (when) date and time (what) attribute (or identifier of attribute) THE VALUE (how) method (or identifier of method) (who) creator (or identifier of creator or data source) Data series • used as an organizing construct • logical grouping of data values (usually from a column in a table) • commonly, but not limited to time series (e.g. type series with depth) • Properties we control become identifying series-level attributes • Properties we measure become variables or variable level attributes Why an Observations Data Model • Syntactic consistency (File types and formats) • Semantic consistency – Language for observation attributes (structural) – Language to encode observation attribute values (contextual) • Publishing and sharing research data • Metadata to facilitate unambiguous interpretation • Enhance analysis capability What are the basic attributes to be associated with each single data value and how can these best be organized? Community Design Requirements (from comments of 22 reviewers) • Incorporate sufficient metadata to identify provenance and give exact definition of data for unambiguous interpretation • Spatial location of measurements • Scale of measurements (support, spacing, extent) • Depth/Offset Information • Censored data • Classification of data type to guide appropriate interpretation – Continuous – Indication of gaps • Indicate data quality http://www.neng.usu.edu/cee/faculty/dtarb/HydroObsDataModelReview.pdf Observations Data Model Streamflow Precipitation & Climate Water Quality • • • • Groundwater levels Soil moisture data Flux tower data A relational database at the single observation level Common persistence model for observations data Metadata for unambiguous interpretation Traceable heritage from raw measurements to usable information • Promote syntactic and semantic consistency Horsburgh et al., 2008, • Cross dimension retrieval and analysis WRR 44: W05406 CUAHSI Observations Data Model http://his.cuahsi.org/odmdatabases.html Horsburgh, J. S., D. G. Tarboton, D. R. Maidment and I. Zaslavsky, (2008), A Relational Model for Environmental and Water Resources Data, Water Resour. Res., 44: W05406, doi:10.1029/2007WR006392. Stage and Streamflow Example Water Chemistry from a profile in a lake Water Chemistry from Laboratory Sample 3 Work from Out to In 5 7 At last … 1 2 6 And don’t forget … 4 CUAHSI Observations Data Model http://www.cuahsi.org/his/odm.html HydroServer - A Platform for Managing and Publishing Experimental Watershed Data Map Server Time Series AnalystHydroServer Website HydroServer Database Configuration Tool WaterOneFlow ODM WaterOneFlow Services WaterOneFlow ODM HydroServer Database HydroServer Capabilities Web Service Spatial Services WaterOneFlow ODM ODM Databases and WaterOneFlow Web Services ArcGIS Server Spatial Data Services http://hydroserver.codeplex.com/ Dynamic shared vocabulary moderation system ODM Data Manager ODM Website ODM Tools XML Local ODM Database Local Server ODM Shared Vocabulary Moderator ODM Shared Vocabulary Web Services http://his.cuahsi.org/mastercvreg.html Master ODM Shared Vocabulary From Jeff Horsburgh Data System Overview CZO Desktop CZO Central Data Store GetSites GetSiteInfo GetVariableInfo GetValues WaterML WaterOneFlow Web Service Harvester ASCII text Standardized web based display Boulder Shale Sierra Luquillo CZO Servers Jemez Christina CUAHSI HIS – looking ahead • A “data sharing/social networking” site for hydrologic data (and possibly models) • Simple and easy to use • Find, create, share, connect, integrate, work together online. Collaborate • Hydro value added CZO web based file format • Time series display files – The data – time series in columns • Methods files – A single file listing the methods used by the CZO • Measurement location files (the term agreed for what used to be called a site. Other names considered were station, node, monitoring point, platform) – A single file listing the measurement locations at which measurements are made by the CZO – Need a concept of spatial grouping for locations – Identify the groups that locations belong to – implies a need for a location groups file. (Measurement groups) The slides from this one following contain edits made during the presentation, e.g. the change from “site” to “measurement location”. As a result they may not be entirely consistent, but were as we left things at the end of the meeting. Time series display file • Header – Doc group – Default parameter group – Column header group • Data – Columns of data Doc group Doc Attributes Title Abstract Investigator contact Information Keywords Variable names Citation Publications Comments Description A title for the set of data series in the file Description of the data Name and contact information for investigator responsible for the data Keywords useful for discovery of the data series Names for variables for the data series Text string that give the citation to be used when the data are referenced. Publications related to this data Additional comments related to interpretation and use of this data Default parameters pertain to all data in file except when overridden by a specific column header (to encourage specification only once) Examples DEFAULT_PARAMETER. site ="GREEN LAKE 4" DEFAULT_PARAMETER. offset_value ="2", offsetUnits = "meters", offset_description= "this is vertical offset from the ground level down" DEFAULT_PARAMETER. quality_control_level ="0" DEFAULT_PARAMETER. missing_value_indicator ="9999" Column headers Examples COL1. label=ValueAttribute, value=DateTime, UTCOffset=-7, Timezone=MST, format=”YYYYMMDD hh:mm” COL2. label=VariableName, value=StreamFlow, units=m3/s, TimeSupport= 3, TimeSupportUnits=hr, NoDataValue=-9999, SampleMedium=water, method=method1, Offsetvalue = 3, OffsetValueUnits=m , offsetDescription = "Depth below surface" COL3. label=VariableName, value=pH, units=pH units, missing value indicator=-9999 COL4. label=VariableName, value=conductance, units=uS/cm @ 25 degrees C Series level attributes • Required metadata for each data value in a CZO time series display file SiteCode Units Method OffsetValue OffsetDescription SampleType VariableName SampleMedium ValueType TimeSupport TimeSupportUnits DataType DataLevel NoDataValue UTCOffset TimeZone OffsetValue OffsetDescription OffSetUnits CensorCode Series level attribute definitions 1 Attributes LocationCode Units Method OffsetValue OffsetDescription VariableName SampleMedium ValueType TimeSupport Description Code used to identify the Measurement Location (refers to Measurement locations file) The units associated with a data value Identifier to point to a record in the methods file The value of a measurement offset if constant. (Optional) Full text description of the offset value. (Optional, but required if OffsetValue is given) Name of the variable from the variables preferred value table. The medium of the sample or where the measurement is made. This should be from the SampleMediumPV preferred vocabulary table. Text value indicating what type of data value is being recorded. This should be from the ValueTypeCV controlled vocabulary table. (e.g. Field measurement, modeled, derived) Numerical value that indicates the temporal footprint of the data values. 0 is used to indicate data values that are instantaneous. Other values indicate the time over which the data values are implicitly or explicitly averaged or aggregated. Series level attribute definitions 2 Attributes TimeSupportUnits DataType DataLevel Version NoDataValue UTCOffset TimeZone OffSetUnits CensorCode Description Units of time support value from Units PV table. Text value that identifies the data as one of several types (e.g. min, max, average). PV Level used to identify the level of quality control to which data values have been subjected. Ameriflux is the starting point. Quality control and processing. DOI. A version is associated with a publication or specific release for a specific analysis purpose. The value used to encode no data Offset in hours from UTC time of the corresponding LocalDateTime value. Time zone where observation site is located (e.g. Mountain time) Units with which the offset value is measured (Units PV) Text indication of whether the data value is censored from the CensorCodeCV controlled vocabulary. See USGS document that Anthony knows about Value level attributes Attributes DateTime OffsetValue SampleNumber (then put sample attributes in a separate file associated with sample numbers a cross reference to SESAR) Spatial Support Horizontal Spatial Support Vertical ValueAccuracy Description The date and time at which the value was observed The value of a measurement offset. (Optional). [Note that OffsetValue may be either a series level, or value level attribute for any data series, depending upon whether it is a controlled or measured property.] Type of sample, e.g. grab, from groundwater, from leaf. From sample type preferred value table. Collection method. Need a more general concept of sample attributes. Also need sample number. Optional Optional Specify as absolute Any value level attribute that is the same for an entire series may be promoted to series level attribute and go in column header Measurement Locations file Measurement Location File Attribute labels SiteCode SiteName Latitude Longitude LatLongDatum Elevation VerticalDatum LocalX LocalY Local Z LocalProjection PosAccuracy Comments Description Code used by organization that collects the data to identify the site Full name of the sampling site. Latitude in decimal degrees. Longitude in decimal degrees. East positive, West negative. The Spatial Reference System of the latitude and longitude coordinates in the SpatialReferences table. Elevation of site (in m – or do we want a separate item to give units). Vertical datum of the elevation. Controlled Vocabulary from VerticalDatumCV. Local Projection X coordinate. (Optional) Local Projection Y Coordinate. (Optional) Local elevation coordinate Identifier that references the Spatial Reference System of the local coordinates. (Optional) X, Y and Z Value giving the accuracy with which the positional information is specified. (Optional) Comments related to the site. (Optional) Sampling feature refers to feature of interest. Methods file Attributes Description Method Description of each method. Link Hyperlink to external reference on the method (Optional) Is further subdivision needed to elicit specific method elements ? Shared vocabularies • Variable names (grouped into categories with a keyword list associated with each name. Need a field for keywords and categories to be added to present CUAHSI HIS system) (e.g. Precipitation, Streamflow, Nitrogen, Soil moisture) • Units (extended from CUAHSI HIS) (e.g. m, g/L) • Value type (from CUAHSI HIS) (e.g. Field observation, derived value, model output) • Sample type (from CUAHSI HIS) (e.g. stream water, ground water, rock, soil) • Data type (from CUAHSI HIS) (e.g. average over interval, cumulative, continuous, sporadic) • Data level (based on Ameriflux list) (e.g. level 0=raw data, level 4 = fully infilled and quality controlled) • Spatial references ( extensible based on EPSG) (e.g. NAD 1983, WGS84, UTM zone 11) • Censor code (from CUAHSI HIS) (e.g. less than, not-censored, non detect) • Qualifier code (in CUAHSI HIS qualifiers are not a PV. A CZO specific set of qualifiers will need to be developed) • Vertical datum (from CUAHSI HIS) (e.g. Mean Sea Level, NGVD29) Ilya’s Unresolved issues • Policies and best practices for generating display files and setting up data folders, and how we detect what is new • Update frequency • Semantic tagging (how automated) • How shall we handle situations when data are removed/overwritten? • Need more examples and test cases • What information in log files is needed • How to present data use agreements in services • How to deal with different types of data Other issues • Other data types – Maps, GIS data (OGC web services?) – Geophysical data, images, geochemistry data, geological data, soil profile data • Simple capability to store and share arbitrary digital objects with metadata using e.g. Catalog Services for the web • LIDAR data (just use SDSC Open Topography or NCALM) • Archiving • Questions, additional needs (wishes)