Data Model and Metadata

advertisement
CZO Integrated Data
Management
Data Model and Metadata
David Tarboton
Based on CUAHSI HIS
CUAHSI
HIS
Sharing hydrologic data
Metadata
HIS Central
Data Discovery and
Integration
Analysis
Data
HydroServer
Data Publication
ODM
Internet based system to
support the sharing of
hydrologic data comprised
of hydrologic databases and
servers connected through
web services and software
for data publication,
discovery and access.
WaterML
GML
OGC Services
Geo Data
Support
EAR 0622374
HydroDesktop
Data Synthesis and
Research
Data System Overview
CZO Desktop
CZO Central
Data Store
GetSites
GetSiteInfo
GetVariableInfo
GetValues
WaterML
WaterOneFlow
Web Service
Harvester
ASCII
text
Standardized web based display
Boulder
Shale
Sierra
Luquillo
CZO Servers
Jemez
Christina
Requirements
• Sufficient metadata for published CZO data to
be unambiguously interpreted and used
• Each CZO operate own local data management
system
• Format used to present data and metadata
should be identical across CZOs and should
support heterogeneous local systems
• Local systems are autonomous with local
control on the release and publication of data
Access
• Users required to agree to CZO data use
policies
• Same data use agreement for all CZOs
• Data accessible freely to registered users who
have agreed to policy
Information Hierarchy
• National CZO
• Experimental
Watershed
• Sites
• Variables
• Series
• Data values
Abstract data model
•
•
•
•
•
•
(where) location, object or platform identifier
(when) date and time
(what) attribute (or identifier of attribute)
THE VALUE
(how) method (or identifier of method)
(who) creator (or identifier of creator or data
source)
Data series
• used as an organizing construct
• logical grouping of data values (usually from a
column in a table)
• commonly, but not limited to time series (e.g.
type series with depth)
• Properties we control become identifying
series-level attributes
• Properties we measure become variables or
variable level attributes
Why an Observations Data Model
• Syntactic consistency (File types and formats)
• Semantic consistency
– Language for observation attributes (structural)
– Language to encode observation attribute values
(contextual)
• Publishing and sharing research data
• Metadata to facilitate unambiguous
interpretation
• Enhance analysis capability
What are the basic attributes to be
associated with each single data value
and how can these best be organized?
Community Design Requirements
(from comments of 22 reviewers)
• Incorporate sufficient metadata to identify provenance and
give exact definition of data for unambiguous interpretation
• Spatial location of measurements
• Scale of measurements (support, spacing, extent)
• Depth/Offset Information
• Censored data
• Classification of data type to guide appropriate interpretation
– Continuous
– Indication of gaps
• Indicate data quality
http://www.neng.usu.edu/cee/faculty/dtarb/HydroObsDataModelReview.pdf
Observations Data Model
Streamflow
Precipitation
& Climate
Water Quality
•
•
•
•
Groundwater
levels
Soil
moisture
data
Flux tower
data
A relational database at the single observation level
Common persistence model for observations data
Metadata for unambiguous interpretation
Traceable heritage from raw measurements to usable
information
• Promote syntactic and semantic consistency
Horsburgh et al., 2008,
• Cross dimension retrieval and analysis
WRR 44: W05406
CUAHSI Observations Data Model http://his.cuahsi.org/odmdatabases.html
Horsburgh, J. S., D. G. Tarboton, D. R. Maidment and I. Zaslavsky, (2008), A Relational Model for
Environmental and Water Resources Data, Water Resour. Res., 44: W05406, doi:10.1029/2007WR006392.
Stage and Streamflow Example
Water Chemistry from a profile in a lake
Water Chemistry from Laboratory Sample
3
Work
from Out
to In
5
7
At last …
1
2
6
And don’t
forget …
4
CUAHSI Observations Data Model
http://www.cuahsi.org/his/odm.html
HydroServer - A Platform for Managing and Publishing
Experimental Watershed Data
Map Server
Time Series AnalystHydroServer Website
HydroServer
Database
Configuration
Tool
WaterOneFlow
ODM
WaterOneFlow
Services
WaterOneFlow
ODM
HydroServer
Database
HydroServer
Capabilities
Web Service
Spatial
Services
WaterOneFlow
ODM
ODM Databases and
WaterOneFlow Web Services
ArcGIS Server Spatial Data
Services
http://hydroserver.codeplex.com/
Dynamic shared vocabulary moderation system
ODM Data
Manager
ODM
Website
ODM
Tools
XML
Local ODM
Database
Local
Server
ODM Shared
Vocabulary Moderator
ODM
Shared
Vocabulary
Web Services
http://his.cuahsi.org/mastercvreg.html
Master ODM
Shared
Vocabulary
From Jeff Horsburgh
Data System Overview
CZO Desktop
CZO Central
Data Store
GetSites
GetSiteInfo
GetVariableInfo
GetValues
WaterML
WaterOneFlow
Web Service
Harvester
ASCII
text
Standardized web based display
Boulder
Shale
Sierra
Luquillo
CZO Servers
Jemez
Christina
CUAHSI HIS – looking ahead
• A “data sharing/social networking” site for
hydrologic data (and possibly models)
• Simple and easy to use
• Find, create, share, connect, integrate, work
together online. Collaborate
• Hydro value added
CZO web based file format
• Time series display files
– The data – time series in columns
• Methods files
– A single file listing the methods used by the CZO
• Measurement location files (the term agreed for what used to
be called a site. Other names considered were station, node,
monitoring point, platform)
– A single file listing the measurement locations at which
measurements are made by the CZO
– Need a concept of spatial grouping for locations
– Identify the groups that locations belong to – implies a
need for a location groups file. (Measurement groups)
The slides from this one following contain edits made during the
presentation, e.g. the change from “site” to “measurement
location”. As a result they may not be entirely consistent, but were
as we left things at the end of the meeting.
Time series display file
• Header
– Doc group
– Default parameter group
– Column header group
• Data
– Columns of data
Doc group
Doc
Attributes
Title
Abstract
Investigator
contact
Information
Keywords
Variable
names
Citation
Publications
Comments
Description
A title for the set of data series in the file
Description of the data
Name and contact information for investigator responsible
for the data
Keywords useful for discovery of the data series
Names for variables for the data series
Text string that give the citation to be used when the data are
referenced.
Publications related to this data
Additional comments related to interpretation and use of this
data
Default parameters pertain to all data in file
except when overridden by a specific column
header (to encourage specification only once)
Examples
DEFAULT_PARAMETER. site ="GREEN LAKE 4"
DEFAULT_PARAMETER. offset_value ="2", offsetUnits =
"meters", offset_description= "this is vertical offset
from the ground level down"
DEFAULT_PARAMETER. quality_control_level ="0"
DEFAULT_PARAMETER. missing_value_indicator ="9999"
Column headers
Examples
COL1. label=ValueAttribute, value=DateTime, UTCOffset=-7,
Timezone=MST, format=”YYYYMMDD hh:mm”
COL2. label=VariableName, value=StreamFlow, units=m3/s,
TimeSupport= 3, TimeSupportUnits=hr, NoDataValue=-9999,
SampleMedium=water, method=method1, Offsetvalue = 3,
OffsetValueUnits=m , offsetDescription = "Depth below
surface"
COL3. label=VariableName, value=pH, units=pH units, missing
value indicator=-9999
COL4. label=VariableName, value=conductance, units=uS/cm @
25 degrees C
Series level attributes
• Required metadata for each data value in a
CZO time series display file
SiteCode
Units
Method
OffsetValue
OffsetDescription
SampleType
VariableName
SampleMedium
ValueType
TimeSupport
TimeSupportUnits
DataType
DataLevel
NoDataValue
UTCOffset
TimeZone
OffsetValue
OffsetDescription
OffSetUnits
CensorCode
Series level attribute definitions 1
Attributes
LocationCode
Units
Method
OffsetValue
OffsetDescription
VariableName
SampleMedium
ValueType
TimeSupport
Description
Code used to identify the Measurement Location (refers to Measurement
locations file)
The units associated with a data value
Identifier to point to a record in the methods file
The value of a measurement offset if constant. (Optional)
Full text description of the offset value. (Optional, but required if OffsetValue
is given)
Name of the variable from the variables preferred value table.
The medium of the sample or where the measurement is made. This should
be from the SampleMediumPV preferred vocabulary table.
Text value indicating what type of data value is being recorded. This should be
from the ValueTypeCV controlled vocabulary table. (e.g. Field measurement,
modeled, derived)
Numerical value that indicates the temporal footprint of the data values. 0 is
used to indicate data values that are instantaneous. Other values indicate the
time over which the data values are implicitly or explicitly averaged or
aggregated.
Series level attribute definitions 2
Attributes
TimeSupportUnits
DataType
DataLevel
Version
NoDataValue
UTCOffset
TimeZone
OffSetUnits
CensorCode
Description
Units of time support value from Units PV table.
Text value that identifies the data as one of several types (e.g. min,
max, average). PV
Level used to identify the level of quality control to which data
values have been subjected. Ameriflux is the starting point. Quality
control and processing.
DOI. A version is associated with a publication or specific release for
a specific analysis purpose.
The value used to encode no data
Offset in hours from UTC time of the corresponding LocalDateTime
value.
Time zone where observation site is located (e.g. Mountain time)
Units with which the offset value is measured (Units PV)
Text indication of whether the data value is censored from the
CensorCodeCV controlled vocabulary. See USGS document that
Anthony knows about
Value level attributes
Attributes
DateTime
OffsetValue
SampleNumber (then put
sample attributes in a
separate file associated with
sample numbers a cross
reference to SESAR)
Spatial Support Horizontal
Spatial Support Vertical
ValueAccuracy
Description
The date and time at which the value was
observed
The value of a measurement offset. (Optional).
[Note that OffsetValue may be either a series
level, or value level attribute for any data
series, depending upon whether it is a
controlled or measured property.]
Type of sample, e.g. grab, from groundwater, from leaf. From
sample type preferred value table. Collection method. Need a
more general concept of sample attributes. Also need sample
number.
Optional
Optional
Specify as absolute
Any value level attribute that is the same for an entire series may be
promoted to series level attribute and go in column header
Measurement Locations file
Measurement
Location File
Attribute labels
SiteCode
SiteName
Latitude
Longitude
LatLongDatum
Elevation
VerticalDatum
LocalX
LocalY
Local Z
LocalProjection
PosAccuracy
Comments
Description
Code used by organization that collects the data to identify the site
Full name of the sampling site.
Latitude in decimal degrees.
Longitude in decimal degrees. East positive, West negative.
The Spatial Reference System of the latitude and longitude coordinates in the
SpatialReferences table.
Elevation of site (in m – or do we want a separate item to give units).
Vertical datum of the elevation. Controlled Vocabulary from VerticalDatumCV.
Local Projection X coordinate. (Optional)
Local Projection Y Coordinate. (Optional)
Local elevation coordinate
Identifier that references the Spatial Reference System of the local coordinates.
(Optional) X, Y and Z
Value giving the accuracy with which the positional information is specified.
(Optional)
Comments related to the site. (Optional)
Sampling feature refers to feature of interest.
Methods file
Attributes Description
Method
Description of each
method.
Link
Hyperlink to
external reference
on the method
(Optional)
Is further subdivision needed to elicit
specific method elements ?
Shared vocabularies
• Variable names (grouped into categories with a keyword list associated with
each name. Need a field for keywords and categories to be added to present
CUAHSI HIS system) (e.g. Precipitation, Streamflow, Nitrogen, Soil moisture)
• Units (extended from CUAHSI HIS) (e.g. m, g/L)
• Value type (from CUAHSI HIS) (e.g. Field observation, derived value, model
output)
• Sample type (from CUAHSI HIS) (e.g. stream water, ground water, rock, soil)
• Data type (from CUAHSI HIS) (e.g. average over interval, cumulative,
continuous, sporadic)
• Data level (based on Ameriflux list) (e.g. level 0=raw data, level 4 = fully
infilled and quality controlled)
• Spatial references ( extensible based on EPSG) (e.g. NAD 1983, WGS84, UTM
zone 11)
• Censor code (from CUAHSI HIS) (e.g. less than, not-censored, non detect)
• Qualifier code (in CUAHSI HIS qualifiers are not a PV. A CZO specific set of
qualifiers will need to be developed)
• Vertical datum (from CUAHSI HIS) (e.g. Mean Sea Level, NGVD29)
Ilya’s Unresolved issues
• Policies and best practices for generating display
files and setting up data folders, and how we
detect what is new
• Update frequency
• Semantic tagging (how automated)
• How shall we handle situations when data are
removed/overwritten?
• Need more examples and test cases
• What information in log files is needed
• How to present data use agreements in services
• How to deal with different types of data
Other issues
• Other data types
– Maps, GIS data (OGC web services?)
– Geophysical data, images, geochemistry data,
geological data, soil profile data
• Simple capability to store and share arbitrary
digital objects with metadata using e.g.
Catalog Services for the web
• LIDAR data (just use SDSC Open Topography or
NCALM)
• Archiving
• Questions, additional needs (wishes)
Download