CUAHSI Community Observations Data Model (ODM) - CUAHSI-HIS

advertisement
CUAHSI Community Observations Data Model (ODM)
Version 1.1
Design Specifications
May 2008
David G. Tarboton1, Jeffery S. Horsburgh1, David R. Maidment2
Abstract
The CUAHSI Hydrologic Information System project is developing information technology
infrastructure to support hydrologic science. One aspect of this is a data model for the storage
and retrieval of hydrologic observations in a relational database. The purpose for such a
database is to store hydrologic observations data in a system designed to optimize data retrieval
for integrated analysis of information collected by multiple investigators. It is intended to
provide a standard format to aid in the effective sharing of information between investigators and
to allow analysis of information from disparate sources both within a single study area or
hydrologic observatory and across hydrologic observatories and regions. The observations data
model is designed to store hydrologic observations and sufficient ancillary information
(metadata) about the data values to provide traceable heritage from raw measurements to usable
information allowing them to be unambiguously interpreted and used. A relational database
format is used to provide querying capability to allow data retrieval supporting diverse analyses.
A generic template for the observations database is presented. This is referred to as the
Observations Data Model (ODM).
Introduction
The Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) is
an organization representing more than 100 universities and is sponsored by the National Science
Foundation to provide infrastructure and services to advance the development of hydrologic
science and education in the United States. The CUAHSI Hydrologic Information System (HIS)
is being developed as a geographically distributed network of hydrologic data sources and
functions that are integrated using web services so that they function as a connected whole. One
aspect of the CUAHSI HIS is the development of a standard database schema for use in the
storage of point observations in a relational database. This is referred to as the point
Observations Data Model (ODM) and is intended to allow for comprehensive analysis of
information collected by multiple investigators for varying purposes. It is intended to expand the
ability for data analysis by providing a standard format to share data among investigators and to
facilitate analysis of information from disparate sources both within a single study area or
hydrologic observatory and across hydrologic observatories and regions. The ODM is designed
to store hydrologic observations with sufficient ancillary information (metadata) about the data
values to provide traceable heritage from raw measurements to usable information allowing them
to be unambiguously interpreted and used. Although designed specifically with hydrologic
observation data in mind, this data model has a simple and general structure that will also
1
2
Utah Water Research Laboratory, Utah State University
Center for Research in Water Resources, University of Texas at Austin
1
accommodate a wide range of other data, such as from other environmental observatories or
observing networks.
ODM uses a relational database format to allow for ease in querying and data retrieval in support
of a diverse range of analyses. Reliance on databases and tables within databases also provides
the capability to have the model scalable from the observations of a single investigator in a single
project through the multiple investigator communities associated with a hydrologic observatory
and ultimately to the entire set of observations available to the CUAHSI community. ODM is
focused on observations made at a point. A relational database model with individual
observations recorded as individual records (an atomic model) has been chosen to provide
maximum flexibility in data analysis through the ability to query and select individual
observation records. This approach carries the burden of record level metadata, so it is not
appropriate for all variables that might be observed. For example, individual pixel values in
large remotely sensed images or grids are inappropriate for this model.
This data model is presented as a generic template for a point observations database, without
reference to the specific implementation in a database management system. This is done so that
the general design is not limited to any specific proprietary software, although we expect that
implementations will take advantage of capabilities of specific software. It should be possible to
implement ODM in a variety of relational database management systems, or even in a set of text
tables or variable arrays in a computer program. However, to take full advantage of the
relationships between data elements, the querying capability of a relational database system is
required. By presenting the design at a general conceptual level, we also avoid implementation
specific detail on the format of how information is represented. See the discussion of Dates and
Times under ODM features below for an example of the distinction between general concepts
and implementation specific details.
Version Information
ODM has evolved from an initial design presented at a CUAHSI workshop held in Austin during
March, 2005 (Maidment, 2005) that was then widely reviewed with comments being received
from 22 individuals (Tarboton, 2005). These reviews served as the basis for a redesign that was
presented at a CUAHSI workshop in Duke during July, 2005 and presented as part of the
CUAHSI HIS status report (Horsburgh et al., 2005). Following this presentation of the design,
the data model was reviewed and commented on by a number of others, including the
CLEANER (Collaborative Large-scale Engineering Analysis Network for Environmental
Research) cyberinfrastructure committee. Further versions of the Observations Data Model were
circulated in April, June and October 2006. These documented changes made in the evolution of
this design. The fundamental design, however, has not changed since the status report
presentation of the model (Horsburgh et al., 2005) but many table and field names have been
changed. Tables have also been added to give spatial reference information, metadata
information, and to define controlled vocabularies. Version 1.0 of ODM, which was the first
release version of ODM, has been implemented and tested within the WATERS network of test
bed sites and was documented in Water Resources Research (Horsburgh et al., 2008). This
document describes the second release version of the data model design, which has been named
ODM Release Version 1.1, and has been so named to correspond to the Version 1.1 release of
the CUAHSI HIS. This document supersedes the previous documents.
2
In general, the following changes have been made for Version 1.1:
•
•
•
•
•
•
•
•
•
•
•
All integer IDs serving as the primary key for tables in ODM have been changed to auto
number/identity fields.
Text field lengths have been relaxed in some cases and have been standardized according
to the following scheme: codes = 50 characters, terms = 255 characters, links = 500
characters, definitions/explanations = unlimited.
Check constraints have been defined for the Latitude and Longitude fields in the Sites
table.
Check constraints have been added to many of the fields in ODM to constrain the
characters that are valid for those fields (see Appendix A for details).
Relationships have been added between controlled vocabulary tables and the tables that
contain the fields that they define. This was done to more rigorously enforce the ODM
controlled vocabularies.
Unique constraints were placed on both SiteCode in the Sites table and VariableCode in
the Variables table.
The controlled vocabulary was relaxed on the QualityControlLevels table to allow more
detailed versioning of data series. A QualityControlLevelCode was also added to this
table to facilitate this.
A Citation field was added to the Sources table to provide a place for a formal citation for
data in the database.
A Speciation field was added to the Variables table. This provides a place to store
information about the speciation of chemistry observations. A SpeciationCV controlled
vocabulary table was added to define this field.
An ODMVersion table was added to store the version number of the database.
The SeriesCatalog table has been updated based on the addition of the above fields.
3
Hydrologic Observations
Many organizations and individuals measure hydrologic variables such as streamflow, water
quality, groundwater levels, and precipitation. National databases such as USGS’ National
Water Information System (NWIS) and USEPA’s data Storage and Retrieval (STORET) system
contain a wealth of data, but, in general, these national data repositories have different data
formats, storage, and retrieval systems, and combining data from disparate sources can be
difficult. The problem is compounded when multiple investigators are involved (as would be the
case at proposed CUAHSI Hydrologic Observatories) because everyone has their own way of
storing and manipulating observational data. There is a need within the hydrologic community
for an observations database structure that presents observations from many different sources and
of many different types in a consistent format.
Hydrologic observations are identified by the following fundamental characteristics:
•
•
•
The location at which the observations were made (space)
The date and time at which the observations were made (time)
The type of variable that was observed, such as streamflow, water surface elevation,
water quality concentration, etc. (variable)
These three fundamental characteristics may be represented as a data cube (Figure 1), where a
particular observed data value (D) is located as a function of where it was observed (L), its time
of observation (T), and what kind of variable it is (V), thus forming D(L,T,V).
Figure 1. A measured data value (D) is indexed by its spatial location (L), its time of
measurement (T), and what kind of variable it is (V).
In addition to these fundamental characteristics, there are many other distinguishing attributes
that accompany observational data. Many of these secondary attributes provide more
information about the three fundamental characteristics mentioned above. For example, the
4
location of an observation can be expressed as a text string (i.e., “Bear River Near Logan, UT”),
or as latitude and longitude coordinates that accurately delineate the location of the observation.
Other attributes can provide important context in interpreting the observational data. These
include data qualifying comments and information about the organization that collected the data.
The fundamental design decisions associated with the ODM involve choices as to how much
supporting information to include in the database and whether to store (and potentially repeat)
this information with each observation or save this information in separate tables with key fields
used to logically associate observation records with the associated information in the ancillary
tables. Table 1 presents the general attributes associated with a point observation that we judged
should be included in the generic ODM design.
Table 1. ODM attributes associated with an observation
Attribute
Definition
Data Value
The observation value itself
Accuracy
Quantification of the measurement accuracy associated with the observation value
Date and Time
The date and time of the observation (including time zone offset relative to UTC and daylight savings
time factor)
Variable Name
The name of the physical, chemical, or biological quantity that the data value represents (e.g.
streamflow, precipitation, temperature)
Speciation
For concentration measurements, the species in which the concentration is expressed (e.g., as N, or as
NO3, or as NH4)
Location
The location at which the observation was made (e.g. latitude and longitude)
Units
The units (e.g. m or m3/s) and unit type (e.g. length or volume/time) associated with the variable
Interval
The interval over which each observation was collected or implicitly averaged by the measurement
method and whether the observations are regularly recorded on that interval
Offset
Distance from a reference point to the location at which the observation was made (e.g. 5 meters
below water surface)
Offset Type/
Reference Point
The reference point from which the offset to the measurement location was measured (e.g. water
surface, stream bank, snow surface)
Data Type
An indication of the kind of quantity being measured (e.g. a continuous, minimum, maximum, or
cumulative measurement)
Organization
The organization or entity providing the measurement
Censoring
An indication of whether the observation is censored or not
Data Qualifying
Comments
Comments accompanying the data that can affect the way the data is used or interpreted (e.g. holding
time exceeded, sample contaminated, provisional data subject to change, etc.)
Analysis Procedure/
Method
An indication of what method was used to collect the observation (e.g. dissolved oxygen by field
probe or dissolved oxygen by Winkler Titration) including quality control and assurance that it has
been subject to
Source
Information on the original source of the observation (e.g. from a specific organization, agency, or
investigator 3rd party database)
Sample Medium
The medium in which the sample was collected (e.g. water, air, sediment, etc.)
Value Category
An indication of whether the data value represents an actual measurement, a calculated value, or is
the result of a model simulation
5
Observations Data Model
The schema of the Observations Data Model is given in Figure 2. Appendix A gives details of
each table and each field in this generic data model schema. Appendix A serves as the data
dictionary for the data model and documents specific database constraints, data types, examples,
and best practices. The primary table that stores point observation values is the DataValues table
at the center of the schema in Figure 2. Logical relationships between fields in the data model
are shown and serve to establish the connectivity between the observation values and associated
ancillary information. Details of the relationships are given in Table 2. Figure 2 shows each of
the controlled vocabulary tables and their relationships to the table containing the field that they
define. Controlled vocabulary tables are highlighted with red headers. In Figure 2, each of the
mandatory fields is shown in bold text, whereas optional fields are shown in regular text.
6
Figure 2. Observations Data Model schema.
Table 2. Observations Data Model Logical Relationships
Relationships that define ancillary information about data values
Table
DataValues
DataValues
DataValues
DataValues
DataValues
DataValues
DataValues
DataValues
Field
SiteID
VariableID
OffsetTypeID
QualifierID
MethodID
SourceID
SampleID
QualityControlLevelID
Type
* <-> 1
* <-> 1
* <-> 1
* <-> 1
* <-> 1
* <-> 1
* <-> 1
* <-> 1
Field
SiteID
VariableID
OffsetTypeID
QualifierID
MethodID
SourceID
SampleID
QualityControlLevelID
Table
Sites
Variables
OffsetTypes
Qualifiers
Methods
Sources
Samples
QualityControlLevels
Relationships that define derived from groups
Table
Field
DataValues
DataValues
DerivedFromID
ValueID
Type
* <-> *
1 <-> *
Field
Table
DerivedFromID
ValueID
DerivedFrom
DerivedFrom
Relationships that define groups
Table
Field
DataValues
GroupDescriptions
ValueID
GroupID
Type
1 <-> *
1 <-> *
Field
Table
ValueID
GroupID
Groups
Groups
Relationships used to define categories for categorical data
Table
Field
Variables
DataValues
VariableID
DataValue
Type
1 <-> *
* <-> 1
Field
Table
VariableID
DataValue
Categories
Categories
Relationships used to define the Units
Table
Field
Type
Field
Table
Units
Units
Units
UnitsID
UnitsID
UnitsID
1<->*
1<->*
1<->*
VariableUnitsID
TimeUnitsID
OffsetUnitsID
Variables
Variables
OffsetTypes
Relationship used to define the Sample Laboratory Methods
Table
Field
Type
Field
Table
LabMethods
LabMethodID
1<->*
LabMethodID
Samples
Relationships used to define the Spatial References
Table
Field
Type
Field
Table
SpatialReferences
SpatialReferences
SpatialReferenceID
SpatialReferenceID
1<->*
1<->*
LatLongDatumID
LocalProjectionID
Sites
Sites
Relationship used to define the ISOMetaData
Table
Field
Type
Field
Table
IsoMetaData
MetadataID
1<->*
Sources
MetadataID
8
Relationships used to define Controlled Vocabularies
Table
Field
Type
Field
Table
VerticalDatumCV
SampleTypeCV
VariableNameCV
ValueTypeCV
DataTypeCV
SampleMediumCV
SpeciationCV
GeneralCategoryCV
TopicCategoryCV
CensorCodeCV
Term
Term
Term
Term
Term
Term
Term
Term
Term
Term
1<->*
1<->*
1<->*
1<->*
1<->*
1<->*
1<->*
1<->*
1<->*
1<->*
VerticalDatum
SampleType
VariableName
ValueType
DataType
SampleMedium
Speciation
GeneralCategory
TopicCategory
CensorCode
Sites
Samples
Variables
Variables
Variables
Variables
Variables
Variables
ISOMetadata
DataValues
Relationship type is indicated as One to One (1<->1), One to Many (1<->*), Many to One (*<>1) and Many to Many (*<->*). The first set of relationships defines the links to tables that
contain ancillary information. They are used so that only compact (integer) identifiers are stored
with each data value and thus repeated many times while the more voluminous ancillary
information is stored to the side and not repeated. The second set of relationships defines
derived from groupings used to specify data values that have been used to derive other data
values. The third set of relationships defines logical groupings of data values. The fourth set of
relationships is used to specify the categories associated with categorical variables. The fifth set
of relationships is used to define the units. The sixth set of relationships associates laboratory
methods with samples. The seventh set of relationships associates sites with the Spatial
Reference System used to define the location. The eigth set of relationships associates project
and dataset level metadata with each data source. The last set of relationships defines the linkage
between the controlled vocabulary fields and the tables that stored the acceptable terms for those
fields. Details of how these relationships work are given in the discussion of features of the data
model design below.
Features of the Observations Data Model Design
Geography
ODM is intended to be independent of the geographical representation of the site locations. The
geographic location of sites is specified through the Latitude, Longitude, and Elevation
information in the Sites table, and optionally local coordinates, which may be in a standard
geographic projection for the study area or in a locally defined coordinate system specific to a
study area. Each site also has a unique identifier, SiteID, which can be logically linked to one or
more objects in a Geographic Information System (GIS) data model. For example, Figure 3
depicts a one-to-one relationship between sites within ODM and HydroPoints within the Arc
Hydro Framework Data Model (Maidment, 2002) used to represent objects in a digital
watershed. In simple implementations, SiteID may have the same integer value as the identifier
for the associated GIS object, HydroID in this case. In more complex implementations, and
especially when multiple databases are merged into a single ODM, it may not be possible to
preserve the simple one-to-one relationship between SiteID and HydroID with each of these
fields holding the same integer identifier values. In these cases, where SiteID and HydroID are
9
not the same, a coupling table would be used to associate the ODM SiteIDs used to identify sites
with HydroIDs in the Arc Hydro data model.
SiteID must be unique within an instance of ODM. This could, for example, be achieved by
assigning SiteIDs from a master table. The linkage between SiteIDs and GIS object IDs is
intended to be generic and suitable for use with any geographic data model that includes
information specifying the location of sites. For example, a linear referencing system on a river
network, such as the National Hydrography Dataset, might be used to specify the location of a
site on a river network. Addressing relative to specific hydrologic objects through the SiteID
field provides direct and specific location information necessary for proper interpretation of data
values. Information from direct addressing relative to hydrologic objects is often of greater
value to a user than the simple Latitude and Longitude information stored in the ODM Sites
table. For example, it is more useful to know that a stream gage is on such and such a stream
rather than simply its latitude and longitude.
Figure 3. Arc Hydro Framework Data Model and Observations Data Model related through
SiteID field in the Sites table.
Series Catalog
A "data series" is an organizing principle used in ODM. A data series consists of all the data
values associated with a unique site, variable, method, source, and quality control level
combination in the DataValues table. The SeriesCatalog table lists data series, identifying each
by a unique series identifier, SeriesID. This table is essentially a summary of many of the tables
10
in the ODM and is not required to maintain the integrity of the data. However, it serves to
provide a listing of all the distinct series of data values of a specific variable at a specific site.
By doing so, this table provides a means by which users can execute most common data
discovery queries (i.e., which variables have data at a site, etc.) without the overhead of querying
the entire DataValues table, which can become quite large.
The SeriesCatalog table is also intended to support CUAHSI Web Service method queries such
as GetSiteInfo, which returns information about a monitoring site within an instance of the ODM
including the variables that have been measured at that site. It should be noted that data series,
as they are defined here, do not distinguish between different series of the same variable at the
same site but measured with different offsets. If for example temperature was measured at two
different offsets by two different sensors at one site, both sets of data would fall into one data
series for the purposes of the SeriesCatalog table. In these cases, interpretation or analysis
software will need to specifically examine and parse the offsets by examining the offset
associated with each data value. The SeriesCatalog table does not do this because its principal
purpose is data discovery, which we did not want to be overly complicated. The SeriesCatalog
table should be programmatically generated and modified as data are added to the database.
Accuracy
Each data value in the DataValues table has an associated attribute called ValueAccuracy. This
is a numeric value that quantifies the total measurement accuracy defined as the nearness of a
measurement to the true or standard value. Since the true value is not known, the
ValueAccuracy is estimated based on knowledge of the instrument accuracy, measurement
method, and operational environment. The ValueAccuracy, which is also called the uncertainty
of the measurement, compounds the estimates of both bias and precision errors. Bias errors are
generally fixed or systematic and cannot be determined statistically, while precision errors are
random, being generated by the variability in the measurement system and operational
environment. Figure 4 illustrates the effects of these errors on a sample of measurements. Bias
errors are usually estimated through specially designed experiments (calibrations). The precision
errors are determined using statistical analysis by quantifying the measurement scatter, which is
proportional to the standard deviation of the sample of repeated measurements. The total error is
obtained by the root-sum-square of the estimates for bias and precision errors involved in the
measurement. Figure 5 gives another illustration of the ValueAccuracy concept based on the
analogy of a target, where the bulls eye at the center represents the true value.
ValueAccuracy is a data value level attribute because it can change with each measurement,
dependent on the instrument or measurement protocol. For example, if streamflow is measured
using a V-notch weir, it is actually the stage that is measured, with accuracy limited by the
precision and bias of the depth recording instrument. The conversion to discharge through the
stage-discharge relationship results in greater absolute error for larger discharges. Inclusion of
the ValueAccuracy attribute, which will be blank for many historic datasets because historically
accuracy has not been recorded, adds to the size of data in the ODM, but provides a way for
factoring the accuracy associated with measurements into data analysis and interpretation, a
practice that should be encouraged.
11
Figure 4. Illustration of measurement error effect (Source: AIAA, 1995).
Bias
Figure 5. Illustration of Accuracy versus Precision (adapted from Wikipedia
http://en.wikipedia.org/wiki/Accuracy).
In designing ODM, consideration was given to the suggestion by some reviewers to record bias
and precision separately, in addition to ValueAccuracy for each data value. This has not been
done at this release in the interest of parsimony and also because quantifying these separate
components of the error is difficult. We suggest that for most measurements there should be the
presumption that they are unbiased and that ValueAccuracy quantifies the precision and accuracy
in the judgment of the investigator responsible for collecting the data. For cases where there is
specific bias and precision information to complement the ValueAccuracy attribute, this could be
recorded in the ODM as a separate variable, e.g. discharge precision, or temperature bias. The
groups and derived from features (see below) could be used to associate these variables with
their related observations. For measurements that are known to be biased, we suggest that the
bias could be quantified by other reference measurements that should also be placed in the
12
database and that a new set of corrected measurements that have had the bias removed should be
added to the database at a higher quality control level. These new measurements should have a
lower ValueAccuracy value to reflect the improvement in accuracy by removal of the bias. The
method and derived from information for these corrected measurements should give the bias
removal method and refer to the data used to quantify and remove the bias.
Offset
Each record in the DataValues table has two optional fields OffsetValue and OffsetTypeID.
These are used to record the location of an observation relative to an appropriate datum, such as
“depth below the water surface” or “depth below or above the ground.” The OffsetTypeID
references an OffsetValue into an OffsetTypes table that gives units and definition associated
with the OffsetValue. This design only has the capability to represent one offset for each data
value. In cases (which we expect to be rare) when there are multiple offsets (e.g. distance in
from a stream bank and depth below the surface) one of the offsets will need to be distinguished
as a separate variable.
Spatial Reference and Positional Accuracy
Unambiguous specification of the location of an observation site requires that the horizontal and
vertical datum used for latitude, longitude, and elevation be specified. The SpatialReferences
table is provided for this purpose to record the name and EPSG code of each Spatial Reference
System used. EPSG codes are numeric codes associated with coordinate system definitions
published by the OGP Surveying and Positioning Committee (http://www.epsg.org/). A nonstandard Spatial Reference System, such as, for example, a local grid at an experimental
watershed, may be defined in the SpatialReferences table Notes field. The accuracy with which
the location of a monitoring site is known is quantified using the PosAccuracy_m field in the
Sites table. This is a numeric value intended to specify the uncertainty (as a standard deviation
or root mean square error) in the spatial location information (latitude and longitude or local
coordinates) in meters. Using a large number for PosAccuracy_m (e.g. 2000 m) accommodates
entry of data collected for a study area where the precise location where the observation was
recorded is not known.
Groups and Derived from Associations
The DerivedFrom and Groups tables fulfill the function of grouping data values for different
purposes. These are tables where the same identifier (DerivedFromID or GroupID) can appear
multiple times in the table associated with different ValueIDs, thereby defining an associated
group of records. In the DerivedFrom table this is the sole purpose of the table, and each group
so defined is associated with a record in the DataValues table (through the DerivedFromID field
in that table). This record would have been derived from the data values identified by the group.
The method of derivation would be given through the methods table associated with the data
value. This construct is useful, for example, to identify the 96 15-minute unit streamflow values
that go into the estimate of the mean daily streamflow. Note that there is no limit to how many
groups a data value may be associated with, and data values that are derived from other data
values may themselves belong to groups used to derive other data values (e.g. the daily minimum
flow over a month derived from daily values derived from 15 minute unit values). Note also that
a derived from group may have as few as one data value for the case where a data value is
derived from a single more primitive data value (e.g., discharge from stage). Through this
13
construct the ODM has the capability to store raw observation values and information derived
from raw observations, while preserving the connection of each data value to its more primitive
raw measurement.
The GroupID relationship that appears in Table 2 is designated as one-to-many because there
will be many records in the Groups table that have the same GroupID, but different ValueID, that
serve to define the group. In Figure 1, the Group relationship is labeled 1..*, at the DataValues
table and 0..* at the Groups table. This indicates that a group may comprise one or more data
values and that a data value may be included in 0 or more groups. Similarly, there will be many
records in the DerivedFrom table that have the same DerivedFromID, but different ValueID that
serve to define the group of data values from which a data value is derived. Logically a data
value should not be in a DerivedFrom group upon which it is derived from. If this can be
programmatically checked by the system, then this sort of circularity error could be prevented.
The method description in the Methods table associated with a data value that has a
DerivedFromID should describe the method used for deriving the particular data value from
other data values (e.g. calculating discharge from a number of velocity measurements across a
stream). The relationship between the DataValues table DerivedFromID field and DerivedFrom
table DerivedFromID field is many-to-many (*<->*) because it can occur that the same group of
data values is used to derive more than one derived data value. In Figure 1, the AreDerivedFrom
relationship between the data values and DerivedFrom table actually depicts both relationships
between these tables listed in Table 2. The AreDerivedFrom relationship is labeled 1..* at the
DataValues table and 0..* at the DerivedFrom table to indicate that a derived from group may
comprise 1 or more data values and that a data value may be a member of 0 or more derived
from groups.
Dates and Times
Unambiguous interpretation of date and time information requires specification of the time zone
or offset from universal time (UTC). A UTCOffset field is included in the DataValues table to
ensure that local times recorded in the database can be referenced to standard time and to enable
comparison of results across databases that may store data values collected in different time
zones (e.g. compare data values from one hydrologic observatory to those collected at another
hydrologic observatory located across the country). A design choice here was to have
UTCOffset as a record level qualifier because even though the time zone, and hence offset, is
likely the same for all measurements at a site, the offset may change due to daylight savings.
Some investigators may run data loggers on UTC time, while others may use local time adjusting
for daylight saving time. To avoid the necessity to keep track of the system used, or impose a
system that might be cumbersome and lead to errors, we decided that if the offset was always
recorded, the precise time would be unambiguous and would reduce the chance for interpretation
errors. A field DateTimeUTC is also included as a record level attribute associated with each
data value. This provides a consistent time for querying and sorting data values. There is a level
of redundancy between LocalDateTime, UTCOffset and DateTimeUTC. Only two are required
to calculate the third. For simplicity and clarity we retain all three. A specific database
implementation may choose to retain only two and calculate the third on the fly. ODM data
loaders should only require two of the quantities to be input and should then calculate the third.
14
The separation of the date and time specification into two variables, LocalDateTime and
UTCOffset, in the generic conceptual model may be handled differently within specific
implementations. In one specific implementation these may be grouped in one text field in
standard (e.g. ISO 8601) format such as YYYY-MM-DDhh:mm:ss.sss:UTCOffset (e.g. 2006-032516:19:56.232:-7), while in another format the date and time may be specified as the number of
fractional days from an origin (e.g. Excel represents the above date as the following number
38801.6805 and allows the user to specify the format for display) with UTCOffset as a separate
attribute. In general we expect specific implementations to take advantage of the representation
of date time objects provided by the implementation software, but to expose the LocalDateTime
and UTCOffset to users so that time may be unambiguously interpreted. In the SeriesCatalog
table, begin and end times for each data series are represented by the attributes BeginDateTime,
EndDateTime, BeginDateTimeUTC, and EndDateTimeUTC. The UTC offset may be derived
from the difference between the UTC and local times. Because local time may change (e.g. with
daylight savings) it is important during the derivation of the SeriesCatalog table that
identification of the first and last records be based on UTC time and that local times be read from
the corresponding records, rather than using a min or a max function on local times which can
result in an error.
Support Scale
In interpreting data values that comprise a time series it is important to know the scale
information associated with the data values. Blöschl and Sivapalan (1995) review the important
issues. Any set of data values is quantified by a scale triplet comprising support, spacing, and
extent as illustrated in Figure 6.
quantity
quantity
quantity
length or time
(c) Support
(b) Spacing
(a) Extent
length or time
length or time
Figure 6. The scale triplet of measurements (a) extent, (b) spacing, (c) support (from Blöschl,
1996).
Extent is the full range over which the measurements occur, spacing is the spacing between
measurements, and support is the averaging interval or footprint implicit in any measurement. In
ODM, extent and spacing are properties of multiple measurements and are defined by the
LocalDateTime or DateTimeUTC associated with data values. We have included a field called
TimeSupport in the Variables table to explicitly quantify support. Figure 7 shows some of the
implications associated with support, spacing, and extent in the interpretation of time series data
values.
15
(a) spacing too large – noise (aliasing)
(b) extent too small – trend
(c) support too large – smoothing out
Figure 7. The effect of sampling for measurement scales not commensurate with the process
scale: (a) spacing larger than the process scale causes aliasing in the data; (b) extent smaller than
the process scale causes a trend in the data; (c) support larger than the process scale causes
excessive smoothing in the data (adapted from Blöschl, 1996).
The concepts of scale described here apply in spatial as well as time dimensions. However,
TimeSupport is only used to quantify support in the time dimension. The spatial support
associated with a specific measurement method needs to be given or implied in the methods
description in the Methods table. The next section indicates how time support should be
specified for the different types of data.
Data Types
In the ODM, the following data types are defined. These are specified by the DataType field in
the Variables table.
16
1. Continuous data – the phenomenon, such as streamflow, Q(t) is specified at a particular
instant in time and measured with sufficient frequency (small spacing) to be interpreted
as a continuous record of the phenomenon. Time support may be specified as 0 if the
measurements are instantaneous, or given a value that represents the time averaging
inherent in the measurement method or device.
2. Sporadic data – the phenomenon is sampled at a particular instant in time but with a
frequency that is too coarse for interpreting the record as continuous. This would be the
case when the spacing is significantly larger than the support and the time scale of
fluctuation of the phenomenon, such as for example infrequent water quality samples. As
for continuous data, time support may be specified as 0 if the measurements are
instantaneous, or given a value that represents the time averaging inherent in the
measurement method or device.
3. Cumulative data – the data represents the cumulative value of a variable measured or
calculated up to a given instant of time, such as cumulative volume of flow or cumulative
t
precipitation: V( t ) = ∫ Q( τ)dτ , where τ represents time in the integration over the
0
interval [0,t]. To unambiguously interpret cumulative data one needs to know the time
origin. In the ODM we adopt the convention of using a cumulative record with a value of
zero to initialize or reset cumulative data. With this convention, cumulative data should
be interpreted as the accumulation over the time interval between the date and time of the
zero record and the current record at the same site position. Site position is defined by a
unique combination of SiteID, VariableID, OffsetValue and OffsetType. All four of
these quantities comprise the unambiguous description of the position of an observation
value and there may be multiple time series associated with multiple observation
positions (e.g. redundant rain gauges with different offsets) at a location. The time
support for a cumulative value should be specified as 0 if the measurement of the
cumulative quantity is instantaneous, or given a value that represents the time averaging
inherent in the measurement of the cumulative value at the end of the period of
accumulation.
4. Incremental data – the data value represents the incremental value of a variable over a
time interval Δt such as the incremental volume of flow, or incremental precipitation:
t + Δt
ΔV (t ) =
∫ Q(τ )dτ .
As for cumulative data, unambiguous interpretation requires
t
knowledge of the time increment. In the ODM we adopt the convention of using
TimeSupport to specify the interval Δt, or the time interval to the next data value at the
same position if TimeSupport is 0. This accommodates incremental type precipitation
data that is only reported when the data value is non-zero, such as NCDC data. Such
NCDC data is irregular, with the interpretation that precipitation is 0 if not reported
unless qualifying comments designate otherwise. See example E.4 below for an
illustration of how NCDC precipitation data is accommodated in the ODM.
5. Average data – the data value represents the average over a time interval, such as daily
ΔV (t )
. The averaging interval is
mean discharge or daily mean temperature: Q (t ) =
Δt
quantified by TimeSupport in the case of regular data (as quantified by the IsRegular
17
6.
7.
8.
9.
field) and by the time interval from the previous data value at the same position for
irregular data.
Maximum data – the data value is the maximum value occurring at some time during a
time interval, such as annual maximum discharge or a daily maximum air temperature.
Again unambiguous interpretation requires knowledge of the time interval. The ODM
adopts the convention that the time interval is the TimeSupport for regular data and the
time interval from the previous data value at the same position for irregular data.
Minimum data – the data value is the minimum value occurring at some time during a
time interval, such as 7-day low flow for a year, or the daily minimum temperature. The
time interval is defined similarly to Maximum data.
Constant over interval data – the data value is a quantity that can be interpreted as
constant over the time interval to the next measurement.
Categorical data – the data value is a categorical rather than continuous valued quantity.
Mapping from data values to categories is through the Categories table.
We anticipate that additional data types such as median, standard deviation, variance, and others
may need to be added as users work with ODM.
Beginning of Interval Reporting Time for Interval Data Values
Data types 4 to 8 above apply to data values that occur over an interval of time. The date and
time reported and entered in to the ODM database associated with each interval data value is the
beginning time of the observation interval. This convention was adopted to be consistent with
the way dates and times are represented in most common database management systems. It
should be noted that using the beginning of the interval is not consistent with the time a data
logger would log an observation value. Care should be exercised in adding data to the ODM to
ensure that the beginning of interval convention is followed.
Time Series Data
A considerable portion of hydrologic observations data is in the form of time series. This was
why the initial model was based on the Arc Hydro Time Series Data Model. The ODM design
has not specifically highlighted time series capabilities; nevertheless, the data model has
inherited the key components from the Arc Hydro Time Series Data Model to give it time series
capability. In particular one variable DataType is “Continuous,” which is designed to indicate
that the data values are collected with sufficient frequency as to be interpreted as a smooth time
series. The IsRegular field also facilitates time series analysis because certain time series
operations (e.g., Fourier Analysis) are predisposed to regularly sampled data. At first glance it
may appear that there is redundancy between the IsRegular field and the DataType
“Continuous,” but we chose to keep these separate because there are regularly sampled quantities
for which it is not reasonable to interpret the data values as “Continuous.” For example, monthly
grab samples of water quality are not continuous, but are better categorized as having DataType
“Sporadic.” Note that ODM does not explicitly store the time interval between measurements,
nor does it indicate where a continuous series has data gaps. Both of these are required for time
series analysis, but are inherently not properties of single measurements. The time interval is the
time difference between sequential regular measurements, something that can be easily
computed from date and time values by analysis tools. The inference of measurement gaps (and
18
what to do about them) from date and time values we also regard as analysis functionality left for
a Hydrologic Analysis System to handle.
Categorical Variables
In ODM, categorical or ordinal variables are stored in the same table as continuous valued ‘real’
variables through a numerical encoding of the categorical data value as a ‘real’ data value. The
Categories table then associates, for each variable, a data value with an associated category
description. This is a somewhat cumbersome construct because real valued quantities are being
used as database keys. We do not see this as a significant shortcoming though, because
typically, in our judgment, only a small fraction of hydrologic observations will be categorical.
The Categories table stores the categories associated with categorical data values. If a Variable
has a DataType that is “Categorical” then the VariableID must match one or more VariableIDs in
Categories that define the mapping between DataValues and Categories. The
CategoryDescription field in the Categories table defines the category.
Samples and Methods
At first glance there may appear to be redundancy between the information in the Samples table
and Methods table. However, the samples table is intended to only be used where data values
are derived from a physical sample that is later analyzed in a laboratory (e.g., a water chemistry
sample or biological sample). The SampleID that links into the Samples table provides tracking
of the specific physical sample used to derive each measurement and, by reference to
information in the LabMethods table, the laboratory methods and protocols followed. The
Methods table refers to the method of field data collection, which may specify “how” a physical
observation was made or collected (e.g., from an automated sampler or collected manually), but
is also used to specify the measurement method associated with an in-situ measurement
instrument such as a weir, turbidity sensor, dissolved oxygen sensor, humidity sensor, or
temperature sensor.
Data Qualifiers
Each record in the DataValues table has an attribute called QualifierID that references the
Qualifiers table. Each QualifierID in the Qualifiers table has attributes QualifierCode and
QualifierDescription that provide qualifying information that can note anything unusual or
problematic about individual observations such as, for example, "holding time for analysis
exceeded" or "incomplete or inexact daily total." Specification of a QualifierID in the
DataValues table is optional, with the inference that if a QualifierID is not specified then the
corresponding data value is not qualified.
Quality Control Level Encoding
Each data value in the DataValues table has an attribute called QualityControlLevelID that
references the QualityControlLevels table and is designed to record the level of quality control
processing that the data value has been subjected to at the level of data series. Quality control
level is one of the attributes (together with site, variable, method, and source) used to uniquely
identify data series. Each quality control level is uniquely identified by its
QualityControlLevelID; however, each level also has a text QualityControlLevelCode that, along
with a Definition and Explanation, provides a more descriptive encoding of the quality control
level. The default quality control level system used by ODM applies integer values between 0
19
and 4 (converted to text strings) as the QualityControlLevelCodes. Other custom systems for
QualityControlLevelCodes can be used (e.g., 0.1, 0.2 to represent raw data that is progressing
through a quality control work sequence, or text strings such as “Raw” or “Processed”). The
following 0 – 4 QualityControlLevelCode definitions are adapted from those used by other
similar systems, such as NASA, Earthscope and Ameriflux (e.g.
http://ilrs.gsfc.nasa.gov/reports/ilrs_reports/9809_attach7a.html,
http://public.ornl.gov/ameriflux/available.shtml accessed 3/6/2007) and are suggested so that
CUAHSI ODM is consistent with the practice of other data systems:
-
QualityControlLevelCode = “0” - Raw Data
Raw data is defined as unprocessed data and data products that have not undergone quality
control. Depending on the data type and data transmission system, raw data may be available
within seconds or minutes after real-time. Examples include real time precipitation,
streamflow and water quality measurements.
-
QualityControlLevelCode = “1” – Quality Controlled Data
Quality controlled data have passed quality assurance procedures such as routine estimation
of timing and sensor calibration or visual inspection and removal of obvious errors. An
example is USGS published streamflow records following parsing through USGS quality
control procedures.
-
QualityControlLevelCode = “2” –Derived Products
Derived products require scientific and technical interpretation and include multiple-sensor
data. An example might be basin average precipitation derived from rain gages using an
interpolation procedure.
-
QualityControlLevelCode = “3” –Interpreted Products
These products require researcher (PI) driven analysis and interpretation, model-based
interpretation using other data and/or strong prior assumptions. An example is basin average
precipitation derived from the combination of rain gages and radar return data.
-
QualityControlLevelCode = “4” –Knowledge Products
These products require researcher (PI) driven scientific interpretation and multidisciplinary
data integration and include model-based interpretation using other data and/or strong prior
assumptions. An example is percentages of old or new water in a hydrograph inferred from
an isotope analysis.
These definitions for quality control level are stored in the QualityControlLevels table. These
definitions are recommended for use, but users can define their own quality control level system.
The QualityControlLevels table is not a controlled vocabulary, but specification of a quality
control level for each data value is required. Appendix B of this document provides a discussion
of how to handle data versioning in terms of quality control levels (using the levels defined
above), data series editing, and data series creation.
20
Metadata
ODM has been designed to contain all the core elements of the CUAHSI HIS metadata system
(http://www.cuahsi.org/his/metadata.html) required for compliance with evolving standards such
as the draft ISO 19115. In its design, the ODM embodies much record, variable, and site level
metadata. Dataset and project level metadata required by these standards, such as
TopicCategory, Title, and Abstract are included in a table called ISOMetaData linked to each
data source.
Reference Documents
The Methods, Sources, LabMethods and ISOMetaData tables contain fields that can be used to
store links to source or reference information. At the general conceptual level of the ODM we
do not specify how, or in what form these links to references or sources should be implemented.
Options include using URLs or storing entire documents in the database. If external URLs are
used it will be important as the database grows and is used over time to ensure that links or
URLs included are stable. An alternative approach to external links is to exploit the capability of
modern databases to store entire digital documents, such as an html or xml page, PDF document,
or raw data file, within a field in the database. The capability therefore exists to instead have
these links refer to a separate table that would actually contain this metadata information, instead
of housing it in a separate digital library. There is some merit in this because then any data
exported in ODM format could take with it the associated metadata required to completely define
it as well as the raw data upon which it is derived. However, this has the disadvantage of
increasing (perhaps substantially) the size of database file containing the data and being
distributed to users.
Controlled Vocabularies
The following tables in the ODM are tables where controlled vocabularies for the fields are
required to maintain consistency and avoid the use of synonyms that can lead to ambiguity:
•
•
•
•
•
•
•
•
•
•
•
•
CensorCodeCV
DataTypeCV
GeneralCategoryCV
SampleMediumCV
SampleTypeCV
SpatialReferences
SpeciationCV
TopicCategoryCV
Units
ValueTypeCV
VariableNameCV
VerticalDatumCV
The initial contents of these controlled vocabularies are specified in the Microsoft SQL Server
2005 blank schema for the ODM. However, the ODM controlled vocabularies are dynamic. A
central repository of current ODM controlled vocabulary terms is maintained on the ODM
Website at http://water.usu.edu/cuahsi/odm/, together with the most recent version of the ODM
21
SQL Server 2005 blank schema, this design specifications document, and other tools for working
with ODM. Users can submit new terms for the controlled vocabularies and can request changes
to existing terms using functionality available on the ODM website
(http://water.usu.edu/cuahsi/odm/). Functionality for updating local controlled vocabulary tables
with new terms from the central ODM controlled vocabulary repository is provided in the ODM
Tools software application, which is also available from the ODM website. The CUAHSI HIS
team welcomes input on the controlled vocabularies.
Examples
The following examples show the capability of ODM to store different types of point
observations. It is not possible in examples such as these to present all of the field values for all
the tables. Because of this, the examples present selected fields and tables chosen to illustrate
key capabilities of the data model. Refer to Appendix A for the complete definition of table and
field contents.
Streamflow - Gage Height and Discharge
Figure E.1 illustrates how both stream gage height measurements and the associated discharge
estimates derived from the gage height measurements can be stored in the ODM. Note that gage
height in feet and discharge in cubic feet per second are both in the same data table but with
different VariableIDs that reference the Variables table, which specifies the VariableName,
Units, and other quantities associated with these data values. The link between VariableID in the
DataValues table and Variables table is shown. In this example, discharge measurements are
derived from gage height (stage) measurements through a rating curve. The MethodID
associated with each discharge record references into the Methods table that describes this and
provides a URL that should contain metadata details for this method. The DerivedFromID in the
DataValues table references into the DerivedFrom table that references back to the
corresponding gage height in the DataValues table from which the discharge was derived.
22
Figure E.1. Excerpts from tables illustrating the population of ODM with streamflow gage height
(stage) and discharge data.
Streamflow - Daily Average Discharge
Daily average streamflow is reported as an average of continuous 15 minute interval data values.
Figure E.2 shows excerpts from tables illustrating the population of ODM with both the
continuous discharge values and derived daily averages. The record giving the single daily
average discharge with a value of 722 ft3/s in the DataValues table has a DerivedFromID of 100.
This refers to multiple records in the DerivedFrom table, with associated ValueIDs 97, 98, 99, …
113 shown. These refer to the specific 15 minute discharge values in the DataValues table used
to derive the average daily discharge. VariableID in the DataValues table identifies the
appropriate record in the Variables table specifying that this is a daily average discharge with
units of ft3/s from UnitsID referencing in to the Units table. MethodID in the DataValues table
identifies the appropriate record in the Methods table specifying that the method used to obtain
this data value was daily averaging.
23
Figure E.2. Excerpts from tables illustrating the population of ODM with daily average discharge
derived from 15 minute discharge values.
Water Chemistry from a Profile in a Lake
Reservoir profile measurements provide an example of the logical grouping of data values and
data values that have an offset in relationship to the location of the monitoring site. These
measurements may be made simultaneously (by multiple instruments in the water column) or
over a short time period (one instrument that is lowered from top to bottom). Figure E.3 shows
an example of how these data would be stored in ODM. The OffsetTypes table and OffsetValue
attribute is used to quantify the depth offset associated with each measurement. Each of the data
values shown has an OffsetTypeID that references into the OffsetTypes table. The OffsetTypes
table indicates that for this OffsetType the offset is “Depth below water surface.” The
OffsetTypes table references into the Units table indicating that the OffsetUnits are meters, so
OffsetValue in the DataValues table is in units of meters depth below the water surface. Each of
the data values shown also has a VariableID that in the Variables table indicates that the variable
measured was dissolved oxygen concentration in units of mg/L. Each of the data values shown
also has a MethodID that in the Methods table indicates that dissolved oxygen was measured
with a Hydrolab multiprobe. The data values shown are part of a logical group of data values
representing the water chemistry profile in a lake. This is represented using the Groups table and
GroupDescriptions table. The Groups table associates GroupID 1 with each of the ValueIDs of
the data values belonging to the group. A description of this group is given in the
GroupDescriptions table.
24
Figure E.3. Excerpts from tables illustrating the population of ODM with water chemistry data.
NCDC Precipitation Data
Figure E.4 illustrates the representation of NCDC 15 minute precipitation data by ODM. The
data includes 15 minute incremental data values as well as daily totals. Separate records in the
Variables table are used for the 15 minute or daily total values. These data are reported at
irregular intervals and only logged for time periods for which precipitation is non zero. This is
accommodated by setting the IsRegular attribute associated with the variable to “False” and
specifying the TimeSupport value as 15 or 24 and the TimeUnits as “Minutes” or “Hours”. The
DataType of “Incremental” is used to indicate that these are incremental data values defined over
the TimeSupport interval. The convention for incremental data (see above) is that when the time
support is specified, it specifies the increment for irregular incremental data. When time support
is specified as 0 it means the increment is from the previous data value at the same site position.
Data qualifiers indicate periods where the data is missing. The method associated with each
precipitation variable documents the convention that zero precipitation periods are not logged in
this data acquired from NCDC. A data qualifier is also used to flag days where the precipitation
total is incomplete due to the record being missing during part of the day.
25
Figure E.4. Excerpts from tables illustrating the population of the ODM with NCDC
Precipitation Data.
Groundwater Level Data
The following is an example of how groundwater level data can be stored in ODM. In this
example, the data values are the water table level relative to the ground surface reported as
negative values. This example shows multiple data values of a single variable at a single site
made by a single source that have been quality controlled as indicated by the
QualityControlLevelID field in the QualityControlLevels table. The SiteID field in the
DataValues table indicates the site in the Sites table that gives the location information about the
monitoring site. In this case, the elevation is with respect to the NGVD29 datum as indicated in
the VerticalDatum field, and latitude and longitude are with respect to the NAD27 datum as
indicated in the SpatialReferences table. The VariableID field in the DataValues table references
the appropriate record in the Variables table indicating information about the variable. The
SourceID field in the DataValues table references the appropriate record in the Sources table
giving information about the source of the data.
26
Figure E.5. Excerpts from tables illustrating the population of the ODM with irregularly sampled
groundwater level data.
Acknowledgements
This material is based upon work supported by the National Science Foundation under Grant
Nos. EAR 0412975 and 0413265. Any opinions, findings and conclusions or recommendations
expressed in this material are those of the authors and do not necessarily reflect the views of the
National Science Foundation (NSF).
27
References
AIAA, (1995), Assessment of Wind Tunnel Data Uncertainty, American Institute of Aeronautics
and Astronautics: AIAA S-071-1995.
Blöschl, G., (1996), Scale and Scaling in Hydrology, Habilitationsschrift, Weiner Mitteilungen
Wasser Abwasser Gewasser, Wien, 346 p.
Blöschl, G. and M. Sivapalan, (1995), "Scale Issues in Hydrological Modelling: A Review,"
Hydrological Processes, 9(1995): 251-290.
Horsburgh, J. S., D. G. Tarboton and D. R. Maidment, (2005), "A Community Data Model for
Hydrologic Observations, Chapter 6," in Hydrologic Information System Status Report,
Version 1, Edited by D. R. Maidment, p.102-135,
http://www.cuahsi.org/his/docs/HISStatusSept15.pdf.
Horsburgh, J. S., D. G. Tarboton, D. R. Maidment, and I. Zaslavsky, (2008), A relational model
for environmental and water resources data, Water Resources Research, Vol. 44,
W05406, doi:10.1029/2007WR006392.
Maidment, D. R., ed. (2002), Arc Hydro GIS for Water Resources, ESRI Press, Redlands, CA,
203 p.
Maidment, D. R., (2005), "A Data Model for Hydrologic Observations." Paper prepared for
presentation at the CUAHSI Hydrologic Information Systems Symposium, University of
Texas at Austin. March 7, 2005.
Tarboton, D. G., (2005), "Review of Proposed CUAHSI Hydrologic Information System
Hydrologic Observations Data Model." Utah State University. May 5, 2005.
28
Appendix A. Observations Data Model Table and Field Structure
The following is a description of the tables in the observations data model, a listing of the fields
contained in each table, a description of the data contained in each field and its data type,
examples of the information to be stored in each field where appropriate, specific constraints
imposed on each field, and discussion on how each field should be populated. Values in the
example column should not be considered to be inclusive of all potential values, especially in the
case of fields that require a controlled vocabulary. We anticipate that these controlled
vocabularies will need to be extended and adjusted. Tables appear in alphabetical order.
Each table below includes a “Constraint” column. The value in this column designates each field
in the table as one of the following:
Mandatory (M) – A value in this field is mandatory and cannot be NULL.
Optional (O) – A value in this field is optional and can be NULL.
Programmatically derived (P) – Inherits from the source field. The value in this field should be
automatically populated as the result of a query and is not required to be input by the user.
Additional constraints are documented where appropriate in the Constraint column. In addition,
where appropriate, each table contains a “Default Value” column. The value in this column is
the default value for the associated field. The default value specifies the convention that should
be followed when a value for the field is not specified. Below each table is a discussion of the
rules and best practices that should be used in populating each table within ODM.
Table: Categories
The Categories table defines the categories for categorical variables. Records are required for
variables where DataType is specified as "Categorical." Multiple entries for each VariableID,
with different DataValues provide the mapping from DataValue to category description.
Field Name
VariableID
DataType
Integer
DataValue
CategoryDescription
Real
Text (Unlimited)
Description
Integer identifier that references the
Variables record of a categorical
variable.
Numeric value that defines the category
Definition of categorical variable value
Examples
45
1.0
“Cloudy”
Constraint
M
Foreign
key
M
M
The following rules and best practices should be used in populating this table:
1. Although all of the fields in this table are mandatory, they need only be populated if
categorical data are entered into the database. If there are no categorical data in the
DataValues table, this table will be empty.
2. This table should be populated before categorical data values are added to the DataValues
table.
29
Table: CensorCodeCV
The CensorCodeCV table contains the controlled vocabulary for censor codes. Only values from
the Term field in this table can be used to populate the CensorCode field of the DataValues table.
Field Name
Term
Data Type
Text (255)
Description
Controlled vocabulary for CensorCode.
Examples
“lt”, “gt”,
“nc”
Definition
Text (unlimited)
Definition of CensorCode controlled
vocabulary term. The definition is
optional if the term is self explanatory.
“less than”,
“greater
than”, “not
censored”
Constraint
M
Unique
Primary key
Cannot
contain tab,
line feed, or
carriage
return
characters
O
This table is pre-populated within the ODM. Changes to this controlled vocabulary can be
requested at http://water.usu.edu/cuahsi/odm/.
Table: DataTypeCV
The DataTypeCV table contains the controlled vocabulary for data types. Only values from the
Term field in this table can be used to populate the DataType field in the Variables table.
Field Name
Term
Data Type
Text (255)
Description
Controlled vocabulary for DataType.
Examples
“Continuous”
Definition
Text (unlimited)
Definition of DataType controlled
vocabulary term. The definition is
optional if the term is self explanatory.
“A quantity
specified at a
particular
instant in time
measured with
sufficient
frequency
(small
spacing) to be
interpreted as
a continuous
record of the
phenomenon.”
30
Constraint
M
Unique
Primary key
Cannot
contain tab,
line feed, or
carriage
return
characters
O
This table is pre-populated within the ODM. Changes to this controlled vocabulary can be
requested at http://water.usu.edu/cuahsi/odm/.
Table: DataValues
The DataValues table contains the actual data values.
Field Name
Data Type
Description
Example
Constraint
ValueID
Integer
Identity
Unique integer identifier for each
data value.
43
DataValue
Real
34.5
ValueAccuracy
Real
4
O
LocalDateTime
Date/Time
9/4/2003
7:00:00
AM
M
UTCOffset
Real
-7
M
DateTimeUTC
Date/Time
9/4/2003
2:00:00
PM
M
SiteID
Integer
3
M
Foreign key
VariableID
Integer
5
M
Foreign key
OffsetValue
Real
2.1
O
NULL =
No Offset
OffsetTypeID
Integer
3
O
Foreign key
NULL =
No Offset
CensorCode
Text (50)
The numeric value of the
observation. For Categorical
variables, a number is stored here.
The Variables table has DataType
as Categorical and the Categories
table maps from the DataValue
onto Category Description.
Numeric value that describes the
measurement accuracy of the data
value. If not given, it is interpreted
as unknown.
Local date and time at which the
data value was observed.
Represented in an implementation
specific format.
Offset in hours from UTC time of
the corresponding LocalDateTime
value.
Universal UTC date and time at
which the data value was observed.
Represented in an implementation
specific format.
Integer identifier that references the
site at which the observation was
measured. This links data values to
their locations in the Sites table.
Integer identifier that references the
variable that was measured. This
links data values to their variable in
the Variables table.
Distance from a datum or control
point to the point at which a data
value was observed. If not given
the OffsetValue is inferred to be 0,
or not relevant/necessary.
Integer identifier that references the
measurement offset type in the
OffsetTypes table.
Text indication of whether the data
value is censored from the
CensorCodeCV controlled
vocabulary.
M
Unique
Primary key
M
“nc”
M
Foreign key
“nc” = Not
Censored
31
Default
Value
NULL
Field Name
Data Type
Description
Example
Constraint
Default
Value
NULL
QualifierID
Integer
4
O
Foreign key
MethodID
Integer
3
M
Foreign key
SourceID
Integer
5
M
Foreign key
SampleID
Integer
7
O
Foreign key
NULL
DerivedFromID
Integer
5
O
NULL
QualityControlLevelID
Integer
Integer identifier that references the
Qualifiers table. If Null, the data
value is inferred to not be qualified.
Integer identifier that references
method used to generate the data
value in the Methods table.
Integer identifier that references the
record in the Sources table giving
the source of the data value.
Integer identifier that references
into the Samples table. This is
required only if the data value
resulted from a physical sample
processed in a lab.
Integer identifier for the derived
from group of data values that the
current data value is derived from.
This refers to a group of derived
from records in the DerivedFrom
table. If NULL, the data value is
inferred to not be derived from
another data value.
Integer identifier giving the level of
quality control that the value has
been subjected to. This references
the QualityControlLevels table.
1
M
Foreign key
-9999 =
Unknown
0 = No
method
specified
The following rules and best practices should be used in populating this table:
1. ValueID is the primary key, is mandatory, and cannot be NULL. This field should be
implemented as an autonumber/identity field. When data values are added to this table, a
unique integer ValueID should be assigned to each data value by the database software
such that the primary key constraint is not violated.
2. Each record in this table must be unique. This is enforced by a unique constraint across
all of the fields in this table (excluding ValueID) so that duplicate records are avoided.
3. The LocalDateTime, UTCOffset, and DateTimeUTC must all be populated. Care must
be taken to ensure that the correct UTCOffset is used, especially in areas that observe
daylight saving time. If LocalDateTime and DateTimeUTC are given, the UTCOffset
can be calculated as the difference between the two dates. If LocalDateTime and
UTCOffset are given, DateTimeUTC can be calculated.
4. SiteID must correspond to a valid SiteID from the Sites table. When adding data for a
new site to the ODM, the Sites table should be populated prior to adding data values to
the DataValues table.
5. VariableID must correspond to a valid VariableID from the Variables table. When
adding data for a new variable to the ODM, the Variables table should be populated prior
to adding data values for the new variable to the DataValues table.
6. OffsetValue and OffsetTypeID are optional because not all data values have an offset.
Where no offset is used, both of these fields should be set to NULL indicating that the
data values do not have an offset. Where an OffsetValue is specified, an OffsetTypeID
32
must also be specified and it must refer to a valid OffsetTypeID in the OffsetTypes table.
The OffsetTypes table should be populated prior to adding data values with a particular
OffsetTypeID to the DataValues table.
7. CensorCode is mandatory and cannot be NULL. A default value of “nc” is used for this
field. Only Terms from the CensorCodeCV table should be used to populate this field.
8. The QualifierID field is optional because not all data values have qualifiers. Where no
qualifier applies, this field should be set to NULL. When a QualifierID is specified in
this field it must refer to a valid QualifierID in the Qualifiers table. The Qualifiers table
should be populated prior to adding data values with a particular QualifierID to the
DataValues Table.
9. MethodID must correspond to a valid MethodID from the Methods table and cannot be
NULL. A default value of 0 is used in the case where no method is specified or the
method used to create the observation is unknown. The Methods table should be
populated prior to adding data values with a particular MethodID to the DataValues table.
10. SourceID must correspond to a valid SourceID from the Sources table and cannot be
NULL. The Sources table should be populated prior to adding data values with a
particular SourceID to the DataValues table.
11. SampleID is optional and should only be populated if the data value was generated from
a physical sample that was sent to a laboratory for analysis. The SampleID must
correspond to a valid SampleID in the Samples table, and the Samples table should be
populated prior to adding data values with a particular SampleID to the DataValues table.
12. DerivedFromID is optional and should only be populated if the data value was derived
from other data values that are also stored in the ODM database.
13. QualityControlLevelID is mandatory, cannot be NULL, and must correspond to a valid
QualityControlLevelID in the QualityControlLevels table. A default value of -9999 is
used for this field in the event that the QualityControlLevelID is unknown. The
QualityControlLevels table should be populated prior to adding data values with a
particular QualityControlLevelID to the DataValues table.
Table: DerivedFrom
The DerivedFrom table contains the linkage between derived data values and the data values that
they were derived from.
Field Name
DerivedFromID
Data Type
Integer
ValueID
Integer
Description
Integer identifying the group of data
values from which a quantity is derived.
Integer identifier referencing data values
that comprise a group from which a
quantity is derived. This corresponds to
ValueID in the DataValues table.
Examples
3
Constraint
M
1,2,3,4,5
M
The following rules and best practices should be used in populating this table:
1. Although all of the fields in this table are mandatory, they need only be populated if
derived data values and the data values that they were derived from are entered into the
database. If there are no derived data in the DataValues table, this table will be empty.
33
Table: GeneralCategoryCV
The GeneralCategoryCV table contains the controlled vocabulary for the general categories
associated with Variables. The GeneralCategory field in the Variables table can only be
populated with values from the Term field of this controlled vocabulary table.
Field Name
Term
Data Type
Text (255)
Description
Controlled vocabulary for
GeneralCategory.
Examples
“Hydrology”
Definition
Text (unlimited)
Definition of GeneralCategory
controlled vocabulary term. The
definition is optional if the term is self
explanatory.
“Data
associated
with
hydrologic
variables or
processes.”
Constraint
M
Unique
Primary key
Cannot
contain tab,
line feed, or
carriage
return
characters
O
This table is pre-populated within the ODM. Changes to this controlled vocabulary can be
requested at http://water.usu.edu/cuahsi/odm/.
Table: GroupDescriptions
The GroupDescriptions table lists the descriptions for each of the groups of data values that have
been formed.
Field Name
GroupID
Data Type
Integer
Identity
GroupDescription
Text (unlimited)
Description
Unique integer identifier for each group
of data values that has been formed.
This also references to GroupID in the
Groups table.
Text description of the group.
Example
4
Constraint
M
Unique
Primary key
“Echo
Reservoir
Profile
7/7/2005”
O
The following rules and best practices should be used in populating this table:
1. This table will only be populated if groups of data values have been created in the ODM
database.
2. The GroupID field is the primary key, must be a unique integer, and cannot be NULL. It
should be implemented as an auto number/identity field.
3. The GroupDescription can be any text string that describes the group of observations.
34
Table: Groups
The Groups table lists the groups of data values that have been created and the data values that
are within each group.
Field Name
GroupID
Data Type
Integer
ValueID
Integer
Description
Integer ID for each group of data values
that has been formed.
Integer identifier for each data value that
belongs to a group. This corresponds to
ValueID in the DataValues table
Example
4
2,3,4
Constraint
M
Foreign key
M
Foreign key
The following rules and best practices should be used in populating this table:
1. This table will only be populated if groups of data values have been created in the ODM
database.
2. The GroupID field must reference a valid GroupID from the GroupDescriptions table,
and the GroupDescriptions table should be populated for a group prior to populating the
Groups table.
Table: ISOMetadata
The ISOMetadata table contains dataset and project level metadata required by the CUAHSI HIS
metadata system (http://www.cuahsi.org/his/documentation.html) for compliance with standards
such as the draft ISO 19115 or ISO 8601. The mandatory fields in this table must be populated
to provide a complete set of ISO compliant metadata in the database.
Field Name
MetadataID
Data Type
Integer
Identity
Description
Unique integer ID for each
metadata record.
Example
4
TopicCategory
Text (255)
“inlandWaters”
Title
Text (255)
Topic category keyword that
gives the broad ISO19115
metadata topic category for data
from this source. The controlled
vocabulary of topic category
keywords is given in the
TopicCategoryCV table.
Title of data from a specific data
source.
Abstract
Text
(unlimited)
Abstract of data from a specific
data source.
35
Constraint
M
Unique
Primary key
M
Foreign key
Default Value
M
Cannot
contain tab,
line feed, or
carriage
return
characters
M
“Unknown”
“Unknown”
“Unknown”
Field Name
ProfileVersion
Data Type
Text (255)
Description
Name of metadata profile used
by the data source
MetadataLink
Text (500)
Link to additional metadata
reference material.
Example
“ISO8601”
Constraint
M
Cannot
contain tab,
line feed, or
carriage
return
characters
O
Default Value
“Unknown”
NULL
The following rules and best practices should be used in populating this table:
1. The MetadataID field is the primary key, must be a unique integer, and cannot be NULL.
This field should be implemented as an auto number/identity field.
2. All of the fields in this table are mandatory and cannot be NULL except for the
MetadataLink field.
3. The TopicCategory field should only be populated with terms from the
TopicCategoryCV table. The default controlled vocabulary term is “Unknown.”
4. The Title field should be populated with a brief text description of what the referenced
data represent. This field can be populated with “Unknown” if there is no title for the
data.
5. The Abstract field should be populated with a more complete text description of the data
that the metadata record references. This field can be populated with “Unknown” if there
is no abstract for the data.
6. The ProfileVersion field should be populated with the version of the ISO metadata profile
that is being used. This field can be populated with “Unknown” if there is no profile
version for the data.
7. One record with a MetadataID = 0 should exist in this table with TopicCategory, Title,
Abstract, and ProfileVersion = “Unknown” and MetadataLink = NULL. This record
should be the default value for sources with unknown/unspecified metadata.
Table: LabMethods
The LabMethods table contains descriptions of the laboratory methods used to analyze physical
samples for specific constituents.
Field Name
LabMethodID
Data Type
Integer
Identity
Description
Unique integer identifier
for each laboratory
method. This is the key
used by the Samples table
to reference a laboratory
method.
36
Example
6
Constraint
M
Unique
Primary
key
Default Value
LabName
Text (255)
Name of the laboratory
responsible for processing
the sample.
“USGS
Atlanta Field
Office”
LabOrganization
Text (255)
Organization responsible
for sample analysis.
“USGS”
LabMethodName
Text (255)
Name of the method and
protocols used for sample
analysis.
“USEPA365.1”
LabMethodDescription
Text
(unlimited)
Description of the method
and protocols used for
sample analysis.
“Processed
through Model
*** Mass
Spectrometer”
LabMethodLink
Text (500)
Link to additional
reference material on the
analysis method.
M
Cannot
contain tab,
line feed,
or carriage
return
characters
M
Cannot
contain tab,
line feed,
or carriage
return
characters
M
Cannot
contain tab,
line feed,
or carriage
return
characters
M
“Unknown”
O
NULL
“Unknown”
“Unknown”
“Unknown”
The following rules and best practices should be used when populating this table:
1. The LabMethodID field is the primary key, must be a unique integer, and cannot be
NULL. It should be implemented as an auto number/identity field.
2. All of the fields in this table are required and cannot be null except for the
LabMethodLink.
3. The default value for all of the required fields except for the LabMethodID is
“Unknown.”
4. A single record should exist in this table where the LabMethodID = 0 and the LabName,
LabOrganization, LabMethdodName, and LabMethodDescription fields are equal to
“Unknown” and the LabMethodLink = NULL. This record should be used to identify
samples in the Samples table for which nothing is known about the laboratory method
used to analyze the sample.
37
Table: Methods
The Methods table lists the methods used to collect the data and any additional information about
the method.
Field Name
MethodID
Data Type
Integer
Identity
Description
Unique integer ID for
each method.
Example
5
MethodDescription
Text
(unlimited)
Text description of each
method.
“Specific
conductance
measured using a
Hydrolab” or
"Streamflow
measured using a V
notch weir with
dimensions xxx"
MethodLink
Text (500)
Link to additional
reference material on
the method.
Constraint
M
Unique
Primary key
M
Default Value
O
NULL
The following rules and best practices should be used when populating this table:
1. The MethodID field is the primary key, must be a unique integer, and cannot be NULL.
2. There is no default value for the MethodDescription field in this table. Rather, this table
should contain a record with MethodID = 0, MethodDescription = “Unknown”, and
MethodLink = NULL. A MethodID of 0 should be used as the MethodID for any data
values for which the method used to create the value is unknown (i.e., the default value
for the MethodID field in the DataValues table is 0).
3. Methods should describe the manner in which the observation was collected (i.e.,
collected manually, or collected using an automated sampler) or measured (i.e., measured
using a temperature sensor or measured using a turbidity sensor). Details about the
specific sensor models and manufacturers can be included in the MethodDescription.
Table: ODM Version
The ODM Version table has a single record that records the version of the ODM database. This
table must contain a valid ODM version number. This table will be pre-populated and should
not be edited.
Field Name
VersionNumber
Data Type
Text (50)
Description
String that lists the version of the ODM
database.
38
Example
“1.1”
Constraint
M
Cannot
contain tab,
line feed, or
carriage
return
characters
Table: OffsetTypes
The OffsetTypes table lists full descriptive information for each of the measurement offsets.
Field Name
OffsetTypeID
Data Type
Integer
Identity
Description
Unique integer identifier that identifies
the type of measurement offset.
Example
2
OffsetUnitsID
Integer
1
OffsetDescription
Text (unlimited)
Integer identifier that references the
record in the Units table giving the Units
of the OffsetValue.
Full text description of the offset type.
“Below water
surface”
“Above
Ground Level”
Constraint
M
Unique
Primary key
M
Foreign key
M
The following rules and best practices should be followed when populating this table:
1. Although all three fields in this table are mandatory, this table will only be populated if
data values measured at an offset have been entered into the ODM database.
2. The OffsetTypeID field is the primary key, must be a unique integer, and cannot be
NULL. This field should be implemented as an auto number/identity field.
3. The OffsetUnitsID field should reference a valid ID from the UnitsID field in the Units
table. Because the Units table is a controlled vocabulary, only units that already exist in
the Units table can be used as the units of the offset.
4. The OffsetDescription field should be filled in with a complete text description of the
offset that provides enough information to interpret the type of offset being used. For
example, “Distance from stream bank” is ambiguous because it is not known which bank
is being referred to.
Table: Qualifiers
The Qualifiers table contains data qualifying comments that accompany the data.
Field Name
QualifierID
Data Type
Integer
Identity
Description
Unique integer identifying the
data qualifier.
Example
3
QualifierCode
Text (50)
Text code used by
organization that collects the
data.
“e” (for
estimated) or
“a” (for
approved) or
“p” (for
provisional)
QualifierDescription
Text
(unlimited)
Text of the data qualifying
comment.
“Holding
time for
sample
analysis
exceeded”
39
Constraint
M
Unique
Primary key
O
Cannot
contain
space, tab,
line feed, or
carriage
return
characters
M
Default Value
NULL
This table will only be populated if data values that have data qualifying comments have been
added to the ODM database. The following rules and best practices should be used when
populating this table:
1. The QualifierID field is the primary key, must be a unique integer, and cannot be NULL.
This field should be implemented as an auto number/identity field.
Table: QualityControlLevels
The QualityControlLevels table contains the quality control levels that are used for versioning
data within the database.
Field Name
QualityControlLevelID
Data Type
Integer
Identity
Description
Unique integer identifying the
quality control level.
Example
0, 1, 2, 3, 4, 5
QualityControlLevelCode
Text (50)
Code used to identify the level
of quality control to which data
values have been subjected.
“1”, “1.1”, “Raw”,
“QC Checked”
Definition
Text (255)
Definition of Quality Control
Level.
“Raw Data”,
“Quality Controlled
Data”
Explanation
Text
(unlimited)
Explanation of Quality Control
Level
“Raw data is defined
as unprocessed data
and data products
that have not
undergone quality
control.”
Constraint
M
Unique
Primary key
M
Cannot
contain tab,
line feed, or
carriage
return
characters
M
Cannot
contain tab,
line feed, or
carriage
return
characters
M
This table is pre-populated with quality control levels 0 through 4 within the ODM. The
following rules and best practices should be used when populating this table:
1. The QualityControlLevelID field is the primary key, must be a unique integer, and cannot
be NULL. This field should be implemented as an auto number/identity field.
2. It is suggested that the pre-populated system of quality control level codes (i.e.,
QualityControlLevelCodes 0 – 4) be used. If the pre-populated list is not sufficient, new
quality control levels can be defined. A quality control level code of -9999 is suggested
for data whose quality control level is unknown.
40
Table: SampleMediumCV
The SampleMediumCV table contains the controlled vocabulary for sample media.
Field Name
Term
Data Type
Text (255)
Description
Controlled vocabulary for
sample media.
Examples
“Surface Water”
Definition
Text (unlimited)
Definition of sample media
controlled vocabulary term. The
definition is optional if the term
is self explanatory.
“Sample taken from
surface water such as a
stream, river, lake,
pond, reservoir, ocean,
etc.”
Constraint
M
Unique
Primary key
Cannot
contain tab,
line feed, or
carriage
return
characters
O
This table is pre-populated within the ODM. Changes to this controlled vocabulary can be
requested at http://water.usu.edu/cuahsi/odm/.
Table: Samples
The Samples table gives information about physical samples analyzed in a laboratory.
Field Name
SampleID
Data Type
Integer
Identity
Description
Unique integer identifier that
identifies each physical sample.
Example
3
SampleType
Text (255)
Controlled vocabulary specifying
the sample type from the
SampleTypeCV table.
LabSampleCode
Text (50)
Code or label used to identify and
track lab sample or sample
container (e.g. bottle) during lab
analysis.
“FD”,
“PB”,
“SW”,
“Grab
Sample”
“AB-123”
LabMethodID
Integer
Unique identifier for the
laboratory method used to process
the sample. This references the
LabMethods table.
4
Constraint
M
Unique
Primary key
M
Foreign key
M
Unique
Cannot
contain tab,
line feed, or
carriage
return
characters
M
Foreign key
The following rules and best practices should be followed when populating this table:
41
Default Value
“Unknown”
0 = Nothing
known about
lab method
1. This table will only be populated if data values associated with physical samples are
added to the ODM database.
2. The SamplID field is the primary key, must be a unique integer, and cannot be NULL.
This field should be implemented as an auto number/identity field.
3. The SampleType field should be populated using terms from the SampleTypeCV table.
Where the sample type is unknown, a default value of “Unknown” can be used.
4. The LabSampleCode should be a unique text code used by the laboratory to identify the
sample. This field is an alternate key for this table and should be unique.
5. The LabMethodID must reference a valid LabMethodID from the LabMethods table.
The LabMethods table should be populated with the appropriate laboratory method
information prior to adding records to this table that reference that laboratory method. A
default value of 0 for this field indicates that nothing is known about the laboratory
method used to analyze the sample.
Table: SampleTypeCV
The SampleTypeCV table contains the controlled vocabulary for sample type.
Field Name
Term
Data Type
Text (255)
Description
Controlled vocabulary for sample
type.
Examples
“FD”, “PB”, “Grab
Sample”
Definition
Text
(unlimited)
Definition of sample type controlled
vocabulary term. The definition is
optional if the term is self
explanatory.
“Foliage Digestion”,
“Precipitation Bulk”
Constraint
M
Unique
Primary key
Cannot
contain tab,
line feed, or
carriage
return
characters
O
This table is pre-populated within the ODM. Changes to this controlled vocabulary can be
requested at http://water.usu.edu/cuahsi/odm/.
Table: SeriesCatalog
The SeriesCatalog table lists each separate data series in the database for the purposes of
identifying or displaying what data are available at each site and to speed simple queries without
querying the main DataValues table. Unique site/variable combinations are defined by unique
combinations of SiteID, VariableID, MethodID, SourceID, and QualityControlLevelID.
This entire table should be programmatically derived and should be updated every time data is
added to the database. Constraints on each field in the SeriesCatalog table are dependent upon
the constraints on the fields in the table from which those fields originated.
42
Field Name
SeriesID
SiteID
Data Type
Integer
Identity
Integer
SiteCode
Text (50)
SiteName
Text (255)
VariableID
Integer
VariableCode
Text (50)
VariableName
Text (255)
Speciation
Text (255)
VariableUnitsID
Integer
VariableUnitsName
Text (255)
SampleMedium
Text (255)
ValueType
Text (255)
TimeSupport
Real
Description
Unique integer identifier for
each data series.
Site identifier from the Sites
table.
Site code used by organization
that collects the data.
Full text name of sampling
site.
Integer identifier for each
Variable that references the
Variables table.
Variable code used by the
organization that collects the
data.
Name of the variable from the
variables table.
Code used to identify how the
data value is expressed (i.e.,
total phosphorus concentration
expressed as P). This should
be from the SpeciationCV
controlled vocabulary table.
Integer identifier that
references the record in the
Units table giving the Units of
the data value.
Full text name of the variable
units from the UnitsName
field in the Units table.
The medium of the sample.
This should be from the
SampleMediumCV controlled
vocabulary table.
Text value indicating what
type of data value is being
recorded. This should be from
the ValueTypeCV controlled
vocabulary table.
Numerical value that indicates
the time support (or temporal
footprint) of the data values. 0
is used to indicate data values
that are instantaneous. Other
values indicate the time over
which the data values are
implicitly or explicitly
averaged or aggregated.
43
Example
5
7
Constraint
P
Primary key
P
“1002000”
P
“Logan River”
P
4
P
“00060”
P
“Temperature”
P
“P”, “N”, “NO3”
P
5
P
“milligrams per
liter”
P
“Surface Water”
P
“Field Observation”
P
0, 24
P
Field Name
TimeUnitsID
Data Type
Integer
TimeUnitsName
Text (255)
DataType
Text (255)
GeneralCategory
Text (255)
MethodID
Integer
MethodDescription
Text
(unlimited)
SourceID
Integer
Organization
Text (255)
SourceDescription
Text
(unlimited)
Description
Integer identifier that
references the record in the
Units table giving the Units of
the time support. If
TimeSupport is 0, indicating
an instantaneous observation,
a unit needs to still be given
for completeness, although it
is somewhat arbitrary.
Full text name of the time
support units from the
UnitsName field in the Units
table.
Text value that identifies the
data as one of several types
from the DataTypeCV
controlled vocabulary table.
General category of the
variable from the
GeneralCategoryCV table.
Integer identifier that
identifies the method used to
generate the data values and
references the Methods table.
Full text description of the
method used to generate the
data values.
Integer identifier that
identifies the source of the
data values and references the
Sources table.
Text description of the source
organization from the Sources
table.
Text description of the data
source from the Sources table.
44
Example
4
Constraint
P
“hours”
P
“Continuous”
“Instantaneous”
“Cumulative”
“Incremental”
“Average”
“Minimum”
“Maximum”
“Constant Over
Interval”
“Categorical”
“Water Quality”
P
P
2
P
“Specific
conductance
measured using a
Hydrolab” or
"Streamflow
measured using a V
notch weir with
dimensions xxx"
5
P
“USGS”
P
“Text file retrieved
from the EPA
STORET system
indicating data
originally from Utah
Division of Water
Quality”
P
P
Field Name
Citation
Data Type
Text
(unlimited)
Description
Text string that give the
citation to be used when the
data from each source are
referenced.
QualityControlLevelID
Integer
QualityControlLevelCode
Text (50)
BeginDateTime
Date/Time
EndDateTime
Date/Time
BeginDateTimeUTC
Date/Time
EndDateTimeUTC
Date/Time
ValueCount
Integer
Integer identifier that indicates
the level of quality control that
the data values have been
subjected to.
Code used to identify the level
of quality control to which
data values have been
subjected.
Date of the first data value in
the series. To be
programmatically updated if
new records are added.
Date of the last data value in
the series. To be
programmatically updated if
new records are added.
Date of the first data value in
the series in UTC. To be
programmatically updated if
new records are added.
Date of the last data value in
the series in UTC. To be
programmatically updated if
new records are added.
The number of data values in
the series identified by the
combination of the SiteID,
VariableID, MethodID,
SourceID and
QualityControlLevelID fields.
To be programmatically
updated if new records are
added.
45
Example
“Slaughter, C. W.,
D. Marks, G. N.
Flerchinger, S. S.
Van Vactor and M.
Burgess, (2001),
"Thirty-five years of
research data
collection at the
Reynolds Creek
Experimental
Watershed, Idaho,
United States,"
Water Resources
Research, 37(11):
2819-2823.”
0,1,2,3,4
Constraint
P
P
“1”, “1.1”, “Raw”,
“QC Checked”
P
9/4/2003 7:00:00
AM
P
9/4/2005 7:00:00
AM
P
9/4/2003 2:00 PM
P
9/4/2003 2:00 PM
P
50
P
Table: Sites
The Sites table provides information giving the spatial location at which data values have been
collected.
Field Name
SiteID
Data Type
Integer
Identity
Description
Unique identifier for each
sampling location.
Example
37
SiteCode
Text (50)
Code used by organization that
collects the data to identify the
site
“10109000”
(USGS Gage
number)
SiteName
Text (255)
Full name of the sampling site.
“LOGAN
RIVER
ABOVE
STATE DAM,
NEAR
LOGAN,UT”
Latitude
Real
Latitude in decimal degrees.
45.32
Longitude
Real
Longitude in decimal degrees.
East positive, West negative.
-100.47
LatLongDatumID
Integer
1
Elevation_m
Real
VerticalDatum
Text (255)
LocalX
LocalY
Real
Real
Identifier that references the
Spatial Reference System of
the latitude and longitude
coordinates in the
SpatialReferences table.
Elevation of sampling location
(in m). If this is not provided it
needs to be obtained
programmatically from a DEM
based on location information.
Vertical datum of the elevation.
Controlled Vocabulary from
VerticalDatumCV.
Local Projection X coordinate.
Local Projection Y Coordinate.
46
Constraint
M
Unique
Primary
key
M
Unique
Allows
only
characters
in the
range of AZ (case
insensitive)
, 0-9, and
“.”, “-“,
and “_”.
M
Cannot
contain
tab, line
feed, or
carriage
return
characters
M
(>= -90
AND
<= 90)
M
(>= -180
AND <=
360)
M
Foreign
key
Default Value
1432
O
NULL
“NAVD88”
O
Foreign
key
O
O
NULL
456700
232000
0 = Unknown
NULL
NULL
Field Name
LocalProjectionID
Data Type
Integer
Description
Identifier that references the
Spatial Reference System of
the local coordinates in the
SpatialReferences table. This
field is required if local
coordinates are given.
Value giving the accuracy with
which the positional
information is specified in
meters.
Name of state in which the
monitoring site is located.
Example
7
Constraint
O
Foreign
key
Default Value
NULL
PosAccuracy_m
Real
100
O
NULL
State
Text (255)
“Utah”
NULL
Text (255)
Name of county in which the
monitoring site is located.
“Cache”
Text
(unlimited)
Comments related to the site.
O
Cannot
contain
tab, line
feed, or
carriage
return
characters
O
Cannot
contain
tab, line
feed, or
carriage
return
characters
O
County
Comments
NULL
NULL
The following rules and best practices should be followed when populating this table:
1. The SiteID field is the primary key, must be a unique integer, and cannot be NULL. This
field should be implemented as an auto number/identity field.
2. The SiteCode field must contain a text code that uniquely identifies each site. The values
in this field should be unique and can be an alternate key for the table. SiteCodes cannot
contain any characters other than A-Z (case insensitive), 0-9, period “.”, dash “-“, and
underscore “_”.
3. The LatLongDatumID must reference a valid SpatialReferenceID from the
SpatialReferences controlled vocabulary table. If the datum is unknown, a default value
of 0 is used.
4. If the Elevation_m field is populated with a numeric value, a value must be specified in
the VerticalDatum field. The VerticalDatum field can only be populated using terms
from the VerticalDatumCV table. If the vertical datum is unknown, a value of
“Unknown” is used.
5. If the LocalX and LocalY fields are populated with numeric values, a value must be
specified in the LocalProjectionID field. The LocalProjectionID must reference a valid
SpatialReferenceID from the SpatialReferences controlled vocabulary table. If the spatial
reference system of the local coordinates is unknown, a default value of 0 is used.
47
Table: Sources
The Sources table lists the original sources of the data, providing information sufficient to
retrieve and reconstruct the data value from the original data files if necessary.
Field Name
SourceID
Data Type
Integer
Identity
Description
Unique integer identifier that
identifies each data source.
Example
5
Organization
Text (255)
“Utah Division
of Water
Quality”
SourceDescription
Text
(unlimited)
Name of the organization that
collected the data. This should
be the agency or organization
that collected the data, even if
it came out of a database
consolidated from many
sources such as STORET.
Full text description of the
source of the data.
SourceLink
Text (500)
ContactName
Text (255)
Phone
Text (255)
Phone number for the contact
person.
“435-797-0000”
Email
Text (255)
Email address for the contact
person.
“Jane.Adams@
dwq.ut”
Link that can be pointed at the
original data file and/or
associated metadata stored in
the digital library or URL of
data source.
Name of the contact person for
the data source.
48
“Text file
retrieved from
the EPA
STORET
system
indicating data
originally from
Utah Division
of Water
Quality”
“Jane Adams”
Constraint
M
Unique
Primary key
M
Cannot
contain tab,
line feed, or
carriage
return
characters
M
Default Value
O
NULL
M
Cannot
contain tab,
line feed, or
carriage
return
characters
M
Cannot
contain tab,
line feed, or
carriage
return
characters
M
Cannot
contain tab,
line feed, or
carriage
return
characters
“Unknown”
“Unknown”
“Unknown”
Field Name
Address
Data Type
Text (255)
Description
Street address for the contact
person.
Example
“45 Main
Street”
City
Text (255)
City in which the contact
person is located.
“Salt Lake City”
State
Text (255)
State in which the contact
person is located. Use two
letter abbreviations for US.
For other countries give the full
country name.
“UT”
ZipCode
Text (255)
US Zip Code or country postal
code.
“82323”
Citation
Text
(unlimited)
Text string that give the
citation to be used when the
data from each source are
referenced.
MetadataID
Integer
Integer identifier referencing
the record in the ISOMetadata
table for this source.
“Data collected
by USU as part
of the Little
Bear River Test
Bed Project”
5
Constraint
M
Cannot
contain tab,
line feed, or
carriage
return
characters
M
Cannot
contain tab,
line feed, or
carriage
return
characters
M
Cannot
contain tab,
line feed, or
carriage
return
characters
M
Cannot
contain tab,
line feed, or
carriage
return
characters
M
Default Value
“Unknown”
M
Foreign key
0 = Unknown
or uninitialized
metadata
“Unknown”
“Unknown”
“Unknown”
“Unknown”
The following rules and best practices should be followed when populating this table:
1. The SourceID field is the primary key, must be a unique integer, and cannot be NULL.
This field should be implemented as an auto number/identity field.
2. The Organization field should contain a text description of the agency or organization
that created the data.
3. The SourceDescription field should contain a more detailed description of where the data
was actually obtained.
4. A default value of “Unknown” may be used for the source contact information fields in
the event that this information is not known.
5. Each source must be associated with a metadata record in the ISOMetadata table. As
such, the MetadataID must reference a valid MetadataID from the ISOMetadata table.
The ISOMetatadata table should be populated with an appropriate record prior to adding
49
a source to the Sources table. A default MetadataID of 0 can be used for a source with
unknown or uninitialized metadata.
6. Use the Citation field to record the text that you would like others to use when they are
referencing your data. Where available, journal citations are encouraged to promote the
correct crediting for use of data.
Table: SpatialReferences
The SpatialReferences table provides information about the Spatial Reference Systems used for
latitude and longitude as well as local coordinate systems in the Sites table. This table is a
controlled vocabulary.
Field Name
SpatialReferenceID
Data Type
Integer
Identity
Description
Unique integer identifier for each Spatial
Reference System.
Example
37
SRSID
Integer
4269
SRSName
Text (255)
Integer identifier for the Spatial
Reference System from
http://www.epsg.org/.
Name of the Spatial Reference System.
IsGeographic
Boolean
“True”,
“False”
Notes
Text
(unlimited)
Value that indicates whether the spatial
reference system uses geographic
coordinates (i.e. latitude and longitude)
or not.
Descriptive information about the Spatial
Reference System. This field would be
used to define a non-standard study area
specific system if necessary and would
contain a description of the local
projection information. Where possible,
this should refer to a standard projection,
in which case latitude and longitude can
be determined from local projection
information. If the local grid system is
non-standard then latitude and longitude
need to be included too.
“NAD83”
Constraint
M
Unique
Primary key
O
M
Cannot contain
tab, line feed, or
carriage return
characters
O
O
This table is pre-populated within the ODM. Changes to this controlled vocabulary can be
requested at http://water.usu.edu/cuahsi/odm/.
50
Table: SpeciationCV
The SpeciationCV table contains the controlled vocabulary for the Speciation field in the
Variables table.
Field Name
Term
Data Type
Text (255)
Description
Controlled vocabulary for Speciation.
Examples
“P”
Definition
Text
(unlimited)
Definition of Speciation controlled
vocabulary term. The definition is
optional if the term is self explanatory.
“Expressed as
phosphorus”
Constraint
M
Unique
Primary key
Cannot
contain tab,
line feed, or
carriage
return
characters
O
This table is pre-populated within the ODM. Changes to this controlled vocabulary can be
requested at http://water.usu.edu/cuahsi/odm/.
Table: TopicCategoryCV
The TopicCategoryCV table contains the controlled vocabulary for the ISOMetaData topic
categories.
Field Name
Term
Data Type
Text (255)
Description
Controlled vocabulary for
TopicCategory.
Examples
“InlandWaters”
Definition
Text
(unlimited)
Definition of TopicCategory controlled
vocabulary term. The definition is
optional if the term is self explanatory.
“Data
associated with
inland waters”
Constraint
M
Unique
Primary key
Cannot
contain tab,
line feed, or
carriage
return
characters
O
This table is pre-populated within the ODM. Changes to this controlled vocabulary can be
requested at http://water.usu.edu/cuahsi/odm/.
51
Table: Units
The Units table gives the Units and UnitsType associated with variables, time support, and
offsets. This is a controlled vocabulary table.
Field Name
UnitsID
Data Type
Integer
Identity
Description
Unique integer identifier that identifies
each unit.
Example
6
UnitsName
Text (255)
Full text name of the units.
“Milligrams
Per Liter”
UnitsType
Text (255)
Text value that specifies the dimensions
of the units.
“Length”
“Time”
“Mass”
UnitsAbbreviation
Text (255)
Text abbreviation for the units.
“mg/L”
Constraint
M
Unique
Primary key
M
Cannot
contain tab,
line feed, or
carriage
return
characters
M
Cannot
contain tab,
line feed, or
carriage
return
characters
M
Cannot
contain tab,
line feed, or
carriage
return
characters
This table is pre-populated within the ODM. Changes to this controlled vocabulary can be
requested at http://water.usu.edu/cuahsi/odm/.
Table: ValueTypeCV
The ValueTypeCV table contains the controlled vocabulary for the ValueType field in the
Variables and SeriesCatalog tables.
Field Name
Term
Data Type
Text (255)
Description
Controlled vocabulary for ValueType.
Examples
“Field
Observation”
Definition
Text (unlimited)
Definition of the ValueType controlled
vocabulary term. The definition is
optional if the term is self explanatory.
“Observation
of a variable
using a field
instrument”
52
Constraint
M
Unique
Primary key
Cannot
contain tab,
line feed, or
carriage
return
characters
O
This table is pre-populated within the ODM. Changes to this controlled vocabulary can be
requested at http://water.usu.edu/cuahsi/odm/.
Table: VariableNameCV
The VariableName CV table contains the controlled vocabulary for the VariableName field in
the Variables and SeriesCatalog tables.
Field Name
Term
Data Type
Text (255)
Description
Controlled vocabulary for Variable
names.
Definition
Text (unlimited)
Definition of the VariableName
controlled vocabulary term. The
definition is optional if the term is self
explanatory.
Examples
"Temperature",
"Discharge",
"Precipitation"
Constraint
M
Unique
Primary key
Cannot
contain tab,
line feed, or
carriage
return
characters
O
This table is pre-populated within the ODM. Changes to this controlled vocabulary can be
requested at http://water.usu.edu/cuahsi/odm/.
Table: Variables
The Variables table lists the full descriptive information about what variables have been
measured.
Field Name
VariableID
Data Type
Integer
Identity
Description
Unique integer identifier for
each variable.
Example
6
VariableCode
Text (50)
Text code used by the
organization that collects the
data to identify the variable.
“00060” used
by USGS for
discharge
VariableName
Text (255)
Full text name of the variable
that was measured, observed,
modeled, etc. This should be
from the VariableNameCV
controlled vocabulary table.
“Discharge”
53
Constraint
M
Unique
Primary key
M
Unique
Allows only
characters in
the range of
A-Z (case
insensitive),
0-9, and “.”,
“-“, and “_”.
M
Foreign key
Default Value
Field Name
Speciation
Data Type
Text (255)
VariableUnitsID
Integer
SampleMedium
Text (255)
ValueType
Text (255)
IsRegular
Boolean
TimeSupport
Real
TimeUnitsID
Integer
DataType
Text (255)
Description
Text code used to identify how
the data value is expressed (i.e.,
total phosphorus concentration
expressed as P). This should be
from the SpeciationCV
controlled vocabulary table.
Integer identifier that references
the record in the Units table
giving the units of the data
values associated with the
variable.
The medium in which the
sample or observation was taken
or made. This should be from
the SampleMediumCV
controlled vocabulary table.
Text value indicating what type
of data value is being recorded.
This should be from the
ValueTypeCV controlled
vocabulary table.
Value that indicates whether the
data values are from a regularly
sampled time series.
Numerical value that indicates
the time support (or temporal
footprint) of the data values. 0 is
used to indicate data values that
are instantaneous. Other values
indicate the time over which the
data values are implicitly or
explicitly averaged or
aggregated.
Integer identifier that references
the record in the Units table
giving the Units of the time
support. If TimeSupport is 0,
indicating an instantaneous
observation, a unit needs to still
be given for completeness,
although it is somewhat
arbitrary.
Text value that identifies the
data values as one of several
types from the DataTypeCV
controlled vocabulary table.
54
Example
“P”, “N”,
“NO3”
Constraint
M
Foreign key
Default Value
“Not
Applicable”
4
M
Foreign key
“Surface
Water”
“Sediment”
“Fish Tissue”
M
Foreign key
“Unknown”
“Field
Observation”
“Laboratory
Observation”
“Model
Simulation
Results”
“True”
“False”
M
Foreign key
“Unknown”
M
“False”
0, 24
M
0 = Assumes
instantaneous
samples where
no other
information is
available
4
M
Foreign key
103 = hours
“Continuous”
“Sporadic”
“Cumulative”
“Incremental”
“Average”
“Minimum”
“Maximum”
“Constant
Over Interval”
“Categorical”
M
Foreign key
“Unknown”
Field Name
GeneralCategory
Data Type
Text (255)
Description
General category of the data
values from the
GeneralCategoryCV controlled
vocabulary table.
NoDataValue
Real
Numeric value used to encode
no data values for this variable.
Example
“Climate”
“Water
Quality”
“Groundwater
Quality”
-9999
Constraint
M
Foreign key
Default Value
“Unknown”
M
-9999
The following rules and best practices should be followed when populating this table:
1. The VariableID field is the primary key, must be a unique integer, and cannot be NULL.
This field should be implemented as an auto number/identity field.
2. The VariableCode field must be unique and serves as an alternate key for this table.
Variable codes can be arbitrary, or they can use an organized system. VaraibleCodes
cannot contain any characters other than A-Z (case insensitive), 0-9, period “.”, dash “-“,
and underscore “_”.
3. The VariableName field must reference a valid Term from the VariableNameCV
controlled vocabulary table.
4. The Speciation field must reference a valid Term from the SpeciationCV controlled
vocabulary table. A default value of “Not Applicable” is used where speciation does not
apply. If the speciation is unknown, a value of “Unknown” can be used.
5. The VariableUnitsID field must reference a valid UnitsID from the UnitsTable controlled
vocabulary table.
6. Only terms from the SampleMediumCV table can be used to populate the
SampleMedium field. A default value of “Unknown” is used where the sample medium
is unknown.
7. Only terms from the ValueTypeCV table can be used to populate the ValueType field. A
default value of “Unknown” is used where the value type is unknown.
8. The default for the TimeSupport field is 0. This corresponds to instantaneous values. If
the TimeSupport field is set to a value other than 0, an appropriate TimeUnitsID must be
specified. The TimeUnitsID field can only reference valid UnitsID values from the Units
controlled vocabulary table. If the TimeSupport field is set to 0, any time units can be
used (i.e., seconds, minutes, hours, etc.), however a default value of 103 has been used,
which corresponds with hours.
9. Only terms from the DataTypeCV table can be used to populated the DataType field. A
default value of “Unknown” can be used where the data type is unknown.
10. Only terms from the GeneralCategoryCV table can be used to populate the
GeneralCategory field. A default value of “Unknown” can be used where the general
category is unknown.
11. The NoDataValue should be set such that it will never conflict with a real observation
value. For example a NoDataValue of -9999 is valid for water temperature because we
would never expect to measure a water temperature of -9999. The default value for this
field is -9999.
55
Table: VerticalDatumCV
The VerticalDatumCV table contains the controlled vocabulary for the VerticalDatum field in
the Sites table.
Field Name
Term
Data Type
Text (255)
Description
Controlled vocabulary for
VerticalDatum.
Examples
“NAVD88”
Definition
Text (unlimited)
Definition of the VerticalDatum
controlled vocabulary. The definition is
optional if the term is self explanatory.
“North
American
Vertical
Datum of
1988”
Constraint
M
Unique
Primary key
Cannot
contain tab,
line feed, or
carriage
return
characters
O
This table is pre-populated within the ODM. Changes to this controlled vocabulary can be
requested at http://water.usu.edu/cuahsi/odm/.
56
Appendix B. Data Versioning Within ODM
The main text of this document focuses on how ODM is structured to store observations data. It
does not address how to manage editing data stored within ODM. Software applications based
on ODM will have functionality that will allow data managers and database administrators to
modify, delete, change, or otherwise make edits to data stored within ODM. In addition, these
software tools will provide functionality to create derived datasets, or datasets that are calculated
or derived from data already stored in ODM (i.e., calculate a time series of discharge from a time
series of stage, or calculate a time series of daily average temperature from a time series of
hourly observations). The purpose of this appendix is to clarify how data editing and versioning
can be managed within the ODM schema.
Data Series Defined
In order to fully grasp the concepts that follow, the idea of a “data series” in the context of ODM
must be clarified. A “data series” is an organizing principle that is present in the ODM. A data
series consists of all of the data values associated with a unique site, variable, method, source,
and quality control level combination. An example of the full specification for a data series is:
“all of the raw unchecked (QualityControlLevel) water temperature (Variable) values measured
in the Logan River near Logan, UT (Site) using a field temperature sensor (method) by Utah
State University (Source).” Each record in the SeriesCatalog table of ODM represents a unique
data series.
Rules for Editing and Deriving Data Series in ODM
The following rules are suggested so that versioning of and edits to data series can be managed
within the ODM schema. Software applications that work with ODM should follow these rules.
These rules are based on the default set of Quality Control Levels that are distributed with the
ODM blank schema.
1. Data versioning should be done at the data series level – Within ODM, the concept of
data versioning is related to the quality control level. Quality control level is a data series
level attribute, and as such, changes to the quality control level should occur at the data
series level rather than at the individual value level. For example, if an investigator
wished to create a quality controlled Level 1 data series from a raw Level 0 data series,
he/she should first make a copy of the raw Level 0 data series and then perform any edits
and adjustments required in the quality control process to the copy. The edited copy then
becomes the Level 1 data series, and the Level 0 data series is preserved intact.
2. Data series with a QualityControlLevelCode of 0 cannot be edited – Level 0 data series
represent raw data from sensors (i.e., stage measured by a water level recorder) or other
products derived from raw data (i.e., discharge that is programmatically derived from
stage before the stage values have been quality controlled). By definition, Level 0 data
have not been quality controlled and may contain significant errors and bad values.
However, Level 0 data series represent the source from which all other derived data
series are based, and as such should remain intact for archive purposes. Level 0 data
series should not be used for analysis unless no other adequate options are available, and
57
3.
4.
5.
6.
only if the user is aware that the data are raw. Level 0 data series can be removed
entirely from the database, but only by removing the entire data series.
Only one QualityControlLevel 0 data series can exist for a Site, Variable, and Method
combination – Only one raw data series for a Site, Variable, and Method combination can
exist within an ODM database. If multiple sensors are measuring the same variable at the
same site, the method description would have to distinguish between the two.
Only one QualityControlLevel 1 data series can exist for each Site, Variable, and Method
combination – Once a Level 0 data series has been loaded to the database, a Level 1 data
series can be “derived” from that Level 0 data series. This is done by first making a copy
of the Level 0 data series, second changing the QualityControlLevel of the copy to 1, and
last doing any necessary filtering or editing required so that the Level 1 data series is
acceptable as quality controlled. In most cases, the majority of the values within a Level
0 data series and its corresponding Level 1 data series will remain the same. However,
where instruments malfunction or other conditions are present that affect the raw data
values, Level 0 values may be deleted, adjusted, or otherwise edited in creating the Level
1 data series.
Any edits to a data series are saved to that data series – Level 0 data cannot be edited.
With Levels 1 or higher, however, software applications should be allowed to edit and
delete values. Each time an edit is made, the result should overwrite the previous value
within a data series. In other words, edits should not create new data series, they should
modify an existing one. This will be true even where edits are done within multiple
editing sessions. The editing software should record the method or basis for any data
edits in appropriate method records.
Data series of Level 2 or higher can only be created from data series of Level 1 or higher
– Derived data series of Level 2 or higher can only be created from data series of Level 1
or higher. If a user wishes to create a derived data series from a Level 0 data series (such
as discharge from raw, unchecked stage values) that derived data series would also be
Level 0.
58
Download