The Role of Data and Metadata Archives in Programs

The Role of Data and Metadata Archives in
Environmental Monitoring and Research
Programs 1
William K. Michener2
Abstract-Environmental monitoring and research programs depend on being able to store, retrieve, and understand data that are
collected over long periods. However, long-term accessibility of
high quality data is hindered by "data entropy," the natural tendency for data to degrade in information content over time. Two
approaches are necessary to impede data entropy: developing comprehensive documentation (i.e., metadata) for the data and contributing both the data and metadata to secure data archives. In
this chapter, I discuss data archives, identify metadata content and
format standards relevant to spatial and non-geospatial data, and
present examples of several North American data archives. The
long-term, integrated inventory and monitoring offorest ecosystem
resources depend on the availability of high quality metadata and
archives for supporting data and metadata storage and retrieval.
Environmental monitoring and research programs are
increasingly attempting to address questions that require
long-term, broad-scale, and multi-disciplinary data. The
ultimate success of such programs depends on being able
to store, retrieve, and understand data that are collected
over long time periods (Michener et al. 1994). Long-term
accessibility of high quality data is hindered by "data entropy," the natural tendency for data to degrade in information content over time (Michener et al. 1997). Two complimentary approaches are necessary to impede data entropy:
developing comprehensive documentation (Le., meta data)
for the data and contributing both the data and meta data to
secure data archives. In this chapter, I identify metadata
content and format standards relevant to spatial and nongeospatial data, discuss the role of data archives, and present
examples of several North American data archives.
Metadata -----------------------------------Metadata are the information necessary to understand
and effectively use data, including documentation of the
data set contents, context, quality, structure, and accessibility. Ideally, meta data describe the who, what, when, where,
and how about every aspect of the data. For instance,
metadata should address the five basic questions that
normally arise when a scientist attempts to acquire and use
a specific data set: (1) What relevant data exist?; (2) Why
were the data collected and are they suitable for a particular
need?; (3) How can the data be obtained?; (4) How are the
data organized?; and (5) What additional information is
available to facilitate data use and interpretation?
Investing adequate time and energy into metadata development is beneficial for several reasons (Michener et al.
1997). First, data entropy is delayed and data longevity
increases. Second, data sharing is facilitated. Scientists
often find that data that were previously collected for a
particular purpose can later be 'mined' to answer new
questions that are posed by themselves and others. Metada ta
that describe sampling and analytical procedures, data
quality, and data set structure facilitate subsequent analyses and data interpretation. Furthermore, well-documented
data can be used to expand the scale of scientific inquiry.
Many of the questions now being addressed by environmental scientists require more data than could be realistically
collected, managed, and analyzed by a single individual.
Consequently, scientists increasingly rely upon data that
were collected by other scientists for different purposes.
Metadata are receiving increased attention from the scientific community. For example, considerable resources
have been devoted to developing, adopting, and implementing spatial meta data standards. In 1994, the U.S. Federal
Geographic Data Committee completed the Content Standards for Digital Geospatial Metadata (FGDC 1994). These
standards were developed as part of the National Biological
Information Infrastructure (NBII) in the United States and
attempts to standardize geographical data collected by the
Federal Government. The Content Standards contain more
than 200 meta data fields that are categorized into seven
classes of metadata descriptors: identification, data quality, spatial data organization, spatial reference, entity and
attribute, distribution, and metadata. Additional extensions to the Content Standards relevant to vegetation classification' cultural, demographic, and other types of data
are in development. International (ISO) geospatial metadata standards, based on the Content Standards, are also
Metadata standards for non-geospatial data typically
do not exist in any accepted format beyond individual
studies, projects, or organizations. Environmental studies
often require large amounts of diverse data related to the
chemical and physical environment, as well as the organisms, populations, communities, and ecosystems comprising
the biotic portion of the environment. It is, therefore, unlikely that a single meta data standard, no matter how
comprehensive, could encompass all environmental data.
Consequently, a generic set of non-geospatial metadata
descriptors for environmental data was recently introduced
(Michener et a1. 1997). The metadata descriptors were proposed as a template that could serve as the basis for more
refined discipline-specific metadata guidelines. Five classes
of meta data descriptors were included:
1. Data set descriptors-basic attributes of the data set
(e.g., data set title, scientists, abstract, and keywords)
2. Research origin descriptors-descriptions of the
research leading to a particular data set (e.g., hypotheses, site characteristics, experimental design, and
research methods)
3. Data set status and accessibility descriptorsthe status and accessibility of the data set and associated meta data
4. Data structural descriptors-all attributes related
to the physical structure of the data
5. Supplemental descriptors-related information that
may be necessary for data interpretation, publishing
the data set, or supporting an audit of the data set.
The lack of software tools that can facilitate metadata
entry has been a major impediment to meta data implementation. Metadata development is frequently an ad hoc
process performed by the data originators, although some
organizations have independently developed word processing or database forms that can be filled in to meet in-house
requirements. One notable exception is the NBII
MetaMaker, a PC-based program that was developed to
support geospatial metadata generation in a format conforming to Federal Geographic Data Committee (FGDC
1994) guidelines. For the inost recent version of this program, visit the NBII web site ( geographic information system software for spatial
metadata, as well as other software tools that support nongeospatial metadata generation. should be forthcoming.
Data Archives
Data Sources and the Global Change
Master Directory
The Global Change Master Directory (GCMD; http:// is a comprehensive directory of data set
descriptions that are particularly relevant to global change
research. The GCMD is maintained by the American node of
the Committee on Earth Observation Satellites' International Directory Network (CEOS IDN) and is operated by
NASA's Global Change Data Center at NASA/Goddard
Space Flight Center. The GCMD includes descriptions of
data sets covering climate change, the biosphere (e.g., land
use and forest cover), hydrosphere and oceans (e.g., water
quality, sea surface temperature), geology (e.g., soils), geography, and human dimensions of global change (e.g., disease
outbreaks, resource inventories) (Table 1).
The GCMD employs a user-friendly search interface that
allows individuals to easily search for specific data. Resulting metadata records provide information on the nature of
the data (e.g., parameters measured, geographic location)
Table 1.-Types of data in Global Change Master Directory.
General classes
Data types
Aerosols, Air quality, Altitude, Chemistry, Phenomena,
Pressure, Temperature, Water vapor, Winds,
Clouds, Precipitation, Radiation budget
Aquatic habitat, Ecological dynamics, Fungi, Microbiota,
Terrestrial habitat, Vegetation, Wetlands, Zoology
Snow, Ice, Sea ice
Ground water, Snow, Ice, Surface water, Water quality
Land Surface
Erosion, Sedimentation, Landscape, Temperature, Land
use/land cover, Soils, Radiative properties, Topography
Human Dimensions
Attitudes, Preferences, Behaviors, Boundaries,
Economic resources, Environmental effects,
Population, Food resources, Human health,
Coastal processes, Marine geophysics, Salinity, Marine
sediments, Chemistry, Circulation, Winds, Heat
budget, Tides, Pressure, Temperature, Waves
Geologic time, Ice core records, Land records, Ocean
records, Lake records
Gamma ray, Infrared, Microwave, Radar, Radio,
Ultraviolet, Visible, X-ray
Solar Physics
Solar activity, Solar energetic particles
Solid Earth
Geochemistry, Geodetics, Geomagnetism, Geothermal,
Natural resources, Rocks/minerals, Seismology,
and where the data are stored. The actual data and metadata "pointed to" in the GCMD are stored by universities,
government agencies, and other organizations in locations
that are variously referred to as data repositories, digital
libraries, data clearinghouses, data centers, or data archives. Some data sources that may be relevant to forest
scientists are presented in Table 2. Especially comprehensive data sources include the distributed active archive
centers (DAACs) that are funded as partofthe U.S. National
Aeronautical and Space Administration's (NASA) Earth
Observing System (EOS) program.
Data Archives
A data archive is a collection of digital data sets and
meta data that are organized and stored so that users can
easily locate, acquire, and use data that meet a particular
objective. Furthermore, data in an archive are generally
stored in multiple locations so they are secure against
natural and anthropogenic disasters. With the primary
exception ofthe Global Change Master Directory which acts
as a pointer to data that reside elsewhere, the online data
centers listed in Table 2 qualify as data archives. Many ofthe
data archives focus on a restricted set of themes. One
example of ada ta archive wi th a specific focus is the National
Climatic Data Center (NCDC) which was established in
1951 by the U.S. National Oceanographic and Atmospheric
Administration. The NCDC has one of the largest environmental data archives in relation to length (data date to the
1800s), volume (55 gigabytes are added daily), and users
(more than 170,000 requests annually).
A major objective for every data archive, as well as institutional data centers, is the secure storage of the data and
metadata. Many different approaches to data storage
can be taken depending on the size of the data holdings,
anticipated rate of access by users, and other factors. For
example, a relatively small volume of data can be stored on
a single hard disk and made available via the World Wide
Web. In this case, some form of manual or automated
backup scheme will be required. Moderate data holdings
(e.g., >10-100 gigabytes) may be stored online in a series of
disks (e.g., redundant array of independent devices, RAID)
that can be configured to support various levels of redundancy. Extremely large data holdings, such as those maintained by the NCDC in the previous example, may be stored
in a large mass storage system consisting of multiple RAID
units and automated tape libraries. In this latter case, data
are either online or near-line, and data backup may be
performed offsite (at a mirror location).
One of the primary benefits of focussing on a particular
theme is the ability of the data archive to easily add value to
data. For instance, the Carbon Dioxide Information Analysis Center (CDIAC), which is funded by the U.S. Department
of Energy, emphasizes the value-added component of data
sets that results from the participation of scientists and
users in metadata preparation, rigorous QAlQC processing,
peer-review of data and metadata, beta testing of data sets
prior to general release, and incorporation of user feedback
into its data packages. In addition to extensive metadata,
CDIAC data packages contain examples of data applications and copies of important associated literature. By
focusing on a few types of data, a data archive can establish
and maintain high standards (e.g., QAlQC) and develop the
requisite pool of experts for data and meta data peer-review.
The process of data submission will vary from one data
archive to another. Different data archives may have different data structure, QAlQC, and metadata content standards that must be adhered to. Frequently, dataandmetadata
are reviewed by data archive staff for internal consistency
and completeness. In some cases, the quality assurance
2.-0nline sources for envjronmental data.
Data center
Data center URL
Data center focus
Alaska SAR Facility-DAAC
Synthetic aperture radar data.
Carbon Dioxide Information Analysis Center
http://www.cdiac. esd .ornl. gOY
Greenhouse gases and climate change.
EROS Data Center-DAAC
Land processes and characteristics data.
Langley Research Center-DAAC
Radiation budget, clouds, aerosols, and
tropospheric chemistry.
Marshall Space Flight Center-DAAC
Global hydrology and climate data.
National Climatic Data Center
Climate data.
National Snow and Ice Data Center-DAAC
Snow and ice data.
Oak Ridge National Lab-DAAC
Biogeochemical dynamics (Le., biological,
geological, and chemical components of
the Earth's environment) data.
The Consortium for International
Earth Science Information
Network - Socioeconomic Data
and Applications Center
Integrated social and natural science data.
National Oceanic and Atmospheric
Administration - Satellite Active Archive
Real-time and historical satellite data.
procedures that are documented in the meta data will be
reviewed. Mter going through the review process, data and
metadata may then be incorporated into the archive database and made publicly accessible.
Following submission to an archive, the availability of a
data set may be "publicized" in a data directory like the
Global Change Master Directory (GCMD). Listing a data set
in the GCMD is relatively straightforward and primarily
requires filling out a form, known as a DIF, that describes
the data. A DIF, an acronym for Directory Interchange
Format, consists of several fields that describe the data
and essentially represent a subset of the metadata for that
particular data set (Table 3). Several fields are required
including the entry id, title, parameters, originating center,
datacenter, and summary. Other fields that provide location
keywords and describe the spatial and temporal coverage of
the data are deemed critical for data set selection (i.e.,
searching) and user understanding of the data. The GCMD
provides a number of guidelines to facilitate DIF writing
such as recommended keywords and definitions of what
constitutes a good title and summary. Furthermore, there
are several free, downloadable tools that can be used to
author DIFs for UNIX (using the Emacs editor), PCs (both
DOS and Microsoft Windows), and the World Wide Web.
Once entered, the GCMD can perform very effective
searches to match users with data that meet their particular objectives.
Table3.-Summary of Global Change Master Directory (GCMD)
online data set description.
Conclusions ------------------------------The long-term, integrated inventory and monitoring of
forest ecosystem resources will depend on adopting or developing relevant metadata standards and developing new
federated approaches to data storage and retrieval. In
many cases, organizations will benefit from adopting existing metadata standards (e.g., the geospatial metadata
content standard developed by the FGDC (1994) or the
related ISO standard). For other types of data (i.e., nongeospatial), new metadata content standards must be developed. Ideally, these standardization efforts would benefit
from international collaboration so that data and metadata
can be shared across political borders to address common
problems. New national and international data archives
are also required to support the long-term, integrated inventory and monitoring of forest ecosystem resources. Typically, the data that are required to address critical questions
related to forest ecosystem resources at regional and continental scales are either not available in digital form (e.g.,
due to lack of data archives) or are inadequate in their
Data set title
Short description of the data set.
General attribute information
Data source (i.e., digitized from paper maps,
derived from satellite data, etc.)
Temporal coverage
• start and stop date
Geographic coverage
• southwest and northeast extent
Location keywords
• category
• topic
• term
• variable
• detailed variable
General keywords
Entry ID/Originating center
Data center contact
Storage media
Technical contact
Directory Interchange Format (DIF) author
Data set citation
International Directory Network (ION) node
Revision date
Review date
present form (e.g., different data structures, lack of quality
assurance, insufficient metadata).
