This file was created by scanning the printed publication. Errors identified by the software have been corrected; however, some errors may remain. The Role of Data and Metadata Archives in Environmental Monitoring and Research Programs 1 William K. Michener2 Abstract-Environmental monitoring and research programs depend on being able to store, retrieve, and understand data that are collected over long periods. However, long-term accessibility of high quality data is hindered by "data entropy," the natural tendency for data to degrade in information content over time. Two approaches are necessary to impede data entropy: developing comprehensive documentation (i.e., metadata) for the data and contributing both the data and metadata to secure data archives. In this chapter, I discuss data archives, identify metadata content and format standards relevant to spatial and non-geospatial data, and present examples of several North American data archives. The long-term, integrated inventory and monitoring offorest ecosystem resources depend on the availability of high quality metadata and archives for supporting data and metadata storage and retrieval. Environmental monitoring and research programs are increasingly attempting to address questions that require long-term, broad-scale, and multi-disciplinary data. The ultimate success of such programs depends on being able to store, retrieve, and understand data that are collected over long time periods (Michener et al. 1994). Long-term accessibility of high quality data is hindered by "data entropy," the natural tendency for data to degrade in information content over time (Michener et al. 1997). Two complimentary approaches are necessary to impede data entropy: developing comprehensive documentation (Le., meta data) for the data and contributing both the data and meta data to secure data archives. In this chapter, I identify metadata content and format standards relevant to spatial and nongeospatial data, discuss the role of data archives, and present examples of several North American data archives. Metadata -----------------------------------Metadata are the information necessary to understand and effectively use data, including documentation of the data set contents, context, quality, structure, and accessibility. Ideally, meta data describe the who, what, when, where, and how about every aspect of the data. For instance, metadata should address the five basic questions that normally arise when a scientist attempts to acquire and use Ipaper presented at the North American Science Symposium: Toward a Unified Framework for Inventorying and Monitoring Forest Ecosystem Resources, Guadalajara, Mexico, November 1-6,1998. 2William Michener is Associate Scientist, Joseph W. Jones Ecological Research Center, located at Newton, GA. For correspondence: Dr. William K. Michener, JW Jones Ecological Research Center, Route 2, Box 2324, Newton, Georgia 31770, USA. Telephone: 912-734-4706. Facsimile: 912-734-4707. e-mail: address: wmichene@jonesctr.org USDA Forest Service Proceedings RMRS-P-12. 1999 a specific data set: (1) What relevant data exist?; (2) Why were the data collected and are they suitable for a particular need?; (3) How can the data be obtained?; (4) How are the data organized?; and (5) What additional information is available to facilitate data use and interpretation? Investing adequate time and energy into metadata development is beneficial for several reasons (Michener et al. 1997). First, data entropy is delayed and data longevity increases. Second, data sharing is facilitated. Scientists often find that data that were previously collected for a particular purpose can later be 'mined' to answer new questions that are posed by themselves and others. Metada ta that describe sampling and analytical procedures, data quality, and data set structure facilitate subsequent analyses and data interpretation. Furthermore, well-documented data can be used to expand the scale of scientific inquiry. Many of the questions now being addressed by environmental scientists require more data than could be realistically collected, managed, and analyzed by a single individual. Consequently, scientists increasingly rely upon data that were collected by other scientists for different purposes. Metadata are receiving increased attention from the scientific community. For example, considerable resources have been devoted to developing, adopting, and implementing spatial meta data standards. In 1994, the U.S. Federal Geographic Data Committee completed the Content Standards for Digital Geospatial Metadata (FGDC 1994). These standards were developed as part of the National Biological Information Infrastructure (NBII) in the United States and attempts to standardize geographical data collected by the Federal Government. The Content Standards contain more than 200 meta data fields that are categorized into seven classes of metadata descriptors: identification, data quality, spatial data organization, spatial reference, entity and attribute, distribution, and metadata. Additional extensions to the Content Standards relevant to vegetation classification' cultural, demographic, and other types of data are in development. International (ISO) geospatial metadata standards, based on the Content Standards, are also forthcoming. Metadata standards for non-geospatial data typically do not exist in any accepted format beyond individual studies, projects, or organizations. Environmental studies often require large amounts of diverse data related to the chemical and physical environment, as well as the organisms, populations, communities, and ecosystems comprising the biotic portion of the environment. It is, therefore, unlikely that a single meta data standard, no matter how comprehensive, could encompass all environmental data. Consequently, a generic set of non-geospatial metadata descriptors for environmental data was recently introduced 441 (Michener et a1. 1997). The metadata descriptors were proposed as a template that could serve as the basis for more refined discipline-specific metadata guidelines. Five classes of meta data descriptors were included: 1. Data set descriptors-basic attributes of the data set (e.g., data set title, scientists, abstract, and keywords) 2. Research origin descriptors-descriptions of the research leading to a particular data set (e.g., hypotheses, site characteristics, experimental design, and research methods) 3. Data set status and accessibility descriptorsthe status and accessibility of the data set and associated meta data 4. Data structural descriptors-all attributes related to the physical structure of the data 5. Supplemental descriptors-related information that may be necessary for data interpretation, publishing the data set, or supporting an audit of the data set. The lack of software tools that can facilitate metadata entry has been a major impediment to meta data implementation. Metadata development is frequently an ad hoc process performed by the data originators, although some organizations have independently developed word processing or database forms that can be filled in to meet in-house requirements. One notable exception is the NBII MetaMaker, a PC-based program that was developed to support geospatial metadata generation in a format conforming to Federal Geographic Data Committee (FGDC 1994) guidelines. For the inost recent version of this program, visit the NBII web site (http://www.nbii.gov).Commercial geographic information system software for spatial metadata, as well as other software tools that support nongeospatial metadata generation. should be forthcoming. Data Archives Data Sources and the Global Change Master Directory The Global Change Master Directory (GCMD; http:// gcmd.gsfc.nasa.gov) is a comprehensive directory of data set descriptions that are particularly relevant to global change research. The GCMD is maintained by the American node of the Committee on Earth Observation Satellites' International Directory Network (CEOS IDN) and is operated by NASA's Global Change Data Center at NASA/Goddard Space Flight Center. The GCMD includes descriptions of data sets covering climate change, the biosphere (e.g., land use and forest cover), hydrosphere and oceans (e.g., water quality, sea surface temperature), geology (e.g., soils), geography, and human dimensions of global change (e.g., disease outbreaks, resource inventories) (Table 1). The GCMD employs a user-friendly search interface that allows individuals to easily search for specific data. Resulting metadata records provide information on the nature of the data (e.g., parameters measured, geographic location) Table 1.-Types of data in Global Change Master Directory. General classes 442 Data types Atmosphere Aerosols, Air quality, Altitude, Chemistry, Phenomena, Pressure, Temperature, Water vapor, Winds, Clouds, Precipitation, Radiation budget Biosphere Aquatic habitat, Ecological dynamics, Fungi, Microbiota, Terrestrial habitat, Vegetation, Wetlands, Zoology Cryosphere Snow, Ice, Sea ice Hydrosphere Ground water, Snow, Ice, Surface water, Water quality Land Surface Erosion, Sedimentation, Landscape, Temperature, Land use/land cover, Soils, Radiative properties, Topography Human Dimensions Attitudes, Preferences, Behaviors, Boundaries, Economic resources, Environmental effects, Population, Food resources, Human health, Infrastructure Oceans Coastal processes, Marine geophysics, Salinity, Marine sediments, Chemistry, Circulation, Winds, Heat budget, Tides, Pressure, Temperature, Waves Paleoclimate Geologic time, Ice core records, Land records, Ocean records, Lake records Radiance Gamma ray, Infrared, Microwave, Radar, Radio, Ultraviolet, Visible, X-ray Solar Physics Solar activity, Solar energetic particles Solid Earth Geochemistry, Geodetics, Geomagnetism, Geothermal, Natural resources, Rocks/minerals, Seismology, Tectonics, USDA Forest Service Proceedinas RMRS-P-12. 1999 and where the data are stored. The actual data and metadata "pointed to" in the GCMD are stored by universities, government agencies, and other organizations in locations that are variously referred to as data repositories, digital libraries, data clearinghouses, data centers, or data archives. Some data sources that may be relevant to forest scientists are presented in Table 2. Especially comprehensive data sources include the distributed active archive centers (DAACs) that are funded as partofthe U.S. National Aeronautical and Space Administration's (NASA) Earth Observing System (EOS) program. Data Archives A data archive is a collection of digital data sets and meta data that are organized and stored so that users can easily locate, acquire, and use data that meet a particular objective. Furthermore, data in an archive are generally stored in multiple locations so they are secure against natural and anthropogenic disasters. With the primary exception ofthe Global Change Master Directory which acts as a pointer to data that reside elsewhere, the online data centers listed in Table 2 qualify as data archives. Many ofthe data archives focus on a restricted set of themes. One example of ada ta archive wi th a specific focus is the National Climatic Data Center (NCDC) which was established in 1951 by the U.S. National Oceanographic and Atmospheric Administration. The NCDC has one of the largest environmental data archives in relation to length (data date to the 1800s), volume (55 gigabytes are added daily), and users (more than 170,000 requests annually). A major objective for every data archive, as well as institutional data centers, is the secure storage of the data and metadata. Many different approaches to data storage can be taken depending on the size of the data holdings, Table anticipated rate of access by users, and other factors. For example, a relatively small volume of data can be stored on a single hard disk and made available via the World Wide Web. In this case, some form of manual or automated backup scheme will be required. Moderate data holdings (e.g., >10-100 gigabytes) may be stored online in a series of disks (e.g., redundant array of independent devices, RAID) that can be configured to support various levels of redundancy. Extremely large data holdings, such as those maintained by the NCDC in the previous example, may be stored in a large mass storage system consisting of multiple RAID units and automated tape libraries. In this latter case, data are either online or near-line, and data backup may be performed offsite (at a mirror location). One of the primary benefits of focussing on a particular theme is the ability of the data archive to easily add value to data. For instance, the Carbon Dioxide Information Analysis Center (CDIAC), which is funded by the U.S. Department of Energy, emphasizes the value-added component of data sets that results from the participation of scientists and users in metadata preparation, rigorous QAlQC processing, peer-review of data and metadata, beta testing of data sets prior to general release, and incorporation of user feedback into its data packages. In addition to extensive metadata, CDIAC data packages contain examples of data applications and copies of important associated literature. By focusing on a few types of data, a data archive can establish and maintain high standards (e.g., QAlQC) and develop the requisite pool of experts for data and meta data peer-review. The process of data submission will vary from one data archive to another. Different data archives may have different data structure, QAlQC, and metadata content standards that must be adhered to. Frequently, dataandmetadata are reviewed by data archive staff for internal consistency and completeness. In some cases, the quality assurance 2.-0nline sources for envjronmental data. Data center Data center URL Data center focus Alaska SAR Facility-DAAC http://www.asf.alaska.edu Synthetic aperture radar data. Carbon Dioxide Information Analysis Center http://www.cdiac. esd .ornl. gOY Greenhouse gases and climate change. EROS Data Center-DAAC http://epcwww.cr.usgs.gov/ landdaac/landdaac.html Land processes and characteristics data. Langley Research Center-DAAC http://eosweb.larc.nasa.gov Radiation budget, clouds, aerosols, and tropospheric chemistry. Marshall Space Flight Center-DAAC http://ghrc.msfc.nasa.gov Global hydrology and climate data. National Climatic Data Center http://www.ncdc.noaa.gov Climate data. National Snow and Ice Data Center-DAAC http://eosims.colorado.edu Snow and ice data. Oak Ridge National Lab-DAAC http://www-eosdis.ornl.gov Biogeochemical dynamics (Le., biological, geological, and chemical components of the Earth's environment) data. The Consortium for International Earth Science Information Network - Socioeconomic Data and Applications Center http://sedac.ciesin.org Integrated social and natural science data. National Oceanic and Atmospheric Administration - Satellite Active Archive http://www.saa.noaa.gov/lndex3.html Real-time and historical satellite data. USDA Forest Service Proceedings RMRS-P-12. 1999 443 procedures that are documented in the meta data will be reviewed. Mter going through the review process, data and metadata may then be incorporated into the archive database and made publicly accessible. Following submission to an archive, the availability of a data set may be "publicized" in a data directory like the Global Change Master Directory (GCMD). Listing a data set in the GCMD is relatively straightforward and primarily requires filling out a form, known as a DIF, that describes the data. A DIF, an acronym for Directory Interchange Format, consists of several fields that describe the data and essentially represent a subset of the metadata for that particular data set (Table 3). Several fields are required including the entry id, title, parameters, originating center, datacenter, and summary. Other fields that provide location keywords and describe the spatial and temporal coverage of the data are deemed critical for data set selection (i.e., searching) and user understanding of the data. The GCMD provides a number of guidelines to facilitate DIF writing such as recommended keywords and definitions of what constitutes a good title and summary. Furthermore, there are several free, downloadable tools that can be used to author DIFs for UNIX (using the Emacs editor), PCs (both DOS and Microsoft Windows), and the World Wide Web. Once entered, the GCMD can perform very effective searches to match users with data that meet their particular objectives. Table3.-Summary of Global Change Master Directory (GCMD) online data set description. Conclusions ------------------------------The long-term, integrated inventory and monitoring of forest ecosystem resources will depend on adopting or developing relevant metadata standards and developing new federated approaches to data storage and retrieval. In many cases, organizations will benefit from adopting existing metadata standards (e.g., the geospatial metadata content standard developed by the FGDC (1994) or the related ISO standard). For other types of data (i.e., nongeospatial), new metadata content standards must be developed. Ideally, these standardization efforts would benefit from international collaboration so that data and metadata can be shared across political borders to address common problems. New national and international data archives are also required to support the long-term, integrated inventory and monitoring of forest ecosystem resources. Typically, the data that are required to address critical questions related to forest ecosystem resources at regional and continental scales are either not available in digital form (e.g., due to lack of data archives) or are inadequate in their 444 Description Contents Data set title Title Summary Short description of the data set. Resolution General attribute information Data source (i.e., digitized from paper maps, derived from satellite data, etc.) Coverage Temporal coverage • start and stop date Geographic coverage • southwest and northeast extent Location keywords Attributes Parameters • category • topic • term • variable • detailed variable Discipline/Sub-discipline General keywords Entry ID/Originating center Distribution Data center contact Storage media Personnel Technical contact Directory Interchange Format (DIF) author Reference Data set citation International Directory Network (ION) node Revision date Review date present form (e.g., different data structures, lack of quality assurance, insufficient metadata). Literature Cited Federal Geographic Data Committee. 1994. Content standards for digital spatial metadata. Federal Geographic Data Committee. Washington, DC. http://geochange.er.usgs.gov/pub/tools/ metadatalstandardlmetadata.html. Michener, W.K., J.W. Brunt, and S.G. Stafford. 1994. Environmental Information Management and Analysis: Ecosystem to Global Scales. Taylor and Francis, Ltd., London, England. Michener, W.K., J.W. Brunt, J. Helly, T.B. Kirchner, and S.G. Stafford. 1997. Non-geospatial metadata for the ecological sciences. Ecological Applications 7:330-342. USDA Forest Service Proceedings RMRS-P-12. 1999