090404_CatalogMetadata

advertisement
Abstract:
The GEOSS Air Quality Community of Practice (CoP) is developing a metadata system
that will facilitate the discovery, understanding and usage of distributed air quality data.
This collaborative effort consists of preparing a community catalog, GEO portals and
community portals. The community catalog will be harvested through the GEOSS
Common Infrastructure (GCI) and the GEO and community portals will facilitate finding
and distributing AQ datasets. The catalog records contain three types of metadata: (1)
metadata fields common for all datasets, (2) extended dataset specific metadata using ISO
19115 and (3)a link to DataSpaces for additional unstructured, community-contributed
metadata. The common fields for data discovery will be extracted from the OGC
WMS/WCS GetCapabilities file. The ISO 19115 Geographic Information Metadata
standard will be used to extend the metadata to include lineage and information for better
understanding the data. DataSpaces are provided to incorporate the structured metadata
described above as well as further extend the metadata to incorporate user feedback,
discussion, free tags and harvest flexible community-contributed content like papers and
web applications related to the dataset. The value of the DataSpaces comes from the
ability to connect the dataset community: users, mediators and providers. As DataSpaces
evolves and is used more by the community, additional functionality will emerge.
Introduction
Air quality data is normally created for a particular mandated end user. The format of the
data is specific to the use and the metadata that describes the data is focused on a certain
application. More and more the GEOSS idea is growing that a ‘dataset can be used for
many applications and a single application needs many datasets’ (Ref GEOSS).
Standardized data access services allow interoperability in binding to a dataset, however
the first two parts of publishing and finding a data access service are still problematic.
Two problems arise in completing this task: (1) how does one describe the data access
service so that it can be discovered by users (2) where does one publish their data access
service so that it can be found. At the start of this project there was no identified set of
queryable fields needed in order to discover Earth observations. This paper describes the
process that was taken to enhance the publishing and finding of Earth observation data
access services, while the paper focuses on air quality datasets at this point the only thing
that is specific to air quality are the test datasets that we have used. This method could be
applied to data access services in any other societal benefit area within GEOSS.
Method:
The development of the metadata records and publication through GEOSS Common
Infrastructure (GCI) was done collaboratively through the Architecture Implementation
Pilot-II of GEOSS. The pilot work was broken into two dimensions (Fig.1 ): societal
benefit areas (AQ IT Cast) and transverse technologies (GEOSS AIP Workgroups).
Fig. 1. GEOSS Workgroups
The societal benefit work groups used their domain to test the different steps of the
publish-find-bind process facilitated by GCI. The transverse workgroups worked on
particular technologies and tools specific to publishing, finding or binding (PILOT Ref).
The main transverse workgroup responsible for the publishing and finding of data access
service metadata was the Catalog, Clearinghouse, Registries and Metadata workgroup
(CCRM Ref). This group worked closely with the air quality workgroup in order to create
standard metadata records that could be published through a web accessible folder,
harvested by the clearinghouses and found by AQ users (Fig. 1). The communication was
mostly through weekly telecons, an active list-serv and the google sites workspace for
Pilot activities. Currently there are three clearinghouses under development through
USGS, ESRI and Compusult. The coordination of the work with the three clearinghouses
was through a Google Doc spreadsheet that mapped the metadata record to queryable
fields in each clearinghouse.
ISO 19115 Metadata Records
The Air Quality group used ISO 19115 geographic information metadata standard to
create the metadata records because it was recognized by GEOSS (Ref 10yr Pl) in the
GEOSS Standard Registry (Ref Std. Registry), it is internationally recognized and the
U.S. is migrating toward the ISO 19115 standard from FGDC with a North American
Profile. ISO 19115 standard allows not only for discovery metadata to be included, but
also includes data lineage, usage and other types of metadata for understanding multiple
applications. The initial focus of metadata creation was on discovery metadata and future
work will include the extension to usage, lineage, and other components of the metadata.
The discovery metadata record is the minimum information a user needs to find the data
and bind to it.
To identify the fields necessary for discovery of the metadata record we started with the
ISO 19115 Core Metadata fields (REF) and also the fields that were needed for a
CSW:Record. This fulfilled two goals – one to create valid ISO 19115 records and the
other to be able to retrieve the records from the clearinghouse which is supposed to have
a CSW 2.0.2 interface. We then compared this list of required fields to the GEOSS
Clearinghouses queryable and returnable fields. It was found that none of the
clearinghouses or the GEOSS service registry allowed all of the CSW:record fields to be
searched. It was also found that the three clearinghouses queried over different fields,
there wasn’t a standard nomenclature for field names and key fields for Earth observation
data like temporal extent of the datasets was missing from at least one clearinghouse.
https://sites.google.com/site/geosspilot2/air-quality-and-health-working-group/geoclearinghouse-comparison
The key set of queryable fields for all earth observations seem to be: Dataset title,
abstract, geolocation, temporal extent, keywords, service type, metadata file ID and
associated component (catalog information). This allows for text search, spatial and
temporal searches. Additional keyword search allows search by measurement platform or
observed phenomena if a common vocabulary is used. Information about the service type
allows filtering by specific service standard and the associated component links the
metadata record of a given dataset to a larger community catalog. These queryable fields
can be in any metadata format, it is just important that the community agrees this set is
needed to find any Earth observation service.
With the Earth observation queryable fields plus the ISO 19115 Core Metadata fields we
had a metadata record that could be used for discovery. The next step in the process was
to create the metadata record. We used two tools for this process. First we used the
INSPIRE metadata generator to create an ISO 19115 record. Then for the additional ISO
19115 fields that we wanted to include, we used CatMDEdit software to generate the ISO
19139 xml structure needed and then inserted that portion of metadata into our growing
ISO 19115 record. As the record grew we validated the record against the ISO 19115
schema.
Publish: Initially it was thought that if one “published” data on the web in any format
that was enough for sharing. Then it became apparent that if you wanted to integrate
datasets that standard data access formats were useful and publication meant that one
exposed a standard data access service available on the web. Through the Pilot it became
apparent that just exposing the data access service with a GetCapabilities document
doesn’t provide enough metadata for discovery and additional standard metadata needs to
be published by the provider or distributor for discovery of the service.
The services we are registering are OGC WMS and WCS services and they have a
GetCapabilities document which has some metadata information included and can be
mapped to ISO 19115 fields (Stefano Ref). To further improve the use of GetCapabilities
documents for metadata creation, we organized our data access services by dataset, so
that the general metadata information was dataset specific and the coverages were each
dataset parameter. The table below shows the mapping from the OGC GetCapabilities
documents to the ISO 19115 metadata record. Only a handful of additional fields are
needed to create a valid metadata record and these can be hard coded or entered at the
time or registration.
Figure 2. shows the flow of metadata. Starting with the service provider, the
GetCapabilities document is used to create an ISO 19115 metadata record for the data
access service. The xml document is saved into the community catalog.
Figure 2. Publish-Find-Bind process for data access services from service provider to
user
Community Catalog: The community catalog is a component and service that is
registered in the GEOSS Component and Service Registry. There were several options
that the GEOSS AIP group considered when creating the catalog, Catalog Service for the
Web (CSW), Z39.50 and Web Accessible Folder (WAF). The WAF is the simplest
option since the only criteria is that there is a folder accessible on a server. WAFs are
“agnostic” to what metadata format is used. The Renewable Energy work group in the
AIP-2 and NOAA both used the WAF method for publishing metadata and were
successfully harvested by the ESRI Clearinghouse. The benefit of being aware of what
these other groups were doing through the AIP Plenary telecons and the google sites Pilot
workspace allowed the AQ community to learn faster from others and ultimately
implemented this option for our initial community catalog.
The WAF is also not a standard service interface and had to be registered as a special
arrangement in the GEOSS Standard Registry. This process of testing the GEOSS
architecture through the AIP-2 identified the need for the WAF and a special arrangement
was based on a group agreement within the Pilot.
The drawbacks to the WAF versus the CSW or Z39.50 interfaces are that the WAF must
be harvested by the clearinghouses and cannot be dynamically queried like the catalog
service interfaces. Since it is harvested, there is a delay in updated metadata being
available for access in the clearinghouse.
Once the Community catalog is registered in the GEOSS CSR the GEOSS
Clearinghouses then finds the catalog services and harvest the WAF (Figure 2, Harvest).
After the harvest each of the three clearinghouses extracts some of the discovery
metadata and allows users to find data access services. The value that the community
catalog provides to the service provider is that through one registration in GEOSS of the
catalog, many data access services collected by the catalog can be harvested and exposed
through the clearinghouse.
Find:In order to find the data access services in the clearinghouse one had to know what
to search for. The clearinghouse queryable fields and metadata fields were mapped in a
crosswalk to clarify how information was extracted from the harvested metadata records
(see Table 2). The key queries we have been interested in so far are finding our own
information, through the parent identifier, searching for services by type and keyword.
The ESRI and USGS clearinghouses expose a search API which enabled the AQ group to
create a more customized search interfaces. One example interface that we have set up is
using the USGS API and searching full text in order to find WMS or WCS services (Ref
link).
Bind: Once the user has found the dataset of interest the next step is to bind to the dataset
(fig. 2) and display the data access service through the tool of choice. Each service is
described in the metadata record with its GetCapabilities URL. Through preliminary
testing with WMS services, the Compusult Clearinghouse is able to bind to the
GetCapabilities URL and display an instance of the map.
Results and Disucssion: Through the community efforts of AIP-2 the Air Quality
Community has had limited success in establishing the flow of metadata and data through
the GCI. Data access services can be published to the WAF as ISO 19115 metadata
records. When the WAF is registered as a component and service in the GEOSS CSR, the
the WAF is harvested and the data access services are exposed through the
clearinghouses.
Interactions between the clearinghouse and WAF are still changing. There are not formal
procedures set up to schedule harvesting other than through e-mail requests and informal
arrangements. There are also issues with deletion of records in the WAF and how that
propagates to the clearinghouse. In one instance with USGS, every time the WAF is
harvested all of the old records from that catalog are deleted and replaced with the new
records. Finally even though the metadata has the fields needed for discovery, the
clearinhouses still do not search for all of these fields needed for discovery making it
difficult to find records.
Conclusion:
This work is still evolving. The metadata for discovery may continue to change as others
add their content and the fields need to be revised. Addditionaly, discovery is only one
aspect of the data life cycle and the next steps will be to further close the loop between
provider and user through more flexible DataSpaces. DataSpaces will incorporate the
structured ISO 19115 metadata described above as well as further extend the metadata to
incorporate user feedback, discussion, free tags and harvest flexible communitycontributed content like papers and web applications related to the dataset.
References:
Download