Abstract: The GEOSS Air Quality Community of Practice (CoP) is developing a metadata system that will facilitate the discovery, understanding and usage of distributed air quality data. This collaborative effort consists of preparing a community catalog, GEO portals and community portals. The community catalog will be harvested through the GEOSS Common Infrastructure (GCI) and the GEO and community portals will facilitate finding and distributing AQ datasets. The catalog records contain three types of metadata: (1) metadata fields common for all datasets, (2) extended dataset specific metadata using ISO 19115 and (3)a link to DataSpaces for additional unstructured, community-contributed metadata. The common fields for data discovery will be extracted from the OGC WMS/WCS GetCapabilities file. The ISO 19115 Geographic Information Metadata standard will be used to extend the metadata to include lineage and information for better understanding the data. DataSpaces are provided to incorporate the structured metadata described above as well as further extend the metadata to incorporate user feedback, discussion, free tags and harvest flexible community-contributed content like papers and web applications related to the dataset. The value of the DataSpaces comes from the ability to connect the dataset community: users, mediators and providers. As DataSpaces evolves and is used more by the community, additional functionality will emerge. Introduction Air quality data is normally created for a particular mandated end user. The format of the data is specific to the use and the metadata that describes the data is focused on a certain application. More and more the GEOSS idea is growing that a ‘dataset can be used for many applications and a single application needs many datasets’ (Ref GEOSS). Standardized data access services allow interoperability in binding to a dataset, however the first two parts of publishing and finding a data access service are still problematic. Two problems arise in completing this task: (1) how does one describe the data access service so that it can be discovered by users (2) where does one publish their data access service so that it can be found. At the start of this project there was no identified set of queryable fields needed in order to discover Earth observations. This paper describes the process that was taken to enhance the publishing and finding of Earth observation data access services, while the paper focuses on air quality datasets at this point the only thing that is specific to air quality are the test datasets that we have used. This method could be applied to data access services in any other societal benefit area within GEOSS. Method: The development of the metadata records and publication through GEOSS Common Infrastructure (GCI) was done collaboratively through the Architecture Implementation Pilot-II of GEOSS. The pilot work was broken into two dimensions (Fig.1 ): societal benefit areas (AQ IT Cast) and transverse technologies (GEOSS AIP Workgroups). Fig. 1. GEOSS Workgroups The societal benefit work groups used their domain to test the different steps of the publish-find-bind process facilitated by GCI. The transverse workgroups worked on particular technologies and tools specific to publishing, finding or binding (PILOT Ref). The main transverse workgroup responsible for the publishing and finding of data access service metadata was the Catalog, Clearinghouse, Registries and Metadata workgroup (CCRM Ref). This group worked closely with the air quality workgroup in order to create standard metadata records that could be published through a web accessible folder, harvested by the clearinghouses and found by AQ users (Fig. 1). The communication was mostly through weekly telecons, an active list-serv and the google sites workspace for Pilot activities. Currently there are three clearinghouses under development through USGS, ESRI and Compusult. The coordination of the work with the three clearinghouses was through a Google Doc spreadsheet that mapped the metadata record to queryable fields in each clearinghouse. ISO 19115 Metadata Records The Air Quality group used ISO 19115 geographic information metadata standard to create the metadata records because it was recognized by GEOSS (Ref 10yr Pl) in the GEOSS Standard Registry (Ref Std. Registry), it is internationally recognized and the U.S. is migrating toward the ISO 19115 standard from FGDC with a North American Profile. ISO 19115 standard allows not only for discovery metadata to be included, but also includes data lineage, usage and other types of metadata for understanding multiple applications. The initial focus of metadata creation was on discovery metadata and future work will include the extension to usage, lineage, and other components of the metadata. The discovery metadata record is the minimum information a user needs to find the data and bind to it. To identify the fields necessary for discovery of the metadata record we started with the ISO 19115 Core Metadata fields (REF) and also the fields that were needed for a CSW:Record. This fulfilled two goals – one to create valid ISO 19115 records and the other to be able to retrieve the records from the clearinghouse which is supposed to have a CSW 2.0.2 interface. We then compared this list of required fields to the GEOSS Clearinghouses queryable and returnable fields. It was found that none of the clearinghouses or the GEOSS service registry allowed all of the CSW:record fields to be searched. It was also found that the three clearinghouses queried over different fields, there wasn’t a standard nomenclature for field names and key fields for Earth observation data like temporal extent of the datasets was missing from at least one clearinghouse. https://sites.google.com/site/geosspilot2/air-quality-and-health-working-group/geoclearinghouse-comparison The key set of queryable fields for all earth observations seem to be: Dataset title, abstract, geolocation, temporal extent, keywords, service type, metadata file ID and associated component (catalog information). This allows for text search, spatial and temporal searches. Additional keyword search allows search by measurement platform or observed phenomena if a common vocabulary is used. Information about the service type allows filtering by specific service standard and the associated component links the metadata record of a given dataset to a larger community catalog. These queryable fields can be in any metadata format, it is just important that the community agrees this set is needed to find any Earth observation service. With the Earth observation queryable fields plus the ISO 19115 Core Metadata fields we had a metadata record that could be used for discovery. The next step in the process was to create the metadata record. We used two tools for this process. First we used the INSPIRE metadata generator to create an ISO 19115 record. Then for the additional ISO 19115 fields that we wanted to include, we used CatMDEdit software to generate the ISO 19139 xml structure needed and then inserted that portion of metadata into our growing ISO 19115 record. As the record grew we validated the record against the ISO 19115 schema. Publish: Initially it was thought that if one “published” data on the web in any format that was enough for sharing. Then it became apparent that if you wanted to integrate datasets that standard data access formats were useful and publication meant that one exposed a standard data access service available on the web. Through the Pilot it became apparent that just exposing the data access service with a GetCapabilities document doesn’t provide enough metadata for discovery and additional standard metadata needs to be published by the provider or distributor for discovery of the service. The services we are registering are OGC WMS and WCS services and they have a GetCapabilities document which has some metadata information included and can be mapped to ISO 19115 fields (Stefano Ref). To further improve the use of GetCapabilities documents for metadata creation, we organized our data access services by dataset, so that the general metadata information was dataset specific and the coverages were each dataset parameter. The table below shows the mapping from the OGC GetCapabilities documents to the ISO 19115 metadata record. Only a handful of additional fields are needed to create a valid metadata record and these can be hard coded or entered at the time or registration. Figure 2. shows the flow of metadata. Starting with the service provider, the GetCapabilities document is used to create an ISO 19115 metadata record for the data access service. The xml document is saved into the community catalog. Figure 2. Publish-Find-Bind process for data access services from service provider to user Community Catalog: The community catalog is a component and service that is registered in the GEOSS Component and Service Registry. There were several options that the GEOSS AIP group considered when creating the catalog, Catalog Service for the Web (CSW), Z39.50 and Web Accessible Folder (WAF). The WAF is the simplest option since the only criteria is that there is a folder accessible on a server. WAFs are “agnostic” to what metadata format is used. The Renewable Energy work group in the AIP-2 and NOAA both used the WAF method for publishing metadata and were successfully harvested by the ESRI Clearinghouse. The benefit of being aware of what these other groups were doing through the AIP Plenary telecons and the google sites Pilot workspace allowed the AQ community to learn faster from others and ultimately implemented this option for our initial community catalog. The WAF is also not a standard service interface and had to be registered as a special arrangement in the GEOSS Standard Registry. This process of testing the GEOSS architecture through the AIP-2 identified the need for the WAF and a special arrangement was based on a group agreement within the Pilot. The drawbacks to the WAF versus the CSW or Z39.50 interfaces are that the WAF must be harvested by the clearinghouses and cannot be dynamically queried like the catalog service interfaces. Since it is harvested, there is a delay in updated metadata being available for access in the clearinghouse. Once the Community catalog is registered in the GEOSS CSR the GEOSS Clearinghouses then finds the catalog services and harvest the WAF (Figure 2, Harvest). After the harvest each of the three clearinghouses extracts some of the discovery metadata and allows users to find data access services. The value that the community catalog provides to the service provider is that through one registration in GEOSS of the catalog, many data access services collected by the catalog can be harvested and exposed through the clearinghouse. Find:In order to find the data access services in the clearinghouse one had to know what to search for. The clearinghouse queryable fields and metadata fields were mapped in a crosswalk to clarify how information was extracted from the harvested metadata records (see Table 2). The key queries we have been interested in so far are finding our own information, through the parent identifier, searching for services by type and keyword. The ESRI and USGS clearinghouses expose a search API which enabled the AQ group to create a more customized search interfaces. One example interface that we have set up is using the USGS API and searching full text in order to find WMS or WCS services (Ref link). Bind: Once the user has found the dataset of interest the next step is to bind to the dataset (fig. 2) and display the data access service through the tool of choice. Each service is described in the metadata record with its GetCapabilities URL. Through preliminary testing with WMS services, the Compusult Clearinghouse is able to bind to the GetCapabilities URL and display an instance of the map. Results and Disucssion: Through the community efforts of AIP-2 the Air Quality Community has had limited success in establishing the flow of metadata and data through the GCI. Data access services can be published to the WAF as ISO 19115 metadata records. When the WAF is registered as a component and service in the GEOSS CSR, the the WAF is harvested and the data access services are exposed through the clearinghouses. Interactions between the clearinghouse and WAF are still changing. There are not formal procedures set up to schedule harvesting other than through e-mail requests and informal arrangements. There are also issues with deletion of records in the WAF and how that propagates to the clearinghouse. In one instance with USGS, every time the WAF is harvested all of the old records from that catalog are deleted and replaced with the new records. Finally even though the metadata has the fields needed for discovery, the clearinhouses still do not search for all of these fields needed for discovery making it difficult to find records. Conclusion: This work is still evolving. The metadata for discovery may continue to change as others add their content and the fields need to be revised. Addditionaly, discovery is only one aspect of the data life cycle and the next steps will be to further close the loop between provider and user through more flexible DataSpaces. DataSpaces will incorporate the structured ISO 19115 metadata described above as well as further extend the metadata to incorporate user feedback, discussion, free tags and harvest flexible communitycontributed content like papers and web applications related to the dataset. References: