14-ONeill-metadataDiscovery

advertisement
A specialised metadata approach to discovery and use of
data in the NERC DataGrid
Kevin O'Neill1, Ray Cramer3, Marta Gutierrez2,
Kerstin Kleese van Dam1, Siva Kondapalli3, Susan Latham2,
Bryan Lawrence2, Roy Lowry3, Andrew Woolf1
1
CCLRC e-Science Centre
British Atmospheric Data Centre
3
British Oceanographic Data Centre
2
1. INTRODUCTION
The Natural Environment Research Council (NERC) has a wide range of data holdings, held in
technologies from flat files to relational databases. These holdings are relevant to a wide range of
scientific disciplines, despite often having been collected on behalf of quite narrow specialised
disciplines. The data holdings are stored across a wide range of archives, ranging from specialist
professional data curators and archivists, such as the British Atmospheric Data Centre (BADC)
and the British Oceanographic Data Centre (BODC) to files held on the hard disc of an individual
scientist's PC.
The NDG vision is for the user to see these data resources as one entity, thus improving the ability
of scientists to find data, and to provide a framework for the integration of data manipulation and
visualisation services to improve the usability of the data. As a by-product, it is hoped that it will
then be easier for scientists to contribute to and help maintain managed data holdings.
Key requirements are that the NDG should:

allow discovery and access of data without needing a priori knowledge details of storage
characteristics, values or parameters;

be discipline specific, but provide functionality for users beyond that community;

allow discovery and access of relevant data by science beyond the discipline for which it
was collected;

hide the heterogeneity of the data sources being queried, and combine the results into a
single, consistent, result set;

allow the specification of pre-presentation processing, such as sub-querying,
transformation, and consolidation, particularly where the data may be spread across
several data sources;

deliver data to the desired place in the desired format, aiming at hiding the original format
of the data without losing data values or its semantic content;

allow (limited) server-side processing of the data.
Given that the NDG is going to be built on pre-existing data holdings, the NDG needs to provide
mechanisms to query metadata about the datasets and collate the results, along with the means to
declare metadata models into which a data holding can map its local schema to allow crossholding queries and data processing.
It is intended to do this by providing a decoupled data and metadata infrastructure that will bring
together developed versions of tools that either already exist or are under development within
e-Science or the worldwide earth science community. Initially, Atmospheric and Oceanographic
data held in the BODC and BADC will be made available, with data from other disciplines
funded by NERC being added in due course.
2. OVERVIEW OF NDG METADATA
Usually, metadata models have tried to cover discovery and use within a single structure. In trying
to capture the entire metadata chain from discovery to use for the NDG, it was found that either
this structure would be far too large to be easily managed or understood, or we would have to
make a pragmatic decision regarding the perspective to be emphasised. Also, metadata values
were found to have multiple semantics that may not sit together easily in the longer run,
especially where each viewpoint required attributes to be maintained that were not of interest to
the other viewpoint.
The above problems led to the development of a metadata taxonomy that identified metadata
specialisations and related them. In brief, the key elements of the metadata include (but are not
limited to):
A [Archive]
format and usage metadata.
B [Browse]
superset of discovery and contextual metadata.
C [Comment] annotations, documentation and other supporting material.
D [Discovery] metadata used primarily to locate datasets.
The key types are the “Type A” metadata, which is directly concerned with the use of the data,
“Type D” which is the metadata directly used by discovery services, and the “Type B” core
metadata..
Type B is a superset of the Discovery metadata. This will be used to generate different “D Type”
discovery formats from a single corpus of metadata, e.g. GCMD DIF, FGDC Z.39.50 “GEO”
profile, or Dublin Core. It is generally referred to as the NDG Metadata Model.
“Type A” is more directly concerned with the use of the data, and is the basis of work on the
semantics contained inside the data itself and how this can help in the realisation of the semantic
grid 1. This is referred to as the NDG Data Model, emphasising its inclination towards the data
itself.
Ancillary
metadata
(C)
D(DIF)
D(DC)
Use
Data
(A)
Data summary
Data granule IDs
Core
Metadata
(B)
Discovery
Information
Discovery
D(...)
E
1 See abstract submitted to GGF10 by Andrew Woolf regarding “Data Modelling and Metadata in the Semantic Grid”
Figure 1 - Relation of major metadata types identified
This categorisation has brought benefits by giving a clear split between discovery and use. Many
disciplines have widely used, almost standard, data formats encapsulating discipline semantics to
one degree or another. Separation allows the discovery metadata model to be plugged into
different data models in a manner that means that the underlying data model is transparent to the
user; and the reverse is also true as a single data model may be used in different disciplines. It also
means that each model can tune the detail kept in it to that necessary to perform its task. For
example, the data model must keep track of the actual data values and sufficient information to
deliver the data to the user, if necessary transforming it from the original format to another,
whereas the metadata model needs only a summary of the data values, but must hold detail of how
and why the data was gathered. Thus, some data values are kept in both the data and metadata
models, but their intended usages or the detail required is very different.
D
i
s
c
o
v
e
r
y
Creating
project
Parameters
measured
Data
size
Parameter
value ranges
D
a
t
a
U
s
e
Figure 2 - Examples of metadata elements needed by both discovery and data use
It is vital that the “A” and “B”, and hence the “D”, metadata be able to cross-reference each other.
An identifier generated by the data model links the data and metadata models. Once the data of
interest is identified, by searching the “D type” metadata, the IDs of the data granules are passed
to data browsing software that will interact with the “Type A” metadata allowing the user to
identify and process the actual portion(s) of data of interest. This processing could include
subsetting and aggregation of the data, in some cases producing new data granules that will be
registered in the NDG, in others the result will be a temporary data set that will be discarded after
use.
Also, a summary of the data contents is passed to the “B” metadata. This contains details that are
of use to the discovery service, such as parameters represented in the data, but that are dealt with
in more detail in the “A” data use metadata.
Metadata Model
Data
granule ID
Data
summary
Data Model
Figure 3 - Relating the Metadata and Data Models
3. FUTURE DIRECTIONS
Currently, the NDG has produced a prototype in which the “B” metadata was used to generate a
“D type” discovery format that then allowed the user to access data in raw or virtualised format.
This has pointed out the need for more work in a number of areas. These include:

engagement with existing earth science disciplines’ taxonomies and controlled
vocabularies to extend and clarify them;

extension to discovery terms to map into, and cross-index with, “foreign” discipline
terminologies;

a need to provide semantic descriptions of a data structures, such as a “marine section”,
that can be recognised by characteristics like its dimensionality and the value range of
some dimensions, that can encapsulate representations in DFDL or similar;

convergence with emerging standards, such as those emerging from the Open GIS
Consortium (OGC) and ISO’s TC211 activity.
In the meantime, a series of systems providing real functionality to earth scientists are being
produced or planned, these utilising the evolving NDG framework coupled with a serviceoriented implementation strategy to allow the incorporation of developments without excessive
disruption or delay. Also, the EcoGRID project has been started to bring the technologies and
concepts of the NDG to the field of ecology.
Download