NERC ECOLOGICAL DATA GRID (ECOGRID)

advertisement
NERC ECOLOGICAL DATA GRID (ECOGRID)
Neil Bennett1, Rod Scott2, Mike Brown2, Mandy Lane2, Kerstin Kleese-van Dam1, Kevin O’Neill1, Andrew Woolf1, John
Watkins2
1
CCLRC – Daresbury Laboratory, Daresbury, Warrington, Cheshire, WA4 4AD, UK.
Centre for Ecology & Hydrology, Lancaster Environment Centre, Library Avenue, Bailrigg, Lancaster, LA1 4AP, UK
2
Abstract: The Centre for Ecology & Hydrology
(CEH) holds various databases collectively
representing a valuable environmental research
resource. However, their use inside and outside
CEH is constrained by lack of data accessibility
and interoperability. This project has focused on
three test-bed datasets held at the Lancaster
Environment Centre which form a good example
of the diversity of CEH terrestrial and freshwater
data. Metadata systems and access tools have
been constructed in collaboration with the NERC
Data Grid (NDG) to provide users with Grid
services linking data discovery to dataset delivery.
A data dictionary is also being developed to
describe environmental data as this will enable
ontological interoperability.
1. Introduction
CEH is the leading UK body for research,
survey and monitoring in terrestrial and
freshwater environments. It has eight sites in
the UK specialising in different areas. The
Lancaster site has a mix of freshwater and
terrestrial datasets and so its data represents a
good example of the diversity of CEH’s data.
There are three test-bed datasets within
the EcoGrid project: the Countryside Survey
vegetation database (CS); the Environmental
Change Network database (ECN); and, the
Lakes database.
The ECN data is a time-series that started
in 1993. ECN has a network of collaborators
at monitoring sites across the UK studying
water chemistry, vegetation and animals.
CS is a time series which started in 1978
and was repeated in 1984, 1990 and 1998.
Each new survey repeats areas and indicators
included previously but also introduces new
ones. A new database is produced for each
successive survey, with data from previous
surveys included via comparison fields.
Currently, researchers tackling complex
environmental problems are unaware of the
breadth and depth of relevant data from CEH.
They need better tools for data discovery,
exploration and delivery that enable
exploitation of this resource. These tools need
to be compatible with other data networks so
that researchers can easily combine data
across the network. Without a networked
approach, tool development tends to duplicate
effort and the potential of combined data
resources remains unexploited.
Part of this problem is being addressed by
NDG. This is establishing an infrastructure
for accessing the data of various UK research
institutions. NDG is an ongoing project but
ultimately will provide a portal allowing
scientists to search for datasets via the
corresponding metadata records. NDG also
takes care of security issues such as
authentication and authorisation of users.
NDG security was developed using the
CCLRC
Data
Portal
Authorisation
Architecture [3] as a starting point.
Thus,
being part of the NDG will open up CEH’s
data to other researchers.
CEH is also looking to collaborate with
ecologists worldwide. In the USA, the
Knowledge Network for Biocomplexity is
looking to solve similar problems and has
developed Ecological Metadata Language
(EML) to describe ecological data [4]. If
CEH can produce EML records for its data, it
will become visible to many more ecologists.
1.1.
Overview
The main aim of this project is to make a
diverse subset of CEH’s data more accessible
to scientists by integration into the NDG.
NDG has defined a detailed metadata
model [1], MOLES (Metadata Objects for
Links in Environmental Science) and a data
model [2] known as CSML (Climate Science
Modelling Language). Both models are
standards-based and are represented as XML
schemas. The models were produced by
close
collaboration
with
research
organisations across various scientific
domains. They are intentionally generic so
that they can accommodate data from a wide
range of scientific disciplines.
MOLES has been designed to allow
production of the various ‘industry standard’
discovery formats, such as DIF, ISO 19115,
FGDC/GEO and SensorML. It summarises
the key points of the data and adds elements
and relations that do not appear in the data.
MOLES allows a smooth link from data
browse to data usage.
CSML provides information about the
data that processing and visualisation services
need. It describes the data in semantic terms
and virtualises it, removing the need to know
the actual format of the data.
To integrate the EcoGrid data into NDG
requires mapping from this to the NDG
models. The test-bed datasets used for
EcoGrid are diverse in nature. To map
directly from these to NDG would be a
difficult task for the following reasons.
An individual EcoGrid database table
often does not constitute a dataset. A dataset
may consist of several tables joined together.
Also, current datasets are not always
adequately described; certain metadata is
implicit rather than being included explicitly.
There are also some inconsistencies in how
databases are used to document the data, both
within each dataset and across datasets.
Thus, it seems sensible to ‘clean up’ the
EcoGrid datasets first before mapping to
NDG. When one considers that EcoGrid may
ultimately incorporate all CEH data and not
just three test-bed datasets, a sensible option
is to create an intermediate layer that
describes all CEH data in a consistent way.
This intermediate layer will resolve conflicts
in the base data by concentrating on
semantics. Thus, data with different names
but similar meaning will map to the same
intermediate element, and data with similar
names but different meaning will map to
different intermediate elements.
The NDG data model and metadata
model between them describe data and its
origin from a number of angles, including:
the activity (project, etc.) that the data relates
to; data production tools used to obtain the
data; observation stations of which these tools
are a part; the data itself; and, ‘usage’
metadata that concentrates on the semantics
of the data. As these models provide a
comprehensive description of the data and its
origin, and as the intermediate layer
ultimately has to be mapped to NDG and thus
must be ‘compatible’, it made sense to draw
heavily upon the NDG models in the
construction of the intermediate layer.
Once intermediate layer records are
created for all EcoGrid datasets, these can be
mapped to the NDG. If one wishes to create
EML records (for instance) from EcoGrid in
the future, this can be done relatively easily,
either from the intermediate layer or the NDG
records depending on whether it is preferable
to map from a hierarchical (XML) or a
relational model.
2. Architecture
The architecture is illustrated in the
figure. The EcoGrid data is held in the form
of databases. The first stage is to convert this
data into the intermediate layer format which
takes the form of an Oracle database. Thus,
mappings are required between the base
EcoGrid data and the intermediate layer.
Initially, the population of the
intermediate layer is done automatically via
SAS scripts.
However, there may be
instances where this is not possible or the
intermediate layer is populated incorrectly.
Thus, a web interface will be set up where
intermediate records can be manually edited.
The intermediate layer records are then
mapped into the NDG models using Oracle
XML View. This transforms the data from a
relational format to a hierarchical format.
Obviously, it is important to ensure that
the base EcoGrid data is mapped to the NDG
models (via the intermediate layer)
appropriately. Often the distinction between
data and metadata is blurred. Theoretically,
metadata are descriptive data whereas data
are physical measurements. However, in
reality, measurements can be part of a
descriptive coding system. For example, to
search datasets by species of interest requires
the individual species codes recorded in each
data set to be in the discovery metadata.
Thus, species will feature in both MOLES
records (from which discovery metadata is
generated) and CSML records.
For discovery metadata such as species, it
is important that EcoGrid recognises
synonyms. Otherwise, a user searching the
portal may not retrieve all relevant records.
EcoGrid is collaborating with producers of
relevant data dictionaries holding defined
vocabularies to solve this problem. The
Biological Records Centre (based at CEH
Monks Wood) is working with the Natural
History Museum to get a common
taxonomical classification, and EcoGrid
follows developments there closely.
Also, a significant proportion of EcoGrid
data relates to freshwater chemistry where
various parameters are measured and values
recorded in appropriate units.
To help
describe datasets fully and aid comparison
with other datasets both inside CEH and
outside, it is important to reference units and
parameters to standardised definitions in
widely accepted dictionaries.
CSML
incorporates references to unit and
‘phenomenon’ dictionaries so EcoGrid’s use
of CSML will help with the issues of
accessibility and interoperability.
3. Future
EcoGrid will be taking account of new
technologies and changes to existing ones.
Further work and research will be undertaken
with other projects where possible.
It would be desirable to extend EcoGrid
to cover all CEH data and also incorporate
data held by the National Biodiversity
Network which focuses on Sites of Special
Scientific Interest and thus is complementary
to EcoGrid’s data.
UK ecologists would like to collaborate
with ecologists worldwide. For instance,
creation of EML records should open up
CEH’s data to a much wider audience.
Environmental data always has a spatial
component. The current version of the NDG,
with which EcoGrid is closely associated,
will allow spatial searching of datasets.
However, nothing more complex is permitted
and no use is made of geographical
information systems. The second stage of the
NDG is due to begin shortly. Amongst other
things, this will expand NDG’s spatial
capabilities with consequent benefits for
EcoGrid.
4. References
[1] A specialised metadata approach to discovery
and use of data in the NERC Data Grid. K
O’Neill et al., Proceedings of the UK e-Science
All Hands Meeting 2004.
(http://www.allhands.org.uk/2004/proceedings)
[2] Climate Science Modelling Language:
Standards -based markup for metocean data", 85th
meeting of American Meteorological Society, San
Diego, Jan 2005.
[3] Grid Authorisation Framework for the CCLRC
Data Portal. A Manandhar et al., Proceedings of
the UK e-Science All Hands Meeting 2003.
[4] http://knb.ecoinformatics.org/software/eml/
Download