NERC ECOLOGICAL DATA GRID (ECOGRID) Neil Bennett1, Rod Scott2, Mike Brown2, Mandy Lane2, Kerstin Kleese-van Dam1, Kevin O’Neill1, Andrew Woolf1, John Watkins2 1 CCLRC – Daresbury Laboratory, Daresbury, Warrington, Cheshire, WA4 4AD, UK. Centre for Ecology & Hydrology, Lancaster Environment Centre, Library Avenue, Bailrigg, Lancaster, LA1 4AP, UK 2 Abstract: The Centre for Ecology & Hydrology (CEH) holds various databases collectively representing a valuable environmental research resource. However, their use inside and outside CEH is constrained by lack of data accessibility and interoperability. This project has focused on three test-bed datasets held at the Lancaster Environment Centre which form a good example of the diversity of CEH terrestrial and freshwater data. Metadata systems and access tools have been constructed in collaboration with the NERC Data Grid (NDG) to provide users with Grid services linking data discovery to dataset delivery. A data dictionary is also being developed to describe environmental data as this will enable ontological interoperability. 1. Introduction CEH is the leading UK body for research, survey and monitoring in terrestrial and freshwater environments. It has eight sites in the UK specialising in different areas. The Lancaster site has a mix of freshwater and terrestrial datasets and so its data represents a good example of the diversity of CEH’s data. There are three test-bed datasets within the EcoGrid project: the Countryside Survey vegetation database (CS); the Environmental Change Network database (ECN); and, the Lakes database. The ECN data is a time-series that started in 1993. ECN has a network of collaborators at monitoring sites across the UK studying water chemistry, vegetation and animals. CS is a time series which started in 1978 and was repeated in 1984, 1990 and 1998. Each new survey repeats areas and indicators included previously but also introduces new ones. A new database is produced for each successive survey, with data from previous surveys included via comparison fields. Currently, researchers tackling complex environmental problems are unaware of the breadth and depth of relevant data from CEH. They need better tools for data discovery, exploration and delivery that enable exploitation of this resource. These tools need to be compatible with other data networks so that researchers can easily combine data across the network. Without a networked approach, tool development tends to duplicate effort and the potential of combined data resources remains unexploited. Part of this problem is being addressed by NDG. This is establishing an infrastructure for accessing the data of various UK research institutions. NDG is an ongoing project but ultimately will provide a portal allowing scientists to search for datasets via the corresponding metadata records. NDG also takes care of security issues such as authentication and authorisation of users. NDG security was developed using the CCLRC Data Portal Authorisation Architecture [3] as a starting point. Thus, being part of the NDG will open up CEH’s data to other researchers. CEH is also looking to collaborate with ecologists worldwide. In the USA, the Knowledge Network for Biocomplexity is looking to solve similar problems and has developed Ecological Metadata Language (EML) to describe ecological data [4]. If CEH can produce EML records for its data, it will become visible to many more ecologists. 1.1. Overview The main aim of this project is to make a diverse subset of CEH’s data more accessible to scientists by integration into the NDG. NDG has defined a detailed metadata model [1], MOLES (Metadata Objects for Links in Environmental Science) and a data model [2] known as CSML (Climate Science Modelling Language). Both models are standards-based and are represented as XML schemas. The models were produced by close collaboration with research organisations across various scientific domains. They are intentionally generic so that they can accommodate data from a wide range of scientific disciplines. MOLES has been designed to allow production of the various ‘industry standard’ discovery formats, such as DIF, ISO 19115, FGDC/GEO and SensorML. It summarises the key points of the data and adds elements and relations that do not appear in the data. MOLES allows a smooth link from data browse to data usage. CSML provides information about the data that processing and visualisation services need. It describes the data in semantic terms and virtualises it, removing the need to know the actual format of the data. To integrate the EcoGrid data into NDG requires mapping from this to the NDG models. The test-bed datasets used for EcoGrid are diverse in nature. To map directly from these to NDG would be a difficult task for the following reasons. An individual EcoGrid database table often does not constitute a dataset. A dataset may consist of several tables joined together. Also, current datasets are not always adequately described; certain metadata is implicit rather than being included explicitly. There are also some inconsistencies in how databases are used to document the data, both within each dataset and across datasets. Thus, it seems sensible to ‘clean up’ the EcoGrid datasets first before mapping to NDG. When one considers that EcoGrid may ultimately incorporate all CEH data and not just three test-bed datasets, a sensible option is to create an intermediate layer that describes all CEH data in a consistent way. This intermediate layer will resolve conflicts in the base data by concentrating on semantics. Thus, data with different names but similar meaning will map to the same intermediate element, and data with similar names but different meaning will map to different intermediate elements. The NDG data model and metadata model between them describe data and its origin from a number of angles, including: the activity (project, etc.) that the data relates to; data production tools used to obtain the data; observation stations of which these tools are a part; the data itself; and, ‘usage’ metadata that concentrates on the semantics of the data. As these models provide a comprehensive description of the data and its origin, and as the intermediate layer ultimately has to be mapped to NDG and thus must be ‘compatible’, it made sense to draw heavily upon the NDG models in the construction of the intermediate layer. Once intermediate layer records are created for all EcoGrid datasets, these can be mapped to the NDG. If one wishes to create EML records (for instance) from EcoGrid in the future, this can be done relatively easily, either from the intermediate layer or the NDG records depending on whether it is preferable to map from a hierarchical (XML) or a relational model. 2. Architecture The architecture is illustrated in the figure. The EcoGrid data is held in the form of databases. The first stage is to convert this data into the intermediate layer format which takes the form of an Oracle database. Thus, mappings are required between the base EcoGrid data and the intermediate layer. Initially, the population of the intermediate layer is done automatically via SAS scripts. However, there may be instances where this is not possible or the intermediate layer is populated incorrectly. Thus, a web interface will be set up where intermediate records can be manually edited. The intermediate layer records are then mapped into the NDG models using Oracle XML View. This transforms the data from a relational format to a hierarchical format. Obviously, it is important to ensure that the base EcoGrid data is mapped to the NDG models (via the intermediate layer) appropriately. Often the distinction between data and metadata is blurred. Theoretically, metadata are descriptive data whereas data are physical measurements. However, in reality, measurements can be part of a descriptive coding system. For example, to search datasets by species of interest requires the individual species codes recorded in each data set to be in the discovery metadata. Thus, species will feature in both MOLES records (from which discovery metadata is generated) and CSML records. For discovery metadata such as species, it is important that EcoGrid recognises synonyms. Otherwise, a user searching the portal may not retrieve all relevant records. EcoGrid is collaborating with producers of relevant data dictionaries holding defined vocabularies to solve this problem. The Biological Records Centre (based at CEH Monks Wood) is working with the Natural History Museum to get a common taxonomical classification, and EcoGrid follows developments there closely. Also, a significant proportion of EcoGrid data relates to freshwater chemistry where various parameters are measured and values recorded in appropriate units. To help describe datasets fully and aid comparison with other datasets both inside CEH and outside, it is important to reference units and parameters to standardised definitions in widely accepted dictionaries. CSML incorporates references to unit and ‘phenomenon’ dictionaries so EcoGrid’s use of CSML will help with the issues of accessibility and interoperability. 3. Future EcoGrid will be taking account of new technologies and changes to existing ones. Further work and research will be undertaken with other projects where possible. It would be desirable to extend EcoGrid to cover all CEH data and also incorporate data held by the National Biodiversity Network which focuses on Sites of Special Scientific Interest and thus is complementary to EcoGrid’s data. UK ecologists would like to collaborate with ecologists worldwide. For instance, creation of EML records should open up CEH’s data to a much wider audience. Environmental data always has a spatial component. The current version of the NDG, with which EcoGrid is closely associated, will allow spatial searching of datasets. However, nothing more complex is permitted and no use is made of geographical information systems. The second stage of the NDG is due to begin shortly. Amongst other things, this will expand NDG’s spatial capabilities with consequent benefits for EcoGrid. 4. References [1] A specialised metadata approach to discovery and use of data in the NERC Data Grid. K O’Neill et al., Proceedings of the UK e-Science All Hands Meeting 2004. (http://www.allhands.org.uk/2004/proceedings) [2] Climate Science Modelling Language: Standards -based markup for metocean data", 85th meeting of American Meteorological Society, San Diego, Jan 2005. [3] Grid Authorisation Framework for the CCLRC Data Portal. A Manandhar et al., Proceedings of the UK e-Science All Hands Meeting 2003. [4] http://knb.ecoinformatics.org/software/eml/