CERIF for Datasets
D2.1
Metadata Ontology
Filename:
Circulation:
Version:
Date:
Stage:
D2.1 Metadata Ontology
Restricted (PP)
2.0
21 February 2012
Final
Authors:
Sheila Garfield and Albert Bokma
Partners:
University of Sunderland
University of Glasgow
University of St Andrews
Engineering and Physical Sciences Research Council
Natural Environment Research Council
EuroCRIS
C4D is a project funded under JISC's Managing Research Data
Programme
C4D
D2.1 Metadata Ontology
COPYRIGHT
© Copyright 2011 The C4D Consortium.
All rights reserved.
This document may not be copied, reproduced, or modified in whole or in part for any purpose without written
permission from the C4D Consortium. In addition to such written permission to copy, reproduce, or modify this
document in whole or in part, an acknowledgement of the authors of the document and all applicable portions
of the copyright notice must be clearly referenced.
This document may change without notice.
DOCUMENT HISTORY
Version Issue Date
Stage
Content and changes
V1.0
14 Dec 2011
Draft
Table of Contents
V1.1
20 Jan 2012
Draft
Section revision following consultation with project partners
V2.0
20 Feb 2012
Final
Full revision following consultation with project partners
V2.1
29 Feb 2012
Updated version following inclusion of CSMD
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 2/x
C4D
D2.1 Metadata Ontology
1
BACKGROUND AND CONTEXT
The goal of the CERIF for Datasets (C4D) project is to extend CERIF to effectively deal with research
datasets. CERIF is a standard for modelling and managing research information with a current focus
to record information about people, projects, organisations, publications, patents, products,
funding, events, facilities, and equipment. One aspect which is becoming of increasing interest is
research data that may be generated or used in the course of research activity in the form of
datasets. While other aspects of output and impact are already supported explicitly by CERIF,
datasets are currently not fully supported in the same way. In some disciplines there are
considerable quantities of datasets generated which are an important output; these may be of use
also to other researchers as a useful input to their research. This can significantly speed up the rate
of discovery of new knowledge and results for future research and be in itself a useful indicator of
impact.
Despite similarities and overlaps to other forms of research output, datasets are significantly
different to require special treatment. In this report we examine current developments in standards
for datasets and metadata standards in order to explore:



What data needs to be stored about datasets in terms of metadata
How CERIF can be extended to support the storage of metadata about datasets
How this metadata can be made to be discoverable, that is, how the metadata used to
describe the dataset can be created in such a way that allows users to easily find out what
datasets are available to meet their needs based on typical retrieval criteria
In the following sections we will look in turn at a number of important aspects: In section 2 we
briefly examine the nature of datasets before in section 3 considering the stakeholders, their
expectation and common needs. In section 4 we examine existing metadata standards, followed by
an exploration of existing datasets in section 5. Section 6 looks at common classification schemes,
before we discuss possible extensions of CERIF. In section 7 we look at the aspects of a possible
ontology to provide for a discovery mechanism before rounding up our discussions and
recommendations in section 8.
2
WHAT ARE DATASETS?
The term dataset has been frequently used in relation to mainframe computing where it referred to
input or output data of some computation. Datasets have since been used to denote collections of
data also in other contexts, where this data is structured data that can be readily put into tabular
form where each row represents a record with a number of fields that hold data on various
attributes. For the purpose of exchange, datasets can also be put into files or data streams using a
mark-up language such as XML or in a more basic form through formats such as CSV. In some cases
datasets may contain a single record; however in most cases there are at least several records in a
dataset.
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 3/x
C4D
D2.1 Metadata Ontology
One can distinguish different types of datasets. There are historic datasets derived from historic
records (e.g. 17th century ship logs). Alternatively datasets can be based on recording observations,
whether through simple observation or as a result of some form of experiment (e.g. sea mammal
studies using data recorders attached to mammals). Finally, datasets can be generated as output
from modelling and simulation and there can be datasets that are purely synthetic and generated by
some automaton and/or application of randomisation. Some datasets are complete where no
subsequent updates are to be expected, while others are continuously growing as new data
becomes available. Some datasets are part of a family of related datasets, perhaps where staged
processing of raw data yields more refined results.
Finally, the ownership and associated rights and duties can be complex issues where these were not
generated by researchers in the course of their personal original research. If they were generated
through collaborative efforts, there may be joint ownership or obligations and restrictions that apply
(SMRU datasets are jointly owned in some cases).
Datasets are very common in the sciences, but they also occur in a variety of other disciplines. The
list below shows some of these subject areas but is by no means exhaustive:






Physics
Geography
Biology
Medicine
Social Sciences
History
It should also be noted that there are also datasets from library sources, national archives, and
government departments which will not have arisen from academic research activities but may
nevertheless be used by researchers in their research (e.g. ancient ships logs being used for
climatological research or datasets in national archives such as demographic data even going back to
the domesday book).
In summary a vast variety of datasets exist and these are becoming increasingly available for further
research as, akin to the open source revolution in information technology with its public licences
similar efforts are becoming evident in the sciences with public licences (such as Creative Commons)
and sharing of datasets in a variety of portals is becoming more popular.
3
STAKEHOLDERS
If CERIF is to provide a comprehensive method of efficiently sharing research information
incorporating a variety of research outputs it ought to cater for datasets. At present, it does not
have specific support for this even if many aspects of CERIF can be reused. CERIF does not contain
the actual research outputs such as publications and patents or products but records the metadata
to describe, identify and locate them and likewise would not be holding datasets themselves but
rather the metadata to describe, identify and locate them.
Increasingly, governments are expecting that datasets generated from publicly funded research is
made available to the public via e-government open data portals or via the media, such as science
journalists, effectively making citizens potential stakeholders. In order to examine the implications
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 4/x
C4D
D2.1 Metadata Ontology
and specific needs for supporting datasets in CERIF, we first ought to consider the stakeholders and
their respective perspective:




Researchers who own datasets
Researchers who want to access existing datasets for their work
Research Organisations that jointly own or are responsible for publishing and curating
datasets
Funding bodies who fund projects where datasets are generated/used
Research
Institutions
Datasets
Researchers
Funding
Bodies
Figure 1: Datasets and their Stakeholders
3.1
Researchers who own datasets
In an age of increased information sharing, researchers are also increasingly prepared to share the
datasets they have used for their own research with others for several reasons. They may hope that
a more ready sharing of these resources in the research community will enable them to use other
researchers’ datasets in return for their own research. They may also like to share these with others
in order to create more publicity, show impact as part of esteem factors such as standing in their
research community, and enhance their personal rating during national research assessment
exercises. While there are motivations to share datasets there are also possible reservations;
sometimes making datasets available is potentially detrimental to the work of the researcher as it
gives competitors a chance to catch up or where putting these into the public domain could
endanger the work itself or endanger the subject of the work (e.g. marine mammals movement
patterns not being disclosed to commercial concerns that might hunt them). Finally, making
datasets available and by implication generating the metadata to describe them can be cumbersome
and thus there can be reluctance by researchers to share their data. Another issue that can be
difficult to decide is the possible need for selecting appropriate licensing arrangements under which
a third party can make use of the datasets whilst protecting the Intellectual Property Rights (IPR) of
the researcher that generated them, and ensuring that at the very least when using these datasets
their authorship is duly acknowledged.
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 5/x
C4D
3.2
D2.1 Metadata Ontology
Researchers who want to access datasets
Most researchers need access to data in order to progress their lines of research. Such data can be
difficult to come by, and researchers would probably welcome access to data that would improve
and speed up their research efforts. The main issue for them is to discover suitable datasets that
they can make use of. Researchers are expected to know what is going on in their field, their key
peers, and the work that has been done to date; with ever increasing amounts of research this can
be difficult to keep track of. Therefore it might also be useful to see the relations between
researchers, institutions, publications and datasets. This might be especially useful when a
researcher is branching into a new field. If all this information was available and navigable, it is likely
to attract interest from the research community. Given the fragmentation of information in
different places and the fact that researchers usually have to look in several places to piece together
all the information needed, the comprehensiveness of such a collection and the resulting utility
would be significant. Given that sometimes datasets that are of interest might have been generated
in a different discipline to that of the researcher who wants to use them, it will be important that
such information is properly classified to make such discovery across disciplines possible.
3.3
Research Organisations that are jointly responsible for datasets
While in many cases individual researchers will be responsible for their datasets, there is also an
institutional role. The datasets in question may have been generated by researchers in their employ
and these organisations either feel duty bound or contractually obliged to safeguard and make
available the datasets in question. From an institutional perspective, this will require that datasets
are suitably stored and backed up for curation purposes and to ensure that researchers do take
appropriate steps to implement their own curation, or hand these over to the institution for curation
and that the correct metadata has been generated and published in the appropriate way and form.
It is also important that appropriate licensing is applied to ensure that the organisation has
protected their IPR where appropriate, work is suitably acknowledged, and data is released in
accordance with funder terms and conditions.
3.4
Funding bodies who fund projects where datasets are generated/used
Funding bodies include public, charitable and private bodies. They have to demonstrate the results
and impact of their work through the results generated by the projects they fund. With increasing
requirements of transparency on spending and impact these bodies are interested in recording
datasets alongside other forms of research outcome, and showing the impact of the work they have
done to the public or their investors. Funders are increasingly taking note of datasets and may
require the projects they fund to publish information about them including the generation of
metadata to record them in appropriate stores. It is therefore important that the correct metadata
has been generated and that the impact of the work and the datasets can be assessed; and that the
datasets, where appropriate, have been made available for further work or to allow others to
reassess the work that was published about them. The C4D project includes the EPSRC and NERC
research councils which are forerunners in terms of the other research councils in the UK and the
management of datasets. EPSRC currently maintains 2 datacentres to actually curate datasets that
are collected from projects that have produced them. NERC has currently 7 datacentres for the
same purpose. One key consideration beyond wanting to show impact and make results available for
further research are also legal requirements such as the INSPIRE EU Directive that requires
geospatial data to be curated and made available. Consequently, there are now contractual
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 6/x
C4D
D2.1 Metadata Ontology
requirements on fund holders to curate datasets within their own organisations or hand them over
to the funding body for curation as is currently the case with NERC.
3.5
Conclusions
Datasets are an important kind of research output and are also an important input. Significant
numbers of datasets are being generated and need to be managed for a number of reasons:
preserving important data for the future, potentially enabling better research and accelerating the
rate of discoveries, and assessing and demonstrating impact. The researchers who were interviewed
by the project from within its members were generally willing to share these, given suitable
safeguards, but do not have a lot of time to generate quality metadata about them or curate and
administer them properly; they do want to be assured that appropriate use is being made and their
use is acknowledged. Researchers would also like to be able to discover and access suitable datasets
for their research activities. Consequently, there is a need for recording these with easy to use tools
and publishing them in a way that facilitates their discovery. Organisations and funders are also
interested in having datasets properly published and managed from the point of view of managing
access and managing their curation (e.g. NERC and EPSRC).
From the analysis of stakeholders a number of important considerations arise:
 The need for tools to make the generation of metadata easy
 The need for data management plans so that access restrictions can be implemented for
sensitive datasets
 In some cases the ability to publish only derivations that anonymise the data to protect
the object of the dataset
 The ability to deal with families of datasets where these represent successive processing
of data
 The ability to navigate the relationships between datasets, authors and publications
especially also where datasets are evidence for claims made in publications
 The ability to show the impact of datasets in terms of researchers unconnected to the
datasets or projects out of which they arose
 The ability to discover datasets from within disciplines as well as across disciplines (e.g.
marine mammals generating climatological data)
 The requirements of grant awarding bodies and their grant holders to publish effectively
the outputs and in particular datasets and to make them available and discoverable
4
EXISTING METADATA STANDARDS
According to NISO, “Metadata is structured information that describes, explains, locates or
otherwise makes it easier to retrieve, use or manage an information resource” (“Understanding
Metadata”, anon, ISBN1-880124-62-9). A typical example of metadata is the catalogue found in any
library, which records key information about the collection such as author, title, subject, as well as
the location in the library.
Different types of metadata can be distinguished:

Descriptive metadata to describe the resource for identification and retrieval purposes:
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 7/x
C4D
D2.1 Metadata Ontology



Description
 Topic
 Keywords
 Geospatial location
 Taxa
Structural metadata to describe the structure of the resource and potentially relationships
between elements of the resource
 Record structure
 Fields
 Units
 Instrumentation used
Administrative metadata to help manage the resource such as
 version
 when the resource was added to the archive
 licensing
 encoding (eg XML, CSV, flat file etc)
 language
 data management plan
 conformity, standard name and version
Metadata elements grouped into sets are called metadata schemes; these elements are the
descriptor types that are available to be applied to the data. For every element in the scheme the
name and the semantics are specified; some schemes also specify in which syntax the elements must
be encoded while some schemes do not. The encoding allows the metadata to be processed by a
computer program and many current schemes use XML to specify their syntax.
Metadata schemes are developed and maintained by standard organisations (such as ISO) or
organisations that have taken on such responsibility (such as the Dublin Core Metadata Initiative).
Many different metadata schemes have been developed with some of them designed for use across
disciplines, while others are designed for specific subject areas, dataset types or archival purposes:
Standard
Domain
Description
DDI
Social Science
The Data Documentation Initiative is a standard for technical
documentation describing social science data.
EAD
Archiving
Encoded Archival Description - a standard for archives and
manuscript repositories.
CDWA
Arts
Categories for the Description of Works of Art is a framework
for describing and accessing information about works of art,
architecture, and cultural resources.
VRA Core
Arts
Visual Resources Association is a standard for the description
of works of visual culture as well as the images.
Darwin
Core
Biology
Darwin Core is a standard for the occurrence of species and the
existence of specimens in collections.
ONIX
Book industry
Online Information Exchange - international standard for
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 8/x
C4D
D2.1 Metadata Ontology
representing and communicating book industry product
information in electronic form.
EML
Ecology
Ecological Metadata Language is a standard developed for
ecology.
IEEE LOM
Education
Learning Objects Metadata - specifies the syntax and semantics
of Learning Object Metadata.
CSDGM
Geographic data
Content Standard for Digital Geospatial Metadata maintained
by the Federal Geographic Data Committee (FGDC).
ISO 19115
This is a geographic metadata standard that defines how to
describe geographical information and associated services,
Geographic data
including contents, spatial-temporal purchases, data quality,
access and rights to use.
e-GMS
Government
e-Government Metadata Standard (E-GMS) defines the
metadata elements for information resources across public
sector organisations in the UK.
TEI
Humanities,
social sciences
and linguistics
Text Encoding Initiative - a standard for the representation of
texts in digital form, chiefly in the humanities, social sciences
and linguistics.
NISO MIX
Images
The NISO Metadata standard is designed for Images and
technical data elements required to manage digital image
collections.
Librarianship
MARC - MAchine Readable Cataloging - standards for the
representation and communication of bibliographic and
related information in machine-readable form.
MEDIN
Marine Data
MEDIN is a partnership of UK organisations committed to
improving access to marine data. MEDIN is a marine themed
standard to record information about datasets, which is
compliant to international standards.
METS
Librarianship
Metadata Encoding and Transmission Standard - an XML
schema for encoding descriptive, administrative, and structural
metadata regarding objects within a digital library.
MODS
Librarianship
Metadata Object Description Schema - is a schema for a
bibliographic element set that may be used for a variety of
purposes, and particularly for library applications.
Dublin Core
Networked
resources
Dublin Core - interoperable online metadata standard focused
on networked resources.
DOI
Networked
resources
Digital Object Identifier - provides a system for the
identification and hence management of information
("content") on digital networks, providing persistence and
semantic interoperability.
DIF
Scientific data
sets
Directory Interchange Format - a descriptive and standardized
format for exchanging information about scientific data sets.
MARC
In the following we explore some of these standards in more detail:
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 9/x
C4D
D2.1 Metadata Ontology

Marine Environmental Data Information Network (MEDIN) Discovery Metadata Standard
The MEDIN metadata schema is based on the ISO 19115 standard, and includes all core INSPIRE
metadata elements. It also complies with the UK GEMINI2.1 metadata standard. Therefore, the
MEDIN standard should be considered as an interpretation of GEMINI2.1 whereby MEDIN have
specified the use of certain term lists to be used for a specific element - for example, to describe
the spatial extent of the resource it is strongly recommended to use the SeaVox Sea Areas salt
and freshwater body gazetteer, or in some cases an obligation for an element has been changed
– for example, changed Conditional to Mandatory. In all cases however a MEDIN discovery
metadata record is compliant with the GEMINI2.1 standard. The generated XML serialisations,
that is, the conversion of an object into an XML stream that can be transported across processes
and machines, conform to the ISO 19139 standard for XML implementations.

Dublin Core (DC)
The Dublin Core (DC) Metadata Element Set was developed initially to describe web-based
resources sufficiently to allow their discovery by search engines; it is an element set for
describing a wide range of networked resources, focusing on bibliographic needs. It is a basic
standard which can be easily understood and implemented and comprises 15 core elements
encompassing the descriptive, administrative and technical elements required to uniquely
identify a digital object. However, the elements are broad and therefore they may be refined by
qualifiers to limit their semantic range and to provide further levels of detail. There are,
however, no cataloguing rules; this means that the content of the elements may not be uniform
across applications potentially reducing the re-usability of the content and making automatic
searching difficult. Additionally, DC is limited in its ability to handle the geospatial aspects of
data, that is, data which has an explicit or implicit geographic area and is associated with or
referenced to some position or location on the surface of the earth.

ISO 19115 Metadata Standard for Geographic Information
ISO 19115 defines the conceptual model required for describing geographic information and is
widely used as the basis for geospatial metadata services. It provides a comprehensive set of
metadata elements and describes all aspects of geospatial metadata providing information
about the identification, the extent, the quality, the spatial and time elements, spatial reference,
and distribution of digital geographic data, such as, such as Geographic Information System (GIS)
files, geospatial databases, earth imagery and geospatial resources including data catalogues,
mapping applications, data models and related websites. It details which elements should be
present, what information is required within each field, and whether the field is mandatory,
optional or conditional. This metadata standard can be used for describing digital or physical
objects or datasets which have a spatial dimension, that is, data directly or indirectly referenced
to a location on the surface of the earth. However, because of the large number of metadata
elements and the complexity of its data model, it can be difficult to implement.

UK GEMINI
The UK Geospatial Metadata Interoperability Initiative (GEMINI) was originally developed by
collaboration between the Association for Geographic Information (AGI), the e-Government Unit
(eGU) and the UK Data Archive. The UK GEMINI Discovery Metadata Standard is a defined
element set for describing geospatial, discovery level metadata within the UK designed to be
compatible with the core elements of ISO 19115 and Version 2.1 complies with the INSPIRE
metadata set.

UK GEMINI2
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 10/x
C4D
D2.1 Metadata Ontology
UK GEMINI2 is designed for use in a geospatial discovery metadata service. It specifies a core set
of metadata elements for describing geospatial data resources for discovery. The data resources
may be datasets, dataset series, services delivering geographic data, or any other information
resource with a geospatial content. Included in this are datasets that relate to a limited
geographic area and which may be graphical or textual, hardcopy or digital. GEMINI2 is a revised
version of GEMINI and is compatible with the requirements of the INSPIRE metadata
Implementing Rules (IR), conforming to the international metadata standard for geographic
information, ISO 19115.

Infrastructure for Spatial Information in the European Community (INSPIRE)
INSPIRE provides a general framework for a Spatial Data Infrastructure (SDI). The INSPIRE
metadata Implementing Rules define the minimum set of metadata elements necessary to
comply with the INSPIRE Directive which satisfy the need for the discovery of resources. These
elements are: Resource title, Resource abstract, Resource type, Resource locator ,Unique
resource identifier, Resource language, Topic category, Keyword , Geographic bounding box,
Temporal reference, Lineage , Spatial resolution, Conformity, Conditions for access and use,
Limitations on public access, Responsible organisation, Metadata point of contact, Metadata
date, and Metadata language. In essence it is a profile of ISO 19115 for discovery purposes. It
allows a variety of possible implementations.

Darwin Core
The Darwin Core is primarily based on taxa, an ordered system that indicates relationships
among organisms, their occurrence in nature as documented by observations, specimens, and
samples, and related information. It is a content specification designed for data about the
geographical occurrences of species (i.e. the observation or collection of a particular species or
other taxonomic group at a particular location). The Darwin Core is based on the standards
developed by the Dublin Core Metadata Initiative (DCMI) and should be viewed as an extension
of the Dublin Core for biodiversity information.
The Metadata Standards listed above provide for a variety of the needs of a variety of disciplines
with some more generic such as Dublin Core, which originated from the domain of librarianship but
has been widely used on its own or with some features absorbed into other metadata standards. It
provides a number of useful basic facilities for describing resources. Other standards provide more
focused support for specific disciplines such as Darwin Core for biology or INSPIRE, GEMINI and
MEDIN for geospatial applications. The diverging needs, even just among the science disciplines,
present a dilemma as it will be difficult to find a single format that fits all these needs and with a
need of, at least, optionality with respect to different elements to accommodate key aspects. There
are a variety of basic descriptions about the dataset that will be the same such as name, description,
authorship, location, language, topic and keywords, while other aspects are discipline-specific such
as geospatial bounding boxes for geospatial sciences, taxa for biology, chemical structures, etc., that
will be important not just for correct classification, but also for eventual discovery.
It is also important to note that there are metadata standards for different levels of abstraction
where some of them are designed more for the generation and management of dataset repositories.
As in some cases, repositories hold a variety of datasets from different disciplines such as National
Archives and Libraries there are metadata standards that are more focused archival and librarianship
such as EAD, MARC, MEDS and MODS as well as Dublin Core. On the other end of the scale there are
also metadata standards for particular types of standardised datasets which focus more on the
detailed content of these datasets such as IMMA for ship records:

INTERNATIONAL Maritime Meteorological Archive (IMMA) Format
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 11/x
C4D
D2.1 Metadata Ontology
IMMA (Woodruff, 2003) is a flexible and extensible ASCII format designed for the archival and
exchange of both historical and contemporary marine data. The fields presently composing
IMMA have been organized into two different types of format components: the “Core” and
“attachments” (attm). The Core contains the most universal and commonly used marine data
elements (e.g., reported time and location, temperatures, wind, pressure, cloudiness, and sea
state). The Core is divided into two sections:
o “location” section: for report time/space location and identification elements, and other
key metadata
o “regular” section: for standardised data elements and types of data that are frequently
used for climate and other research
The Core forms the common front-end for all IMMA data and by itself forms a relatively concise
“record type”. The “attachments” contain additional, less universal data elements. Additional
record types are constructed by appending to the Core one or more attachments.

Core Scientific Metadata Model (CSMD)
The Core Scientific Metadata Model is a model for the representation of scientific study
metadata developed within the Science and Technology Facilities Council (STFC) to represent
the data generated from scientific facilities. The model captures high level information about
scientific studies and the data that they produce. The model defines a hierarchical model of the
structure of scientific research around studies and investigations, with their associated
information, and also a generic model of the organisation of datasets into collections and files.
Specific datasets can be associated with the appropriate experimental parameters, and details
of the files holding the actual data, including their location for linking. This provides a detailed
description of the study. There are nine core entities defined for a study: Investigation,
Investigator, Topic and Keyword, Publication, Sample, Dataset, Datafile, Parameter and
Authorisation.
In conclusion, the support of a variety of disciplines not just limited to sciences and perhaps also to
the humanities will require a flexible approach that can capture key aspects of different disciplines
that are important for the correct classification and thus discovery of datasets along typical search
criteria. Thus geospatial aspects, in some disciplines, as well as taxa or classifications of
compounds/elements/molecules or even extra-geospatial locations for others, will be needed. The
discipline specific aspects are largely to do with what the dataset or resource is about and its
context. At the same time there are also a considerable amount of common features to do with
what the resource is, what it is called, where it is located, how it is encoded, who owns it, what
conditions are attached to it which are fairly common across the disciplines.
The metadata standards used at NERC currently follow the MEDIN standard as does INSPIRE and
GEMINI and thus would appear to be a logical starting point for a suitable standard for C4D. The
caveat however is that not all disciplines will have geospatial references such as the bounding box
specified in MEDIN and having other requirements more specific to their respective disciplines. This
would require that the geospatial references are optional rather than obligatory and other key
aspects of other disciplines be introduced alongside, again as optional, to allow a more meaningful
representation for a variety of disciplines:




geographic/political regions, place names and localities (e.g. occurrence of certain species)
elevation (for some biological/geological datasets for examples)
taxa for biological sciences
extra-spatial reference for astronomy
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 12/x
C4D
D2.1 Metadata Ontology
MEDIN has considerable overlap with other standards in terms of the administrative aspects such as
with Dublin Core and in fact many standards have virtually identical provision for administrative
aspects but differ in the descriptive aspects. Consequently, a MEDIN based standard would be
useful as a starting point whilst needing to be extended to cover key discipline specific aspects such
as those described above.
What can be concluded from the considerations of metadata standards are a number of lessons:



There are different metadata standards that have evolved to serve the needs of their respective
communities and a way has to be found to either develop a more generic scheme that can work
in a variety of disciplines or to work with these existing standards to develop a higher level
metadata scheme that can subsume them. Given the sheer complexity of the available
standards it will however be more pragmatic to start with a widely used standard such as MEDIN
and assess the need to extend it to incorporate key criteria of other disciplines such as taxa for
biologists.
Another important outcome is that not all standards operate at the same level and some are
focusing more on the format and categories of data in the dataset such as in the case of IMMA
which contrasts to more high level standards that are less descriptive but represent more
administrative aspects such as MEDIN and which contrasts on much more general ones more
focused on management of repositories like METS and MODS or on general purpose discovery
such as Dublin Core.
Consequently, it is important to recognise these differences and find a way of managing the
complexity between need of specific support of formats and disciplines while supporting more
general discoverability and the fact that there may a need in the end of supporting a variety of
low and medium level standards in repositories (because of legacy issues) but facilitate their
discovery. This cross disciplinary discoverability can be important as the case of SMRU shows
where biological datasets are used for climatological research. CSMD proposes a Core Scientific
Metadata Model with a 3 layer architecture with metadata at the data level as opposed to the
contextual level and above that a discovery level. We discuss this in more detail in section 7.
In conclusion, there is the need to support a variety of disciplines and to be mindful of the level of
metadata and distinguish the 3 levels propose in the CSMD initiative to ensure a clear separation
between levels and to enable eventual integration also between the repositories holding datasets
and those repositories holding research information for administrative purposes and to be able to
layer over these the capability of discoverability for the purposes of impact analysis and making
data retrievable and reusable.
5
EXISTING DATASETS OF C4D PARTNERS
To investigate the needs for metadata support or datasets in CERIF and develop a concept
demonstrator in the course of C4D, the project will consider a range of datasets already available
from the project partners:


The University of Sunderland currently co-owns or has access to several suitable research
datasets. Specifically, these include the Climatological Database for the World’s Oceans
(CLIWOC) and the JISC-funded UK Colonial Registers and Royal Navy Logbooks digitisation
project (CORRAL) providing historical and marine climatology datasets.
At the University of St Andrews, the Sea Mammal Research Unit (SMRU) is a NERC
collaborative centre.
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 13/x
C4D
D2.1 Metadata Ontology

At the University of Glasgow, the School of Geographical and Earth Sciences currently has
access to marine climatological datasets, primarily for the North Atlantic, extending back up
to 650 years.
Climatological Database for the World’s Oceans (CLIWOC): 1750 to 1850
The European Union funded CLIWOC project was the first ever to examine the scientific utility of
early, pre-instrumental ship’s logbook information. The database contains climatic information
contained in ships’ logbooks for the period 1750 to 1850, and more specifically climatological
information for the North and South Atlantic, the Indian and the Pacific Oceans. Entries are
chronological, day by day, and include identification by vessel and geographic location.
Information on wind direction, wind force and other recorded weather elements such as,
precipitation, fog, ice cover, state of sea and sky, are also included. Data were abstracted,
verified, and calibrated before inclusion in the database. In total, 1,624 logbooks were digitized,
comprising of 273,269 observations from over 5,000 logbooks. The observations were put in the
database in IMMA standard format (Woodruff, 2003).
UK Colonial Registers and Royal Navy Logbooks (CORRAL)
The Joint Information Systems Committee (JISC) funded CORRAL project is an imaging and
digitising project to image ship's logbooks of particular historic and scientific value, and to
digitise the meteorological observations in those logbooks. Digitising Royal Navy ship's logbooks
(from ships of voyages of scientific discovery and those in the service of the Hydrographic
Survey) and coastal and island records contained in UK Colonial documents provides
meteorological recordings from marine sites back to the 18th Century. The instrumental data
(mostly of air pressure and temperature) are of singular importance in representing marine
conditions.
The Sea Mammal Research Unit (SMRU), University of St Andrews
The Sea Mammal Research Unit (SMRU) carries out research on marine mammals using animalborne instruments that have been designed and implemented to provide in situ hydrographic
data from parts of the oceans where little or no other data are currently available. An important
aspect of these deployments is the provision of unique, linked biological and physical datasets
which can be used by marine biologists, who study these animals and also by biological
oceanographers. Incorporating animal-borne sensors into ocean observing systems provides
information about global ocean circulation and enhances our understanding of climate and the
corresponding heat and salt transports, and at the same time increases our knowledge about
the life history of the ocean’s top predators and their sensitivity to climate change.
School of Geographical and Earth Sciences, University of Glasgow
The School of Geographical and Earth Sciences currently has access to marine climatological
datasets. Modelling and measurements show that Atlantic marine temperatures are rising;
however, the low temporal resolution of models and restricted spatial resolution of
measurements (i) mask regional details critical for determining the rate and extent of climate
variability, and (ii) prevent robust determination of climatic impacts on marine ecosystems.
Historic sea water temperatures were algae-derived in situ temperature time-series developed
during the current investigation. Projected Sea Surface Temperatures (SSTs) from 2000–2040 for
57°07’53’’N 08°12’60’’W were obtained from Stendel, et al (2004).
The datasets represented by the project partners, whilst not being exhaustive, nevertheless
represent a good cross-section of disciplines and variety:

Biological datasets
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 14/x
C4D
D2.1 Metadata Ontology


Climatological datasets
Maritime vessels datasets
The datasets also involve cross disciplinary use cases as in the case of the marine mammals datasets
being of climatological interest. By supporting these in the C4D reference architecture this should
lead to results that are relevant for a variety of disciplines and use cases that should in principle be
extensible to remaining specialist areas.
6
CLASSIFICATION SCHEMES
In any discipline, the organisation and provision of relevant information is a significant challenge. It is
essential that users are able to access and assimilate a variety of information resources in order to
be able to carry out their tasks.
Classification is the placing of subjects into categories and provides a system for organising and
categorising knowledge. The purpose of classification is to bring together related items in a useful
sequence from the general to the specific enabling users to browse collections on a topic, either in
person or online.
Subject classification schemes describe resources by their subject and are a means of organising
knowledge in libraries and other information environments. Classification schemes differ from other
subject indexing systems such as subject headings, thesauri, etc. by trying to create collections of
related resources in a hierarchical structure thus they can aid information retrieval by providing
browsing structures for subject-based information; providing a browsing directory-type structure
that is user friendly thereby making finding and retrieving resources easier. However, the usefulness
of any browsing structure depends on the accuracy of the classification. Additionally, updating
classification schemes takes a long time and because of their size they tend not to be updated that
frequently.
Classification schemes have been developed and used in a variety of contexts and vary in scope and
methodology, but can be broadly divided into universal, national general, subject specific and homegrown schemes:




Universal schemes offer coverage of all subject areas and are designed for worldwide usage,
e.g., Dewey Decimal Classification (DDC), Universal Decimal Classification (UDC), and the
Library of Congress Classification (LCC)
National general schemes offer coverage of all subject areas, however, generally they are
not well known outside of their place of origin and have usually been designed for use in a
single country or language community, e.g., the Nederlandse Basisclassificatie (BC) - Dutch,
the Sveriges Allmänna Biblioteksförening (SAB) scheme - Swedish
Subject specific schemes have been devised with a particular (international or national) usergroup or subject community in mind, e.g., the National Library of Medicine (NLM)
Classification, the Engineering Information (EI) Classification Codes
Home-grown schemes are devised for use in a particular service and knowledge is organised
by devising their own classification scheme, e.g., Yahoo
There are a number of classification schemes in use with a marked focus on library and information
science, which are designed to organise and classify the world of knowledge and its contents. The
most widely-used universal classification schemes are: Dewey Decimal Classification (DDC),
Universal Decimal Classification (UDC), and the Library of Congress Classification (LCC).
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 15/x
C4D
D2.1 Metadata Ontology

Dewey Decimal Classification (DDC)
The DDC system is a universal classification scheme for organising general knowledge
that is continuously updated to keep pace with new developments and changes in
knowledge, and it is the most widely used classification system in the world (Mitchell
& Vizine-Goetz, 2009; OCLC, 2003). The DDC is used by libraries in more than 135
countries to organise and provide access to their collections as well as DDC numbers
featuring in the national bibliographies of more than 60 countries (Mitchell & VizineGoetz, 2009; OCLC, 2003). DDC is also used for other purposes, e.g., as a browsing
mechanism for resources on the web (OCLC, 2003).
The basic DDC classes are organised by academic disciplines or fields of study divided into 10
main classes, (see Table 1 below), which together cover the entire world of knowledge. Each
main class is further subdivided into 10 divisions, and each division is further subdivided into
10 sections (OCLC, 2003).
000
100
200
300
400
500
600
700
800
900
Computer science, information & general works
Philosophy & psychology
Religion
Social sciences
Language
Science
Technology
Arts & recreation
Literature
History & geography
Table 1: The 10 main classes (OCLC, 2003, p.7)
The DDC system distributes the subject according to the context referred to as “aspect
classification”. As a result the same subject may be classed in more than one place in the
scheme, as shown by the example below. The DDC system is based on the ideas of relative
location, decimal notation, relative index and detailed classification with its main principles
being: classification by discipline, a hierarchical structure and practicality.
Relative location
Subjects are ordered in a sequence and assigned a number; the books are marked with this
number and not the shelves.
Relative index
The relative index shows the relationship between subjects and the disciplines or in some
cases, the various aspects within disciplines in which they appear (Mitchell & Vizine-Goetz,
2009). For example, the relative index entries for Garlic are as follows:
Garlic
Garlic - botany
Garlic - cooking
Garlic - food
Garlic - garden crop
Garlic - pharmacology
Version 1.1 of 14 Jan 2012
641.3526
584.33
641.6526
641.3526
635.26
615.32433
D2.1 Metadata Ontology.doc
Page 16/x
C4D
D2.1 Metadata Ontology
Within 641 food and drink, garlic appears in food (641.3526) and cooking (641.6526). Garlic
also appears in botany (584); as a garden crop in agriculture (635.26); and in a subfield of
medicine, pharmacology (615.32433) (Mitchell &Vizine-Goetz, 2009).
Decimal notation
The decimal notation refers to the principle of dividing each class into ten sub-divisions and
each of these sub-divisions into another ten sub-divisions and so on. The first digit in each
three-digit number represents the main class. For example, the 600 class represents
technology. The second digit in each three-digit number indicates the division. For example,
600 is used for general works on technology, 610 for medicine and health, 620 for
engineering, 630 for agriculture, 640 for home economics and family living. The third digit in
each three-digit number indicates the section. Thus, 610 is used for general works on
medicine and health, 611 for human anatomy, 612 for human physiology, 613 for personal
health and safety.

Library of Congress Classification (LCC)
The LCC system organises knowledge into 21 basic classes. Each class is identified by a single
letter of the alphabet (labelled A-Z except I, O, W, X, and Y). Most of these 21 classes are
further divided into more specific subclasses by adding one or two additional letters and a
set of numbers. Within each subclass, topics that are relevant to the subclass are arranged
from the general to the more specific. Relationships among topics are shown by indenting
subtopics under the larger topics rather than by the numbers assigned to them and in this
respect it is different from the more strictly hierarchical Dewey Decimal Classification system
where the hierarchical relationships among topics are shown by numbers that can be
continuously subdivided.

Universal Decimal Classification (UDC)
The UDC system is adapted from the Decimal Classification of Melvil Dewey. It is used
worldwide and is the world leader in multilingual classification schemes for all branches of
human knowledge. It has undergone extensive revision and development resulting in a
flexible and efficient system for organising bibliographic records. The UDC system is suitable
for use with all branches of human knowledge in any kind of medium and is well suited to
multi-media information collections.
UDC has a hierarchical structure which divides knowledge into 10 classes (see Table 2
below); each class is further subdivided into its relevant parts and each subdivision is further
subdivided with the system continuing on in this way. UDC uses decimal notation and the
longer the number the more detailed the subdivision represented by that number.
Consequently UDC is able to represent not just straightforward subjects but also relations
between subjects as all recorded knowledge is treated as a coherent whole built of related
parts.
0
1
2
3
5
6
7
Science and Knowledge. Organisation. Computer
Documentation. Librarianship. Institutions. Publications
Philosophy. Psychology
Religion. Theology
Social Sciences
Mathematics. Natural Sciences
Applied Sciences. Medicine. Technology
The Arts. Recreation. Entertainment. Sport
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Science.
Information.
Page 17/x
C4D
D2.1 Metadata Ontology
8
9
Language. Linguistics. Literature
Geography. Biography. History
Table 2: The 10 classes (UDC Consortium)

UNESCO/SPINES thesaurus
The Science and Technology Policies Information Exchange System (SPINES) thesaurus was
developed by the United Nations Educational, Scientific and Cultural Organisation (UNESCO).
The SPINES thesaurus is a controlled and structured vocabulary for information processing in
the field of science and technology for policy-making, management and development and
permits the indexing of:
 The literature dealing with all aspects of science and technology policies
 Documents dealing with research and experimental development projects
 Literature and projects dealing with development in general, and more particularly
those which call heavily on the application of science and technology
 Frascati
The Frascati Manual is not only a standard for Research and Development (R&D) surveys in
the Organisation for Economic Co-operation and Development (OECD) member countries, it
has become a standard for R&D surveys worldwide as a result of initiatives by OECD,
UNESCO, the European Union and various regional organisations. The Manual was first
issued nearly 40 years ago and deals exclusively with the measurement of human and
financial resources devoted to research and experimental development (R&D) data. The
Manual presents recommendations and guidelines on the collection and interpretation of
established R&D data. Additionally, the Manual contains eleven annexes, which interpret
and expand upon the basic principles outlined in the Manual in order to provide additional
guidelines for R&D surveys or deal with topics relevant to R&D surveys.
For the purpose of C4D, it will be important to find a suitable classification scheme that will enhance
the ability to correctly and reliably classify datasets to aid their discovery. The Dewey Decimal
Classification, Universal Decimal Classification, and the Library of Congress Classification schemes
mentioned above, while universal, do however have a library bias which is also to do with correctly
shelving and finding items in a library and they have evolved from a period where some disciplines
had not been in existence giving rise to strange phenomena such as different aspects of computer
science being organised into two locations at opposite ends of the classification system under the
Dewey System for example. Another problem is the insufficient specificity with respect to sub-areas
in different disciplines. At the same time no universal and detailed system exists that can
immediately be applied, which may only leave the option of collecting subject specific classification
schemes for specific disciplines. There are more generic schemes that could be applied with a
universal focus such as PAIS, but which are again very coarse for each specific subject area.
As far as the UK Research Councils are concerned, they currently already use classification systems
for the purpose of managing their activities and in particular awards. Thus NERC currently uses a
scheme http://www.nerc.ac.uk/funding/application/topics.asp. Amongst the prevailing schemes
there is the RCUK Subject Classification scheme as well as JACS from HESA
http://www.hesa.ac.uk/dox/jacs/JACS_complete.pdf
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 18/x
C4D
D2.1 Metadata Ontology
6.1
Implications for CERIF
Metadata describes the “who, what, where, when, why and how” a dataset was collected; it may
also describe the “quality” of the data. Consistent use of standard terminology for metadata
descriptors, identifiers or fields will help the understanding, integration, discovery and use of
datasets. Among existing metadata content standards there is potentially some overlap among the
terminology used. Thus a mapping, (see Table 3 below), between (some) selected standards from
different disciplines and potential CERIF entities was carried out in order to compare terminology
and its usage. The selected standards are:








MEDIN – Marine Environmental Data Information Network Discovery Metadata Standard
GEMINI2 – used in a geospatial discovery metadata service
Dublin Core – developed initially to describe web-based resources
Darwin Core – primarily based on taxa
EML – Ecological Metadata Language is a metadata specification developed by, and for, the
ecology discipline
DIF – Directory Interchange Format is a metadata format used to create directory entries
that describe scientific data
DDI – Data Document Initiative is a descriptive metadata standard for social science data
CSMD – (the) Core Scientific Metadata Model.
The initial aim of the mapping is to identify the metadata content needs, as there may be content
details and elements from multiple standards that can be included to help users understand the
data. A secondary aim of the mapping is to try and produce a minimum subset of terms for
classification, administration and discovery. Several of the standards, e.g. Darwin Core and EML have
very detailed documentation, which for this first cut is too vast to study in any great detail.
These standards were chosen for comparison since they represent a wide range of disciplines and
thus will allow us to develop a generic solution that can be used certainly for the majority of
sciences. Some of these have already established a wide acceptance such as Dublin Core. Finding a
solution that covers the core aspects of these and is able to map equivalent elements would make it
easier to potentially also import metadata that has already been written in these standards into the
independent CERIF format and CERIF compliant metadata repositories.
In order to take this further and create a generic framework in which to develop a CERIF mapping
there needs to be agreement on which core information is required for data users to discover, use
and understand the data. Following on from this, it is essential that a formal definition for each term
is agreed upon.
It can be seen from the table below that with the exception of Darwin Core all the selected
standards incorporate a term that relates to “why” the data set was collected, e.g. MEDIN uses
‘Resource abstract’, Dublin Core uses ‘Description’ and DIF uses ‘Summary’. Likewise all standards,
incorporate details about “who” – the person and/or organisation responsible for collecting and
processing the data and who should be contacted if there are questions about the data, e.g. using
terms such as ‘Responsible party/organisation’ (MEDIN/GEMINI2), ‘Personnel’ and ‘Originating
Center’ (DIF) and ‘Publisher’ (Dublin Core). Among the selected standards most also cover the
“quality” aspect of the dataset as well as the “where” – geographical location and/or a description of
the spatial characteristics of the data.
Therefore based on this initial mapping it is possible to identify metadata elements that are common
among the standards chosen. These elements provide coverage of the necessary descriptive and
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 19/x
C4D
D2.1 Metadata Ontology
structural metadata elements required to aid discoverability of datasets. Some of the administrative
metadata elements e.g., metadata name, metadata version and metadata language, are primarily
incorporated only in the MEDIN, GEMINI2, Darwin Core, EML, and DIF standards and, at this time,
the CERIF result entity ‘Product’ is the most likely CERIF entity to which these can be mapped.
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 20/x
3
M Abstract
4
M Description
4
M Resource type
39
M Type
Record-level Term
dataset
Resource locator
Unique resource
identifier
5
19
C Resource Identifier
Record-level Term
url
6
C Resource locator
Unique resource
C identifier
36
M
Record-level Term
alternateIdentifier
Entry ID
Coupled resource +
7
C Coupled resource
38
C
Resource language
8
C Dataset language
3
C Language
Record-level Term
language
Dataset Language
Topic category
Spatial data
service type +
9
C Topic category
Spatial data service
10 C type
5
M Subject
37
C
Keywords
Geographic
bounding box
11 M Keyword
6
M
West/east/north/
11,12
12 C south bounding box 13,14 M
Extent
13 O Extent
purpose
<abstract>
2.2.2
<dataKind>
2.2.3.10
R
R
R
O *Coverage
Location
geographicDescription
16
O
Location
boundingAltitudes
17
M *Coverage
Location
R
Element
Obligation
CERIF Entity (Possible direct mapping)
cfResultProductName
CERIF Entity/Link entity
cfResultProduct
cfResultProductDescription
cfResultProduct
cfDublinCoreResourceType
cfDublinCore
cfDublinCoreResourceIdentifier
cfOrganisationUnit_DublinCore
ADO Locator: URI
Reference Location:
Service
cfDublinCoreResourceIdentifier
achieved by linking relation between two
datasets
cfLanguage
cfDublinCoreLanguage
cfOrganisationUnit_DublinCore
Topic: Subject
cfDublinCoreSubject
cfDublinCore
HR
R
<keyword>
<topClass>
2.2.1.1
2.2.1.2
Lineage
17 C Lineage
10
M Source? Relation?
Spatial resolution
Additional
information source
Limitations on
public access
Conditions
applying for
access and use
18 C Spatial resolution
Additional
19 O information source
Limitations on
20 M public access
18
C
27
O
Record-level Term
25
M Rights Management
Record-level Term
intellectualRights
21 M Use constraints
26
M Rights Management
Record-level Term
Record-level Term
intellectualRights
contact
publisher
organizationName
Record-level Term
mediumName
maintenance
maintenanceUpdateFre
quency
Responsible party
Responsible
22 M organisation
23
M Publisher
Data format
23 O Data format
21
O Format?
Frequency of
update
Frequency of
24 C update
24
M
Conformity
25
41
C
Metadata date
26 M Metadata date
30
M
Event
Topic: Keywords OR
Keywords
R
<geogCover>
Spatial Coverage
temporalCoverage
R Temporal Coverage
methods
O Quality
<timePrd>
<prodDate>
HR <collDate>
<sources>
HR <othrStdyMat>
cfDublinCore
28 M
OBLIGATION FOR EML:
R - Required
O - Optional
cfDublinCore
cfGeographicBoundingBox
cfGeographicBoundingBox
cfDubLinCoreCoverageSpatial
cfDublinCore
cfDublinCoreCoverageTemporal
cfDublinCoreDate
achieved by linking relation between two
datasets
this would be in detailed metadata not
contextual
cfDublinCore
cfResultPublicationTitle
Access Constraints
HR
Access Conditions
cfDublinCoreRightsManagementAccessRights cfDublinCore
HR
HR
R
RC <producer>
cfDublinCoreRightsManagementLicense
cfDublinCore
2.1.3.1
Access Conditions
Study: Investigator
OR Study Institution:
Name
File ADOL: Media
cfDublinCorePublisher/
cfDublinCoreRightsHolder
cfOrganisationUnit_DublinCore
HR <fileType>?
3.1.5
OR
File ADOL: File Type
cfMedium
cfMedium
Use Constraints
R Personnel
Data Center
O Originating Center
Distribution
O
MR Meta
Information:
Metadata
Conformance
RC
RC
Metadata Name
R
Metadata Version
R
RC
OBLIGATION FOR DIF:
R - Required
HR - Highly Recommended
RC - Recommended
+
(Only required if what is being described is a service; also applies to MEDIN)
* Could apply to each of these MEDIN elements
D2.1 Metadata Ontology.doc
Study Information:
Time Line OR
Logical Description:
Time Period
Data Description:
Data Quality
cfGeographicBoundingBox
cfDublinCoreCoverage
Related Material:
Publications, Refs
Parent DIF
OBLIGATION FOR MEDIN AND GEMINI2:
C – Conditional
M – Mandatory
O – Optional
2.2.3.1
2.1.3.3
2.2.3.2
2.3.1.8
2.5
cfResultProduct
cfGeographicBoundingBox
RC
HR
C
30 O
2.2.3.4
cfResultProductKeywords
References/Publications
Dataset Citation
DIF Creation Date
Last DIF Revision
Date
27 M
2.2.3.4
HR
<geogCover>
M *Coverage/Date
Version 1.1 of 14 Jan 2012
Study: Study
Information OR
Study Information:
Purpose
Data Holding OR
Data Description
HR
Parameters (Science
Keywords)
Location
7
Parent ID
R
O Related URL
keywordSet
geographicCoverage
boundingCoordinates
15
33
CSMD
Study: Study name
achieved by linking dataset to service
16 M Temporal extent
Metadata language 29 M Metadata language
O Summary
ISO Topic Category
Temporal
reference
Metadata standard
name
Metadata standard
version
2.1.1.1
cfResprod can have more than 1 name
Resource type
Conformity
DDI
<titl>
shortName
Resource abstract
Vertical extent info 14 O Vertical extent info
Spatial reference
Spatial reference
system
15 M system
R
Obligation
O
DIF
R Entry Title
Element
2
Obligation
O Alternative title
EML
objectName
Element
2
Darwin Core
Record-level Term
Element
Obligation
Dublin Core
M Title
Element
Obligation
1
Element
Obligation
M Title
GEMINI2
Obligation
1
Element
Obligation
MEDIN
Resource title
Alternative
resource title
D2.1 Metadata Ontology
Element
C4D
Page 21/x
MR Meta
Information:
Metadata Source
MR Meta
Information:
Metadata ID
could be done by linking dataset to event
linking relation dataset to publication
achieved by relationship between metadata
and authorising organisation
achieved by relationship between metadata
and authorising organisation
achieved by relationship between metadata
and authorising organisation
language associated with title or abstract
achieved by linking relation between two
datasets
CSMD Metadata Category Comments from Keith Jeffery
MR - Metadata Record
cfPerson_ResultPublication
C4D
D2.1 Metadata Ontology
CERIF is already able to store and communicate some metadata about datasets through the existing
cfResultProduct entity as shown below but which needs to be extended to cover all necessary
metadata elements to represent key administrative, descriptive and perhaps to a lesser degree
structural aspects:
Figure 2: CERIF CFResultProduct Entity Relationships
Conclusion:
Based on the mapping there are 2 prime CERIF entities that are likely candidates for storing
metadata – cfResultProduct and cfDublinCore, with other additional entities for specific aspects of
the metadata, e.g., cfGeographicBoundingBox, which appears to be ideally suited for providing data
about the extent of the resource. The CERIF entities – cfResultProduct and cfDublinCore – provide
good coverage of most of the descriptive and structural metadata elements required and either one
or both of these entities would need to be extended to include the administrative metadata
elements required or, alternatively, a new CERIF entity needs to be created.
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 22/x
C4D
D2.1 Metadata Ontology
The C4D project will endeavour to build a solution that reuses existing CERIF entities where possible
and add new ones to support the metadata associated with datasets. The pilot implementation and
the experiences gained will then be fed back into the EuroCRIS standardisation process that can then
consider the proposed extensions to CERIF.
6.2
Related Activities
With respect to the aim of building an infrastructure for the management of research information
covering researchers, publications projects and outcomes such as datasets there are two current
activities C4D should be taking note of: which will be a useful model to learn from also given that it
covers the metadata about these activities as well as the access to resources such as datasets:


the Australian National Data Service (ANDS) and its Australian Research Data Commons
(ARDC)
the Research Councils UK (RCUK) and the Research Output System (ROS)
Australian National Data Service (ANDS)
The Australian government has already started to implement this in Australia and which will be a
useful model to learn from also given that it covers the metadata about these activities as well as the
access to resources such as datasets. The Australian Initiative is known as the Australian National
Data Service (ANDS) ands leading the creation of a cohesive national collection of research
resources and a richer data environment that will:



Make better use of Australia’s research outputs
Enable Australian researchers to easily publish, discover, access and use data
Enable new and more efficient research
ANDS is concerned with data that is produced by researchers as well as data that is used by and
accessible to them. Thus, ANDS is enabling the transformation of data that are unmanaged,
disconnected, invisible and single use into structured collections that are managed, connected,
discoverable and reusable so that researchers can easily publish, discover, access and use research
data. To enable Australia’s research data to be transformed, ANDS is:





Creating partnerships with research and data producing organisations through funded
projects and collaborative engagements
Delivering national services such as Research Data Australia and Cite My Data
Providing guides and advice on managing, producing and reusing data
Building a research data community of practice
Building the Australian Research Data Commons (ARDC) - a cohesive collection of research
resources, made available for community use, from all research institutions, to make better
use of Australia's research outputs
Therefore the task of ANDS is to create the infrastructure to provide greater access to Australia’s
research data assets in forms that support easier and more effective data use and reuse, thereby
enabling researchers to benefit from the Australian Research Data Commons. The infrastructure:
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 23/x
C4D
D2.1 Metadata Ontology





makes available feeds of data collection descriptions from a range of public sector agencies
federates and makes visible the Data Commons
enables data/metadata management and sharing for research producing institutions
enables capture of data and metadata from research instruments, and
allows users to fully exploit the data held in the commons
The ANDS’ approach is to engage in partnerships with the research institutions to enable better local
data management that enables structured collections to be created and published. ANDS then
connects those collections so that they can be found and used. These connected collections,
together with the infrastructure form the Australian Research Data Commons (see Figure 3 below):
Figure 3: Australian Research Data Commons (ARDC)
The Australian Research Data Commons (ARDC) is being created as a “meeting” place for researchers
and data; to support the discovery of, and access to, research data held in Australian universities,
publicly funded research agencies and government organisations for the use of research, and to
provide:




A set of data collections that are shareable
Descriptions of those collections
An infrastructure that enables populating and exploiting the commons
Connections between the data, researchers, research, instruments and institutions
This will bring information about Australian research data together and place it in context. It will
connect the data produced by research, as well as data needed to undertake research. The ARDC is
more than an index - it is a rich web of description. ANDS does not hold the actual data, but points to
the location where the data can be accessed. This combined information can then be used to help
people discover data in context. The intention is for the ARDC to be searched, viewed and accessed
in a number of ways.
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 24/x
C4D
D2.1 Metadata Ontology
Research Councils UK (RCUK)
RCUK is a strategic partnership of the UK Research Councils. RCUK was established in 2002 to enable
the Research Councils to work together more effectively to enhance the overall impact and
effectiveness of their research, training and innovation activities, contributing to the delivery of the
Government’s objectives for science and innovation.
RCUK are responsible for investing public money in research in the UK to advance knowledge and
generate new ideas which lead to a productive economy, healthy society and contribute to a
sustainable world.
The Research Councils are the public bodies charged with investing tax payers’ money in science and
research and, as such, they take very seriously their responsibilities in making the outputs from this
research publicly available – not just to other researchers, but also to potential users in business,
Government and the public sector, and also to the public. The Research Councils are committed to
the guiding principles that publicly funded research must be made available to the public and remain
accessible for future generations. Therefore, the Chief Executives of the Research Councils have
agreed that over time the UK Research Councils will support increased open access, by:


building on their mandates on grant-holders to deposit research papers in suitable
repositories within an agreed time period, and
extending their support for publishing in open access journals, including through the pay-topublish model
Thus a core part of the Research Councils’ remit is making research data available to users. RCUK
are committed to a transparent and coherent approach and their common principles on data policy
provide an overarching framework for individual Research Council policies on data policy. The
information gathered is fundamental to the Research Councils strengthening their evidence base for
strategy development, and crucial in demonstrating the benefits of Research Council funded
research to society and the economy.
The Research Outcomes System (ROS) allows users to provide research outcomes to four of the
Research Councils – the Arts and Humanities Research Council (AHRC), the Biotechnology and
Biological Sciences Research Council (BBSRC), the Economic and Social Research Council (ESRC), and
the Engineering and Physical Sciences Research Council (EPSRC). The ROS can be used by these
Research Council grant holders to input outcomes information about their research. It can also be
used by Higher Education Institution (HEI) research offices to input research outcomes information
on behalf of grant holders and/or access the outcomes information of grant holders in their
institution.
Grant holders, research office managers and associated staff input outcomes information,
comprising both narrative and data, under nine different categories, or outcome types, as follows:





Publications
Other Research Outputs
Collaboration/Partnership
Dissemination/Communication
IP and Exploitation
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 25/x
C4D
D2.1 Metadata Ontology




Award/Recognition
Staff Development
Further Funding
Impact
Each of these categories has various sub-sections, with explanatory guidance, to help users be as
specific as possible when inputting information. In addition, grant holders are asked to provide a
brief summary of the Key Findings arising from their research. This information can be made
available to research users to stimulate engagement and collaboration.
The ROS categories provide a comprehensive, hierarchical, set of elements for inputting information
with regard to outcomes from research. Within each of the top level outcome types (categories)
there are additional selection criteria or options available at a first tier and a second tier level. For
example, the top level “Other Research Outputs” outcome type has options, at the first tier, for
biological outputs, electronic outputs and physical outputs; for electronic outputs, at the second tier
level, there is the option of selecting ‘dataset’ which is described as “a structured record of the value
of variables that were measured as part of the research”.
Conclusions:
Overall, the ROS categories present a good level of detail with regard to research outputs and could
provide the fine detail that will be needed for the C4D subject demonstrator. Ultimately, however,
these categories do not include all of the metadata elements that are covered by standards such as
MEDIN, DIF, CSMD and do not provide coverage either for any of the gaps that have been identified
in other standards.
The goal of ANDS is to assist institutions and research communities to ensure they can operate
within this broader framework, so that the collective national investment in research data is made in
ways that ensure the data can be more widely discovered and reused. This goal is not unlike the
underlying rationale behind CERIF, which is the need to share research information across countries
or even between different funding bodies in the same country. The long-term data management
objective of ANDS is to deliver a discovery service that enables users to find data held in Australian
repositories by searching a national metadata store. And, whilst this does not involve the
development of a standard, there is again a similarity in this objective to the vision behind CERIF.
Whilst CERIF is a standard for modelling and managing research information its current focus is to
record information about people, projects, organisations, publications, patents, products, funding,
events, facilities, and equipment which are likely to be the same types of information recorded by
the ARDC.
C4D will continue to monitor these activities and take them into consideration in the process of
deriving its proposed solution concept. There are clear lessons to be learnt from them in terms of
the strategy used and of particular note is also the RCUK classification system which will be
incorporated into the C4D demonstrator for example.
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 26/x
C4D
D2.1 Metadata Ontology
ONTOLOGY REQUIREMENTS
7
The project will use CERIF to store and communicate the metadata of sample datasets, and integrate
this metadata with that held on research projects and research outputs available at the three
partner institutions. The partner institutions recognise that the next logical step is to develop the
infrastructure at each institution to integrate research data since all of the partners have very
limited capacity in this area at present; this project will extend the use of CERIF into the research
data management area.
The overall aim of the project is to develop a framework for incorporating metadata into CERIF such
that research organisations and researchers can better discover and make use of existing and future
research datasets, wherever they may be held. In order to demonstrate the facility of the approach,
a subject specific demonstrator will be built and this approach verified at the three partner HEIs,
each within their own research administration infrastructure.
As can be seen from the brief stakeholder analysis, datasets are an important kind of research
output and are also an important input for future research. Enabling better research and
accelerating the rate of discoveries and assessing and demonstrating impact are therefore key
considerations:


Researchers would also like to be able to discover and access suitable datasets for their
research activities
Organisations and funders are also interested in having datasets properly published and to
be able to explore impact arising from other output and results that are related to datasets
and potentially their impact as they get used by other researchers and projects subsequently
Once datasets are fully supported in CERIF, and CERIF data stores, whether institutional and
distributed or centralised and populated with the requisite research data about institutions,
individuals, projects, publications and research results including datasets, these data stores will
become a formidable information resource that will serve a variety of users and uses. As we
discussed in the stakeholder analysis as above there are two major perspectives that will need to be
supported:



discovery: the ability to reliably find resources based on typical retrieval criteria
exploration: the ability to explore the context surrounding particular items of interest
(people, projects, publications, results, datasets to gain a better understanding of the
related research activity)
linking: the ability to view this mesh of information as a whole even if parts of this
information are located in different repositories
The aim is to develop an ontology and an ontology-driven interface to CERIF data stores which will
provide the key features of import, discovery, exploration and linking. By modelling the key
concepts in the CERIF data store an interface can be built that has the desired functionality of
enabling the non-invasive retrieval of records also across data stores and the representation and
navigation of links between entities so as to for example visualise related records surrounding a
particular dataset, such as other datasets, researchers, publications, projects, etc.
In computer science an ontology formally represents concepts in a domain, and the relationships
between those concepts. It can be used to reason about the entities within that domain and may be
used to describe the domain. In theory, an ontology is a "formal, explicit specification of a shared
conceptualisation". An ontology renders shared vocabulary and taxonomy which models a domain
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 27/x
C4D
D2.1 Metadata Ontology
with the definition of objects and/or concepts and their properties and relations (e.g. We can define
Felix as a cat which is a subclass of the feline group of mammals, and John as a human being, which
is also a subclass of mammals and link Felix to John through the ownsPet relationship). Ontologies
have a formal syntax (rules of how to compose them) and they have logical properties, which can be
used to reason about them. They can be used to manage data or observations and to answer
specific questions. They are a partial model that concentrates on specific competency questions
determined by the specific application they are supposed to serve. The notion of shared
conceptualisation refers to being capable of shared development and use, they are expected to be
reused where possible, and new ontologies to reuse relevant parts of other ontologies. They have
been used in association with web pages, XML files and databases and can mark these up so as to
allow the extraction of information from them and combining information from a variety of sources
in a distributed environment. Ontologies typically are expressed in XML form as RDF and OWL
where OWL is the prevalent standard that has been adopted by the W3C.
The benefit of using ontologies and semantic web techniques is to express connections between
entities even if these do not explicitly exist in the records and to use these to link together
information and potentially also derive missing information. In addition, when classifications are
represented in the ontology in a taxonomic fashion queries can be expanded or refined to yield
better retrieval results.
Proj1
PersA
DS1
Proj2
Pub1
DS2
Pub2
PersB
Pub3
Figure 4: Datasets and Related Information Context
Ontologies as explicit, machine-processable models of a domain have been in use for some time and
have been formalised by the knowledge representation community. They are also part of
mainstream web and several standards have been adopted by the W3C Consortium, including RDF
and OWL. These are fully compatible with other web technologies and typically serialised in XML
and can be applied in a variety of ways to introduce semantics and enable semantic applications to
deliver a better and more context sensitive response. They can either be used to further annotate
existing XML resources or in a completely non-invasive way point at either elements of XML
resources or map onto a live RDB or web services to provide a semantic interface to them.
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 28/x
C4D
D2.1 Metadata Ontology
Figure 4 above shows how ontologies can be used to classify a variety of resources such as datasets,
publications, researchers and projects for instance and to search for them by resource type.
Ontologies also allow associations or relations to be expressed by which resources are linked
together as indicated in the diagram by arrows where researchers, projects and publications are
linked to datasets. By graphically displaying these as shown on the diagram the activity surrounding
datasets can be visualised and made navigable.
The current version of CERIF already supports semantics to some extent in terms of links between
entities using existing and extensible classification schemes through which entities such as person
can be linked to publications, projects, organisations and results to name but a few. This allows for
simple models to be generated including classes, class hierarchies, and properties. However, it does
not support the full range of constraints such as domain and range, quantification, disjoints,
functional properties, inverses and others that are part of the standard repertoire of ontology
constructs to provide more sophisticated services. This requires further exploration and
consideration of whether it is expedient to put the semantics with the data. The management of an
ontology for the C4D project should therefore make use of exiting entities and facilities but needs to
make use of more dedicated tools to manage the generation and testing of a candidate C4D
ontology through the use of generally accepted tools such as Protégé.
To provide for a discovery, navigation and linking service it is therefore proposed to develop a
separate ontology that will be mapped to the data structure of the repository (and potentially to
other repositories elsewhere) and through which an interface is provided where users can search
the data store to retrieve resources and explore the links to related resources. This would facilitate
retrieval of resources from a range of data stores and allow a view of the combined information in a
uniform and transparent way. By augmenting and building upon the existing link entities, a more
comprehensive information retrieval and navigation service will be provided which will have
concrete benefits for the different users:



ability to retrieve information from a variety of locations
ability to constrain or expand retrieval according to user preference
ability to navigate related information to explore activity and impact
The use of ontologies in conjunction with datasets, metadata about datasets and repositories is not
new and some current work has to be taken note of:





Semantic layer in CERIF and the CERIF ontology for version 1.2
VIVO ontology for scientific networks (http://journal.webscience.org/316/) is designed to
create a scientific network
DataStaR is a scientific data repository implementation driven by an ontology
EPOS - the European Plate Observing System (EPOS) is the integrated solid Earth Sciences
research infrastructure
CSMD – the Core Scientific Metadata Model;
DataStaR is a science data “staging repository” developed by Albert R. Mann Library at Cornell
University that produces semantic metadata while enabling the publication of datasets and
accompanying metadata to discipline-specific data centers or to Cornell’s institutional repository.
DataStaR, which employs OWL and RDF in its metadata store, serves as a Web-based platform for
production and management of metadata and aims to reduce redundant manual input by reusing
named ontology individuals.
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 29/x
C4D
D2.1 Metadata Ontology
The VIVO project is creating an open, Semantic Web-based network of institutional ontology-driven
databases to enable national discovery, networking, and collaboration via information sharing about
researchers and their activities. The project has been funded by NIH to implement VIVO at the
University of Florida, Cornell University, and Indiana University Bloomington together with four
other partner institutions. Working with the Semantic Web/Linked Open Data community, the
project will pilot the development of common ontologies, integration with institutional information
sources and authentication, and national discovery and exploration of networks of researchers.
Building on technology developed over the last five years at Cornell University, VIVO supports the
flexible description and interrelation of people, organisations, activities, projects, publications,
affiliations, and other entities and properties. VIVO itself is an open source Java application built on
W3C Semantic Web standards, including RDF, OWL, and SPARQL.
The original CERIF ontology abstracted the CERIF logical model into OWL format and thus presents a
mirror image of the CERIF framework in OWL and has been presented for discussion. A more up to
date version is currently under development for CERIF 1.3 though has not been released yet. CERIF
itself also contains elements of ontologies already in the semantic layer and in particular the Link
entities and classifications and classification scheme. CERIF is in principle capable of holding an
ontology though does not in itself present a fully developed ontology.
The recent work of the VIVO project and DataStar present interesting avenues that will need to be
explored to see how their ontologies can potentially be adapted for the needs of C4D. What will be
important will be to develop an ontology that can represent datasets and also any associated
researchers, organisations, projects and publications and any other form of result. In this way the
impact of datasets can be explored. This will also need to be married with a suitable classification
scheme to facilitate discovery. In line with the current thinking on ontology development,
ontologies should where possible be reusing and adapting existing ontologies (see NEON
methodology) and it will also be useful to study the application approach used in VIVO and DataStar
to learn from their experience. VIVO is also a member of EuroCRIS and thus related to the CERIF
initiative.
Figure 5: The EPOS Infrastructure
The European Plate Observing System (EPOS) is the integrated solid Earth Sciences research
infrastructure approved by the European Strategy Forum on Research Infrastructures (ESFRI). EPOS
faces a major data handling challenge. The amount of primary data and the demand for accessing it
is enormous and increasing. One main EPOS objective is the creation of a comprehensive, easily
accessible geo-data volume for the entire European plate. Overall, EPOS will ensure secure storage
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 30/x
C4D
D2.1 Metadata Ontology
of geophysical and geological data providing the continued commitment needed for long-term
observation of the Earth.
EPOS aims to integrate data from permanent national and regional geophysical monitoring networks
with observations from “in-situ” observatories and temporary-monitoring and laboratory
experiments. The EPOS infrastructure is based on existing, discipline-oriented, national data centres
and data providers managed by national communities for data archiving and mining. Its cyberinfrastructure, see Figure 5 above, for data mining and processing also has facilities for data
integration, archiving and exchanging data. The EPOS Core Services will provide the top-level service
to users including access to multidisciplinary data and metadata, virtual data from modelling and
solid Earth simulations, data processing and visualisation tools as well as access to high-performance
computational facilities.
Core Scientific Metadata Model (CSMD)
The Core Scientific Metadata Model (CSMD) is a model for the representation of scientific study
metadata developed within the Science & Technology Facilities Council (STFC) to represent the data
generated from scientific facilities. The model has been developed to allow management of and
access to the data resources of the facilities in a uniform way.
The model defines a hierarchical model of the structure of scientific research around studies and
investigations, with their associated information, and also a generic model of the organisation of
datasets into collections and files (see figure 6 below). Specific datasets can be associated with the
appropriate experimental parameters, and details of the files holding the actual data, including their
location for linking. This provides a detailed description of the study, although not all information
captured in specific metadata schemas would be used to search for this data or distinguish one
dataset from another:
Figure 6: The CSMD Entities Infrastructure
The models proposed by CSMD and EPOS are relevant to the efforts of C4D in that they suggest that
with respect to the metadata coverage and focus there needs to be provision for integrating with
repositories that hold the actual datasets where more low level metadata standard aspects will be
covered that are more to do with the format of the datasets as opposed to the more administrative
aspects to do also with context and in particular with a need for supporting the discoverability if
such collections of dataset are to be made effectively reusable and retrievable/browsable. The
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 31/x
C4D
D2.1 Metadata Ontology
CSMD Entities structure is also a useful pointer in terms of relating key entities together and making
these data collections more browsable or assessable especially from the point of view of impact and
impact analysis, be this for reporting purposes or for helping researchers understand the
relationship between items of interest to understand activities surrounding datasets they may be
looking at.
Another development that will need to be considered is the related ICAT Project that builds on the
CSMD Architecture and proposes a mechanism for the seamless access to resources in a federated
situation that links together repositories such as metadata or research information and resources
such as datasets etc.
Figure 7: Architecture of ICAT
The ICAT architecture (shows above in figure 7) is particularly noteworthy as it considers the aspect
of federated systems that may hold parts of a much larger collection of information and data and
where discoverability and the ability to merge data from different locations is concerned. Through
the appropriate tagging and use of keywords and taxonomies of scientific terms a seamless retrieval
of resources can be enabled. This ill then generate a more scalable and extensible architecture and
thus be flexible for further extension in the future and also the ability to link repositories containing
research metadata and the repositories that hold the actual artefacts that may want to be browsed
or retrieved.
Proposed C4D Discovery Architecture
In order to refine the proposed approach and in order to make use of existing work to develop a
comprehensive solution more quickly we propose to base the application on R2O and ODEMapster
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 32/x
C4D
D2.1 Metadata Ontology
R2O & ODEMapster is an integrated framework for the formal specification, evaluation, verification
and exploitation of the semantic mappings between ontologies and relational databases. This
integrated framework consists of:


R2O, a declarative, XML-based language that allows the description of arbitrarily complex
mapping expressions between ontology elements (concepts, attributes and relations) and
relational elements (relations and attributes)
ODEMapster processor, which generates Semantic Web instances from relational instances
based on the mapping description expressed in an R2O document. ODEMapster offers two
modes of execution: Query driven upgrade (on-the-fly query translation) and massive
upgrade batch process that generates all possible Semantic Web individuals from the data
repository
Recently in the context of the NeOn project, the ODEMapster plugin has been developed and it is
included in the NeOn Toolkit. This plugin offers the user a GUI to create, execute, or query the R2O
mappings. It has shte advantage of providing a live connection to the database as opposed to the
D2RQ approach that just offers a snapshot. The proposed architecture for the C4D project is shown
in figure 8 below:
C4D - CERIF
Semantic Browser
C4D Navigator
(Import, Search and Visualisation)
Domain
Ontology
R2O
Mappings
RDB Ontology
ODEMapster
Protege
CERIF Repository
Figure 8: Application of Metadata Ontology
The recommendation will be for:



A suitable classification system to be built into the ontology for a classification taxonomy – it
is likely that the classification used already by RCUK will be the key classification system to
aplly here also since it wold then make future integration between the metadata
repositories and repositories that administer/curate datasets easier since several of these
already exist such as within EPSRC and NERC.
The remaining key entities be specified together with their formal properties and relations
based on current CERIF and the new CERIF ontology as well as VIVO and DataStaR
A suitable toolset to be chosen for the implementation of the application – currently the
prevalent way of developing such applications is through using Protégé for editing purposes,
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 33/x
C4D
D2.1 Metadata Ontology


together with the Jena toolkit for managing the ontology in a Java application to be
developed. This also contains reasoning tools.
The VIVO system architecture should be examined for reference purposes
A database connector needs to be used for connecting the ontology application to data
repositories such as databases. Currently there are two open-source solutions available for
this, namely D2RQ and R2O/ODEMapster. The latter has the benefit of providing a live
connection to a database as opposed to creating a database image in the case of D2RQ
which is both cumbersome and also impractical for use with large repositories/databases.
Figure 8 above shows how such a system architecture would work by linking to an existing metadata
repository (CERIF) which holds the records on researchers, projects, publications and datasets. This
would be accessed using ODEMapster. The application then uses the C4D ontology and maps it to
the database using R2O mappings and then allows the database to be searched and data to be
retrieved by interrogating the domain ontology. Depending on user preference this can then be
accessed through a traditional tabular search facility or through an interactive graphical
representation that allows the collection of information to be navigated and browsed. Also, the
benefit of this approach is that it could be easily extended to deal with a variety of repositories and
allow the data to be combined for more unified and transparent navigation. However this would
require additional effort to deal with merging of data from different sources. The VIVO project has
not concentrated so much on datasets but it would be advisable to study the system they have
developed as it allows for this inter-organisational perspective and provides an ontology driven
approach.
For the purpose of deriving a C4D solution there are a number of services we will also look into to
help with the ontology construction:
 http://ontologydesignpatterns.org/ provides a repository of shared ontologies where there
may be useful components that can be used in the construction of the C4D ontology
 http://semanticweb.org/wiki/GoodRelations is another frequently used repository of
ontology fragments to do with managing relations
 http://owl.cs.manchester.ac.uk/modularity/ is a service that easily allows to extract
modules out of existing ontologies for reuse
SUMMARY AND RECOMMENDATIONS
8
From the study of CERIF and the current developments touched upon in this report the following key
points should guide a future implementation of support for datasets in CERIF:

Making use of the existing cfResultProduct entity in CERIF and extending it to support any
additional elements for datasets and related metadata identified from the survey of
metadata standards
o Build a metadata for dataset support based on MEDIN by making geospatial aspects
optional
o and adding further elements required in other scientific disciplines such as biology,
physics and chemistry.
o Making good use of Dublin Core where possible as this enjoys widespread
acceptance and elements of which are being used in some standards as well as also
being used in CERIF already
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 34/x
C4D
D2.1 Metadata Ontology

Develop a candidate ontology for use with the envisaged application to enter dataset
metadata, import of metadata in bulk, as well as discovery of datasets and dataset related
activity:
o based on a study and potential reuse of aspects of the VIVO and DataStaR
ontologies where these support datasets and key entities associated with datasets
o as well as making use of the new proposed CERIF ontology once available and
maintaining compatibility with semantic aspects already implemented in CERIF as
part of the Semantic Layer, the Link Entities and Classifications and classification
scheme

Study the VIVO ontology application approach to refine the approach of an ontology
enabled application that can be used for:
o Importing existing dataset metadata into CERIF
o Data entry tools for generating dataset metadata in CERIF by hand
o Discovery facility for searching for datasets
o Navigation facility to visualise relations between datasets and key entities in a
graphical and interactive fashion
o Compare this with a current working prototype already available at Sunderland
through another project to decide the best solution

Selection of an existing classification scheme for the purpose of classification of datasets
and their improved discovery in the process of retrieval or browsing:
o Using either the RCUK classification scheme
o Or HESAs JACS codes
o At this point in time there are attempts to merge JACS with RCUK though this is
expected to complete after the end of the C4D project
o The RCUK Classification scheme will get greater exposure as part of the RCUK
Gateways to Research Project – announced in the BIS Research and Innovation
Strategy in December 2011 and it is therefore advisable to work with this in the
interim in C4D.
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 35/x
C4D
D2.1 Metadata Ontology
REFERENCES
Buil-Aranda, C. and Corcho, Óscar and Krause, Amy (2009) Robust service-based semantic querying
to distributed heterogeneous databases. In: 20th International Workshop on Database and
Expert Systems Application, DEXA2009, 31/08/2009 - 04/09/2009, Linz, Austria.
Connaway, Lynn Silipigni, Timothy J. Dickey, and Marie L. Radford. (2011) "'If It Is Too Inconvenient,
I'm Not Going After it:' Convenience as a Critical Factor in Information-Seeking Behaviors."
Library and Information Science Research, 33: 179-190. Pre-print available online at:
http://www.oclc.org/research/publications/library/2011/connaway-lisr.pdf
Connaway, Lynn Silipigni, and Timothy J. Dickey (2010) Towards a Profile of the Researcher of Today:
What Can We Learn from JISC Projects? Common Themes Identified in an Analysis of JISC
Virtual Environment and Digital Repository Projects. Available online at: http://ierepository.jisc.ac.uk/418/2/VirtualScholar_themesFromProjects_revised.pdf
DataStaR: Bridging XML and OWL in Science Metadata Management, Brian Lowe in Communications
in Computer and Information Science, 1, Volume 46, Metadata and Semantic Research, Part 2
Gruber, Thomas R. (June 1993). "A translation approach to portable ontology specifications"
(PDF). Knowledge Acquisition 5 (2): 199–220.
INSPIRE Metadata Regulation 03.13.2008
(see http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:32008R1205:EN:NOT)
INSPIRE Metadata Implementing Rules: Technical Guidelines based on EN ISO 19115 and EN ISO
19119 v1.2 2010-06-16
(http://inspire.jrc.ec.europa.eu/documents/Metadata/INSPIRE_MD_IR_and_ISO_v1_2_20100616.pdf)
Khan, Huda, Brian Caruso, Jon Corson-Rikert, Dianne Dietrich, Brian Lowe, and Gail Steinhart (2010)
Using the Semantic Web Approach for Data Curation. Proceedings of the 6th International
Conference on Digital Curation. December 6-8, 2010. Chicago, IL
http://hdl.handle.net/1813/22945
Lowe, Brian (2009) DataStaR: Bridging XML and OWL in Science Metadata Management. Metadata
and Semantics Research 46: 141-150.
http://www.springerlink.com/content/q0825vj78ul38712/
Brian Matthews, Shoaib Sufi, Damian Flannery, Laurent Lerusse, Tom Griffin, Michael Gleaves,
Kerstin Kleese, “Using a Core Scientific Metadata Model in Large-Scale Facilities”, The
International Journal of Digital Curation. ISSN: 1746-8256, Vol 5 Issue 1, 2010
Flannery, D.; Matthews, B.; Griffin, T.; Bicarregui, J.; Gleaves, M.; Lerusse, L.; Downing, R.; Ashton, A.;
Sufi, S.; Drinkwater, G.; Kleese, K.; , "ICAT: Integrating Data Infrastructure for Facilities Based
Science," e-Science, 2009. e-Science '09. Fifth IEEE International Conference on , vol., no.,
pp.201-207, 9-11 Dec. 2009
doi: 10.1109/e-Science.2009.36
Mitchell, J.S., & Vizine-Goetz, D. (2009) Dewey Decimal Classification. In: Encyclopedia of Library and
Information Science, Third Edition, Bates, M.J., & Niles, M. (eds.). Maack. Boca Raton, Fla.: CRC
Press. Pre-print available online at:
http://www.oclc.org/research/publications/library/2009/mitchell-dvg-elis.pdf.
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 36/x
C4D
D2.1 Metadata Ontology
Steinhart, Gail (2011) DataStaR: A Data Sharing and Publication Infrastructure to Support Research.
Agricultural Information Worldwide 4(1):16-20.
http://journals.sfu.ca/iaald/index.php/aginfo/article/view/199
Stendel ,M., Schmith, T, Roeckner, E., & Cubasch, U. (2004) IPCC_ECHAM4OPYC_SRES_
A2_MM. World Data Centre for Climate 10.1594/WDCC/IPCC_EH4_OPYC_SRES_A2_MM.
Suarez-Figueroa, Mari Carmen and Gomez-Perez, Asuncion (2009) NeOn Methodology for Building
Ontology Networks: a Scenario-based Methodology. Proceedings of the International
Conference on Software, Services & Semantic Technologies (S3T 2009)
SUMMARIES DDC Dewey Decimal Classification (2003) Dewey Decimal Classification and Relative
Index, Edition 22 (DDC 22), OCLC Online Computer Library Center, Inc. ISBN 0-910608-71-7
UDC Consortium, http://www.udcc.org/about.htm accessed: 18/01/2012
UK GEMINI2 Standard Version 2.1.AGI August 2010. (see www.gigateway.org.uk)
VIVO: Enabling National Networking of Scientists, DeanB.Krafft, NicholasA.Cappadona,Brian Caruso,
Jon Corson-Rikert, MedhaDevare,BrianJ.Lowe, available at
http://journal.webscience.org/316/2/websci10_submission_82.pdf
Woodruff, S.D. (Ed) (2003) Archival of data other than in IMMT format. Proposal: International
Maritime Meteorological Archive (IMMA) Format. Update of JCOMM-SGMC-VIII/Doc. 17,
Asheville, NC, USA 10-14 April 2000. Available from http://www.cdc.noaa.gov/coads/edoc/imma/imma.pdf
Version 1.1 of 14 Jan 2012
D2.1 Metadata Ontology.doc
Page 37/x