CERIF for Datasets D2.1 Metadata Ontology Filename: Circulation: Version: Date: Stage: D2.1 Metadata Ontology Restricted (PP) 2.0 21 February 2012 Final Authors: Sheila Garfield and Albert Bokma Partners: University of Sunderland University of Glasgow University of St Andrews Engineering and Physical Sciences Research Council Natural Environment Research Council EuroCRIS C4D is a project funded under JISC's Managing Research Data Programme C4D D2.1 Metadata Ontology COPYRIGHT © Copyright 2011 The C4D Consortium. All rights reserved. This document may not be copied, reproduced, or modified in whole or in part for any purpose without written permission from the C4D Consortium. In addition to such written permission to copy, reproduce, or modify this document in whole or in part, an acknowledgement of the authors of the document and all applicable portions of the copyright notice must be clearly referenced. This document may change without notice. DOCUMENT HISTORY Version Issue Date Stage Content and changes V1.0 14 Dec 2011 Draft Table of Contents V1.1 20 Jan 2012 Draft Section revision following consultation with project partners V2.0 20 Feb 2012 Final Full revision following consultation with project partners V2.1 29 Feb 2012 Updated version following inclusion of CSMD Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 2/x C4D D2.1 Metadata Ontology 1 BACKGROUND AND CONTEXT The goal of the CERIF for Datasets (C4D) project is to extend CERIF to effectively deal with research datasets. CERIF is a standard for modelling and managing research information with a current focus to record information about people, projects, organisations, publications, patents, products, funding, events, facilities, and equipment. One aspect which is becoming of increasing interest is research data that may be generated or used in the course of research activity in the form of datasets. While other aspects of output and impact are already supported explicitly by CERIF, datasets are currently not fully supported in the same way. In some disciplines there are considerable quantities of datasets generated which are an important output; these may be of use also to other researchers as a useful input to their research. This can significantly speed up the rate of discovery of new knowledge and results for future research and be in itself a useful indicator of impact. Despite similarities and overlaps to other forms of research output, datasets are significantly different to require special treatment. In this report we examine current developments in standards for datasets and metadata standards in order to explore: What data needs to be stored about datasets in terms of metadata How CERIF can be extended to support the storage of metadata about datasets How this metadata can be made to be discoverable, that is, how the metadata used to describe the dataset can be created in such a way that allows users to easily find out what datasets are available to meet their needs based on typical retrieval criteria In the following sections we will look in turn at a number of important aspects: In section 2 we briefly examine the nature of datasets before in section 3 considering the stakeholders, their expectation and common needs. In section 4 we examine existing metadata standards, followed by an exploration of existing datasets in section 5. Section 6 looks at common classification schemes, before we discuss possible extensions of CERIF. In section 7 we look at the aspects of a possible ontology to provide for a discovery mechanism before rounding up our discussions and recommendations in section 8. 2 WHAT ARE DATASETS? The term dataset has been frequently used in relation to mainframe computing where it referred to input or output data of some computation. Datasets have since been used to denote collections of data also in other contexts, where this data is structured data that can be readily put into tabular form where each row represents a record with a number of fields that hold data on various attributes. For the purpose of exchange, datasets can also be put into files or data streams using a mark-up language such as XML or in a more basic form through formats such as CSV. In some cases datasets may contain a single record; however in most cases there are at least several records in a dataset. Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 3/x C4D D2.1 Metadata Ontology One can distinguish different types of datasets. There are historic datasets derived from historic records (e.g. 17th century ship logs). Alternatively datasets can be based on recording observations, whether through simple observation or as a result of some form of experiment (e.g. sea mammal studies using data recorders attached to mammals). Finally, datasets can be generated as output from modelling and simulation and there can be datasets that are purely synthetic and generated by some automaton and/or application of randomisation. Some datasets are complete where no subsequent updates are to be expected, while others are continuously growing as new data becomes available. Some datasets are part of a family of related datasets, perhaps where staged processing of raw data yields more refined results. Finally, the ownership and associated rights and duties can be complex issues where these were not generated by researchers in the course of their personal original research. If they were generated through collaborative efforts, there may be joint ownership or obligations and restrictions that apply (SMRU datasets are jointly owned in some cases). Datasets are very common in the sciences, but they also occur in a variety of other disciplines. The list below shows some of these subject areas but is by no means exhaustive: Physics Geography Biology Medicine Social Sciences History It should also be noted that there are also datasets from library sources, national archives, and government departments which will not have arisen from academic research activities but may nevertheless be used by researchers in their research (e.g. ancient ships logs being used for climatological research or datasets in national archives such as demographic data even going back to the domesday book). In summary a vast variety of datasets exist and these are becoming increasingly available for further research as, akin to the open source revolution in information technology with its public licences similar efforts are becoming evident in the sciences with public licences (such as Creative Commons) and sharing of datasets in a variety of portals is becoming more popular. 3 STAKEHOLDERS If CERIF is to provide a comprehensive method of efficiently sharing research information incorporating a variety of research outputs it ought to cater for datasets. At present, it does not have specific support for this even if many aspects of CERIF can be reused. CERIF does not contain the actual research outputs such as publications and patents or products but records the metadata to describe, identify and locate them and likewise would not be holding datasets themselves but rather the metadata to describe, identify and locate them. Increasingly, governments are expecting that datasets generated from publicly funded research is made available to the public via e-government open data portals or via the media, such as science journalists, effectively making citizens potential stakeholders. In order to examine the implications Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 4/x C4D D2.1 Metadata Ontology and specific needs for supporting datasets in CERIF, we first ought to consider the stakeholders and their respective perspective: Researchers who own datasets Researchers who want to access existing datasets for their work Research Organisations that jointly own or are responsible for publishing and curating datasets Funding bodies who fund projects where datasets are generated/used Research Institutions Datasets Researchers Funding Bodies Figure 1: Datasets and their Stakeholders 3.1 Researchers who own datasets In an age of increased information sharing, researchers are also increasingly prepared to share the datasets they have used for their own research with others for several reasons. They may hope that a more ready sharing of these resources in the research community will enable them to use other researchers’ datasets in return for their own research. They may also like to share these with others in order to create more publicity, show impact as part of esteem factors such as standing in their research community, and enhance their personal rating during national research assessment exercises. While there are motivations to share datasets there are also possible reservations; sometimes making datasets available is potentially detrimental to the work of the researcher as it gives competitors a chance to catch up or where putting these into the public domain could endanger the work itself or endanger the subject of the work (e.g. marine mammals movement patterns not being disclosed to commercial concerns that might hunt them). Finally, making datasets available and by implication generating the metadata to describe them can be cumbersome and thus there can be reluctance by researchers to share their data. Another issue that can be difficult to decide is the possible need for selecting appropriate licensing arrangements under which a third party can make use of the datasets whilst protecting the Intellectual Property Rights (IPR) of the researcher that generated them, and ensuring that at the very least when using these datasets their authorship is duly acknowledged. Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 5/x C4D 3.2 D2.1 Metadata Ontology Researchers who want to access datasets Most researchers need access to data in order to progress their lines of research. Such data can be difficult to come by, and researchers would probably welcome access to data that would improve and speed up their research efforts. The main issue for them is to discover suitable datasets that they can make use of. Researchers are expected to know what is going on in their field, their key peers, and the work that has been done to date; with ever increasing amounts of research this can be difficult to keep track of. Therefore it might also be useful to see the relations between researchers, institutions, publications and datasets. This might be especially useful when a researcher is branching into a new field. If all this information was available and navigable, it is likely to attract interest from the research community. Given the fragmentation of information in different places and the fact that researchers usually have to look in several places to piece together all the information needed, the comprehensiveness of such a collection and the resulting utility would be significant. Given that sometimes datasets that are of interest might have been generated in a different discipline to that of the researcher who wants to use them, it will be important that such information is properly classified to make such discovery across disciplines possible. 3.3 Research Organisations that are jointly responsible for datasets While in many cases individual researchers will be responsible for their datasets, there is also an institutional role. The datasets in question may have been generated by researchers in their employ and these organisations either feel duty bound or contractually obliged to safeguard and make available the datasets in question. From an institutional perspective, this will require that datasets are suitably stored and backed up for curation purposes and to ensure that researchers do take appropriate steps to implement their own curation, or hand these over to the institution for curation and that the correct metadata has been generated and published in the appropriate way and form. It is also important that appropriate licensing is applied to ensure that the organisation has protected their IPR where appropriate, work is suitably acknowledged, and data is released in accordance with funder terms and conditions. 3.4 Funding bodies who fund projects where datasets are generated/used Funding bodies include public, charitable and private bodies. They have to demonstrate the results and impact of their work through the results generated by the projects they fund. With increasing requirements of transparency on spending and impact these bodies are interested in recording datasets alongside other forms of research outcome, and showing the impact of the work they have done to the public or their investors. Funders are increasingly taking note of datasets and may require the projects they fund to publish information about them including the generation of metadata to record them in appropriate stores. It is therefore important that the correct metadata has been generated and that the impact of the work and the datasets can be assessed; and that the datasets, where appropriate, have been made available for further work or to allow others to reassess the work that was published about them. The C4D project includes the EPSRC and NERC research councils which are forerunners in terms of the other research councils in the UK and the management of datasets. EPSRC currently maintains 2 datacentres to actually curate datasets that are collected from projects that have produced them. NERC has currently 7 datacentres for the same purpose. One key consideration beyond wanting to show impact and make results available for further research are also legal requirements such as the INSPIRE EU Directive that requires geospatial data to be curated and made available. Consequently, there are now contractual Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 6/x C4D D2.1 Metadata Ontology requirements on fund holders to curate datasets within their own organisations or hand them over to the funding body for curation as is currently the case with NERC. 3.5 Conclusions Datasets are an important kind of research output and are also an important input. Significant numbers of datasets are being generated and need to be managed for a number of reasons: preserving important data for the future, potentially enabling better research and accelerating the rate of discoveries, and assessing and demonstrating impact. The researchers who were interviewed by the project from within its members were generally willing to share these, given suitable safeguards, but do not have a lot of time to generate quality metadata about them or curate and administer them properly; they do want to be assured that appropriate use is being made and their use is acknowledged. Researchers would also like to be able to discover and access suitable datasets for their research activities. Consequently, there is a need for recording these with easy to use tools and publishing them in a way that facilitates their discovery. Organisations and funders are also interested in having datasets properly published and managed from the point of view of managing access and managing their curation (e.g. NERC and EPSRC). From the analysis of stakeholders a number of important considerations arise: The need for tools to make the generation of metadata easy The need for data management plans so that access restrictions can be implemented for sensitive datasets In some cases the ability to publish only derivations that anonymise the data to protect the object of the dataset The ability to deal with families of datasets where these represent successive processing of data The ability to navigate the relationships between datasets, authors and publications especially also where datasets are evidence for claims made in publications The ability to show the impact of datasets in terms of researchers unconnected to the datasets or projects out of which they arose The ability to discover datasets from within disciplines as well as across disciplines (e.g. marine mammals generating climatological data) The requirements of grant awarding bodies and their grant holders to publish effectively the outputs and in particular datasets and to make them available and discoverable 4 EXISTING METADATA STANDARDS According to NISO, “Metadata is structured information that describes, explains, locates or otherwise makes it easier to retrieve, use or manage an information resource” (“Understanding Metadata”, anon, ISBN1-880124-62-9). A typical example of metadata is the catalogue found in any library, which records key information about the collection such as author, title, subject, as well as the location in the library. Different types of metadata can be distinguished: Descriptive metadata to describe the resource for identification and retrieval purposes: Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 7/x C4D D2.1 Metadata Ontology Description Topic Keywords Geospatial location Taxa Structural metadata to describe the structure of the resource and potentially relationships between elements of the resource Record structure Fields Units Instrumentation used Administrative metadata to help manage the resource such as version when the resource was added to the archive licensing encoding (eg XML, CSV, flat file etc) language data management plan conformity, standard name and version Metadata elements grouped into sets are called metadata schemes; these elements are the descriptor types that are available to be applied to the data. For every element in the scheme the name and the semantics are specified; some schemes also specify in which syntax the elements must be encoded while some schemes do not. The encoding allows the metadata to be processed by a computer program and many current schemes use XML to specify their syntax. Metadata schemes are developed and maintained by standard organisations (such as ISO) or organisations that have taken on such responsibility (such as the Dublin Core Metadata Initiative). Many different metadata schemes have been developed with some of them designed for use across disciplines, while others are designed for specific subject areas, dataset types or archival purposes: Standard Domain Description DDI Social Science The Data Documentation Initiative is a standard for technical documentation describing social science data. EAD Archiving Encoded Archival Description - a standard for archives and manuscript repositories. CDWA Arts Categories for the Description of Works of Art is a framework for describing and accessing information about works of art, architecture, and cultural resources. VRA Core Arts Visual Resources Association is a standard for the description of works of visual culture as well as the images. Darwin Core Biology Darwin Core is a standard for the occurrence of species and the existence of specimens in collections. ONIX Book industry Online Information Exchange - international standard for Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 8/x C4D D2.1 Metadata Ontology representing and communicating book industry product information in electronic form. EML Ecology Ecological Metadata Language is a standard developed for ecology. IEEE LOM Education Learning Objects Metadata - specifies the syntax and semantics of Learning Object Metadata. CSDGM Geographic data Content Standard for Digital Geospatial Metadata maintained by the Federal Geographic Data Committee (FGDC). ISO 19115 This is a geographic metadata standard that defines how to describe geographical information and associated services, Geographic data including contents, spatial-temporal purchases, data quality, access and rights to use. e-GMS Government e-Government Metadata Standard (E-GMS) defines the metadata elements for information resources across public sector organisations in the UK. TEI Humanities, social sciences and linguistics Text Encoding Initiative - a standard for the representation of texts in digital form, chiefly in the humanities, social sciences and linguistics. NISO MIX Images The NISO Metadata standard is designed for Images and technical data elements required to manage digital image collections. Librarianship MARC - MAchine Readable Cataloging - standards for the representation and communication of bibliographic and related information in machine-readable form. MEDIN Marine Data MEDIN is a partnership of UK organisations committed to improving access to marine data. MEDIN is a marine themed standard to record information about datasets, which is compliant to international standards. METS Librarianship Metadata Encoding and Transmission Standard - an XML schema for encoding descriptive, administrative, and structural metadata regarding objects within a digital library. MODS Librarianship Metadata Object Description Schema - is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications. Dublin Core Networked resources Dublin Core - interoperable online metadata standard focused on networked resources. DOI Networked resources Digital Object Identifier - provides a system for the identification and hence management of information ("content") on digital networks, providing persistence and semantic interoperability. DIF Scientific data sets Directory Interchange Format - a descriptive and standardized format for exchanging information about scientific data sets. MARC In the following we explore some of these standards in more detail: Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 9/x C4D D2.1 Metadata Ontology Marine Environmental Data Information Network (MEDIN) Discovery Metadata Standard The MEDIN metadata schema is based on the ISO 19115 standard, and includes all core INSPIRE metadata elements. It also complies with the UK GEMINI2.1 metadata standard. Therefore, the MEDIN standard should be considered as an interpretation of GEMINI2.1 whereby MEDIN have specified the use of certain term lists to be used for a specific element - for example, to describe the spatial extent of the resource it is strongly recommended to use the SeaVox Sea Areas salt and freshwater body gazetteer, or in some cases an obligation for an element has been changed – for example, changed Conditional to Mandatory. In all cases however a MEDIN discovery metadata record is compliant with the GEMINI2.1 standard. The generated XML serialisations, that is, the conversion of an object into an XML stream that can be transported across processes and machines, conform to the ISO 19139 standard for XML implementations. Dublin Core (DC) The Dublin Core (DC) Metadata Element Set was developed initially to describe web-based resources sufficiently to allow their discovery by search engines; it is an element set for describing a wide range of networked resources, focusing on bibliographic needs. It is a basic standard which can be easily understood and implemented and comprises 15 core elements encompassing the descriptive, administrative and technical elements required to uniquely identify a digital object. However, the elements are broad and therefore they may be refined by qualifiers to limit their semantic range and to provide further levels of detail. There are, however, no cataloguing rules; this means that the content of the elements may not be uniform across applications potentially reducing the re-usability of the content and making automatic searching difficult. Additionally, DC is limited in its ability to handle the geospatial aspects of data, that is, data which has an explicit or implicit geographic area and is associated with or referenced to some position or location on the surface of the earth. ISO 19115 Metadata Standard for Geographic Information ISO 19115 defines the conceptual model required for describing geographic information and is widely used as the basis for geospatial metadata services. It provides a comprehensive set of metadata elements and describes all aspects of geospatial metadata providing information about the identification, the extent, the quality, the spatial and time elements, spatial reference, and distribution of digital geographic data, such as, such as Geographic Information System (GIS) files, geospatial databases, earth imagery and geospatial resources including data catalogues, mapping applications, data models and related websites. It details which elements should be present, what information is required within each field, and whether the field is mandatory, optional or conditional. This metadata standard can be used for describing digital or physical objects or datasets which have a spatial dimension, that is, data directly or indirectly referenced to a location on the surface of the earth. However, because of the large number of metadata elements and the complexity of its data model, it can be difficult to implement. UK GEMINI The UK Geospatial Metadata Interoperability Initiative (GEMINI) was originally developed by collaboration between the Association for Geographic Information (AGI), the e-Government Unit (eGU) and the UK Data Archive. The UK GEMINI Discovery Metadata Standard is a defined element set for describing geospatial, discovery level metadata within the UK designed to be compatible with the core elements of ISO 19115 and Version 2.1 complies with the INSPIRE metadata set. UK GEMINI2 Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 10/x C4D D2.1 Metadata Ontology UK GEMINI2 is designed for use in a geospatial discovery metadata service. It specifies a core set of metadata elements for describing geospatial data resources for discovery. The data resources may be datasets, dataset series, services delivering geographic data, or any other information resource with a geospatial content. Included in this are datasets that relate to a limited geographic area and which may be graphical or textual, hardcopy or digital. GEMINI2 is a revised version of GEMINI and is compatible with the requirements of the INSPIRE metadata Implementing Rules (IR), conforming to the international metadata standard for geographic information, ISO 19115. Infrastructure for Spatial Information in the European Community (INSPIRE) INSPIRE provides a general framework for a Spatial Data Infrastructure (SDI). The INSPIRE metadata Implementing Rules define the minimum set of metadata elements necessary to comply with the INSPIRE Directive which satisfy the need for the discovery of resources. These elements are: Resource title, Resource abstract, Resource type, Resource locator ,Unique resource identifier, Resource language, Topic category, Keyword , Geographic bounding box, Temporal reference, Lineage , Spatial resolution, Conformity, Conditions for access and use, Limitations on public access, Responsible organisation, Metadata point of contact, Metadata date, and Metadata language. In essence it is a profile of ISO 19115 for discovery purposes. It allows a variety of possible implementations. Darwin Core The Darwin Core is primarily based on taxa, an ordered system that indicates relationships among organisms, their occurrence in nature as documented by observations, specimens, and samples, and related information. It is a content specification designed for data about the geographical occurrences of species (i.e. the observation or collection of a particular species or other taxonomic group at a particular location). The Darwin Core is based on the standards developed by the Dublin Core Metadata Initiative (DCMI) and should be viewed as an extension of the Dublin Core for biodiversity information. The Metadata Standards listed above provide for a variety of the needs of a variety of disciplines with some more generic such as Dublin Core, which originated from the domain of librarianship but has been widely used on its own or with some features absorbed into other metadata standards. It provides a number of useful basic facilities for describing resources. Other standards provide more focused support for specific disciplines such as Darwin Core for biology or INSPIRE, GEMINI and MEDIN for geospatial applications. The diverging needs, even just among the science disciplines, present a dilemma as it will be difficult to find a single format that fits all these needs and with a need of, at least, optionality with respect to different elements to accommodate key aspects. There are a variety of basic descriptions about the dataset that will be the same such as name, description, authorship, location, language, topic and keywords, while other aspects are discipline-specific such as geospatial bounding boxes for geospatial sciences, taxa for biology, chemical structures, etc., that will be important not just for correct classification, but also for eventual discovery. It is also important to note that there are metadata standards for different levels of abstraction where some of them are designed more for the generation and management of dataset repositories. As in some cases, repositories hold a variety of datasets from different disciplines such as National Archives and Libraries there are metadata standards that are more focused archival and librarianship such as EAD, MARC, MEDS and MODS as well as Dublin Core. On the other end of the scale there are also metadata standards for particular types of standardised datasets which focus more on the detailed content of these datasets such as IMMA for ship records: INTERNATIONAL Maritime Meteorological Archive (IMMA) Format Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 11/x C4D D2.1 Metadata Ontology IMMA (Woodruff, 2003) is a flexible and extensible ASCII format designed for the archival and exchange of both historical and contemporary marine data. The fields presently composing IMMA have been organized into two different types of format components: the “Core” and “attachments” (attm). The Core contains the most universal and commonly used marine data elements (e.g., reported time and location, temperatures, wind, pressure, cloudiness, and sea state). The Core is divided into two sections: o “location” section: for report time/space location and identification elements, and other key metadata o “regular” section: for standardised data elements and types of data that are frequently used for climate and other research The Core forms the common front-end for all IMMA data and by itself forms a relatively concise “record type”. The “attachments” contain additional, less universal data elements. Additional record types are constructed by appending to the Core one or more attachments. Core Scientific Metadata Model (CSMD) The Core Scientific Metadata Model is a model for the representation of scientific study metadata developed within the Science and Technology Facilities Council (STFC) to represent the data generated from scientific facilities. The model captures high level information about scientific studies and the data that they produce. The model defines a hierarchical model of the structure of scientific research around studies and investigations, with their associated information, and also a generic model of the organisation of datasets into collections and files. Specific datasets can be associated with the appropriate experimental parameters, and details of the files holding the actual data, including their location for linking. This provides a detailed description of the study. There are nine core entities defined for a study: Investigation, Investigator, Topic and Keyword, Publication, Sample, Dataset, Datafile, Parameter and Authorisation. In conclusion, the support of a variety of disciplines not just limited to sciences and perhaps also to the humanities will require a flexible approach that can capture key aspects of different disciplines that are important for the correct classification and thus discovery of datasets along typical search criteria. Thus geospatial aspects, in some disciplines, as well as taxa or classifications of compounds/elements/molecules or even extra-geospatial locations for others, will be needed. The discipline specific aspects are largely to do with what the dataset or resource is about and its context. At the same time there are also a considerable amount of common features to do with what the resource is, what it is called, where it is located, how it is encoded, who owns it, what conditions are attached to it which are fairly common across the disciplines. The metadata standards used at NERC currently follow the MEDIN standard as does INSPIRE and GEMINI and thus would appear to be a logical starting point for a suitable standard for C4D. The caveat however is that not all disciplines will have geospatial references such as the bounding box specified in MEDIN and having other requirements more specific to their respective disciplines. This would require that the geospatial references are optional rather than obligatory and other key aspects of other disciplines be introduced alongside, again as optional, to allow a more meaningful representation for a variety of disciplines: geographic/political regions, place names and localities (e.g. occurrence of certain species) elevation (for some biological/geological datasets for examples) taxa for biological sciences extra-spatial reference for astronomy Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 12/x C4D D2.1 Metadata Ontology MEDIN has considerable overlap with other standards in terms of the administrative aspects such as with Dublin Core and in fact many standards have virtually identical provision for administrative aspects but differ in the descriptive aspects. Consequently, a MEDIN based standard would be useful as a starting point whilst needing to be extended to cover key discipline specific aspects such as those described above. What can be concluded from the considerations of metadata standards are a number of lessons: There are different metadata standards that have evolved to serve the needs of their respective communities and a way has to be found to either develop a more generic scheme that can work in a variety of disciplines or to work with these existing standards to develop a higher level metadata scheme that can subsume them. Given the sheer complexity of the available standards it will however be more pragmatic to start with a widely used standard such as MEDIN and assess the need to extend it to incorporate key criteria of other disciplines such as taxa for biologists. Another important outcome is that not all standards operate at the same level and some are focusing more on the format and categories of data in the dataset such as in the case of IMMA which contrasts to more high level standards that are less descriptive but represent more administrative aspects such as MEDIN and which contrasts on much more general ones more focused on management of repositories like METS and MODS or on general purpose discovery such as Dublin Core. Consequently, it is important to recognise these differences and find a way of managing the complexity between need of specific support of formats and disciplines while supporting more general discoverability and the fact that there may a need in the end of supporting a variety of low and medium level standards in repositories (because of legacy issues) but facilitate their discovery. This cross disciplinary discoverability can be important as the case of SMRU shows where biological datasets are used for climatological research. CSMD proposes a Core Scientific Metadata Model with a 3 layer architecture with metadata at the data level as opposed to the contextual level and above that a discovery level. We discuss this in more detail in section 7. In conclusion, there is the need to support a variety of disciplines and to be mindful of the level of metadata and distinguish the 3 levels propose in the CSMD initiative to ensure a clear separation between levels and to enable eventual integration also between the repositories holding datasets and those repositories holding research information for administrative purposes and to be able to layer over these the capability of discoverability for the purposes of impact analysis and making data retrievable and reusable. 5 EXISTING DATASETS OF C4D PARTNERS To investigate the needs for metadata support or datasets in CERIF and develop a concept demonstrator in the course of C4D, the project will consider a range of datasets already available from the project partners: The University of Sunderland currently co-owns or has access to several suitable research datasets. Specifically, these include the Climatological Database for the World’s Oceans (CLIWOC) and the JISC-funded UK Colonial Registers and Royal Navy Logbooks digitisation project (CORRAL) providing historical and marine climatology datasets. At the University of St Andrews, the Sea Mammal Research Unit (SMRU) is a NERC collaborative centre. Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 13/x C4D D2.1 Metadata Ontology At the University of Glasgow, the School of Geographical and Earth Sciences currently has access to marine climatological datasets, primarily for the North Atlantic, extending back up to 650 years. Climatological Database for the World’s Oceans (CLIWOC): 1750 to 1850 The European Union funded CLIWOC project was the first ever to examine the scientific utility of early, pre-instrumental ship’s logbook information. The database contains climatic information contained in ships’ logbooks for the period 1750 to 1850, and more specifically climatological information for the North and South Atlantic, the Indian and the Pacific Oceans. Entries are chronological, day by day, and include identification by vessel and geographic location. Information on wind direction, wind force and other recorded weather elements such as, precipitation, fog, ice cover, state of sea and sky, are also included. Data were abstracted, verified, and calibrated before inclusion in the database. In total, 1,624 logbooks were digitized, comprising of 273,269 observations from over 5,000 logbooks. The observations were put in the database in IMMA standard format (Woodruff, 2003). UK Colonial Registers and Royal Navy Logbooks (CORRAL) The Joint Information Systems Committee (JISC) funded CORRAL project is an imaging and digitising project to image ship's logbooks of particular historic and scientific value, and to digitise the meteorological observations in those logbooks. Digitising Royal Navy ship's logbooks (from ships of voyages of scientific discovery and those in the service of the Hydrographic Survey) and coastal and island records contained in UK Colonial documents provides meteorological recordings from marine sites back to the 18th Century. The instrumental data (mostly of air pressure and temperature) are of singular importance in representing marine conditions. The Sea Mammal Research Unit (SMRU), University of St Andrews The Sea Mammal Research Unit (SMRU) carries out research on marine mammals using animalborne instruments that have been designed and implemented to provide in situ hydrographic data from parts of the oceans where little or no other data are currently available. An important aspect of these deployments is the provision of unique, linked biological and physical datasets which can be used by marine biologists, who study these animals and also by biological oceanographers. Incorporating animal-borne sensors into ocean observing systems provides information about global ocean circulation and enhances our understanding of climate and the corresponding heat and salt transports, and at the same time increases our knowledge about the life history of the ocean’s top predators and their sensitivity to climate change. School of Geographical and Earth Sciences, University of Glasgow The School of Geographical and Earth Sciences currently has access to marine climatological datasets. Modelling and measurements show that Atlantic marine temperatures are rising; however, the low temporal resolution of models and restricted spatial resolution of measurements (i) mask regional details critical for determining the rate and extent of climate variability, and (ii) prevent robust determination of climatic impacts on marine ecosystems. Historic sea water temperatures were algae-derived in situ temperature time-series developed during the current investigation. Projected Sea Surface Temperatures (SSTs) from 2000–2040 for 57°07’53’’N 08°12’60’’W were obtained from Stendel, et al (2004). The datasets represented by the project partners, whilst not being exhaustive, nevertheless represent a good cross-section of disciplines and variety: Biological datasets Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 14/x C4D D2.1 Metadata Ontology Climatological datasets Maritime vessels datasets The datasets also involve cross disciplinary use cases as in the case of the marine mammals datasets being of climatological interest. By supporting these in the C4D reference architecture this should lead to results that are relevant for a variety of disciplines and use cases that should in principle be extensible to remaining specialist areas. 6 CLASSIFICATION SCHEMES In any discipline, the organisation and provision of relevant information is a significant challenge. It is essential that users are able to access and assimilate a variety of information resources in order to be able to carry out their tasks. Classification is the placing of subjects into categories and provides a system for organising and categorising knowledge. The purpose of classification is to bring together related items in a useful sequence from the general to the specific enabling users to browse collections on a topic, either in person or online. Subject classification schemes describe resources by their subject and are a means of organising knowledge in libraries and other information environments. Classification schemes differ from other subject indexing systems such as subject headings, thesauri, etc. by trying to create collections of related resources in a hierarchical structure thus they can aid information retrieval by providing browsing structures for subject-based information; providing a browsing directory-type structure that is user friendly thereby making finding and retrieving resources easier. However, the usefulness of any browsing structure depends on the accuracy of the classification. Additionally, updating classification schemes takes a long time and because of their size they tend not to be updated that frequently. Classification schemes have been developed and used in a variety of contexts and vary in scope and methodology, but can be broadly divided into universal, national general, subject specific and homegrown schemes: Universal schemes offer coverage of all subject areas and are designed for worldwide usage, e.g., Dewey Decimal Classification (DDC), Universal Decimal Classification (UDC), and the Library of Congress Classification (LCC) National general schemes offer coverage of all subject areas, however, generally they are not well known outside of their place of origin and have usually been designed for use in a single country or language community, e.g., the Nederlandse Basisclassificatie (BC) - Dutch, the Sveriges Allmänna Biblioteksförening (SAB) scheme - Swedish Subject specific schemes have been devised with a particular (international or national) usergroup or subject community in mind, e.g., the National Library of Medicine (NLM) Classification, the Engineering Information (EI) Classification Codes Home-grown schemes are devised for use in a particular service and knowledge is organised by devising their own classification scheme, e.g., Yahoo There are a number of classification schemes in use with a marked focus on library and information science, which are designed to organise and classify the world of knowledge and its contents. The most widely-used universal classification schemes are: Dewey Decimal Classification (DDC), Universal Decimal Classification (UDC), and the Library of Congress Classification (LCC). Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 15/x C4D D2.1 Metadata Ontology Dewey Decimal Classification (DDC) The DDC system is a universal classification scheme for organising general knowledge that is continuously updated to keep pace with new developments and changes in knowledge, and it is the most widely used classification system in the world (Mitchell & Vizine-Goetz, 2009; OCLC, 2003). The DDC is used by libraries in more than 135 countries to organise and provide access to their collections as well as DDC numbers featuring in the national bibliographies of more than 60 countries (Mitchell & VizineGoetz, 2009; OCLC, 2003). DDC is also used for other purposes, e.g., as a browsing mechanism for resources on the web (OCLC, 2003). The basic DDC classes are organised by academic disciplines or fields of study divided into 10 main classes, (see Table 1 below), which together cover the entire world of knowledge. Each main class is further subdivided into 10 divisions, and each division is further subdivided into 10 sections (OCLC, 2003). 000 100 200 300 400 500 600 700 800 900 Computer science, information & general works Philosophy & psychology Religion Social sciences Language Science Technology Arts & recreation Literature History & geography Table 1: The 10 main classes (OCLC, 2003, p.7) The DDC system distributes the subject according to the context referred to as “aspect classification”. As a result the same subject may be classed in more than one place in the scheme, as shown by the example below. The DDC system is based on the ideas of relative location, decimal notation, relative index and detailed classification with its main principles being: classification by discipline, a hierarchical structure and practicality. Relative location Subjects are ordered in a sequence and assigned a number; the books are marked with this number and not the shelves. Relative index The relative index shows the relationship between subjects and the disciplines or in some cases, the various aspects within disciplines in which they appear (Mitchell & Vizine-Goetz, 2009). For example, the relative index entries for Garlic are as follows: Garlic Garlic - botany Garlic - cooking Garlic - food Garlic - garden crop Garlic - pharmacology Version 1.1 of 14 Jan 2012 641.3526 584.33 641.6526 641.3526 635.26 615.32433 D2.1 Metadata Ontology.doc Page 16/x C4D D2.1 Metadata Ontology Within 641 food and drink, garlic appears in food (641.3526) and cooking (641.6526). Garlic also appears in botany (584); as a garden crop in agriculture (635.26); and in a subfield of medicine, pharmacology (615.32433) (Mitchell &Vizine-Goetz, 2009). Decimal notation The decimal notation refers to the principle of dividing each class into ten sub-divisions and each of these sub-divisions into another ten sub-divisions and so on. The first digit in each three-digit number represents the main class. For example, the 600 class represents technology. The second digit in each three-digit number indicates the division. For example, 600 is used for general works on technology, 610 for medicine and health, 620 for engineering, 630 for agriculture, 640 for home economics and family living. The third digit in each three-digit number indicates the section. Thus, 610 is used for general works on medicine and health, 611 for human anatomy, 612 for human physiology, 613 for personal health and safety. Library of Congress Classification (LCC) The LCC system organises knowledge into 21 basic classes. Each class is identified by a single letter of the alphabet (labelled A-Z except I, O, W, X, and Y). Most of these 21 classes are further divided into more specific subclasses by adding one or two additional letters and a set of numbers. Within each subclass, topics that are relevant to the subclass are arranged from the general to the more specific. Relationships among topics are shown by indenting subtopics under the larger topics rather than by the numbers assigned to them and in this respect it is different from the more strictly hierarchical Dewey Decimal Classification system where the hierarchical relationships among topics are shown by numbers that can be continuously subdivided. Universal Decimal Classification (UDC) The UDC system is adapted from the Decimal Classification of Melvil Dewey. It is used worldwide and is the world leader in multilingual classification schemes for all branches of human knowledge. It has undergone extensive revision and development resulting in a flexible and efficient system for organising bibliographic records. The UDC system is suitable for use with all branches of human knowledge in any kind of medium and is well suited to multi-media information collections. UDC has a hierarchical structure which divides knowledge into 10 classes (see Table 2 below); each class is further subdivided into its relevant parts and each subdivision is further subdivided with the system continuing on in this way. UDC uses decimal notation and the longer the number the more detailed the subdivision represented by that number. Consequently UDC is able to represent not just straightforward subjects but also relations between subjects as all recorded knowledge is treated as a coherent whole built of related parts. 0 1 2 3 5 6 7 Science and Knowledge. Organisation. Computer Documentation. Librarianship. Institutions. Publications Philosophy. Psychology Religion. Theology Social Sciences Mathematics. Natural Sciences Applied Sciences. Medicine. Technology The Arts. Recreation. Entertainment. Sport Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Science. Information. Page 17/x C4D D2.1 Metadata Ontology 8 9 Language. Linguistics. Literature Geography. Biography. History Table 2: The 10 classes (UDC Consortium) UNESCO/SPINES thesaurus The Science and Technology Policies Information Exchange System (SPINES) thesaurus was developed by the United Nations Educational, Scientific and Cultural Organisation (UNESCO). The SPINES thesaurus is a controlled and structured vocabulary for information processing in the field of science and technology for policy-making, management and development and permits the indexing of: The literature dealing with all aspects of science and technology policies Documents dealing with research and experimental development projects Literature and projects dealing with development in general, and more particularly those which call heavily on the application of science and technology Frascati The Frascati Manual is not only a standard for Research and Development (R&D) surveys in the Organisation for Economic Co-operation and Development (OECD) member countries, it has become a standard for R&D surveys worldwide as a result of initiatives by OECD, UNESCO, the European Union and various regional organisations. The Manual was first issued nearly 40 years ago and deals exclusively with the measurement of human and financial resources devoted to research and experimental development (R&D) data. The Manual presents recommendations and guidelines on the collection and interpretation of established R&D data. Additionally, the Manual contains eleven annexes, which interpret and expand upon the basic principles outlined in the Manual in order to provide additional guidelines for R&D surveys or deal with topics relevant to R&D surveys. For the purpose of C4D, it will be important to find a suitable classification scheme that will enhance the ability to correctly and reliably classify datasets to aid their discovery. The Dewey Decimal Classification, Universal Decimal Classification, and the Library of Congress Classification schemes mentioned above, while universal, do however have a library bias which is also to do with correctly shelving and finding items in a library and they have evolved from a period where some disciplines had not been in existence giving rise to strange phenomena such as different aspects of computer science being organised into two locations at opposite ends of the classification system under the Dewey System for example. Another problem is the insufficient specificity with respect to sub-areas in different disciplines. At the same time no universal and detailed system exists that can immediately be applied, which may only leave the option of collecting subject specific classification schemes for specific disciplines. There are more generic schemes that could be applied with a universal focus such as PAIS, but which are again very coarse for each specific subject area. As far as the UK Research Councils are concerned, they currently already use classification systems for the purpose of managing their activities and in particular awards. Thus NERC currently uses a scheme http://www.nerc.ac.uk/funding/application/topics.asp. Amongst the prevailing schemes there is the RCUK Subject Classification scheme as well as JACS from HESA http://www.hesa.ac.uk/dox/jacs/JACS_complete.pdf Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 18/x C4D D2.1 Metadata Ontology 6.1 Implications for CERIF Metadata describes the “who, what, where, when, why and how” a dataset was collected; it may also describe the “quality” of the data. Consistent use of standard terminology for metadata descriptors, identifiers or fields will help the understanding, integration, discovery and use of datasets. Among existing metadata content standards there is potentially some overlap among the terminology used. Thus a mapping, (see Table 3 below), between (some) selected standards from different disciplines and potential CERIF entities was carried out in order to compare terminology and its usage. The selected standards are: MEDIN – Marine Environmental Data Information Network Discovery Metadata Standard GEMINI2 – used in a geospatial discovery metadata service Dublin Core – developed initially to describe web-based resources Darwin Core – primarily based on taxa EML – Ecological Metadata Language is a metadata specification developed by, and for, the ecology discipline DIF – Directory Interchange Format is a metadata format used to create directory entries that describe scientific data DDI – Data Document Initiative is a descriptive metadata standard for social science data CSMD – (the) Core Scientific Metadata Model. The initial aim of the mapping is to identify the metadata content needs, as there may be content details and elements from multiple standards that can be included to help users understand the data. A secondary aim of the mapping is to try and produce a minimum subset of terms for classification, administration and discovery. Several of the standards, e.g. Darwin Core and EML have very detailed documentation, which for this first cut is too vast to study in any great detail. These standards were chosen for comparison since they represent a wide range of disciplines and thus will allow us to develop a generic solution that can be used certainly for the majority of sciences. Some of these have already established a wide acceptance such as Dublin Core. Finding a solution that covers the core aspects of these and is able to map equivalent elements would make it easier to potentially also import metadata that has already been written in these standards into the independent CERIF format and CERIF compliant metadata repositories. In order to take this further and create a generic framework in which to develop a CERIF mapping there needs to be agreement on which core information is required for data users to discover, use and understand the data. Following on from this, it is essential that a formal definition for each term is agreed upon. It can be seen from the table below that with the exception of Darwin Core all the selected standards incorporate a term that relates to “why” the data set was collected, e.g. MEDIN uses ‘Resource abstract’, Dublin Core uses ‘Description’ and DIF uses ‘Summary’. Likewise all standards, incorporate details about “who” – the person and/or organisation responsible for collecting and processing the data and who should be contacted if there are questions about the data, e.g. using terms such as ‘Responsible party/organisation’ (MEDIN/GEMINI2), ‘Personnel’ and ‘Originating Center’ (DIF) and ‘Publisher’ (Dublin Core). Among the selected standards most also cover the “quality” aspect of the dataset as well as the “where” – geographical location and/or a description of the spatial characteristics of the data. Therefore based on this initial mapping it is possible to identify metadata elements that are common among the standards chosen. These elements provide coverage of the necessary descriptive and Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 19/x C4D D2.1 Metadata Ontology structural metadata elements required to aid discoverability of datasets. Some of the administrative metadata elements e.g., metadata name, metadata version and metadata language, are primarily incorporated only in the MEDIN, GEMINI2, Darwin Core, EML, and DIF standards and, at this time, the CERIF result entity ‘Product’ is the most likely CERIF entity to which these can be mapped. Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 20/x 3 M Abstract 4 M Description 4 M Resource type 39 M Type Record-level Term dataset Resource locator Unique resource identifier 5 19 C Resource Identifier Record-level Term url 6 C Resource locator Unique resource C identifier 36 M Record-level Term alternateIdentifier Entry ID Coupled resource + 7 C Coupled resource 38 C Resource language 8 C Dataset language 3 C Language Record-level Term language Dataset Language Topic category Spatial data service type + 9 C Topic category Spatial data service 10 C type 5 M Subject 37 C Keywords Geographic bounding box 11 M Keyword 6 M West/east/north/ 11,12 12 C south bounding box 13,14 M Extent 13 O Extent purpose <abstract> 2.2.2 <dataKind> 2.2.3.10 R R R O *Coverage Location geographicDescription 16 O Location boundingAltitudes 17 M *Coverage Location R Element Obligation CERIF Entity (Possible direct mapping) cfResultProductName CERIF Entity/Link entity cfResultProduct cfResultProductDescription cfResultProduct cfDublinCoreResourceType cfDublinCore cfDublinCoreResourceIdentifier cfOrganisationUnit_DublinCore ADO Locator: URI Reference Location: Service cfDublinCoreResourceIdentifier achieved by linking relation between two datasets cfLanguage cfDublinCoreLanguage cfOrganisationUnit_DublinCore Topic: Subject cfDublinCoreSubject cfDublinCore HR R <keyword> <topClass> 2.2.1.1 2.2.1.2 Lineage 17 C Lineage 10 M Source? Relation? Spatial resolution Additional information source Limitations on public access Conditions applying for access and use 18 C Spatial resolution Additional 19 O information source Limitations on 20 M public access 18 C 27 O Record-level Term 25 M Rights Management Record-level Term intellectualRights 21 M Use constraints 26 M Rights Management Record-level Term Record-level Term intellectualRights contact publisher organizationName Record-level Term mediumName maintenance maintenanceUpdateFre quency Responsible party Responsible 22 M organisation 23 M Publisher Data format 23 O Data format 21 O Format? Frequency of update Frequency of 24 C update 24 M Conformity 25 41 C Metadata date 26 M Metadata date 30 M Event Topic: Keywords OR Keywords R <geogCover> Spatial Coverage temporalCoverage R Temporal Coverage methods O Quality <timePrd> <prodDate> HR <collDate> <sources> HR <othrStdyMat> cfDublinCore 28 M OBLIGATION FOR EML: R - Required O - Optional cfDublinCore cfGeographicBoundingBox cfGeographicBoundingBox cfDubLinCoreCoverageSpatial cfDublinCore cfDublinCoreCoverageTemporal cfDublinCoreDate achieved by linking relation between two datasets this would be in detailed metadata not contextual cfDublinCore cfResultPublicationTitle Access Constraints HR Access Conditions cfDublinCoreRightsManagementAccessRights cfDublinCore HR HR R RC <producer> cfDublinCoreRightsManagementLicense cfDublinCore 2.1.3.1 Access Conditions Study: Investigator OR Study Institution: Name File ADOL: Media cfDublinCorePublisher/ cfDublinCoreRightsHolder cfOrganisationUnit_DublinCore HR <fileType>? 3.1.5 OR File ADOL: File Type cfMedium cfMedium Use Constraints R Personnel Data Center O Originating Center Distribution O MR Meta Information: Metadata Conformance RC RC Metadata Name R Metadata Version R RC OBLIGATION FOR DIF: R - Required HR - Highly Recommended RC - Recommended + (Only required if what is being described is a service; also applies to MEDIN) * Could apply to each of these MEDIN elements D2.1 Metadata Ontology.doc Study Information: Time Line OR Logical Description: Time Period Data Description: Data Quality cfGeographicBoundingBox cfDublinCoreCoverage Related Material: Publications, Refs Parent DIF OBLIGATION FOR MEDIN AND GEMINI2: C – Conditional M – Mandatory O – Optional 2.2.3.1 2.1.3.3 2.2.3.2 2.3.1.8 2.5 cfResultProduct cfGeographicBoundingBox RC HR C 30 O 2.2.3.4 cfResultProductKeywords References/Publications Dataset Citation DIF Creation Date Last DIF Revision Date 27 M 2.2.3.4 HR <geogCover> M *Coverage/Date Version 1.1 of 14 Jan 2012 Study: Study Information OR Study Information: Purpose Data Holding OR Data Description HR Parameters (Science Keywords) Location 7 Parent ID R O Related URL keywordSet geographicCoverage boundingCoordinates 15 33 CSMD Study: Study name achieved by linking dataset to service 16 M Temporal extent Metadata language 29 M Metadata language O Summary ISO Topic Category Temporal reference Metadata standard name Metadata standard version 2.1.1.1 cfResprod can have more than 1 name Resource type Conformity DDI <titl> shortName Resource abstract Vertical extent info 14 O Vertical extent info Spatial reference Spatial reference system 15 M system R Obligation O DIF R Entry Title Element 2 Obligation O Alternative title EML objectName Element 2 Darwin Core Record-level Term Element Obligation Dublin Core M Title Element Obligation 1 Element Obligation M Title GEMINI2 Obligation 1 Element Obligation MEDIN Resource title Alternative resource title D2.1 Metadata Ontology Element C4D Page 21/x MR Meta Information: Metadata Source MR Meta Information: Metadata ID could be done by linking dataset to event linking relation dataset to publication achieved by relationship between metadata and authorising organisation achieved by relationship between metadata and authorising organisation achieved by relationship between metadata and authorising organisation language associated with title or abstract achieved by linking relation between two datasets CSMD Metadata Category Comments from Keith Jeffery MR - Metadata Record cfPerson_ResultPublication C4D D2.1 Metadata Ontology CERIF is already able to store and communicate some metadata about datasets through the existing cfResultProduct entity as shown below but which needs to be extended to cover all necessary metadata elements to represent key administrative, descriptive and perhaps to a lesser degree structural aspects: Figure 2: CERIF CFResultProduct Entity Relationships Conclusion: Based on the mapping there are 2 prime CERIF entities that are likely candidates for storing metadata – cfResultProduct and cfDublinCore, with other additional entities for specific aspects of the metadata, e.g., cfGeographicBoundingBox, which appears to be ideally suited for providing data about the extent of the resource. The CERIF entities – cfResultProduct and cfDublinCore – provide good coverage of most of the descriptive and structural metadata elements required and either one or both of these entities would need to be extended to include the administrative metadata elements required or, alternatively, a new CERIF entity needs to be created. Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 22/x C4D D2.1 Metadata Ontology The C4D project will endeavour to build a solution that reuses existing CERIF entities where possible and add new ones to support the metadata associated with datasets. The pilot implementation and the experiences gained will then be fed back into the EuroCRIS standardisation process that can then consider the proposed extensions to CERIF. 6.2 Related Activities With respect to the aim of building an infrastructure for the management of research information covering researchers, publications projects and outcomes such as datasets there are two current activities C4D should be taking note of: which will be a useful model to learn from also given that it covers the metadata about these activities as well as the access to resources such as datasets: the Australian National Data Service (ANDS) and its Australian Research Data Commons (ARDC) the Research Councils UK (RCUK) and the Research Output System (ROS) Australian National Data Service (ANDS) The Australian government has already started to implement this in Australia and which will be a useful model to learn from also given that it covers the metadata about these activities as well as the access to resources such as datasets. The Australian Initiative is known as the Australian National Data Service (ANDS) ands leading the creation of a cohesive national collection of research resources and a richer data environment that will: Make better use of Australia’s research outputs Enable Australian researchers to easily publish, discover, access and use data Enable new and more efficient research ANDS is concerned with data that is produced by researchers as well as data that is used by and accessible to them. Thus, ANDS is enabling the transformation of data that are unmanaged, disconnected, invisible and single use into structured collections that are managed, connected, discoverable and reusable so that researchers can easily publish, discover, access and use research data. To enable Australia’s research data to be transformed, ANDS is: Creating partnerships with research and data producing organisations through funded projects and collaborative engagements Delivering national services such as Research Data Australia and Cite My Data Providing guides and advice on managing, producing and reusing data Building a research data community of practice Building the Australian Research Data Commons (ARDC) - a cohesive collection of research resources, made available for community use, from all research institutions, to make better use of Australia's research outputs Therefore the task of ANDS is to create the infrastructure to provide greater access to Australia’s research data assets in forms that support easier and more effective data use and reuse, thereby enabling researchers to benefit from the Australian Research Data Commons. The infrastructure: Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 23/x C4D D2.1 Metadata Ontology makes available feeds of data collection descriptions from a range of public sector agencies federates and makes visible the Data Commons enables data/metadata management and sharing for research producing institutions enables capture of data and metadata from research instruments, and allows users to fully exploit the data held in the commons The ANDS’ approach is to engage in partnerships with the research institutions to enable better local data management that enables structured collections to be created and published. ANDS then connects those collections so that they can be found and used. These connected collections, together with the infrastructure form the Australian Research Data Commons (see Figure 3 below): Figure 3: Australian Research Data Commons (ARDC) The Australian Research Data Commons (ARDC) is being created as a “meeting” place for researchers and data; to support the discovery of, and access to, research data held in Australian universities, publicly funded research agencies and government organisations for the use of research, and to provide: A set of data collections that are shareable Descriptions of those collections An infrastructure that enables populating and exploiting the commons Connections between the data, researchers, research, instruments and institutions This will bring information about Australian research data together and place it in context. It will connect the data produced by research, as well as data needed to undertake research. The ARDC is more than an index - it is a rich web of description. ANDS does not hold the actual data, but points to the location where the data can be accessed. This combined information can then be used to help people discover data in context. The intention is for the ARDC to be searched, viewed and accessed in a number of ways. Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 24/x C4D D2.1 Metadata Ontology Research Councils UK (RCUK) RCUK is a strategic partnership of the UK Research Councils. RCUK was established in 2002 to enable the Research Councils to work together more effectively to enhance the overall impact and effectiveness of their research, training and innovation activities, contributing to the delivery of the Government’s objectives for science and innovation. RCUK are responsible for investing public money in research in the UK to advance knowledge and generate new ideas which lead to a productive economy, healthy society and contribute to a sustainable world. The Research Councils are the public bodies charged with investing tax payers’ money in science and research and, as such, they take very seriously their responsibilities in making the outputs from this research publicly available – not just to other researchers, but also to potential users in business, Government and the public sector, and also to the public. The Research Councils are committed to the guiding principles that publicly funded research must be made available to the public and remain accessible for future generations. Therefore, the Chief Executives of the Research Councils have agreed that over time the UK Research Councils will support increased open access, by: building on their mandates on grant-holders to deposit research papers in suitable repositories within an agreed time period, and extending their support for publishing in open access journals, including through the pay-topublish model Thus a core part of the Research Councils’ remit is making research data available to users. RCUK are committed to a transparent and coherent approach and their common principles on data policy provide an overarching framework for individual Research Council policies on data policy. The information gathered is fundamental to the Research Councils strengthening their evidence base for strategy development, and crucial in demonstrating the benefits of Research Council funded research to society and the economy. The Research Outcomes System (ROS) allows users to provide research outcomes to four of the Research Councils – the Arts and Humanities Research Council (AHRC), the Biotechnology and Biological Sciences Research Council (BBSRC), the Economic and Social Research Council (ESRC), and the Engineering and Physical Sciences Research Council (EPSRC). The ROS can be used by these Research Council grant holders to input outcomes information about their research. It can also be used by Higher Education Institution (HEI) research offices to input research outcomes information on behalf of grant holders and/or access the outcomes information of grant holders in their institution. Grant holders, research office managers and associated staff input outcomes information, comprising both narrative and data, under nine different categories, or outcome types, as follows: Publications Other Research Outputs Collaboration/Partnership Dissemination/Communication IP and Exploitation Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 25/x C4D D2.1 Metadata Ontology Award/Recognition Staff Development Further Funding Impact Each of these categories has various sub-sections, with explanatory guidance, to help users be as specific as possible when inputting information. In addition, grant holders are asked to provide a brief summary of the Key Findings arising from their research. This information can be made available to research users to stimulate engagement and collaboration. The ROS categories provide a comprehensive, hierarchical, set of elements for inputting information with regard to outcomes from research. Within each of the top level outcome types (categories) there are additional selection criteria or options available at a first tier and a second tier level. For example, the top level “Other Research Outputs” outcome type has options, at the first tier, for biological outputs, electronic outputs and physical outputs; for electronic outputs, at the second tier level, there is the option of selecting ‘dataset’ which is described as “a structured record of the value of variables that were measured as part of the research”. Conclusions: Overall, the ROS categories present a good level of detail with regard to research outputs and could provide the fine detail that will be needed for the C4D subject demonstrator. Ultimately, however, these categories do not include all of the metadata elements that are covered by standards such as MEDIN, DIF, CSMD and do not provide coverage either for any of the gaps that have been identified in other standards. The goal of ANDS is to assist institutions and research communities to ensure they can operate within this broader framework, so that the collective national investment in research data is made in ways that ensure the data can be more widely discovered and reused. This goal is not unlike the underlying rationale behind CERIF, which is the need to share research information across countries or even between different funding bodies in the same country. The long-term data management objective of ANDS is to deliver a discovery service that enables users to find data held in Australian repositories by searching a national metadata store. And, whilst this does not involve the development of a standard, there is again a similarity in this objective to the vision behind CERIF. Whilst CERIF is a standard for modelling and managing research information its current focus is to record information about people, projects, organisations, publications, patents, products, funding, events, facilities, and equipment which are likely to be the same types of information recorded by the ARDC. C4D will continue to monitor these activities and take them into consideration in the process of deriving its proposed solution concept. There are clear lessons to be learnt from them in terms of the strategy used and of particular note is also the RCUK classification system which will be incorporated into the C4D demonstrator for example. Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 26/x C4D D2.1 Metadata Ontology ONTOLOGY REQUIREMENTS 7 The project will use CERIF to store and communicate the metadata of sample datasets, and integrate this metadata with that held on research projects and research outputs available at the three partner institutions. The partner institutions recognise that the next logical step is to develop the infrastructure at each institution to integrate research data since all of the partners have very limited capacity in this area at present; this project will extend the use of CERIF into the research data management area. The overall aim of the project is to develop a framework for incorporating metadata into CERIF such that research organisations and researchers can better discover and make use of existing and future research datasets, wherever they may be held. In order to demonstrate the facility of the approach, a subject specific demonstrator will be built and this approach verified at the three partner HEIs, each within their own research administration infrastructure. As can be seen from the brief stakeholder analysis, datasets are an important kind of research output and are also an important input for future research. Enabling better research and accelerating the rate of discoveries and assessing and demonstrating impact are therefore key considerations: Researchers would also like to be able to discover and access suitable datasets for their research activities Organisations and funders are also interested in having datasets properly published and to be able to explore impact arising from other output and results that are related to datasets and potentially their impact as they get used by other researchers and projects subsequently Once datasets are fully supported in CERIF, and CERIF data stores, whether institutional and distributed or centralised and populated with the requisite research data about institutions, individuals, projects, publications and research results including datasets, these data stores will become a formidable information resource that will serve a variety of users and uses. As we discussed in the stakeholder analysis as above there are two major perspectives that will need to be supported: discovery: the ability to reliably find resources based on typical retrieval criteria exploration: the ability to explore the context surrounding particular items of interest (people, projects, publications, results, datasets to gain a better understanding of the related research activity) linking: the ability to view this mesh of information as a whole even if parts of this information are located in different repositories The aim is to develop an ontology and an ontology-driven interface to CERIF data stores which will provide the key features of import, discovery, exploration and linking. By modelling the key concepts in the CERIF data store an interface can be built that has the desired functionality of enabling the non-invasive retrieval of records also across data stores and the representation and navigation of links between entities so as to for example visualise related records surrounding a particular dataset, such as other datasets, researchers, publications, projects, etc. In computer science an ontology formally represents concepts in a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain. In theory, an ontology is a "formal, explicit specification of a shared conceptualisation". An ontology renders shared vocabulary and taxonomy which models a domain Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 27/x C4D D2.1 Metadata Ontology with the definition of objects and/or concepts and their properties and relations (e.g. We can define Felix as a cat which is a subclass of the feline group of mammals, and John as a human being, which is also a subclass of mammals and link Felix to John through the ownsPet relationship). Ontologies have a formal syntax (rules of how to compose them) and they have logical properties, which can be used to reason about them. They can be used to manage data or observations and to answer specific questions. They are a partial model that concentrates on specific competency questions determined by the specific application they are supposed to serve. The notion of shared conceptualisation refers to being capable of shared development and use, they are expected to be reused where possible, and new ontologies to reuse relevant parts of other ontologies. They have been used in association with web pages, XML files and databases and can mark these up so as to allow the extraction of information from them and combining information from a variety of sources in a distributed environment. Ontologies typically are expressed in XML form as RDF and OWL where OWL is the prevalent standard that has been adopted by the W3C. The benefit of using ontologies and semantic web techniques is to express connections between entities even if these do not explicitly exist in the records and to use these to link together information and potentially also derive missing information. In addition, when classifications are represented in the ontology in a taxonomic fashion queries can be expanded or refined to yield better retrieval results. Proj1 PersA DS1 Proj2 Pub1 DS2 Pub2 PersB Pub3 Figure 4: Datasets and Related Information Context Ontologies as explicit, machine-processable models of a domain have been in use for some time and have been formalised by the knowledge representation community. They are also part of mainstream web and several standards have been adopted by the W3C Consortium, including RDF and OWL. These are fully compatible with other web technologies and typically serialised in XML and can be applied in a variety of ways to introduce semantics and enable semantic applications to deliver a better and more context sensitive response. They can either be used to further annotate existing XML resources or in a completely non-invasive way point at either elements of XML resources or map onto a live RDB or web services to provide a semantic interface to them. Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 28/x C4D D2.1 Metadata Ontology Figure 4 above shows how ontologies can be used to classify a variety of resources such as datasets, publications, researchers and projects for instance and to search for them by resource type. Ontologies also allow associations or relations to be expressed by which resources are linked together as indicated in the diagram by arrows where researchers, projects and publications are linked to datasets. By graphically displaying these as shown on the diagram the activity surrounding datasets can be visualised and made navigable. The current version of CERIF already supports semantics to some extent in terms of links between entities using existing and extensible classification schemes through which entities such as person can be linked to publications, projects, organisations and results to name but a few. This allows for simple models to be generated including classes, class hierarchies, and properties. However, it does not support the full range of constraints such as domain and range, quantification, disjoints, functional properties, inverses and others that are part of the standard repertoire of ontology constructs to provide more sophisticated services. This requires further exploration and consideration of whether it is expedient to put the semantics with the data. The management of an ontology for the C4D project should therefore make use of exiting entities and facilities but needs to make use of more dedicated tools to manage the generation and testing of a candidate C4D ontology through the use of generally accepted tools such as Protégé. To provide for a discovery, navigation and linking service it is therefore proposed to develop a separate ontology that will be mapped to the data structure of the repository (and potentially to other repositories elsewhere) and through which an interface is provided where users can search the data store to retrieve resources and explore the links to related resources. This would facilitate retrieval of resources from a range of data stores and allow a view of the combined information in a uniform and transparent way. By augmenting and building upon the existing link entities, a more comprehensive information retrieval and navigation service will be provided which will have concrete benefits for the different users: ability to retrieve information from a variety of locations ability to constrain or expand retrieval according to user preference ability to navigate related information to explore activity and impact The use of ontologies in conjunction with datasets, metadata about datasets and repositories is not new and some current work has to be taken note of: Semantic layer in CERIF and the CERIF ontology for version 1.2 VIVO ontology for scientific networks (http://journal.webscience.org/316/) is designed to create a scientific network DataStaR is a scientific data repository implementation driven by an ontology EPOS - the European Plate Observing System (EPOS) is the integrated solid Earth Sciences research infrastructure CSMD – the Core Scientific Metadata Model; DataStaR is a science data “staging repository” developed by Albert R. Mann Library at Cornell University that produces semantic metadata while enabling the publication of datasets and accompanying metadata to discipline-specific data centers or to Cornell’s institutional repository. DataStaR, which employs OWL and RDF in its metadata store, serves as a Web-based platform for production and management of metadata and aims to reduce redundant manual input by reusing named ontology individuals. Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 29/x C4D D2.1 Metadata Ontology The VIVO project is creating an open, Semantic Web-based network of institutional ontology-driven databases to enable national discovery, networking, and collaboration via information sharing about researchers and their activities. The project has been funded by NIH to implement VIVO at the University of Florida, Cornell University, and Indiana University Bloomington together with four other partner institutions. Working with the Semantic Web/Linked Open Data community, the project will pilot the development of common ontologies, integration with institutional information sources and authentication, and national discovery and exploration of networks of researchers. Building on technology developed over the last five years at Cornell University, VIVO supports the flexible description and interrelation of people, organisations, activities, projects, publications, affiliations, and other entities and properties. VIVO itself is an open source Java application built on W3C Semantic Web standards, including RDF, OWL, and SPARQL. The original CERIF ontology abstracted the CERIF logical model into OWL format and thus presents a mirror image of the CERIF framework in OWL and has been presented for discussion. A more up to date version is currently under development for CERIF 1.3 though has not been released yet. CERIF itself also contains elements of ontologies already in the semantic layer and in particular the Link entities and classifications and classification scheme. CERIF is in principle capable of holding an ontology though does not in itself present a fully developed ontology. The recent work of the VIVO project and DataStar present interesting avenues that will need to be explored to see how their ontologies can potentially be adapted for the needs of C4D. What will be important will be to develop an ontology that can represent datasets and also any associated researchers, organisations, projects and publications and any other form of result. In this way the impact of datasets can be explored. This will also need to be married with a suitable classification scheme to facilitate discovery. In line with the current thinking on ontology development, ontologies should where possible be reusing and adapting existing ontologies (see NEON methodology) and it will also be useful to study the application approach used in VIVO and DataStar to learn from their experience. VIVO is also a member of EuroCRIS and thus related to the CERIF initiative. Figure 5: The EPOS Infrastructure The European Plate Observing System (EPOS) is the integrated solid Earth Sciences research infrastructure approved by the European Strategy Forum on Research Infrastructures (ESFRI). EPOS faces a major data handling challenge. The amount of primary data and the demand for accessing it is enormous and increasing. One main EPOS objective is the creation of a comprehensive, easily accessible geo-data volume for the entire European plate. Overall, EPOS will ensure secure storage Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 30/x C4D D2.1 Metadata Ontology of geophysical and geological data providing the continued commitment needed for long-term observation of the Earth. EPOS aims to integrate data from permanent national and regional geophysical monitoring networks with observations from “in-situ” observatories and temporary-monitoring and laboratory experiments. The EPOS infrastructure is based on existing, discipline-oriented, national data centres and data providers managed by national communities for data archiving and mining. Its cyberinfrastructure, see Figure 5 above, for data mining and processing also has facilities for data integration, archiving and exchanging data. The EPOS Core Services will provide the top-level service to users including access to multidisciplinary data and metadata, virtual data from modelling and solid Earth simulations, data processing and visualisation tools as well as access to high-performance computational facilities. Core Scientific Metadata Model (CSMD) The Core Scientific Metadata Model (CSMD) is a model for the representation of scientific study metadata developed within the Science & Technology Facilities Council (STFC) to represent the data generated from scientific facilities. The model has been developed to allow management of and access to the data resources of the facilities in a uniform way. The model defines a hierarchical model of the structure of scientific research around studies and investigations, with their associated information, and also a generic model of the organisation of datasets into collections and files (see figure 6 below). Specific datasets can be associated with the appropriate experimental parameters, and details of the files holding the actual data, including their location for linking. This provides a detailed description of the study, although not all information captured in specific metadata schemas would be used to search for this data or distinguish one dataset from another: Figure 6: The CSMD Entities Infrastructure The models proposed by CSMD and EPOS are relevant to the efforts of C4D in that they suggest that with respect to the metadata coverage and focus there needs to be provision for integrating with repositories that hold the actual datasets where more low level metadata standard aspects will be covered that are more to do with the format of the datasets as opposed to the more administrative aspects to do also with context and in particular with a need for supporting the discoverability if such collections of dataset are to be made effectively reusable and retrievable/browsable. The Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 31/x C4D D2.1 Metadata Ontology CSMD Entities structure is also a useful pointer in terms of relating key entities together and making these data collections more browsable or assessable especially from the point of view of impact and impact analysis, be this for reporting purposes or for helping researchers understand the relationship between items of interest to understand activities surrounding datasets they may be looking at. Another development that will need to be considered is the related ICAT Project that builds on the CSMD Architecture and proposes a mechanism for the seamless access to resources in a federated situation that links together repositories such as metadata or research information and resources such as datasets etc. Figure 7: Architecture of ICAT The ICAT architecture (shows above in figure 7) is particularly noteworthy as it considers the aspect of federated systems that may hold parts of a much larger collection of information and data and where discoverability and the ability to merge data from different locations is concerned. Through the appropriate tagging and use of keywords and taxonomies of scientific terms a seamless retrieval of resources can be enabled. This ill then generate a more scalable and extensible architecture and thus be flexible for further extension in the future and also the ability to link repositories containing research metadata and the repositories that hold the actual artefacts that may want to be browsed or retrieved. Proposed C4D Discovery Architecture In order to refine the proposed approach and in order to make use of existing work to develop a comprehensive solution more quickly we propose to base the application on R2O and ODEMapster Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 32/x C4D D2.1 Metadata Ontology R2O & ODEMapster is an integrated framework for the formal specification, evaluation, verification and exploitation of the semantic mappings between ontologies and relational databases. This integrated framework consists of: R2O, a declarative, XML-based language that allows the description of arbitrarily complex mapping expressions between ontology elements (concepts, attributes and relations) and relational elements (relations and attributes) ODEMapster processor, which generates Semantic Web instances from relational instances based on the mapping description expressed in an R2O document. ODEMapster offers two modes of execution: Query driven upgrade (on-the-fly query translation) and massive upgrade batch process that generates all possible Semantic Web individuals from the data repository Recently in the context of the NeOn project, the ODEMapster plugin has been developed and it is included in the NeOn Toolkit. This plugin offers the user a GUI to create, execute, or query the R2O mappings. It has shte advantage of providing a live connection to the database as opposed to the D2RQ approach that just offers a snapshot. The proposed architecture for the C4D project is shown in figure 8 below: C4D - CERIF Semantic Browser C4D Navigator (Import, Search and Visualisation) Domain Ontology R2O Mappings RDB Ontology ODEMapster Protege CERIF Repository Figure 8: Application of Metadata Ontology The recommendation will be for: A suitable classification system to be built into the ontology for a classification taxonomy – it is likely that the classification used already by RCUK will be the key classification system to aplly here also since it wold then make future integration between the metadata repositories and repositories that administer/curate datasets easier since several of these already exist such as within EPSRC and NERC. The remaining key entities be specified together with their formal properties and relations based on current CERIF and the new CERIF ontology as well as VIVO and DataStaR A suitable toolset to be chosen for the implementation of the application – currently the prevalent way of developing such applications is through using Protégé for editing purposes, Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 33/x C4D D2.1 Metadata Ontology together with the Jena toolkit for managing the ontology in a Java application to be developed. This also contains reasoning tools. The VIVO system architecture should be examined for reference purposes A database connector needs to be used for connecting the ontology application to data repositories such as databases. Currently there are two open-source solutions available for this, namely D2RQ and R2O/ODEMapster. The latter has the benefit of providing a live connection to a database as opposed to creating a database image in the case of D2RQ which is both cumbersome and also impractical for use with large repositories/databases. Figure 8 above shows how such a system architecture would work by linking to an existing metadata repository (CERIF) which holds the records on researchers, projects, publications and datasets. This would be accessed using ODEMapster. The application then uses the C4D ontology and maps it to the database using R2O mappings and then allows the database to be searched and data to be retrieved by interrogating the domain ontology. Depending on user preference this can then be accessed through a traditional tabular search facility or through an interactive graphical representation that allows the collection of information to be navigated and browsed. Also, the benefit of this approach is that it could be easily extended to deal with a variety of repositories and allow the data to be combined for more unified and transparent navigation. However this would require additional effort to deal with merging of data from different sources. The VIVO project has not concentrated so much on datasets but it would be advisable to study the system they have developed as it allows for this inter-organisational perspective and provides an ontology driven approach. For the purpose of deriving a C4D solution there are a number of services we will also look into to help with the ontology construction: http://ontologydesignpatterns.org/ provides a repository of shared ontologies where there may be useful components that can be used in the construction of the C4D ontology http://semanticweb.org/wiki/GoodRelations is another frequently used repository of ontology fragments to do with managing relations http://owl.cs.manchester.ac.uk/modularity/ is a service that easily allows to extract modules out of existing ontologies for reuse SUMMARY AND RECOMMENDATIONS 8 From the study of CERIF and the current developments touched upon in this report the following key points should guide a future implementation of support for datasets in CERIF: Making use of the existing cfResultProduct entity in CERIF and extending it to support any additional elements for datasets and related metadata identified from the survey of metadata standards o Build a metadata for dataset support based on MEDIN by making geospatial aspects optional o and adding further elements required in other scientific disciplines such as biology, physics and chemistry. o Making good use of Dublin Core where possible as this enjoys widespread acceptance and elements of which are being used in some standards as well as also being used in CERIF already Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 34/x C4D D2.1 Metadata Ontology Develop a candidate ontology for use with the envisaged application to enter dataset metadata, import of metadata in bulk, as well as discovery of datasets and dataset related activity: o based on a study and potential reuse of aspects of the VIVO and DataStaR ontologies where these support datasets and key entities associated with datasets o as well as making use of the new proposed CERIF ontology once available and maintaining compatibility with semantic aspects already implemented in CERIF as part of the Semantic Layer, the Link Entities and Classifications and classification scheme Study the VIVO ontology application approach to refine the approach of an ontology enabled application that can be used for: o Importing existing dataset metadata into CERIF o Data entry tools for generating dataset metadata in CERIF by hand o Discovery facility for searching for datasets o Navigation facility to visualise relations between datasets and key entities in a graphical and interactive fashion o Compare this with a current working prototype already available at Sunderland through another project to decide the best solution Selection of an existing classification scheme for the purpose of classification of datasets and their improved discovery in the process of retrieval or browsing: o Using either the RCUK classification scheme o Or HESAs JACS codes o At this point in time there are attempts to merge JACS with RCUK though this is expected to complete after the end of the C4D project o The RCUK Classification scheme will get greater exposure as part of the RCUK Gateways to Research Project – announced in the BIS Research and Innovation Strategy in December 2011 and it is therefore advisable to work with this in the interim in C4D. Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 35/x C4D D2.1 Metadata Ontology REFERENCES Buil-Aranda, C. and Corcho, Óscar and Krause, Amy (2009) Robust service-based semantic querying to distributed heterogeneous databases. In: 20th International Workshop on Database and Expert Systems Application, DEXA2009, 31/08/2009 - 04/09/2009, Linz, Austria. Connaway, Lynn Silipigni, Timothy J. Dickey, and Marie L. Radford. (2011) "'If It Is Too Inconvenient, I'm Not Going After it:' Convenience as a Critical Factor in Information-Seeking Behaviors." Library and Information Science Research, 33: 179-190. Pre-print available online at: http://www.oclc.org/research/publications/library/2011/connaway-lisr.pdf Connaway, Lynn Silipigni, and Timothy J. Dickey (2010) Towards a Profile of the Researcher of Today: What Can We Learn from JISC Projects? Common Themes Identified in an Analysis of JISC Virtual Environment and Digital Repository Projects. Available online at: http://ierepository.jisc.ac.uk/418/2/VirtualScholar_themesFromProjects_revised.pdf DataStaR: Bridging XML and OWL in Science Metadata Management, Brian Lowe in Communications in Computer and Information Science, 1, Volume 46, Metadata and Semantic Research, Part 2 Gruber, Thomas R. (June 1993). "A translation approach to portable ontology specifications" (PDF). Knowledge Acquisition 5 (2): 199–220. INSPIRE Metadata Regulation 03.13.2008 (see http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:32008R1205:EN:NOT) INSPIRE Metadata Implementing Rules: Technical Guidelines based on EN ISO 19115 and EN ISO 19119 v1.2 2010-06-16 (http://inspire.jrc.ec.europa.eu/documents/Metadata/INSPIRE_MD_IR_and_ISO_v1_2_20100616.pdf) Khan, Huda, Brian Caruso, Jon Corson-Rikert, Dianne Dietrich, Brian Lowe, and Gail Steinhart (2010) Using the Semantic Web Approach for Data Curation. Proceedings of the 6th International Conference on Digital Curation. December 6-8, 2010. Chicago, IL http://hdl.handle.net/1813/22945 Lowe, Brian (2009) DataStaR: Bridging XML and OWL in Science Metadata Management. Metadata and Semantics Research 46: 141-150. http://www.springerlink.com/content/q0825vj78ul38712/ Brian Matthews, Shoaib Sufi, Damian Flannery, Laurent Lerusse, Tom Griffin, Michael Gleaves, Kerstin Kleese, “Using a Core Scientific Metadata Model in Large-Scale Facilities”, The International Journal of Digital Curation. ISSN: 1746-8256, Vol 5 Issue 1, 2010 Flannery, D.; Matthews, B.; Griffin, T.; Bicarregui, J.; Gleaves, M.; Lerusse, L.; Downing, R.; Ashton, A.; Sufi, S.; Drinkwater, G.; Kleese, K.; , "ICAT: Integrating Data Infrastructure for Facilities Based Science," e-Science, 2009. e-Science '09. Fifth IEEE International Conference on , vol., no., pp.201-207, 9-11 Dec. 2009 doi: 10.1109/e-Science.2009.36 Mitchell, J.S., & Vizine-Goetz, D. (2009) Dewey Decimal Classification. In: Encyclopedia of Library and Information Science, Third Edition, Bates, M.J., & Niles, M. (eds.). Maack. Boca Raton, Fla.: CRC Press. Pre-print available online at: http://www.oclc.org/research/publications/library/2009/mitchell-dvg-elis.pdf. Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 36/x C4D D2.1 Metadata Ontology Steinhart, Gail (2011) DataStaR: A Data Sharing and Publication Infrastructure to Support Research. Agricultural Information Worldwide 4(1):16-20. http://journals.sfu.ca/iaald/index.php/aginfo/article/view/199 Stendel ,M., Schmith, T, Roeckner, E., & Cubasch, U. (2004) IPCC_ECHAM4OPYC_SRES_ A2_MM. World Data Centre for Climate 10.1594/WDCC/IPCC_EH4_OPYC_SRES_A2_MM. Suarez-Figueroa, Mari Carmen and Gomez-Perez, Asuncion (2009) NeOn Methodology for Building Ontology Networks: a Scenario-based Methodology. Proceedings of the International Conference on Software, Services & Semantic Technologies (S3T 2009) SUMMARIES DDC Dewey Decimal Classification (2003) Dewey Decimal Classification and Relative Index, Edition 22 (DDC 22), OCLC Online Computer Library Center, Inc. ISBN 0-910608-71-7 UDC Consortium, http://www.udcc.org/about.htm accessed: 18/01/2012 UK GEMINI2 Standard Version 2.1.AGI August 2010. (see www.gigateway.org.uk) VIVO: Enabling National Networking of Scientists, DeanB.Krafft, NicholasA.Cappadona,Brian Caruso, Jon Corson-Rikert, MedhaDevare,BrianJ.Lowe, available at http://journal.webscience.org/316/2/websci10_submission_82.pdf Woodruff, S.D. (Ed) (2003) Archival of data other than in IMMT format. Proposal: International Maritime Meteorological Archive (IMMA) Format. Update of JCOMM-SGMC-VIII/Doc. 17, Asheville, NC, USA 10-14 April 2000. Available from http://www.cdc.noaa.gov/coads/edoc/imma/imma.pdf Version 1.1 of 14 Jan 2012 D2.1 Metadata Ontology.doc Page 37/x