CERIF for Datasets D3.1 Ontology Upload Requirements and Process Definition Filename: Circulation: Version: Date: Stage: D3.1_Ontology_Upload_Requirements Restricted (PP) 2.2 9 May 2012 Final Authors: Albert Bokma and Sheila Garfield Partners: University of Sunderland University of Glasgow University of St Andrews Engineering and Physical Sciences Research Council Natural Environment Research Council EuroCRIS C4D is a project funded under JISC's Managing Research Data Programme C4D D3.1 Ontology Upload Requirements COPYRIGHT © Copyright 2011 The C4D Consortium. All rights reserved. This document may not be copied, reproduced, or modified in whole or in part for any purpose without written permission from the C4D Consortium. In addition to such written permission to copy, reproduce, or modify this document in whole or in part, an acknowledgement of the authors of the document and all applicable portions of the copyright notice must be clearly referenced. This document may change without notice. DOCUMENT HISTORY Version Issue Date Stage Content and changes V1.0 4 May 2012 Draft Draft V2.0 9 May 2012 Draft Comments and changes incorporated V2.1 10 May 2012 Late V2.2 10 May 2012 Final Version 2.2 of 10 May 2012 D3.1 Ontology Upload Requirements.doc Page 2/12 C4D 1 D3.1 Ontology Upload Requirements INTRODUCTION The goal of the CERIF for Datasets (C4D) project is to extend CERIF to effectively deal with research datasets. Datasets for the purposes of this paper are collections of data which are usually structured and in tabular form where each row represents a record with a number of fields that hold data on various attributes and that have been generated in the course of research activity, whether they record experimental results or original observations. CERIF in its core entities has a current focus to record information about people, projects, organisations, publications, patents, products, projects, facilities, and equipment. One aspect which is not specifically covered, though becoming of increasing interest is research datasets as a form of research output. In some disciplines there are considerable amounts of datasets generated which are an important output; these may be of use also to other researchers as a useful input to their research and potentially to help improve research and speed up the rate of discovery and be in itself a useful indicator of impact. Transparency and access is increasingly a feature of the public sector and making results more readily available and discoverable are now the order of the day; especially when these were generated with the assistance of public funds there is also a moral obligation to make results available for inspection and reuse. In addition to the moral argument there is also increasing legislation in this direction that places specific requirements on researchers and their host organisations. These data collections, beside the activity out of which they arose, can be very useful for future research also by other researchers; they present a formidable resource if properly managed. With the appropriate safeguards in place for the protection of the results and proper acknowledgement and, if necessary, licensing, they have the potential for propelling research. However, due to lack of concrete knowledge as to who holds which datasets in the relevant research community this means that their potential remains currently underutilized. Dataset A Metadata Dataset B Metadata Dataset n Metadata C4D enhanced CERIF Repository JISC Advance/SSPS HE Cloud & ESB C4D Navigator Publications Institutions Datasets Projects Researchers Figure 1: C4D Ontology based navigation/retrieval Version 2.2 of 10 May 2012 D3.1 Ontology Upload Requirements.doc Page 3/12 C4D D3.1 Ontology Upload Requirements The access to these would be useful for researchers and would have additional benefits for the sharing parties by demonstrating the impact they create. Consequently, there is a need for recording these with easy to use tools and publishing them in a way that facilitates their discovery. To enable these benefits to be realised poses some specific problems: The generation of metadata is not always easy and intuitive and can become onerous once lots of datasets are generated on a regular basis and tools are needed to make sure that correct and appropriate metadata can be generated by researchers or support staff. This may sound easy but experience has shown that a considerable amount of checking and data cleaning can be required to achieve this in practice. In a distributed and networked world where there is increasing mobility of data and data holders there is a need for standards so that data can be extracted, migrated, integrated or aggregated successfully without requiring significant human intervention. As discovery is an important issue to make these resources findable and useful. This also requires the use of appropriate classification systems alongside. Relevant datasets may exist in a variety of repositories and may have arisen in other disciplines. A poignant example is the work of the Sea Mammal Research Unite (SMRU) where in the course of biological research records are created that are of specific interest for climatological research and thus crossdisciplinary cases are not uncommon. In Figure 1, above, the conceptual framework for C4D is presented that shows where C4D fits into the current research landscape and what it attempts to provide in practical terms. Given the motivations stated earlier C4D attempts to find a solution to this problem and consequently focuses on the following issues: Examining the support and extension requirements of CERIF to support datasets Proposing a flexible standard for the metadata for datasets based on existing ones to cover a variety of disciplines Choosing an appropriate classification system to facilitate storage and retrieval of datasets or the metadata about datasets Developing tools for the manual generation of metadata into a CERIF compliant metadata data store and their exchange with other CERIF compliant data stores Discovery and retrieval as well as exploration in relation to other key entities associated with these datasets such as owners, publications, organisations and projects. In the following we look at the classification scheme to be employed before considering the proposed system architecture and ontology support requirements. 2 CLASSIFICATION TAXONOMY REQUIREMENTS The ROS categories provide a comprehensive, hierarchical, set of elements for inputting information with regard to outcomes from research. Within each of the top level outcome types (categories) there are additional selection criteria or options available at a first tier and a second tier level. For example, the top level “Other Research Outputs” outcome type has options, at the first tier, for biological outputs, electronic outputs and physical outputs; for electronic outputs, at the second tier level, there is the option of selecting ‘dataset’ which is described as “a structured record of the value of variables that were measured as part of the research”. The RCUK classification scheme has 78 first level categories and then subcategories to level 3. The main categories and selective subcategories are shown in the list below. Please note that resolution at level 3 can vary from a single subcategory to 30 or more: Version 2.2 of 10 May 2012 D3.1 Ontology Upload Requirements.doc Page 4/12 C4D D3.1 Ontology Upload Requirements Agri-environmental science o Agricultural systems o Crop protection o Crop science o Earth and environmental o Soil science Animal science o 3Rs [Reduction, Refinement, Replacement] o Animal and human physiology o Animal behaviour o Animal diseases o Animal organisms o Animal reproduction o Animal welfare o Endocrinology o Immunology o Livestock production o Musculoskeletal system o Parasitology o Psychology o Systems neuroscience Archaeology o Archaeological Theory o Archaeology Of Human Origins o Archaeology Of Literate Societies o Industrial Archaeology o Landscape & Environmental Archaeology o Maritime Archaeology o Palaeobiology o Prehistoric Archaeology o Quaternary Science o Science-Based Archaeology Astronomy - observation o Astronomy & Space Science Technologies o Data Handling & Storage o Extra-Galactic Astronomy & Cosmology o Galactic & Interstellar Astronomy o Instrumentation for Particle Physics Or Astronomy o Stellar Astronomy Astronomy - theory o Computational Methods and Tools o Extra-Galactic Astronomy & Cosmology o Galactic & Interstellar Astronomy o Stellar Astronomy Atmospheric physics and chemistry o Atmospheric Kinetics o Boundary Layer Meteorology o Land - Atmosphere Interactions o Large Scale Dynamics/Transport o Ocean - Atmosphere Interactions o Radiative Processes & Effects o Stratospheric Processes o Tropospheric Processes o Upper Atmosphere Processes & Geospace o Water In The Atmosphere Atomic and molecular physics o Atoms and Ions o Cold Atomic Species o Fusion o Light-Matter Interactions o Scattering & Spectroscopy Bioengineering Version 2.2 of 10 May 2012 D3.1 Ontology Upload Requirements.doc Page 5/12 C4D D3.1 Ontology Upload Requirements o o o o o o o o o o o Biochemical engineering Bioenergy Biomaterials Bionanoscience Bioreactors Environmental biotechnology Macro-molecular delivery Metabolic engineering Novel industrial products Protein engineering Tissue engineering Biomolecules and biochemistry o Ageing: chemistry/biochemistry o Biochemistry and physiology o Bioenergetics o Biological membranes o Biophysics o Carbohydrate Chemistry o Catalysis and enzymology o Chemical Biology o Multiprotein complexes o Protein chemistry o Protein expression o Protein folding / misfolding o Structural biology Catalysis and surfaces o Catalysis and Applied Catalysis o Complex fluids and soft solids o Electrochemical Science and Engineering o Surfaces and Interfaces Cell biology o Ageing: chemistry/biochemistry o Cell cycle o Cells Adipocytes B cells Basophils Blastocyst Cells Chondrocyte Collenchyma Dendritic cells Endothelial cells Eosinophils Epidermal cells Epithelial cells Erythrocytes Granulocytes Guard cells Haematopoietic stem cells Hepatocytes Keratinocytes Leukocytes Lymphocytes Macrophages Mast cells Melanocytes Myocytes Natural killer cells Neutrophils Oocytes Osteoblasts Version 2.2 of 10 May 2012 D3.1 Ontology Upload Requirements.doc Page 6/12 C4D D3.1 Ontology Upload Requirements Osteoclasts Osteocytes Pancreatic alpha cells Pancreatic beta cell Parenchyma Photoreceptor cells Platelets Regulatory T cells Sclerenchyma Stem cells Stromal cells T Cells Trophoblast o Communication and signalling o Organelles and components o Receptors o Stem cell biology Chemical reaction dynamics and mechanisms Chemical measurement Chemical synthesis Civil engineering and built environment Classics Climate and climate change Complexity science Cultural and museum studies Dance Demography and human geography Design Development studies Drama and theatre studies Ecology, biodiversity and systematics Economics Education Electrical engineering Energy Environmental engineering Environmental planning Facility Development Food science and nutrition Genetics and development Geosciences History Information and communication technologies Instrumentation, sensors and detectors Languages and Literature Law and legal studies Library and information studies Linguistics Management and business studies Manufacturing Marine environments Materials processing Materials sciences Mathematical sciences Mechanical engineering Media Medical and health interface Version 2.2 of 10 May 2012 D3.1 Ontology Upload Requirements.doc Page 7/12 C4D D3.1 Ontology Upload Requirements Microbial sciences Music Nuclear physics Omic sciences and technologies Optics, photonics and lasers Particle astrophysics Particle physics - experiment Particle physics - theory Philosophy Planetary science Plant and crop science Plasma physics Political science and international studies Pollution, waste and resources Process engineering Psychology Social anthropology Social policy Social work Sociology Solar and terrestrial physics Superconductivity, magnetism and quantum fluids Systems engineering Terrestrial and freshwater environments Theology, divinity and religion Tools, technologies and methods Visual arts Table 1: The RCUK Subject Classification Scheme Overall, the RCUK categories present a good level of detail with regard to research outputs and could provide the fine detail that will be needed for the C4D subject demonstrator. Given the aim of C4D to manage metadata of datasets for a variety of disciplines and in particular to support the discovery of them in terms of search based retrieval or explorative browsing the RCUK categories set seems to be a better fit also with a view to generating a scalable solution. Furthermore, the fact that RCUK is already in wider circulation in research councils for use with grant applications and research outputs this would also provide a compatible solution that would allow potentially data from different locations and different types to be integrated at the point of browsing. The ontology will need to cover in the core concepts the key entities in the CERIF data store especially where these are related to datasets but will for classification purposes include a complete classification system in taxonomic form. Version 2.2 of 10 May 2012 D3.1 Ontology Upload Requirements.doc Page 8/12 C4D D3.1 Ontology Upload Requirements 3 PROPOSED C4D DISCOVERY ARCHITECTURE Once datasets are fully supported in CERIF, and CERIF data stores, whether distributed or centralised, these data stores will be a formidable information resource that will serve a variety of users and uses. To be able to fully unlock this potential a number of issues have to be addressed: generation of correct metadata and integration with existing records where necessary (potentially also completing missing information in related records) facilitating the discovery of relevant records through appropriate retrieval mechanisms (even across distributed repositories) enabling the browsing and navigation of relevant datasets and their associated people, organisations, projects and publications The generation of correct metadata and the correct insertion of records in a CERIF repository will require linking it correctly to already existing information such as that about people, organisations and projects or alternatively initiate the creation of additional records (e.g. a dataset needs a person who generated and owns it and should be linked to a project through which is was created and an organisation who hosted the project etc.). This will draw on some earlier work of the authors such as (Bokma 2004) and (Tektonidis & Bokma 2006). Alternatively, it might also require links to be made to data in other repositories in a distributed environment where data is to be subsequently searched and retrieved. Finally, the navigation of datasets and related entities will require connections to be made that might not be explicitly present in the existing record. For this reason it is necessary to have a mechanism that can look at relationships and expectations with respect to correct records and be able to reason about it, be it for the purpose of the generation of correct records, data cleaning or browsing. C4D will develop an ontology, and an ontology-driven interface to CERIF data stores which will provide the key features of import, discovery, exploration and linking. By modelling the key concepts of people, projects, publications, organisations and results an interface can be built that has the desired functionality of enabling the non-invasive retrieval of records also across data stores and the representation and navigation of links between entities so as to, for example, visualise related records surrounding a particular dataset, such as other datasets, researchers, publications, projects, etc. This will enable the vision presented in Figure 1 above to provide for powerful retrieval and navigation of these intrinsically related entities. For the thematic retrieval of datasets and as a starting point for showing related entities the use of a classification system will also be important and hence the integration of the RCUK classification system into an eventual ontology as discussed in section 2. When dealing with distributed repositories and where there may be inconsistencies and there is not the possibility to enforce data-cleaning it is important to have a flexible mechanism that can resolve these in a non-invasive way at the point of combining and aggregating retrieval results. In order to make use of existing work to develop a comprehensive solution, one possible approach could be to base the application on R2O and ODEMapster – see (Buil-Aranda, Corcho et al 2009). R2O & ODEMapster is an integrated framework for the formal specification, evaluation, verification and exploitation of the semantic mappings between ontologies and relational databases. This integrated framework consists of: R2O, a declarative, XML-based language that allows the description of arbitrarily complex mapping expressions between ontology elements (concepts, attributes and relations) and relational elements (relations and attributes) ODEMapster processor, which generates Semantic Web instances from relational instances based on the mapping description expressed in an R2O document. Version 2.2 of 10 May 2012 D3.1 Ontology Upload Requirements.doc Page 9/12 C4D D3.1 Ontology Upload Requirements Figure 2 below shows how such a system architecture would work by linking to an existing metadata repository (CERIF) which holds the records on researchers, projects, publications and datasets; this would be accessed using ODEMapster. The application then uses the C4D ontology and maps it to the database using R2O mappings and then allows the database to be searched and data to be retrieved by interrogating the domain ontology. Depending on user preference this can then be accessed through a traditional tabular search facility or through an interactive graphical representation that allows the collection of information to be navigated and browsed. Also, the benefit of this approach is that it could be easily extended to deal with a variety of repositories and allow the data to be combined for more unified and transparent navigation. However this would require additional effort to deal with merging of data from different sources. C4D - CERIF Domain Ontology Protégé C4D Navigator (Import, Search and Visualisation) R2O Mappings RDB Ontology ODEMapster CERIF Repository Figure 2: Application of Metadata Ontology The use of a common classification scheme in combination with these key concepts will allow any of these categories and existing records to be managed for the purpose of discovery and integration. Thus the ontology will also need to contain the classification scheme as a topic hierarchy to be used to classify individual researchers, projects, datasets and publications so that the activities in particular subject areas become extractable and so that the proposed vision of being able to explore activity surrounding particular datasets, as depicted in the graphical form reproduced above, is achieved. The fact that the RCUK classification scheme is also hierarchical is useful as it allows a richer set of functionality to be generated rather than out of a flat dictionary. In the following we explore the key concepts of the ontology and their relationships. The proposed implementation will follow the architecture described in Figure 2 above and comprise: Dataset: with references to the entity owning it (person or orgunit) and that is stored in a repository; datasets also to have one or more (ROS) classifications Project: with reference to relevant fundholder, orgunit and generating datasets Person: with ownership of datasets and authorship of publications Publications: with authorship of person, references to datasets and affiliation of authors to orgunits Version 2.2 of 10 May 2012 D3.1 Ontology Upload Requirements.doc Page 10/12 C4D D3.1 Ontology Upload Requirements Orgunits: referring to persons as staff members and hosting projects and curating datasets Repositories: containing actual datasets and referring to owners and data management plan isHostedBy Project Orgunit isGenerating Repository hasStaff isStoredIn Dataset isOwnedBy Person hasClassification isReferringTo Classification isAuthorOf Publication Figure 3: C4D Ontology Core Concepts The ontology will contain as a main core concept Classification where ROS will be represented in taxonomic form to be used as a classification as one object property of Dataset. Figure 3 above shows the core concepts and their principal relations. By adding appropriate restrictions it will also be possible to reason about datasets and their relationships to other entities and uncover new links or infer or alert to missing information which are crucial to a consistent dataset. The ontology will be mapped to the CERIF data store to point at the relevant CERIF entities and related properties and entity relations. By linking the ontology to the CERIF data store, the relevant records can be retrieved for retrieval and navigation purposes as well as managing the correct generation and insertion of new records to be discussed in Deliverable D3.2. The proposed architecture of separating the ontology and application from the CERIF data store itself also allows the approach to be used with few modifications in a distributed environment where records are located in separate CERIF data stores. Version 2.2 of 10 May 2012 D3.1 Ontology Upload Requirements.doc Page 11/12 C4D D3.1 Ontology Upload Requirements References Bokma A., Solomon N., Harvey H. & Jones S., “NOPIK: Creating Shared Conceptualisations for Information and Knowledge Sharing in the Virtual Enterprise”, Proc. of the 10th International Conference on Concurrent Enterprising - ICE2004, Seville, June 14-16, 2004 Carlos Buil-Aranda, Oscar Corcho and Amy Krause, “Robust service-based semantic querying to distributed heterogeneous databases” (2009) 20th International Workshop on Database and Expert Systems Application, Linz, Austria, August 31September 04, ISBN: 978-0-7695-3763-4 Dimitrios Tektonidis & Albert Bokma, “Adapting ontologies for integrating e-Government Information Systems”, 2nd International Conference on Interoperability of eGovernment Services, 22-24 March 2006, Bordeaux, France, ISSN: 1773-0279 European Commission (2008) INSPIRE Metadata Regulation 03.13.2008. Available from: lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:32008R1205:EN:NOT [Acc. 6 December 2011] Version 2.2 of 10 May 2012 D3.1 Ontology Upload Requirements.doc http://eur- Page 12/12