CERIF-CRIS Implementation Advice Keith G Jeffery, Anne Asserson 20120202 Introduction The CERIF model is becoming widely accepted for CRIS. However, its very flexibility allows for multiple methods of implementation. While all are potentially valid, experience has demonstrated that some implementation styles or techniques are better than others along dimensions of representativity, usability and efficiency. First we explain how the philosophical concept behind CERIF differs from relational database implementations. We then continue with how to implement CERIF using as examples the four base entities Project, Person, Organisational Unit and Result-Publication. This document is aimed at persons who have some understanding or experience of how to implement relational databases. It should be read in conjunction with the specification of the CERIF model on the euroCRIS website (Brigitte Jörg et al). The Philosophy behind the CERIF Model Conventional relational database technology supports only hierarchic relationships using the foreign key as a linking mechanism. This mechanism tells us nothing more about the linkage than it exists. This is analogous to a URL in HTML or XLINK in XML. CERIF goes beyond this. CERIF has entities that are static and relationships (linkages) that are dynamic. These relationships include a role which is the purpose of the linkage and the temporal duration of the relationship. The relationship cardinality can be between 0,1 or many instances of an entity at either or both ends of the link. CERIF thus represents a fully connected graph structure (many-to-many) as found in the world of research information and expresses the semantics (meaning) of the linkage through the role. This rich structure leads to some simple rules: No dates in base entities (unless the date is needed for a particular application purpose e.g. date of publication is part of a bibliographic reference) because the temporal information is expressed in relationships; o example to avoid: project start date, event date No flags - representing a yes/no condition for state of processing - in base entities because the temporal (and therefore state) information is expressed in relationships; o Example to avoid: proposal received (Y/N) or project state (proposal, funded, ongoing, terminated) Implement the generic entity with specialisation via roles o implement an entity Person NOT Author since Author is a role of person related to a Research Publication); ©Keith G Jeffery/Anne Asserson/euroCRIS Distinguish clearly classification of a base entity (a characteristic of that entity independent of any relationships) and classification (role) within a linking relation (a characteristic of the relationship, not of either entity involved in the relationship). o Example: classification of a publication as ‘book’ is a property of the base entity; classifying the book as ‘published’ is a property of the relationship between the book and the OrgUnit of type publisher. In general it is better to classify via linking relations: o specific expertise of a person is usually related to a role with respect to a particular OrgUnit or Project or service rather than being a native property of the person; Use the terms in the CERIF classification schemes; if not used provide a crosswalk to CERIF canonical semantics from the equivalent terms in a particular implementation of a CERIF-CRIS; o If the local CERIF-CRIS describes the country France as ‘France’ provide crosswalk to ‘FR’ Do not subclass: in CERIF all entities are first class objects (since different users have a different view of the relative importance or ‘ownership’ of one entity by another. The OpenAIRE datamodel provides an example. The relationship between Persons and Projects is clearly n:m; however the ‘participants’ relation is not an implementation representing this relationship but a particular subclassed role ‘coordinating person’. ©Keith G Jeffery/Anne Asserson/euroCRIS Kinds of CERIF Entities CERIF includes many base entities to represent research information. However four such entities are found commonly in almost all CRIS: Project, Person, Organisational Unit and Result-Publication. The E-R model diagram below (from the CERIF model specification on the euroCRIS website) describes Project, Person, Organisational Unit (coloured green) with their linking relations to each other (coloured lilac) and their direct relational links for languagedependent extensions (coloured yellow). Now let us deal with these three aspects in turn: ©Keith G Jeffery/Anne Asserson/euroCRIS Base entities (green): The base entities have few attributes because the information is found in the linking relations and the language-dependent entities (lilac and yellow). One or more classifications can be applied to a base entity. Linking Relations Linking relations express a role (purpose) between two base entities and the temporal duration of that role. The role is not, however, expressed directly as an attribute in the linking relation. The role is part of a classification scheme (cfClassScheme) where the term (cfClass) for the role is given. This allows one stored role for many uses. Classification Scheme: Project-Person o Project leader o Participant o Contact person Note that Participant would be a valid role in the linking relation Person-Event and Contact Person a valid role in Person-Organisation, Person-Facility, Person-Equipment etc. Language-Dependent Entities (yellow) The language dependent entities contain textual information which may be in more than one language. Thus they have a one to many relationship with the base entities and are represented using conventional relational primary/foreign key relationships. They do not need the expressive power of linking relations since they are only language variants of the same attribute(s) and if monolingual would be in the base entity as an attribute. In addition to these basic entities there are two more types of entity that are used in CERIF: Classification Entities (pink) CERIF does not have attributes within base relations that are classification terms to characterise the instances of that entity. Instead, an external classification system points to the particular instance within the entity. Thus, to classify two instances of a publication respectively as ‘book’ and ‘journal article’ a Class Scheme (cfClassScheme) for Publications (cfResPubl) is set up within which these Classification Terms (CfClass) occur. Then a linking relation (cfResPubl_Class) relates the Classification term (e.g. ‘book’) to the instance in the entity Publications and similarly ‘Journal Article’ to the other instance. Similarly to classify OrgUnits an appropriate Class Scheme and set of Class terms is erected so that ‘University’ can be distinguished from ‘SME’. cfOrgUnit_Class provides the link. We also use this classification technique for the roles within a linking relation. Thus in cfOrgUnit_ResPubl we may wish to express that the OrgUnit is the publisher of the publication. The Class Scheme relating to cfOrgUnit_ResPubl would contain ‘publisher of’ but could also contain ‘rightsholder of’ or ‘funder of’. The following diagram illustrates Base Entities, a language dependent entity, a linking relation and how classification is used. ©Keith G Jeffery/Anne Asserson/euroCRIS cfClassScheme cfClass e.g. book | journal article e.g. University | SME cfOrgUnit_Class cfResPubl_Class e.g. publisher of | rightsholder of cfOrgUnit cfOrgUnit_ResPubl cfResPubl cfOrgUnitName The Class Scheme / Class construct is the semantic layer of CERIF and allows: 1. To keep all semantic information structured in one place for ease of maintenance to ensure consistency of use of terminology 2. The same term with the same meaning within one Class Scheme to be used in multiple base relations or linking relations; 3. The same term with different meaning within different Class Schemes to be used in multiple base relations or linking relations; An example of (2) above is ‘is part of’. This may be used in many linking relations between different base entities. Another example is a given country code e.g. ‘FR’. This may be used in several base relations. An example of (3) is ‘computing’. It may refer to a project, to an organisation, to a publication or product or to a service. Name Variant Entities (grey) Especially in the case of cfPers it may be necessary to record variants of the name. This is done by having a special base entity cfPersName conected to cfPers by a linking relaiton so the temporal duration and role (classification) of each name variant instance can be recorded. See under Person for more detail. ©Keith G Jeffery/Anne Asserson/euroCRIS The Major CERIF Base Entities In addition to the standard green, lilac and yellow colours and meanings we now add a pink coloured entity for classification of a base entity and a grey coloured entity for names and name variants. Person ©Keith G Jeffery/Anne Asserson/euroCRIS cfPers The Person entity includes only identification attributes: Id, BirthDate, Gender and URI. One might expect attributes for Family Name and Other names; in CERIF these are managed by a separate entity cfPersName (coloured grey) which is linked to cfPers through CFPers_Persname allowing a role and dates to be assigned to each set of names. This allows a person to have multiple names simultaneously or serially. The research interests of a person and keywords are in separate entities; in this case linked directly to cfPers because there is no need to record role and dates. Classifications may be applied to the person – and one or more schemes may be used. Linking Relations CfPers has many linking relations. cfPers_Pers allows a recursive relationship to be defined linking two persons. However, it should be used sparingly. For example co-authors are best defined by the cfPers-ResPubl linking relation; co-workers are best defined by the cfPersOrgUnit linking relation. The other linking relations provide – with role and dates – the relationship between a person and the other entities. CfPers_OrgUnit may have a role employed; CfPers-ResPubl may have role author but also editor or reviewer; cfProj-Pers may have roles such as project leader or participant. ©Keith G Jeffery/Anne Asserson/euroCRIS Project cfProj The project entity has Id, Acronym (which is not language dependent) and URI. The language-dependent and repeating Title, Abstract and Keywords of a project are in separate entities. They are linked directly to cfProj because there is no need to record role and dates. Classifications may be applied to the project – and one or more schemes may be used. Linking Relations ©Keith G Jeffery/Anne Asserson/euroCRIS cfProj-Proj provides for recursion; for example one project is a sub-project of another or a follow-on to a previous project. cfProj_OrgUnit may be interesting; the OrgUnit involved could be the organisation performing the project or the organisation funding the project. In each case the role would be different. Furthermore there may be more than one organisation in either role – CERIF allows for a fractional value to be recorded in the linking relation. Organisational Unit ©Keith G Jeffery/Anne Asserson/euroCRIS cfOrgUnit cfOrgUnit has attributes: currency code, headcount and turnover (to give an idea of scale) in addition to the usual Id, acronym and URI. The language-dependent name, research activity and keywords are linked directly to cfOrgUnit because there is no need to record role and date. Classifications may be applied to the organisational unit – and one or more schemes may be used. Linking Relations cfOrgUnit_OrgUnit allows for the structure of organisations to be represented. A university may be structured hierarchically with faculties, departments and groups but also may have research centres or institutes which may belong to more than one department. Some universities have a schools or college structure orthogonal to the academic structure. The role within the linking relations defines the dependency structure (is part of). The role of an organisational unit as e.g. a research institute is defined by the classification (pink coloured) applied to this organisational unit. The other linking relations of OrgUnit may well have more than one OrgUnit related to the other entity: cfOrgUnit_ResPubl may have two OrgUnits with roles respectively of author’s institution and publisher. The example of the relationship to project is given above under Project. CERIF does not have separate entities for Funding Organisation or Research Institution. Similarly an Organisational Unit could fund or operate a research facility. ©Keith G Jeffery/Anne Asserson/euroCRIS Result CERIF does not have an entity Result but – because of their different attributes and requirements of processing – distinguishes publications, patents and products. The three Result entities (coloured orange) and their interlinks (lilac) are illustrated below. How to handle original publication title – do we need a structure for Title like cfPersName Note dates within cfPatent (OK in cfResPubl because part of bibliographic reference) ©Keith G Jeffery/Anne Asserson/euroCRIS Result Publication cfResPubl has – in addition to Id and URI, attributes required to construct a bibliographic reference (including publication date) except the Title and subtitle. These are in separate entities (like abstract and keywords) to allow multilinguality. Classifications may be applied to the publication – and one or more schemes may be used. One may classify the publication by its type: book, book chapter, journal, journal article, conference proceedings, conference paper, technical report etc. For subject classification typically UDC (Universal Decimal Classification) or LOC (Library of Congress classification) may be used although there are specialised classifications for medicine (MESH) and some ©Keith G Jeffery/Anne Asserson/euroCRIS organisations use the same classification scheme as that for research activity such as Frascati. Linking Relations: cfResPubl_ResPubl allows several different kinds of relationship. It can record that a chapter is part of a book or that a journal article is part of a journal. It can also record that article B referenced or cited article A. Furthermore it can indicate version relationships although these are perhaps better recorded by the cfResPubl_OrgUnit relationship (including dates) where the OrgUnit is the publisher. Country and Geolocation Contact Information ©Keith G Jeffery/Anne Asserson/euroCRIS MAPPING To CERIF General Concept of the Activity The euroCRIS approach is as follows and comprises 3 activities: 1. Analyse the requirements of the information available of the target CRIS to produce (a) a data model; (b) a mapping of that model to CERIF, including a definition of the representation of the target CRIS’ specific vocabularies in the CERIF semantic layer to be validated by the euroCRIS core team; 2. Analyse the data model (explicit or implicit) in each non-CERIF source related to the target CRIS and for each produce: (a) a data model (b) a mapping of that model to CERIF) to be validated by the euroCRIS core team; 3. Summarise across the core model from activity 1 and all the data sources of relevance to the target CRIS from activity 2 the commonality with, and the differences from, CERIF. For each of the differences (a) analyse whether it requires an extension to CERIF or not and (b) if so, produce a formal proposal to the CERIF Task Group for their approval and a final CERIF-compliant data model for the target CRIS. Implementation of the Methodology The detailed implementation is as follows: 1. The target CRIS information requirements generally are mapped to CERIF to produce a first-cut data model and documented information structures which do not map to CERIF. This involves analysis of the business objects (entities) and their attributes and – importantly – the role and temporal relationships between them; 2. a similar activity is undertaken for each of the non-CERIF sources to be integrated within the target CRIS. Here there is considerable effort (a) to understand the syntax (information structures) of the source since commonly data models do not represent accurately the real world, forcing structures to be hierarchic rather than fully connected graphs, using sub-classing inappropriately etc; (b) to understand the semantics of the information from the source since there is rarely a standardised vocabulary or even a consistent one within any one source; 3. The summarisation activity is crucial; it brings together all the inherent problems in the mappings categorised as syntactic and semantic and produces resolutions. The resolution steps are: (1) taking the problematic information structure and semantics attempt to map to CERIF using (a) base entities, (b) plus language-dependent entities, (c) plus classification using existing schemes and terms, (d) plus linking relations with defined roles; (2) if any information structures are not resolved by (1) propose an appropriate information structure and semantics following the CERIF philosophy in the steps (a) add additional base entity linked directly to an existing base entity (to accommodate additional attributes); (b) add additional language-dependent entity to extend the scope of the base entity with textual terms that may be multilingual; (c) add additional terms to existing classification or to a new classification scheme; (d) add additional base entity with linking relation including role and temporal semantics. ©Keith G Jeffery/Anne Asserson/euroCRIS (3) Reconsider the proposed extensions to CERIF – it may be that with a ‘second glance’ one may find that the requirements can be mapped directly to CERIF, especially after the analysis in the steps of (2) are done; (4) Propose changes to the CERIF Task Group where a larger and more diverse group of experts can consider the proposals with a reconsideration like (3); (5) If the proposed changes are approved, implement in the CERIF data model either as central approved changes or as supplementary additions for this particular (Target CRIS) purpose depending on the breadth of applicability of the changes. ©Keith G Jeffery/Anne Asserson/euroCRIS ANNEX Relational theory depends on set theory with a set of operators (the relational algebra). Implementations usually involve a storage structure for those sets with conceptual, logical and physical levels of representation which must be implemented according to certain rules (through a process known as normalisation) in order for the operators (algebra) to work correctly. Each set is represented as a table. Each row of the table (instance within the set or tuple) is identified uniquely by the value of one attribute designated the primary key. Other attributes may be designated as foreign keys if the same attribute occurs as primary key in a different table with the same meaning. There is a hierarchic relationship between the instance with its primary key in the first table and one or more instances with their primary (equal to foreign key in the first table) in the different table. ©Keith G Jeffery/Anne Asserson/euroCRIS