CERIFImplementationAdvice20120126

advertisement
CERIF-CRIS Implementation Advice
Keith G Jeffery, Anne Asserson 20120202
Introduction
The CERIF model is becoming widely accepted for CRIS. However, its very flexibility allows
for multiple methods of implementation. While all are potentially valid, experience has
demonstrated that some implementation styles or techniques are better than others along
dimensions of representativity, usability and efficiency.
First we explain how the philosophical concept behind CERIF differs from relational
database implementations. We then continue with how to implement CERIF using as
examples the four base entities Project, Person, Organisational Unit and Result-Publication.
This document is aimed at persons who have some understanding or experience of how to
implement relational databases. It should be read in conjunction with the specification of the
CERIF model on the euroCRIS website (Brigitte Jörg et al).
The Philosophy behind the CERIF Model
Conventional relational database technology supports only hierarchic relationships using the
foreign key as a linking mechanism. This mechanism tells us nothing more about the
linkage than it exists. This is analogous to a URL in HTML or XLINK in XML.
CERIF goes beyond this. CERIF has entities that are static and relationships (linkages) that
are dynamic. These relationships include a role which is the purpose of the linkage and the
temporal duration of the relationship. The relationship cardinality can be between 0,1 or
many instances of an entity at either or both ends of the link. CERIF thus represents a fully
connected graph structure (many-to-many) as found in the world of research information and
expresses the semantics (meaning) of the linkage through the role.
This rich structure leads to some simple rules:



No dates in base entities (unless the date is needed for a particular application
purpose e.g. date of publication is part of a bibliographic reference) because the
temporal information is expressed in relationships;
o example to avoid: project start date, event date
No flags - representing a yes/no condition for state of processing - in base entities
because the temporal (and therefore state) information is expressed in relationships;
o Example to avoid: proposal received (Y/N) or project state (proposal, funded,
ongoing, terminated)
Implement the generic entity with specialisation via roles
o implement an entity Person NOT Author since Author is a role of person
related to a Research Publication);
©Keith G Jeffery/Anne Asserson/euroCRIS




Distinguish clearly classification of a base entity (a characteristic of that entity
independent of any relationships) and classification (role) within a linking relation (a
characteristic of the relationship, not of either entity involved in the relationship).
o Example: classification of a publication as ‘book’ is a property of the base
entity; classifying the book as ‘published’ is a property of the relationship
between the book and the OrgUnit of type publisher.
In general it is better to classify via linking relations:
o specific expertise of a person is usually related to a role with respect to a
particular OrgUnit or Project or service rather than being a native property of
the person;
Use the terms in the CERIF classification schemes; if not used provide a crosswalk
to CERIF canonical semantics from the equivalent terms in a particular
implementation of a CERIF-CRIS;
o If the local CERIF-CRIS describes the country France as ‘France’ provide
crosswalk to ‘FR’
Do not subclass: in CERIF all entities are first class objects (since different users
have a different view of the relative importance or ‘ownership’ of one entity by
another. The OpenAIRE datamodel provides an example. The relationship between
Persons and Projects is clearly n:m; however the ‘participants’ relation is not an
implementation representing this relationship but a particular subclassed role
‘coordinating person’.
©Keith G Jeffery/Anne Asserson/euroCRIS
Kinds of CERIF Entities
CERIF includes many base entities to represent research information. However four such
entities are found commonly in almost all CRIS: Project, Person, Organisational Unit and
Result-Publication.
The E-R model diagram below (from the CERIF model specification on the euroCRIS
website) describes Project, Person, Organisational Unit (coloured green) with their linking
relations to each other (coloured lilac) and their direct relational links for languagedependent extensions (coloured yellow).
Now let us deal with these three aspects in turn:
©Keith G Jeffery/Anne Asserson/euroCRIS
Base entities (green):
The base entities have few attributes because the information is found in the linking relations
and the language-dependent entities (lilac and yellow). One or more classifications can be
applied to a base entity.
Linking Relations
Linking relations express a role (purpose) between two base entities and the temporal
duration of that role.
The role is not, however, expressed directly as an attribute in the
linking relation. The role is part of a classification scheme (cfClassScheme) where the term
(cfClass) for the role is given. This allows one stored role for many uses.

Classification Scheme: Project-Person
o Project leader
o Participant
o Contact person
Note that Participant would be a valid role in the linking relation Person-Event and Contact
Person a valid role in Person-Organisation, Person-Facility, Person-Equipment etc.
Language-Dependent Entities (yellow)
The language dependent entities contain textual information which may be in more than one
language. Thus they have a one to many relationship with the base entities and are
represented using conventional relational primary/foreign key relationships. They do not
need the expressive power of linking relations since they are only language variants of the
same attribute(s) and if monolingual would be in the base entity as an attribute.
In addition to these basic entities there are two more types of entity that are used in CERIF:
Classification Entities (pink)
CERIF does not have attributes within base relations that are classification terms to
characterise the instances of that entity. Instead, an external classification system points to
the particular instance within the entity.
Thus, to classify two instances of a publication respectively as ‘book’ and ‘journal article’ a
Class Scheme (cfClassScheme) for Publications (cfResPubl) is set up within which these
Classification Terms (CfClass) occur. Then a linking relation (cfResPubl_Class) relates the
Classification term (e.g. ‘book’) to the instance in the entity Publications and similarly
‘Journal Article’ to the other instance.
Similarly to classify OrgUnits an appropriate Class Scheme and set of Class terms is erected
so that ‘University’ can be distinguished from ‘SME’. cfOrgUnit_Class provides the link.
We also use this classification technique for the roles within a linking relation. Thus in
cfOrgUnit_ResPubl we may wish to express that the OrgUnit is the publisher of the
publication. The Class Scheme relating to cfOrgUnit_ResPubl would contain ‘publisher of’
but could also contain ‘rightsholder of’ or ‘funder of’.
The following diagram illustrates Base Entities, a language dependent entity, a linking
relation and how classification is used.
©Keith G Jeffery/Anne Asserson/euroCRIS
cfClassScheme
cfClass
e.g. book | journal article
e.g. University | SME
cfOrgUnit_Class
cfResPubl_Class
e.g. publisher of | rightsholder of
cfOrgUnit
cfOrgUnit_ResPubl
cfResPubl
cfOrgUnitName
The Class Scheme / Class construct is the semantic layer of CERIF and allows:
1. To keep all semantic information structured in one place for ease of maintenance to
ensure consistency of use of terminology
2. The same term with the same meaning within one Class Scheme to be used in
multiple base relations or linking relations;
3. The same term with different meaning within different Class Schemes to be used in
multiple base relations or linking relations;
An example of (2) above is ‘is part of’. This may be used in many linking relations between
different base entities. Another example is a given country code e.g. ‘FR’. This may be
used in several base relations.
An example of (3) is ‘computing’. It may refer to a project, to an organisation, to a
publication or product or to a service.
Name Variant Entities (grey)
Especially in the case of cfPers it may be necessary to record variants of the name. This is
done by having a special base entity cfPersName conected to cfPers by a linking relaiton so
the temporal duration and role (classification) of each name variant instance can be
recorded. See under Person for more detail.
©Keith G Jeffery/Anne Asserson/euroCRIS
The Major CERIF Base Entities
In addition to the standard green, lilac and yellow colours and meanings we now add a pink
coloured entity for classification of a base entity and a grey coloured entity for names and
name variants.
Person
©Keith G Jeffery/Anne Asserson/euroCRIS
cfPers
The Person entity includes only identification attributes: Id, BirthDate, Gender and URI. One
might expect attributes for Family Name and Other names; in CERIF these are managed by
a separate entity cfPersName (coloured grey) which is linked to cfPers through
CFPers_Persname allowing a role and dates to be assigned to each set of names. This
allows a person to have multiple names simultaneously or serially.
The research interests of a person and keywords are in separate entities; in this case linked
directly to cfPers because there is no need to record role and dates.
Classifications may be applied to the person – and one or more schemes may be used.
Linking Relations
CfPers has many linking relations. cfPers_Pers allows a recursive relationship to be defined
linking two persons. However, it should be used sparingly. For example co-authors are best
defined by the cfPers-ResPubl linking relation; co-workers are best defined by the cfPersOrgUnit linking relation.
The other linking relations provide – with role and dates – the relationship between a person
and the other entities. CfPers_OrgUnit may have a role employed; CfPers-ResPubl may
have role author but also editor or reviewer; cfProj-Pers may have roles such as project
leader or participant.
©Keith G Jeffery/Anne Asserson/euroCRIS
Project
cfProj
The project entity has Id, Acronym (which is not language dependent) and URI. The
language-dependent and repeating Title, Abstract and Keywords of a project are in separate
entities. They are linked directly to cfProj because there is no need to record role and dates.
Classifications may be applied to the project – and one or more schemes may be used.
Linking Relations
©Keith G Jeffery/Anne Asserson/euroCRIS
cfProj-Proj provides for recursion; for example one project is a sub-project of another or a
follow-on to a previous project.
cfProj_OrgUnit may be interesting; the OrgUnit involved could be the organisation
performing the project or the organisation funding the project. In each case the role would
be different. Furthermore there may be more than one organisation in either role – CERIF
allows for a fractional value to be recorded in the linking relation.
Organisational Unit
©Keith G Jeffery/Anne Asserson/euroCRIS
cfOrgUnit
cfOrgUnit has attributes: currency code, headcount and turnover (to give an idea of scale) in
addition to the usual Id, acronym and URI. The language-dependent name, research activity
and keywords are linked directly to cfOrgUnit because there is no need to record role and
date.
Classifications may be applied to the organisational unit – and one or more schemes may be
used.
Linking Relations
cfOrgUnit_OrgUnit allows for the structure of organisations to be represented. A university
may be structured hierarchically with faculties, departments and groups but also may have
research centres or institutes which may belong to more than one department. Some
universities have a schools or college structure orthogonal to the academic structure.
The role within the linking relations defines the dependency structure (is part of). The role of
an organisational unit as e.g. a research institute is defined by the classification (pink
coloured) applied to this organisational unit.
The other linking relations of OrgUnit may well have more than one OrgUnit related to the
other entity: cfOrgUnit_ResPubl may have two OrgUnits with roles respectively of author’s
institution and publisher. The example of the relationship to project is given above under
Project. CERIF does not have separate entities for Funding Organisation or Research
Institution. Similarly an Organisational Unit could fund or operate a research facility.
©Keith G Jeffery/Anne Asserson/euroCRIS
Result
CERIF does not have an entity Result but – because of their different attributes and
requirements of processing – distinguishes publications, patents and products.
The three Result entities (coloured orange) and their interlinks (lilac) are illustrated below.
How to handle original publication title – do we need a structure for Title like cfPersName
Note dates within cfPatent (OK in cfResPubl because part of bibliographic reference)
©Keith G Jeffery/Anne Asserson/euroCRIS
Result Publication
cfResPubl has – in addition to Id and URI, attributes required to construct a bibliographic
reference (including publication date) except the Title and subtitle. These are in separate
entities (like abstract and keywords) to allow multilinguality.
Classifications may be applied to the publication – and one or more schemes may be used.
One may classify the publication by its type: book, book chapter, journal, journal article,
conference proceedings, conference paper, technical report etc. For subject classification
typically UDC (Universal Decimal Classification) or LOC (Library of Congress classification)
may be used although there are specialised classifications for medicine (MESH) and some
©Keith G Jeffery/Anne Asserson/euroCRIS
organisations use the same classification scheme as that for research activity such as
Frascati.
Linking Relations:
cfResPubl_ResPubl allows several different kinds of relationship. It can record that a
chapter is part of a book or that a journal article is part of a journal. It can also record that
article B referenced or cited article A. Furthermore it can indicate version relationships
although these are perhaps better recorded by the cfResPubl_OrgUnit relationship (including
dates) where the OrgUnit is the publisher.
Country and Geolocation
Contact Information
©Keith G Jeffery/Anne Asserson/euroCRIS
MAPPING To CERIF
General Concept of the Activity
The euroCRIS approach is as follows and comprises 3 activities:
1. Analyse the requirements of the information available of the target CRIS to produce
(a) a data model; (b) a mapping of that model to CERIF, including a definition of the
representation of the target CRIS’ specific vocabularies in the CERIF semantic layer
to be validated by the euroCRIS core team;
2. Analyse the data model (explicit or implicit) in each non-CERIF source related to the
target CRIS and for each produce: (a) a data model (b) a mapping of that model to
CERIF) to be validated by the euroCRIS core team;
3. Summarise across the core model from activity 1 and all the data sources of
relevance to the target CRIS from activity 2 the commonality with, and the differences
from, CERIF. For each of the differences (a) analyse whether it requires an extension
to CERIF or not and (b) if so, produce a formal proposal to the CERIF Task Group for
their approval and a final CERIF-compliant data model for the target CRIS.
Implementation of the Methodology
The detailed implementation is as follows:
1. The target CRIS information requirements generally are mapped to CERIF to
produce a first-cut data model and documented information structures which do not
map to CERIF. This involves analysis of the business objects (entities) and their
attributes and – importantly – the role and temporal relationships between them;
2. a similar activity is undertaken for each of the non-CERIF sources to be integrated
within the target CRIS. Here there is considerable effort (a) to understand the syntax
(information structures) of the source since commonly data models do not represent
accurately the real world, forcing structures to be hierarchic rather than fully
connected graphs, using sub-classing inappropriately etc; (b) to understand the
semantics of the information from the source since there is rarely a standardised
vocabulary or even a consistent one within any one source;
3. The summarisation activity is crucial; it brings together all the inherent problems in
the mappings categorised as syntactic and semantic and produces resolutions. The
resolution steps are:
(1) taking the problematic information structure and semantics attempt to map to
CERIF using (a) base entities, (b) plus language-dependent entities, (c) plus
classification using existing schemes and terms, (d) plus linking relations with
defined roles;
(2) if any information structures are not resolved by (1) propose an appropriate
information structure and semantics following the CERIF philosophy in the steps
(a) add additional base entity linked directly to an existing base entity (to
accommodate additional attributes); (b) add additional language-dependent entity
to extend the scope of the base entity with textual terms that may be multilingual;
(c) add additional terms to existing classification or to a new classification
scheme; (d) add additional base entity with linking relation including role and
temporal semantics.
©Keith G Jeffery/Anne Asserson/euroCRIS
(3) Reconsider the proposed extensions to CERIF – it may be that with a ‘second
glance’ one may find that the requirements can be mapped directly to CERIF,
especially after the analysis in the steps of (2) are done;
(4) Propose changes to the CERIF Task Group where a larger and more diverse
group of experts can consider the proposals with a reconsideration like (3);
(5) If the proposed changes are approved, implement in the CERIF data model
either as central approved changes or as supplementary additions for this
particular (Target CRIS) purpose depending on the breadth of applicability of the
changes.
©Keith G Jeffery/Anne Asserson/euroCRIS
ANNEX
Relational theory depends on set theory with a set of operators (the relational algebra).
Implementations usually involve a storage structure for those sets with conceptual, logical
and physical levels of representation which must be implemented according to certain rules
(through a process known as normalisation) in order for the operators (algebra) to work
correctly.
Each set is represented as a table. Each row of the table (instance within the set or tuple) is
identified uniquely by the value of one attribute designated the primary key. Other attributes
may be designated as foreign keys if the same attribute occurs as primary key in a different
table with the same meaning. There is a hierarchic relationship between the instance with
its primary key in the first table and one or more instances with their primary (equal to foreign
key in the first table) in the different table.
©Keith G Jeffery/Anne Asserson/euroCRIS
Download