Introduction

advertisement
The Philosophy Behind CERIF
Keith G Jeffery 20120126
Introduction
Commonly people – once they understand the philosophy behind CERIF – are able to really
use its power and flexibility. The problem is gaining the mindset that appreciates the CERIF
model, why it is as it is and how it is put together. This short paper attempts to address this
issue.
Dublin Core
Let us start with a simple metadata model, related to CERIF since both can be used in the
domain of research information. DC describes resources (such as webpages, publications,
datasets) and consists of elements expressed as a XML schema with values for those
elements in instance records. The elements are:
Identifier, Title, Description, Subject, Language, Format, Type, Date, Coverage, Creator,
Contributor, Publisher, Rights, Source, Relation
The first problem is referential integrity (and specifically functional referential integrity). In
theory, the value of any attribute must be dependent exclusively on the primary key
(identifier). In DC this is clearly not the case; the Creator, Contributor and Publisher exist in
their own right and just happen to be related to this identifier (and hence title, description,
subject) because of the ROLE they are playing. In fact DC would be better represented as
follows:
1. Identifier, Title, Description, Subject, Language, Format, Type, Date
2. Creator, Contributor, Publisher
3. Rights
4. Source
5. Relation
With each of the elements in the numbered sets of items linked to (1).
(1) Describes the object of interest and how it is represented. Note there is a potential
problem with multiple representations or languages (how to describe a FrenchEnglish dictionary);
(2) Describes persons or organisations that have a responsibility of some kind with
respect to the object of interest in creating it, contributing to it or publishing it;
(3) Describes the right to usage. Do the rights relate to creator, contributor or publisher?
Do the rights relate to reader / user? Typically the same rights will apply to many
instances of the object of interest e.g. all scholarly publications in a particular journal
so here it needs separating to avoid repetition in each record;
(4) Describes the source from which material came for the object of interest; again there
may be many sources for one instance of the object of interest and they, themselves
may have DC metadata records.;
(5) Relation is a very vague element and describes some other object related to the
object of interest – again there may be many.
This simple example illustrates why a ‘flat’ metadata structure does not work well for
research information since the research information space in the real world consists of
complex entities (objects) interlinked in complex ways in space and time.
©Keith G Jeffery/euroCRIS
Object of Interest
ID
Title
Description
Subject
Language
Format
Type
Date
Person or Organisation
creator
<missing attributes in DC>
<ID required>
contributor
publisher
claimed by
source
relation
related to
Rights
Other Object
<missing attributes in DC>
<ID required>
<presumably
same
attributes
as
DC
including ID>
The key problems with ‘flat’ metadata are:
1. It violates functional dependency rules (and thus the database lacks integrity);
2. It does not handle well repeating values (or groups of values);
3. DC utilises elements such as ‘contributor’ which is in fact not a base entity
(something in the real world) but a ROLE of a person or an organisation (which are
base entities).
CERIF was designed to avoid these problems and to represent – as faithfully as possible –
the real world of research information. In fact the community concerned with DC is now
moving towards a more representative structure in the semantic web environment using
Linked Open Data and RDF which brings it closer to CERIF.
CERIF
CERIF91 had a structure something like DC with records based on projects, which had
attributes such as project leader (i.e. a person), funder (organisation) etc. Clearly this
violated functional referential dependency and this caused the evolution to CERIF2000.
CERIF2000 is constructed based on the following principles:
1. To represent as faithfully as possible the real world of research information;
2. To have a normalised structure so that repeating (groups of) attributes are handled
correctly;
3. To respect functional referential integrity;
4. To separate base entities (things in the real world such as person and scholarly
publication) from relationships between them (such as author);
5. To store date-time start and end in the relationships which provides the temporal
duration for which the relationship exists;
©Keith G Jeffery/euroCRIS
6. To avoid Yes/No flags as attributes; for example one can deduce that a project is
funded because linked to it is an organisation with the role funder and Date-Time
start, Date-Time end and so an attribute in Project named ‘Funded’ with possible
values ‘yes’ or ‘no’ is not necessary;
7. To have full multilinguality on all text fields and to represent any character set using
Unicode;
8. For attributes in base entities with a restricted set of permissible values (e.g. country)
to store those values in the semantic layer of CERIF (explained later);
9. For roles in relationship entities with a restricted set of permissible values (e.g.
author) to store those values in the semantic layer of CERIF (explained later);
10. To have a semantic layer of CERIF - with permissible attribute values as lexical terms
- which is a full domain ontology (allowing relationships between terms to be
expressed). This is achieved by having Class Schemes (which define a space of
terms usually related to one attribute) and classes which hold the valid terms for that
scheme.
We now take each principle in turn and discuss in some detail.
1. To represent as faithfully as possible the real world of research information;
The real world is complex and certainly not ‘flat’. Universities have faculties and
departments (hierarchic structure) but this is insufficient because a research centre may be
owned by two departments (fully connected graph structure). Thus CERIF provides the
ability to represent a fully connected graph structure which is beyond (but includes) the
hierarchic representation of XML or the ‘flat’ structure of DC.
2. To have a normalised structure so that repeating (groups of) attributes are
handled correctly;
Normalisation separates groups of attributes which can repeat against groups of attributes
which don’t. For example, the ID, Title of a scholarly publication does not repeat whereas
the attributes of author will if it is a multi-author paper. Hence these attributes {PublicationID,
Title}, {AuthorID, PublicationID Family Name, First name(s)} need to be separated and
linked. There are two methods of linking; if the relationship is strictly hierarchic (i.e. 1:n) then
the primary key of the non-repeating attribute set (ID) is taken as a foreign key attribute in
the repeating set of attributes.
{PublicationID, Title}
<123, CERIF Data Structure>
{AuthorID, PublicationID Family Name, First name(s)}
<ABC, 123, Asserson, Anne>
<DEF,123, Jeffery, Keith>
<GHI, 123, Joerg, Brigitte>
However we do not know when the publication was authored and if one of the authors was
also the illustrator or editor of the publication then how do we represent that given that we
are using the entity ‘author’. The answer of course is to revert to base entities (in this case
‘person’ and indicate authorship (or illustratorship or editorship) within the linking between
the publication and the person.
{PublicationID, Title}
{Person-Publication} {PersonID Family, Name, First name(s)}
<123, CERIF Data Structure><ABC, 123, author> <ABC, Asserson, Anne>
<DEF,123, author> <DEF, Jeffery, Keith>
<GHI, 123, author> <GHI, Joerg, Brigitte>
<DEF,123, editor>
<GHI, 123, illustrator>
This also illustrates the non-hierarchic, many-to-many (of n:m) relationship; in this case
persons with IDs DEF and GHI have a n:m relationship with publication with ID 123.
©Keith G Jeffery/euroCRIS
3. To respect functional referential integrity;
Referential integrity requires normalisation to handle repeating groups as described above.
However, full referential integrity also requires functional integrity which is defined as all
attributes must be functionally dependent exclusively on the primary key (or Identifier) of the
record. That is the attributes must describe or elaborate upon the object identified by the
identifier. For a publication the Title depends exclusively on the Identifier, but the author or
publisher do not (they exist independently as persons or organisations that just happen to be
associated with this publication.) Their relationship to the publication is mapped as linking
relations as described above.
4. To separate base entities (things in the real world such as person and
scholarly publication) from relationships between them (such as author);
From the above it becomes clear that CERIF separates base entities (real things, commonly
changing little and infrequently) from linking relationships (which have a temporal duration
and may change more frequently).
Linking relationships describe minimally role (e.g.
author) and date-time start and end. They may have other attributes such as percentage or
fraction (e.g. the percentage authorship contribution of an author).
Another advantage of this separation is that base entities may well be loaded with records
from other systems (e.g. Person from HR (Human Resources) database). This is
accomplished more easily with the separation described, and then the linkages of each
person to e.g. publications can be done separately. The linking together of pre-existing data
is analogous to the more recent concept of Linked Open Data in the semantic web domain.
5. To store date-time start and end in the relationships which provides the
temporal duration for which the relationship exists;
Storing temporal duration allows many useful functions to be performed using a CERIFCRIS. For example analysis of the links Person-Organisation with their date-time start and
end values allows reconstruction of the employment history of a person. Similarly with
Person-Publication a personal bibliography can be produced. Organisation-Organisation
linkages allow the structure of an organisation to be described and the changes in that
organisation over time as organisational units are created and terminated. Thus one can
compare the structure of a university today with 10 years ago.
6. To avoid Yes/No flags as attributes; for example one can deduce that a project
is funded because liked to it is an organisation with the role funder and DateTime start, Date-Time end and so an attribute in Project named ‘Funded’ with
possible values ‘yes’ or ‘no’ is not necessary;
It is common for database systems to be designed with ‘flags’ indicating Yes/No, True/False
etc. However, the CERIF structure allows such values to be deduced from information in
other parts of the CERIF structure as illustrated in the principle itself. The advantage is that
inputting by a person Yes or No is that person’s opinion or belief whereas if the raw data is
present in the database there is a greater degree of certainty and therefore integrity of the
information. Furthermore, deducing facts from existing facts ‘on the fly’ saves input effort
and expands the value of the existing information.
©Keith G Jeffery/euroCRIS
7. To have full multilinguality on all text fields and to represent any character set
using Unicode;
CERIF was developed in a multinational and multilingual European environment which
includes countries using alphabets other than Latin (or extended Latin) e.g. Greek or
Cyrillic.
CERIF provides a normalised facility to allow multiple representations
(language, within the Unicode character set) of any text field such as Title, Description.
8. For attributes in base entities with a restricted set of permissible values (e.g.
country) to store those values in the semantic layer of CERIF (explained later);
9. For roles in relationship entities with a restricted set of permissible values (e.g.
author) to store those values in the semantic layer of CERIF (explained later);
10. To have a semantic layer of CERIF - with permissible attribute values as lexical
terms - which is a full domain ontology (allowing relationships between terms
to be expressed). This is achieved by having Class Schemes (which define a
space of terms usually related to one attribute) and classes which hold the
valid terms for that scheme.
These three principles are best taken together. They concern the semantic meaning of
lexical terms and how those terms are stored, related and used. Principles 8 and 9
concern integrity of attribute values and ensure that the value of a given attribute is
selected from a list of terms. Typically at the user interface this would be implemented
as a pull-down list of possible values to be used for input. Examples are country code
(FR, DE, IT etc) or possible roles (author, editor, reviewer, illustrator...).
Principle 10 concerns how CERIF manages terms in the semantic layer. By using the
constructs of Class Scheme and Class (with appropriate linking relationships) it is
possible to explain the relationships between terms within and across Class Schemes.
The roles in the relationships allow the usual thesaurus relationships (such as synonym,
antonym, superterm, subterm etc). In fact the CERIF structure allows all the functionality
of a domain ontology.
©Keith G Jeffery/euroCRIS
Download