The Philosophy Behind CERIF Keith G Jeffery 20120126 Introduction Commonly people – once they understand the philosophy behind CERIF – are able to really use its power and flexibility. The problem is gaining the mindset that appreciates the CERIF model, why it is as it is and how it is put together. This short paper attempts to address this issue. Dublin Core Let us start with a simple metadata model, related to CERIF since both can be used in the domain of research information. DC describes resources (such as webpages, publications, datasets) and consists of elements expressed as a XML schema with values for those elements in instance records. The elements are: Identifier, Title, Description, Subject, Language, Format, Type, Date, Coverage, Creator, Contributor, Publisher, Rights, Source, Relation The first problem is referential integrity (and specifically functional referential integrity). In theory, the value of any attribute must be dependent exclusively on the primary key (identifier). In DC this is clearly not the case; the Creator, Contributor and Publisher exist in their own right and just happen to be related to this identifier (and hence title, description, subject) because of the ROLE they are playing. In fact DC would be better represented as follows: 1. Identifier, Title, Description, Subject, Language, Format, Type, Date 2. Creator, Contributor, Publisher 3. Rights 4. Source 5. Relation With each of the elements in the numbered sets of items linked to (1). (1) Describes the object of interest and how it is represented. Note there is a potential problem with multiple representations or languages (how to describe a FrenchEnglish dictionary); (2) Describes persons or organisations that have a responsibility of some kind with respect to the object of interest in creating it, contributing to it or publishing it; (3) Describes the right to usage. Do the rights relate to creator, contributor or publisher? Do the rights relate to reader / user? Typically the same rights will apply to many instances of the object of interest e.g. all scholarly publications in a particular journal so here it needs separating to avoid repetition in each record; (4) Describes the source from which material came for the object of interest; again there may be many sources for one instance of the object of interest and they, themselves may have DC metadata records.; (5) Relation is a very vague element and describes some other object related to the object of interest – again there may be many. This simple example illustrates why a ‘flat’ metadata structure does not work well for research information since the research information space in the real world consists of complex entities (objects) interlinked in complex ways in space and time. ©Keith G Jeffery/euroCRIS Object of Interest ID Title Description Subject Language Format Type Date Person or Organisation creator <missing attributes in DC> <ID required> contributor publisher claimed by source relation related to Rights Other Object <missing attributes in DC> <ID required> <presumably same attributes as DC including ID> The key problems with ‘flat’ metadata are: 1. It violates functional dependency rules (and thus the database lacks integrity); 2. It does not handle well repeating values (or groups of values); 3. DC utilises elements such as ‘contributor’ which is in fact not a base entity (something in the real world) but a ROLE of a person or an organisation (which are base entities). CERIF was designed to avoid these problems and to represent – as faithfully as possible – the real world of research information. In fact the community concerned with DC is now moving towards a more representative structure in the semantic web environment using Linked Open Data and RDF which brings it closer to CERIF. CERIF CERIF91 had a structure something like DC with records based on projects, which had attributes such as project leader (i.e. a person), funder (organisation) etc. Clearly this violated functional referential dependency and this caused the evolution to CERIF2000. CERIF2000 is constructed based on the following principles: 1. To represent as faithfully as possible the real world of research information; 2. To have a normalised structure so that repeating (groups of) attributes are handled correctly; 3. To respect functional referential integrity; 4. To separate base entities (things in the real world such as person and scholarly publication) from relationships between them (such as author); 5. To store date-time start and end in the relationships which provides the temporal duration for which the relationship exists; ©Keith G Jeffery/euroCRIS 6. To avoid Yes/No flags as attributes; for example one can deduce that a project is funded because linked to it is an organisation with the role funder and Date-Time start, Date-Time end and so an attribute in Project named ‘Funded’ with possible values ‘yes’ or ‘no’ is not necessary; 7. To have full multilinguality on all text fields and to represent any character set using Unicode; 8. For attributes in base entities with a restricted set of permissible values (e.g. country) to store those values in the semantic layer of CERIF (explained later); 9. For roles in relationship entities with a restricted set of permissible values (e.g. author) to store those values in the semantic layer of CERIF (explained later); 10. To have a semantic layer of CERIF - with permissible attribute values as lexical terms - which is a full domain ontology (allowing relationships between terms to be expressed). This is achieved by having Class Schemes (which define a space of terms usually related to one attribute) and classes which hold the valid terms for that scheme. We now take each principle in turn and discuss in some detail. 1. To represent as faithfully as possible the real world of research information; The real world is complex and certainly not ‘flat’. Universities have faculties and departments (hierarchic structure) but this is insufficient because a research centre may be owned by two departments (fully connected graph structure). Thus CERIF provides the ability to represent a fully connected graph structure which is beyond (but includes) the hierarchic representation of XML or the ‘flat’ structure of DC. 2. To have a normalised structure so that repeating (groups of) attributes are handled correctly; Normalisation separates groups of attributes which can repeat against groups of attributes which don’t. For example, the ID, Title of a scholarly publication does not repeat whereas the attributes of author will if it is a multi-author paper. Hence these attributes {PublicationID, Title}, {AuthorID, PublicationID Family Name, First name(s)} need to be separated and linked. There are two methods of linking; if the relationship is strictly hierarchic (i.e. 1:n) then the primary key of the non-repeating attribute set (ID) is taken as a foreign key attribute in the repeating set of attributes. {PublicationID, Title} <123, CERIF Data Structure> {AuthorID, PublicationID Family Name, First name(s)} <ABC, 123, Asserson, Anne> <DEF,123, Jeffery, Keith> <GHI, 123, Joerg, Brigitte> However we do not know when the publication was authored and if one of the authors was also the illustrator or editor of the publication then how do we represent that given that we are using the entity ‘author’. The answer of course is to revert to base entities (in this case ‘person’ and indicate authorship (or illustratorship or editorship) within the linking between the publication and the person. {PublicationID, Title} {Person-Publication} {PersonID Family, Name, First name(s)} <123, CERIF Data Structure><ABC, 123, author> <ABC, Asserson, Anne> <DEF,123, author> <DEF, Jeffery, Keith> <GHI, 123, author> <GHI, Joerg, Brigitte> <DEF,123, editor> <GHI, 123, illustrator> This also illustrates the non-hierarchic, many-to-many (of n:m) relationship; in this case persons with IDs DEF and GHI have a n:m relationship with publication with ID 123. ©Keith G Jeffery/euroCRIS 3. To respect functional referential integrity; Referential integrity requires normalisation to handle repeating groups as described above. However, full referential integrity also requires functional integrity which is defined as all attributes must be functionally dependent exclusively on the primary key (or Identifier) of the record. That is the attributes must describe or elaborate upon the object identified by the identifier. For a publication the Title depends exclusively on the Identifier, but the author or publisher do not (they exist independently as persons or organisations that just happen to be associated with this publication.) Their relationship to the publication is mapped as linking relations as described above. 4. To separate base entities (things in the real world such as person and scholarly publication) from relationships between them (such as author); From the above it becomes clear that CERIF separates base entities (real things, commonly changing little and infrequently) from linking relationships (which have a temporal duration and may change more frequently). Linking relationships describe minimally role (e.g. author) and date-time start and end. They may have other attributes such as percentage or fraction (e.g. the percentage authorship contribution of an author). Another advantage of this separation is that base entities may well be loaded with records from other systems (e.g. Person from HR (Human Resources) database). This is accomplished more easily with the separation described, and then the linkages of each person to e.g. publications can be done separately. The linking together of pre-existing data is analogous to the more recent concept of Linked Open Data in the semantic web domain. 5. To store date-time start and end in the relationships which provides the temporal duration for which the relationship exists; Storing temporal duration allows many useful functions to be performed using a CERIFCRIS. For example analysis of the links Person-Organisation with their date-time start and end values allows reconstruction of the employment history of a person. Similarly with Person-Publication a personal bibliography can be produced. Organisation-Organisation linkages allow the structure of an organisation to be described and the changes in that organisation over time as organisational units are created and terminated. Thus one can compare the structure of a university today with 10 years ago. 6. To avoid Yes/No flags as attributes; for example one can deduce that a project is funded because liked to it is an organisation with the role funder and DateTime start, Date-Time end and so an attribute in Project named ‘Funded’ with possible values ‘yes’ or ‘no’ is not necessary; It is common for database systems to be designed with ‘flags’ indicating Yes/No, True/False etc. However, the CERIF structure allows such values to be deduced from information in other parts of the CERIF structure as illustrated in the principle itself. The advantage is that inputting by a person Yes or No is that person’s opinion or belief whereas if the raw data is present in the database there is a greater degree of certainty and therefore integrity of the information. Furthermore, deducing facts from existing facts ‘on the fly’ saves input effort and expands the value of the existing information. ©Keith G Jeffery/euroCRIS 7. To have full multilinguality on all text fields and to represent any character set using Unicode; CERIF was developed in a multinational and multilingual European environment which includes countries using alphabets other than Latin (or extended Latin) e.g. Greek or Cyrillic. CERIF provides a normalised facility to allow multiple representations (language, within the Unicode character set) of any text field such as Title, Description. 8. For attributes in base entities with a restricted set of permissible values (e.g. country) to store those values in the semantic layer of CERIF (explained later); 9. For roles in relationship entities with a restricted set of permissible values (e.g. author) to store those values in the semantic layer of CERIF (explained later); 10. To have a semantic layer of CERIF - with permissible attribute values as lexical terms - which is a full domain ontology (allowing relationships between terms to be expressed). This is achieved by having Class Schemes (which define a space of terms usually related to one attribute) and classes which hold the valid terms for that scheme. These three principles are best taken together. They concern the semantic meaning of lexical terms and how those terms are stored, related and used. Principles 8 and 9 concern integrity of attribute values and ensure that the value of a given attribute is selected from a list of terms. Typically at the user interface this would be implemented as a pull-down list of possible values to be used for input. Examples are country code (FR, DE, IT etc) or possible roles (author, editor, reviewer, illustrator...). Principle 10 concerns how CERIF manages terms in the semantic layer. By using the constructs of Class Scheme and Class (with appropriate linking relationships) it is possible to explain the relationships between terms within and across Class Schemes. The roles in the relationships allow the usual thesaurus relationships (such as synonym, antonym, superterm, subterm etc). In fact the CERIF structure allows all the functionality of a domain ontology. ©Keith G Jeffery/euroCRIS