Sharing Resources in CLARIN-NL Jan Odijk, Arjan van Hessen LRTS Workshop IJCNLP Chiang Mai, Thailand, 12 Nov 2011 Overview • • • • • • • • Context Documentation Visibility Referability Accessibility Long Term Preservation Interoperability Conclusions Context • • • • • CLARIN-NL National project in the Netherlands 2009-2015 Budget: 9.01 m euro Funding by NWO (National Roadmap Large Scale Infrastructures) • Coordinated by Utrecht University • 24 partners (universities, royal academy institutes, independent institutes, libraries, etc.) Context • Dutch National contribution to the Europe-wide CLARIN infrastructure • Prepared by CLARIN preparatory project (20082011) – Also coordinated by Utrecht University • From Dec 2011 to be coordinated by the CLARIN-ERIC – ERIC: a legal entity at the European level specifically for research infrastructures CLARIN infrastructure (NL) • An technical research infrastructure in which a humanities researcher who works with language-related resources – Can find all data relevant for the research – Can find all tools relevant for the research – Can apply the tools to the data without any technical background or ad-hoc adaptations – Can store data resulting from the research – Can store tools resulting from the research via one portal CLARIN infrastructure (NL) • This requires systematic sharing of resources (=data, tools, web services, …) • Systematic Sharing requires – – – – – – Documentation Visibility Referability Accessibility Long Term Preservation Interoperability of resources CLARIN-NL subprojects • Resource curation projects – Curate an existing resource • Demonstrator projects – Curate an existing tool and supply a demonstration scenario • #subprojects 21 (12-14 in 2012) • Data Curation Service – Offers the service of curating existing data • Where curation includes – Documentation, Visibility, Referability, Accessibility, Long Term Preservation, Interoperability CLARIN-NL Centres • CLARIN infrastructure is virtual and distributed – CLARIN-Centres work together to implement the infrastructure – Each stores and makes available a part of the resources – Some also provide computational facilities – Centres must meet a list of requirements and be certified by CLARIN • Candidate CLARIN Centres in NL – – – – – Institute for Dutch Lexicology (INL) Max Planck Institute for Psycholinguistics (MPI) Meertens Institute (MI) Huygens ING Institute (HI) Data Archiving and Networked Services (DANS) Infrastructure Implementation • Implementation of basic infrastructure functionality – setting up authentication and authorizations systems – several registries (e.g. ISOCAT, RELCAT, Metadata Registry) – various other infrastructure services • Search Facilities – In resource descriptions (`metadata’) • Centralized after metadata harvesting – In the data themselves • Via federated search • Using Webservices in Workflow systems – Cooperation with Flanders – Based on work done in the STEVIN-programme – (as a severe test for interoperability) Documentation • Is always necessary, so hardly any additional effort • Partly in natural language • Partly formalized – Described under a particular formally identifiable attribute – With an explicit type for the value of the attribute – Possibly with further restrictions on the values (patterns, finite lists of values, constraints, etc.) – Represented formally and unambiguously • Any piece of documentation that can be formalized must be formalized, and must be put in the resource description (metadata of the resource) Documentation • Resource Descriptions – Component-based MetaData Infrastructure (CMDI) – One can define resource profiles as collections of components (which can contain components). – Many generally useable components are available – Resource profiles for most common resources are available – Component-based flexibility – Flexibility: danger: diversity, no interoperability – Controlled by semantic interoperability (see below) – Not yet available but needed: profile(s) for tools • Supported by tools – Component and profile editors – Component and profile registries – Metadata editor Visibility • Each resource and its resource description must be stored at a CLARIN-centre • CLARIN-centres make resource descriptions available for metadata harvesting (using OAI-PMH) • Via harvesting the metadata, the metadata become available in the CLARIN resource catalogue – browsing via the Virtual Language Observatory (VLO) using faceted browsing – Search via a search interface (under development) • In the metadata and in the data • String search and structured search • Results if desired collected in a Virtual Collection Referability • By name or title is not sufficient – All the problems that natural language poses for communication: • not always unique (ambiguity) • language-specific Corpus Gesproken Nederlands – Variants in other languages: Spoken Dutch Corpus – limited knowledge of the foreign language variants: Corpus Spoken Dutch, Dutch Spoken Corpus • Long, too redundant, – abbreviations/acronyms: CGN • Invites for errors – Spoken Dutch Cropus, Spken Dutch Corpus • URLs – Still too long/redundant (unless one uses shortened URLs) – Unstable, volatile • Persistent Identifiers (PIDs) are needed Referability • PIDs • Each CLARIN-Centre – must assign a PID to each resource (and/or to subresources) – Keep the PID resolution registry up-to-date • PID systems – Handle (preferred) – URN – Perhaps others (e.g. DOI) Accessibility • CLARIN infrastructure – Accessible at any time and from any place • IPR – CLARIN-NL promotes maximal open access of resources – is working on plans to implement policies and functionality to properly handle IPR and ethical restrictions • Researchers’ Mindset – Many researchers in the humanities are hesitant or even unwilling to share their resources with others – How to resolve this? With a carrot and a stick • CLARIN must accommodate reasonable wishes • CLARIN must prove benefits for researchers who put their resources there • Funding agencies must oblige researchers to do so (partially already so) Long Term Preservation • Necessary to make sure the resources can be shared with future researchers (that may be the producer!) • Each CLARIN-Centre is obliged to ensure long term preservation • Usually outsources to specialized centres – MI outsources to DANS – MPI outsources to internal Max Planck Gesellschaft organisation Interoperability • Interoperability of resources is the ability of resources to seamlessly work together – No manual ad-hoc adaptations – Adaptations occur automatically `behind the screens’ • Need for interoperability is high – Humanities researchers: not the required technical background • Interoperability – Syntactic interoperability and Semantic interoperability • Each subproject must try to achieve interoperability – Report any problems and make suggestions for adaptations – So that the resources are adapted to the infrastructure (in some cases) and vice-versa (in other cases) • Not easy, but the only way to get further is to actually try this and learn from it. Syntactic Interoperability • the formats of data are selected from a limited set of (de facto) standards or best practices supported by CLARIN • software tools and applications take input and yield output in these formats Semantic Interoperability • Focus on the semantics of Data Categories (DCs) • a privileged data category registry (DCR) is set up containing DCs: – – – – – unique persistent identifiers for DCs (PIDs), their semantics, a definition, Examples lexicalizations in various languages. • Each resource specific DC mapped to DC from the privileged DCR. • every researcher can use his/her own DCs • different DCs from different resources can be interpreted as identical in meaning, via the DC of the privileged DCR • In CLARIN-NL multiple (complementary) privileged DCRs are allowed. The primary is ISOCAT Semantic Interoperability • Achieving semantic interoperability is very hard – Many DCs are almost identical (principled/pragmatic/arbitrary reasons) – Some DCs in ISOCAT are not defined clearly – There are many similar DCs in ISOCAT – Relevant DCs are not easy to find in ISOCAT • Three actions taken – Held several workshops to discuss problems – Appointed a coordinator to deal with problems – Decided to implement RELCAT registry to specify relations between DCs Conclusions • CLARIN-NL requires systematic sharing of resources • Therefore requires researchers to work on – Documentation – Visibility – Referability – Accessibility – Long Term Preservation – Interoperability Of resources • For certain aspects this is relatively easy but it must be done • For other aspects this is very hard but it must be done so that we can learn • The approach described here may be a model for other countries working on the CLARIN-infrastructure • It may be a model for other resource sharing facilities (e.g. METASHARE) Thanks for your attention!