Systematic Sharing

advertisement
Sharing Resources in
CLARIN-NL
Jan Odijk, Arjan van Hessen
LRTS Workshop IJCNLP
Chiang Mai, Thailand,
12 Nov 2011
Overview
•
•
•
•
•
•
•
•
Context
Documentation
Visibility
Referability
Accessibility
Long Term Preservation
Interoperability
Conclusions
Context
•
•
•
•
•
CLARIN-NL
National project in the Netherlands
2009-2015
Budget: 9.01 m euro
Funding by NWO (National Roadmap Large
Scale Infrastructures)
• Coordinated by Utrecht University
• 24 partners (universities, royal academy
institutes, independent institutes, libraries, etc.)
Context
• Dutch National contribution to the Europe-wide
CLARIN infrastructure
• Prepared by CLARIN preparatory project (20082011)
– Also coordinated by Utrecht University
• From Dec 2011 to be coordinated by the
CLARIN-ERIC
– ERIC: a legal entity at the European level
specifically for research infrastructures
CLARIN infrastructure (NL)
• An technical research infrastructure in
which a humanities researcher who
works with language-related resources
– Can find all data relevant for the research
– Can find all tools relevant for the research
– Can apply the tools to the data without any
technical background or ad-hoc adaptations
– Can store data resulting from the research
– Can store tools resulting from the research
via one portal
CLARIN infrastructure (NL)
• This requires systematic sharing of
resources (=data, tools, web services, …)
• Systematic Sharing requires
–
–
–
–
–
–
Documentation
Visibility
Referability
Accessibility
Long Term Preservation
Interoperability
of resources
CLARIN-NL subprojects
• Resource curation projects
– Curate an existing resource
• Demonstrator projects
– Curate an existing tool and supply a
demonstration scenario
• #subprojects 21 (12-14 in 2012)
• Data Curation Service
– Offers the service of curating existing data
• Where curation includes
– Documentation, Visibility, Referability, Accessibility,
Long Term Preservation, Interoperability
CLARIN-NL Centres
• CLARIN infrastructure is virtual and distributed
– CLARIN-Centres work together to implement the infrastructure
– Each stores and makes available a part of the resources
– Some also provide computational facilities
– Centres must meet a list of requirements and be certified by
CLARIN
• Candidate CLARIN Centres in NL
–
–
–
–
–
Institute for Dutch Lexicology (INL)
Max Planck Institute for Psycholinguistics (MPI)
Meertens Institute (MI)
Huygens ING Institute (HI)
Data Archiving and Networked Services (DANS)
Infrastructure Implementation
• Implementation of basic infrastructure functionality
– setting up authentication and authorizations systems
– several registries (e.g. ISOCAT, RELCAT, Metadata Registry)
– various other infrastructure services
• Search Facilities
– In resource descriptions (`metadata’)
• Centralized after metadata harvesting
– In the data themselves
• Via federated search
• Using Webservices in Workflow systems
– Cooperation with Flanders
– Based on work done in the STEVIN-programme
– (as a severe test for interoperability)
Documentation
• Is always necessary, so hardly any additional effort
• Partly in natural language
• Partly formalized
– Described under a particular formally identifiable attribute
– With an explicit type for the value of the attribute
– Possibly with further restrictions on the values (patterns, finite
lists of values, constraints, etc.)
– Represented formally and unambiguously
• Any piece of documentation that can be formalized must
be formalized, and must be put in the resource
description (metadata of the resource)
Documentation
• Resource Descriptions
– Component-based MetaData Infrastructure (CMDI)
– One can define resource profiles as collections of components
(which can contain components).
– Many generally useable components are available
– Resource profiles for most common resources are available
– Component-based  flexibility
– Flexibility: danger: diversity, no interoperability
– Controlled by semantic interoperability (see below)
– Not yet available but needed: profile(s) for tools
• Supported by tools
– Component and profile editors
– Component and profile registries
– Metadata editor
Visibility
• Each resource and its resource description must be
stored at a CLARIN-centre
• CLARIN-centres make resource descriptions available
for metadata harvesting (using OAI-PMH)
• Via harvesting the metadata, the metadata become
available in the CLARIN resource catalogue
– browsing via the Virtual Language Observatory (VLO) using
faceted browsing
– Search via a search interface (under development)
• In the metadata and in the data
• String search and structured search
• Results if desired collected in a Virtual Collection
Referability
• By name or title is not sufficient
– All the problems that natural language poses for communication:
• not always unique (ambiguity)
• language-specific Corpus Gesproken Nederlands
– Variants in other languages: Spoken Dutch Corpus
– limited knowledge of the foreign language  variants: Corpus Spoken Dutch, Dutch Spoken
Corpus
• Long, too redundant,
– abbreviations/acronyms: CGN
• Invites for errors
– Spoken Dutch Cropus, Spken Dutch Corpus
• URLs
– Still too long/redundant (unless one uses shortened URLs)
– Unstable, volatile
• Persistent Identifiers (PIDs) are needed
Referability
• PIDs
• Each CLARIN-Centre
– must assign a PID to each resource (and/or to
subresources)
– Keep the PID resolution registry up-to-date
• PID systems
– Handle (preferred)
– URN
– Perhaps others (e.g. DOI)
Accessibility
• CLARIN infrastructure
– Accessible at any time and from any place
• IPR
– CLARIN-NL promotes maximal open access of resources
– is working on plans to implement policies and functionality to
properly handle IPR and ethical restrictions
• Researchers’ Mindset
– Many researchers in the humanities are hesitant or even
unwilling to share their resources with others
– How to resolve this? With a carrot and a stick
• CLARIN must accommodate reasonable wishes
• CLARIN must prove benefits for researchers who put their resources there
• Funding agencies must oblige researchers to do so (partially already so)
Long Term Preservation
• Necessary to make sure the resources can be shared
with future researchers (that may be the producer!)
• Each CLARIN-Centre is obliged to ensure long term
preservation
• Usually outsources to specialized centres
– MI outsources to DANS
– MPI outsources to internal Max Planck Gesellschaft organisation
Interoperability
• Interoperability of resources is the ability of resources to
seamlessly work together
– No manual ad-hoc adaptations
– Adaptations occur automatically `behind the screens’
• Need for interoperability is high
– Humanities researchers: not the required technical background
• Interoperability
– Syntactic interoperability and Semantic interoperability
• Each subproject must try to achieve interoperability
– Report any problems and make suggestions for adaptations
– So that the resources are adapted to the infrastructure (in some
cases) and vice-versa (in other cases)
• Not easy, but the only way to get further is to actually try
this and learn from it.
Syntactic Interoperability
• the formats of data are selected from a limited set of (de
facto) standards or best practices supported by CLARIN
• software tools and applications take input and yield
output in these formats
Semantic Interoperability
• Focus on the semantics of Data Categories (DCs)
• a privileged data category registry (DCR) is set up containing DCs:
–
–
–
–
–
unique persistent identifiers for DCs (PIDs),
their semantics,
a definition,
Examples
lexicalizations in various languages.
• Each resource specific DC mapped to DC from the
privileged DCR.
• every researcher can use his/her own DCs
• different DCs from different resources can be
interpreted as identical in meaning, via the DC of the
privileged DCR
• In CLARIN-NL multiple (complementary) privileged
DCRs are allowed. The primary is ISOCAT
Semantic Interoperability
• Achieving semantic interoperability is very hard
– Many DCs are almost identical
(principled/pragmatic/arbitrary reasons)
– Some DCs in ISOCAT are not defined clearly
– There are many similar DCs in ISOCAT
– Relevant DCs are not easy to find in ISOCAT
• Three actions taken
– Held several workshops to discuss problems
– Appointed a coordinator to deal with problems
– Decided to implement RELCAT registry to
specify relations between DCs
Conclusions
• CLARIN-NL requires systematic sharing of resources
• Therefore requires researchers to work on
– Documentation
– Visibility
– Referability
– Accessibility
– Long Term Preservation
– Interoperability
Of resources
• For certain aspects this is relatively easy but it must be done
• For other aspects this is very hard but it must be done so that we
can learn
• The approach described here may be a model for other countries
working on the CLARIN-infrastructure
• It may be a model for other resource sharing facilities (e.g. METASHARE)
Thanks for your attention!
Download