Minutes of DFT Group Meeting at P6 draft final

advertisement

Data Foundation and Terminology Interest Group Meeting

RDA Plenary 6, Paris, France

Friday, September 25, 2015

Minutes/Notes

Notes by Mary Vardigan (U. of Michigan)

Overview

Gary Berg-Cross presented an overview of the Data Foundation and Terminology

Interest Group and reviewed the Working Group’s Case Statement.

See https://rd-alliance.org/introduction-dft-p6.html

for a copy of his slides.

He began by describing the objective of the group, which is to RDA efforts in encouraging data sharing and making it easier. As ideas about terminology can differ across individuals and communities, misunderstandings can arise and they have an impact on resources and time and cooperation. Thus, this group is attempting to clarify the language we use when speaking with each other, especially as part of RDA.

A question was raised regarding whether the group is building a library itself or building in cooperation with others and in a stepwise fashion The response was that when RDA began, there were some special needs in the area of terminology so in that sense the group is doing something new itself, but at the same time it wants to reach out to others to collaborate.

An example is that Research Data Canada & CASRAI effort which is pursuing a parallel effort in developing a new trans-disciplinary glossary for research data management, and the groups have interacted and cooperated to adopt terms from each other’s vocabularies.

See http://dictionary.casrai.org/Category:Research_Data_Domain

For example, Active Data’s definition in their Dictionary is taken from ours - http://dictionary.casrai.org/Active_data

1

With respect to the Case Statement some comments were made by the TAB. Issues include representation and international buy-in are key issues for this group to address. With new members, the IG can now modify and update the Case Statement to be more inclusive.

Gary reported that the group had drafted four related model documents on core work: An overview, an analysis and synthesis, a term snapshot, and a set of use cases. See https://rd-alliance.org/groups/data-foundation-and-terminology-wg.html

• DFT IG: https://rd-alliance.org/groups/data-foundations-and-terminology-ig.html

• DFT IG Case Statement: https://www.rd-alliance.org/group/data-foundations-and-terminology-ig/casestatement/case-statement.html

• TeD-T Term Definition Tool: http://smw-rda.esc.rzg.mpg.de/index.php/Main_Page

The objectives for P6 were to continue the IG discussion, address new requirements coming out of other groups, and focus on facilitating community discussions on core concepts.

TeD-T Tool Summary – Rafael Ritz

Rafael demonstrated the Term Editing Tool (aka TeD-T), which currently includes

184 terms. The tool is available at http://smwrda.esc.rzg.mpg.de/index.php/Main_Page .

The tool and site is run by the Rechenzentrum Garching (RZG), a joint computing centre of the Max Planck Institute for Plasmaphysics and the Max Planck Society.

The goal of the tool is to support a community process leading to agreed-upon definitions. To that end, the group decided to build the tool on Semantic Media Wiki, the platform that drives Wikipedia. This platform is considered the working environment and not a publication platform. Prospective contributors need to get accounts to work in the tool. By signing up.

Scope is defined as those terms that are relevant for the RDA IGs and WGs. It was pointed out that some of the terms are so generic that qualifying them is often helpful.

In the second phase, as the group moves on, it is broadening the scope to allow all sorts of discussions. Domain communities have asked whether they can work with

2

the vocabulary, and the IGs and WGs in RDA should have access to this resource in thinking about and clarifying their terms. The group is flexible about adding new terms and wants to support the groups to the extent that they themselves are interested in terminology.

The tool is currently lacking a good capability to export in a satisfactory format. It was put to the audience that the group would be eager to collaborate with anyone who has the knowledge and skills (php) to enhance the output of the Semantic Wiki.

Thomas Zastrow (RZG) provided additional details about the tool. All content in the tool is automatically versioned and each term has its own URL. The license is CC-By but it is possible that CC-0 might be a better choice.

In response to a question about possibly using the Wikipedia platform itself, the answer was that the group decided against using Wikipedia for the vocabulary because of the lack of semantic capabilities (using Semantic forms, for instance).

Also, the group was aiming at a high degree of accessibility in terms of lowering the barrier for people to contribute, and it was thought that the barrier is already high and would be even higher if Wikipedia were used. In addition, there was a desire to reference RDA groups that the terms come from, which was another reason to create a new tool. A drawback of this approach is that visibility is lower than with

Wikipedia. This RDA resource is considered more of a discussion that may lead to a result, which could then be distributed to other channels, including Wikipedia.

It was pointed out that the National Library of Finland maintains several ontologies, and the software is available as open source, which is preferred. Also, there is a need to form a proper thesaurus that shows relationships among terms. These are both desirable goals. The plan is to publish the terminology so others can build infrastructure from it. Having a plan and policies for how to grow and sustain the terminology is important.

The group wants to facilitate points of communication – for example, are the

Metadata Groups and the Data Fabric Group in RDA communicating effectively or is there confusion on terms? The Metadata Catalog Group is talking about elements representing different metadata types and elements, but these terms are not all represented in the tool yet. (see briefing for a slide showing the MIG proposed

“elements.”

In the tool, there is a log to see the discussion that has taken place before a decision is made, and the log can be exposed to multiple groups. As an example, terms like

“collection” have multiple decisions related to them, which is to be expected. We shouldn’t reinvent things in this community.

The tool supports the functionality to make the terms reflect established sources but the editing of content needs to come from the (RDA) community.

3

A question was posed regarding whether there should be DOIs for each term, which is the practice of Wikipedia. This will be a good enhancement once things are published and is currently a priority item for tool improvement. (In related discussions the new Vocabulary Services IG may be able to help with this.)

Practical Policy Briefing– Reagan Moore

The Practical Policy Working Group created a policy-based data management concept graph with 130 policies, over 200 operations, and over 300 persistent state descriptions. These should be described by names we all understand. Also, there is a need to add terms related to manipulating items in the environment -- pieces of information that are needed to consistently manage data in a distributed system. If we can find others doing the same operations and determine the names being used, we might be able to define these terms and this would expand the current terminology greatly. The Practical Policy Group has already leveraged the DFT vocabulary at a high level and visa versa.

Science Europe – Peter Doorn

Science Europe is an association of European research funding organizations that have come together to promote their collective interests and to bring together experts on research data management. It is focused on the funding of research data management and infrastructure, copyright, terminology, and trusted environments for access to data.

Science Europe’s Task Group on Taxonomy/Terminology had a project to develop a glossary of terms as an internal task group document, which would serve as a sounding board allowing a consistency check through the future Task Group activities.

In terms of scope, it is a practical limited list of common terms related to research data and their definitions, with references to where these definitions are drawn from. The majority of the terms relate to the policy area more so than RDA DFT which is focused more on technical concepts. The approach taken to develop the list was to extract data terms from relevant texts by browsing existing lists using a snowball method. The group sought to enlist professional knowledge and to interact with other initiatives.

At this point, the group is wondering if it is useful to integrate their work with that of RDA. The lists do overlap, but the Science Europe list is more policy-oriented, while the DFT is more technical and larger. There are currently about 70 terms and

6000 words in the Science Europe vocabulary.

The group is trying to determine how public the Science Europe vocabulary should be and how they should proceed now to maintain and approve the list. An Open

4

Science stack exchange subsite was started, but Stack Exchange has high standards for amount of traffic and questions, so the subsite will probably not go out of beta stage. Another choice would be to choose an existing site like Academia Stack

Exchange. And finally another question is should it be put somewhere specifically for funders, the target audience?

INDIGO DataCloud

Indigo is an EU project to come up with a prototype for a cloud platform for scientists. One project task is focusing on quality of service for storage and data management of data over the lifecycle. The group has started a discussion group on this at RDA as they want to define quality of service for storage terms. They intend to make use of the Practical Policy work already done and also use terms that have been defined already.

ISO 5127 Foundations and Vocabulary -- Juha Hakala

Juha Hakala of the National Library of Finland spoke about his experience with the

ISO TC 46 group on Terminology and Metadata Standards. The TC is currently in the process of standardizing a repository access protocol that covers normal publications but can be applied to data as well.

ISO 5127 Foundations and Vocabulary has its basis in the library world, but recent modifications have brought it closer to electronic resources including data. It is outdated as the current version is from 2001, and it suffers from the fact that ISO standards are supported by volunteers. While the current list has little description of digital resources, it is being modernized. The group has been very active and has made a thorough revision of the standard. The revision process is approaching its final phase and voting for approval of the draft is now under way with a deadline to approve or turn it down by 15 October.

If the outcome is yes, then this new standard will be published in a few months’ time. This version is much larger with 1800 terms in it and roughly 600 are entirely new. The working group has been using other vocabularies extensively, and one was the RDA vocabulary, with about 40 terms drawn from it. There are 25 terms in the revised standard about data, 14 about metadata and many that have to do with data operations. This could be a good starting point for further RDA work. The DFT group should be aware of this so they are not reinventing the wheel. Normally ISO standards are revised every five years, but five years for data management is far too long. The group wants to make maintenance of the standard a continuous activity.

The entire vocabulary will be made available in SKOS format for free as this is the only way to guarantee that it will be widely used. The ISO vocabulary definitions are shorter than those in the DFT vocabulary but could be supplemented for RDA purposes.

5

Discussion

The Finnish National Library has a national ontology of 23,000 terms that are machine-understandable following a tree. If the RDA vocabulary gets bigger we may want to provide it with SKOS relations in a thesaurus-like view, which is difficult because Semantic Wiki is not good for ontology building. Protégé would be a better tool for this.

This discussion raises the provenance issue. A lot of work has been done with others defining terms, and we want to keep the provenance of where terms came from. The way to keep this manageable is to take in protocols like PROV that let you track provenance.

The DFT IG wants to continue its liaison work related to other RDA IGs and WGs and to continue to solicit ideas for additional use cases and candidate vocabulary items.

It has already worked extensively with the Practical Policy WG and now will be working with the group looking at defining terms describing the expected quality of service and data lifecycle of storage infrastructure (INDIGO Datacloud) and Science

Europe.

Gary closed with a comment that there has been a notable lack of agreement on what a dataset is. We need an agreed-upon definition!

6

Download