Data Foundation Interest Group Meeting Notes Gary Berg-Cross kicked off the meeting with Co-Chair Raphael Ritz online remotely. This was the first meeting of the groups which was formerly a working group, whose products were delivered. Gary provided slides on this and the nature of the WG effort. See https://www.rd-alliance.org/briefing-slides-introducing-new-dft-ig.html One of the initial work groups to produce a document to describe our vocabulary and a term definition tool called “Ted-T” has 100 + terms; Gary provided an overview of term development showing a graph of terms and relationships in early tool development. This moved from general to more specific definitions based on discussions. There are graphical models of relationships for things like data and digital objects/entities, but more work is needed on these and other items. The Ted-T tool allows you to view a list of all terms (alphabetical or hierarchical) The Processes view includes many practical policy terms Most are in the Data Organization area One can add your own term: include fields such as name, definition, explanation, example, etc. - also includes a group discussion page for a term so we can capture what people think about terms. There were several lessons learned from work and things to follow up on. We need, for example, virtual sessions in between plenary’s - to inform discussion; Going forward the tool and the term development will be useful for capturing information and defining new terms. This will be done in coordination with several RDA groups including: data fabric metadata practical policy and others Objectives for P5 include: leverage existing work and approach but improve both facilitate community discussion on core concepts This reflects the fact that the WG completed preliminary work; but the need for common vocabularies in discussing things continues. One of the goals of this meeting was to access what is the community’s interest; Briefings from other Groups Keith Jeffery Briefing on: “Metadata Data Foundation and Terminology” See https://www.rdalliance.org/metadata%0Bdata-group-briefing-data-foundation-and-terminology.html for slides There are multiple data working groups, but the only difference between metadata and data is mode of use or the role to which the data is put. Some data is descriptive of other data. But metadata is not just for data, it is also for users, software services, computing resources. (See also Mark Gehagan’s talk at P5 - Just how open are we, really? ) MD is Not just for description and discovery. It is needed for context. In summary we need metadata that has: formal syntax (structure of metadata) declared semantics (terms in ontological structure)1 Metadata Plan Use cases (collecting use cases to improve template for collecting). The plan is to pull these into a repository and move from a Directory to Catalog. In a catalog we make the MD machine readable. This requires formalization in processable/logical languages Recommended Metadata Packages for Purposes (canonical package good for discovery, archiving, etc.) For open Data we want to conceptualize MD as relationships not Elements: Unique Identifier (for later use including citation) Location (URL) Description Keywords (terms) Temporal coordinates Geospatial coordinates Originator (organisation(s) / person(s)) Project Facility / equipment Quality Availability (licence, persistence) Provenance Citations Related publications (white or grey) 1 Adding semantics to metadata and definitions, such as in DFT, is a long-term goal and perhaps the MD and DFT groups can cooperate on this, Some effort to put definitions in a RDF form with links to appropriate ontologies may be tried within DFT. The current TeD-T has some potential here that will be explored. Related software Schema Medium / format These relationships (provenance for example) are mapped to processes (discovery, context, detail) Note, we want to support E-research through metadata Four models (user, processing, data and resource) are used to discuss data from researcher to Information Communication Technology (ICT) environment for research. Request from MD groups - Please: complete use case profiles that you come across document directory Question: Are you using existing ontologies for packages? A. We will collect all standards within 3 months - present at p6 Packages provide 1. syntactic: 2. semantic: describe elements Question: How are packages delivered? Frequency abuse in standards and schemes to develop common understanding for processes Question: Is this like building a UMLS for an interdisciplinary area? a: Yes Reagan Moore Presentation on “Working Group Practical Policy” (see https://www.rd-alliance.org/practical-policy-wg-slides-dft.html for slides) In this PP WG computer actionable policies are used to enforce data management, automate administrative task. Practical policy means an assertion or assurance that is enforced about a (data) collection. Example properties can be preservation assertions such as authenticity, integrity, chain of custody, and original arrangement or be based on digital collection assertions such as description and arrangement by subject We have examples of 11 types of policies & implementation framework for policies. A visualization of our policy components (Policy-based data management Concept Graph) developed with Gary Berg-Cross for DFT and PP use is: Reagan walked through the diagram noting areas of community consensus on policy and how assertions are represented. : Community consensus: must define a purpose; must define the properties they want their policy to have Computer Actionable Implementation: each property created in community has a policy created which includes various Procedure: created from policy – build procedures by linking together functions Identifiers are defined by the operations that their resolvers support. For example: GUID – unique identifier Handle – add location information Ticket – add access controls Data grid logical name – add arrangement and metadata Workflow – add parsing and subset extraction There is a challenge with identifier across the large variety of objects, based on someone else’s control of what should happen. This can be ephemeral. We associate metadata with the procedures themselves and distinguish as does Keith and the MD group several types of metadata: Provenance Structural Description Internal features We do feature based indexing which extracts all words from text, extract all degrees of freedom from data set. We automate metadata extraction. Comment: Q. Do you cover reservation terminology and policy? Yes, about 70 policies but There is not yet a lot of foundation vocabulary - esp. where there are different types of preservation mechanisms; Adoption of RDA-DFT Terminology and Data Model to the Description and Structuring of Atmospheric Data (Aaron Addison, Rudolf Husar, Cynthia Hudson-Vitale) presented by Cynthia. See https://www.rd-alliance.org/adoption-rda-dft-terminology-and-datamodel-description-and-structuring-atmospheric-data.html Overview of DataFed & the Air Quality: This effort involves a collection of collections and includes a data model to encourage interoperability. On the “back end” represented in the diagram we can cite the catalog effectively. DataFed developed an RDA Data Foundation and Terminology (DFT) Adoption plan ● Map DFT model to DataFed/AQ Com Cat data model ● Assess potential RDA/DFT compliance ● This is an effort to be consistent with (if not compliant with) the DFT model ● Real-world evaluation of outcome Work started in February and will complete in August. Comments and suggestions from the DFT IG are welcome and encouraged by the IG chair. We have noted some gaps in the DFT model such as: • Where does the user fit? • What is the granularity of PIDs. • For use we need some best practices such as what workflow is needed to be DFT compliant? • How does this work for a system of aggregated datasets – or a data mediator? • Where does the non-domain user fit into the DFT data model? Legal Interoperability Paul Uhlir commented that he needs our method for developing vocabularies some he could apply these to his domain and perhaps use our tool tp define some terms from a legal perspective. Linking to definitions There is an issue of linking to a specific definition from our tool. Cyndy Chandler would like to do this as part of her work. Citing the page is not really the granularity you might want so we will have to think about how to get to what you want. Gary has reached out to the developers (e.g. Thomas Zastrow at Max Plank Institute RZG) who noted: • Its possible to link to individual wiki pages like for example: http://smw-rda.esc.rzg.mpg.de/index.php/Access • But of course, such a page contains maybe more than one description. There is work to improve the tool products and the WordCloud (http://smwrda.esc.rzg.mpg.de/index.php/File:Dftwordcloud.png) is one of the "intermediate" results Charles Vardeman noted that interestingly you can get to the RDF version of a page. For example: • Human readable: o http://smw-rda.esc.rzg.mpg.de/index.php/Digital_Collection • Machine version: o http://smwrda.esc.rzg.mpg.de/index.php/Special:ExportRDF/Digital_Collection But Semantic MediaWiki is throwing up some errors about the ExportRDF extension. • Discussion of Terminology evolution; It was noted by Keith that you need a simple relationship between concept and role and temporal component which makes things complicated. Gary replied that the same issue applies to MD so we are in the same boat and need a common solution to definitions and MD. This is not surprising since definitions are MD, but it is perhaps easier to see. Next steps were discussed covering getting input ad requirements from: • Metadata • Practical Policy and perhaps Data Fabric As part of Tool development we are interested in collecting additional requirements; synonym idea and taxonomic structure formally. Some interest groups and domain groups will push the IG support term development. An open questions for term development is. “ can it be usefully extended to the domain?” Dimitris Koureas (Natural History Museum London, UK) from the IG Biodiversity Data Integration was very much interested in trying to use the tool and the IG for their vocabulary. EnVIVO(sp) ontology for biodiversity; ratifies standards for biodiversity work; mission is to broaden views and perspectives from other models Further Discussion: Need to have some adopters to demonstrate impact of the activities Q: What does it mean to be DFT compliant? A: The goal is not be a validator of compliance Q: How will you handle the overlapping terms? A. Usually we form distinctions and formalize these.