RDA’s Recently Endorsed Outputs September 16, 2015 Agenda Introduction Data Foundation and Terminology Data Type Registries PID Information Types Practical Policy Questions 2 Data Foundation and Terminology - Talking the Same Language – Peter Wittenburg, Gary Berg-Cross, Raphael Ritz Summary of the Problem 4 What is the problem? Data organizations (DOrg) and ideas about it are all different We are all speaking different languages, wasting time and misunderstanding each other in any project involving data Different DOrgs make data discovery and integration very time consuming, inefficient and thus expensive Different DOrgs prevent us developing maintainable support software Who is impacted? All efforts to integrate data (Federations, BDA projects, etc.) What are the ramifications of not having the problem resolved? Combining data of all sorts across different origins (projects, repositories, disciplines, etc.) is a nightmare and requires a lot of curation and transformation before the actual scientific analysis can start Highlights of Data Foundation and Terminology Working Group Structure 60 members Almost all regions Different types of institutions and disciplines Skillsets ranged from relative newcomers up to members with much experience from data intensive projects Outputs List of core terms essential to harmonize conceptualization of data organizations Graphical model relating the terms Set of auxiliary documents including many use cases to demonstrate the bottom-up approach and research of the WG Term Tool (using Semantic Media Wiki) to store definitions and allow editing, classification and discussion of terms (which is also open for other groups) 5 Active Contributors to the Work Institute/Project Country/ Region Domain CNRI US IT Research and Systems U Cardiff UK IT Research and Systems AWI DE Oceanography & Environment MPG DE Research Organisation EUDAT EU Data Infrastructure CLARIN EU Linguistic Research Infrastructure EPOS EU Earth Observation Res. Infrastructure ENES Int World Climate Res. Infrastructure ENVRI EU Environmental Res. Infrastructure DataOne US Environmental Infrastructure ESSD/RENCI US Earth Science System Data NCGEN/RENCI US Clinical Genomics Europeana EU Humanities Infrastructure DataCite/EPIC Int PID Infrastructures DICE US IT Research and Systems CAS CN Earth Science Model ADCIRC/RENCI US Ocean and Storm modeling 6 Impact of Outputs The European data infrastructure, EUDAT Federating data from many discipline repositories where each data collection has a different data organization. If integration is not simply done at physical level (file structures), this heterogeneity makes it very costly to integrate all data to enable repurposing and to make it accessible at different repositories. The International CLARIN Project : According to the Technology Director: Very handy to have a lingua franca when discussing research infrastructure architectures. It was good to be involved as adopting community from the start of the work. Similar experiences from international colleagues who work on large scale data integration Harmonization greatly reduces integration time 7 Endorsements/Adopters 8 EUDAT, CLARIN and others with dramatic problems in data integration Approach aligned with the progress of the DFT Working Group discussion Their repository setups adhere now to the DFT model and interaction with different communities based on it The Digital Object, that is described by metadata, is associated with a Persistent ID and whose instances are stored in trustful repositories (see simplified diagram) persistent ID digital object bitstream repository metadata Other projects (humanities, health, bioinformatics, neuroinformatics and atmosphere research) adopted these models and the terminology Endorsements/Adopters 9 Institute/Project Country/ Region Domain CNRI US IT Research and Systems U Cardiff UK IT Research and Systems MPG DE Research Organisation EUDAT EU Data Infrastructure CLARIN EU Linguistic Research Infrastructure EPOS EU Earth Observation Res. Infrastructure ENES Int World Climate Res. Infrastructure ENVRI EU Environmental Res. Infrastructure ESSD/RENCI US Earth Science System Data NCGEN/RENCI US Clinical Genomics DICE US IT Research and Systems ADCIRC/RENCI US Ocean and Storm modeling Deep Carbon Project US Environmental/Athmospheric Research Note: There may be more projects/institutes that have endoresed or adopted the DFT model without noticing us. How You Can Endorse Outputs are openly available to: Anyone who wants to run a project, including those with large data collections Organizations should be strictly compliant to the basic model to guarantee independence and thus easy re-purposing of all components Anyone who is working in a data federation project, integrating data from different sources, or wants to re-purpose data for data intensive science Projects could use the model as a common reference model to design transformations Projects could use the suggested terminology to achieve quick, mutual understanding Software developers, who can adopt the basic model to ensure their software can be used by almost everyone adhering to state of the art principles 10 How to Access and Use Outputs “Core Terms and Model” document available on website Provides the final model and corresponding terms that can be applied to your project Additional Resources Supplementary documents providing information on conceptualization and background for choices Contact the Working Group co-chairs via email or at upcoming plenary Contribute to the now functioning DFT Interest Group via email, wiki, Term Tool Send a request to the RDA Europe support team 11 Next Steps Since Working Group focused only on the basic set of core terms, work needs to be continued Much more out there, in particular also in other RDA groups, where terminology harmonization would help substantially We also see the need to consider the dynamics of the field and to be ready to adapt current definitions and perhaps even the model A follow-up Data Foundation and Terminology Interest Group has been established and will meet at Plenary 6 Group is meeting at RDA’s 6th Plenary in Paris next week A larger scope of integrated work is being discussed as part of the Data Fabric IG 12 Contact Information DFT WG: https://rd-alliance.org/groups/data-foundation-and-terminology-wg.html DFT IG: https://rd-alliance.org/groups/data-foundations-and-terminology-ig.html TeD-T Term Definition Tool: http://smw-rda.esc.rzg.mpg.de/index.php/Main_Page RDA EU Support Team: dmp@europe.rd-alliance.org 13 Contact Information DFT WG: https://rd-alliance.org/groups/data-foundation-and-terminology-wg.html DFT IG: https://rd-alliance.org/groups/data-foundations-and-terminology-ig.html TeD-T Term Definition Tool: http://smw-rda.esc.rzg.mpg.de/index.php/Main_Page RDA EU Support Team: dmp@europe.rd-alliance.org 14 Data Type Registries Larry Lannom, CNRI Daan Broeder, Meertens Institute, KNAW Summary of the Problem 16 Data sharing requires that data can be parsed, understood, and reused by people and applications other than those that created the data How do we do this now? For documents – formats are enough, e.g., PDF, and then the document explains itself to humans This doesn’t work well with data – numbers are not self-explanatory What does the number 7 mean in cell B27? Data producers may not have explicitly specified certain details in the data: measurement units, coordinate systems, variable names, etc. Need a way to precisely characterize those assumptions such that they can be identified by humans and machines that were not closely involved in its creation Affects all data producers and consumers Goal of the DTR Effort: Explicate and Share Assumptions using Types and Type Registries 17 Evaluate and identify a few assumptions in data that can be codified and shared in order to… Produce a functioning Registry system that can easily be evaluated by organizations before adoption Highly configurable for changing scope of captured and shared assumptions depending on the domain or organization Supports several Type record dissemination variations Design for allowing federation between multiple Registry instances The emphasis is not on Identifying every possible assumption and data characteristic applicable for all domains Technology Highlights of the Output 18 Confirmation that detailed and precise data typing is a key consideration in data sharing and reuse and that a federated registry system for such types is highly desirable and needs to accommodate each community’s own requirements Deployment of a prototype registry implementing one potential data model, against which various use cases can be tested Involvement of multiple ongoing scientific data management efforts, across a variety of domains, in actively planning for and testing the use of data types and associated registries in their data management efforts Integration with one additional RDA WG (Persistent Identifier Types) and at least one Interest Group (RDA/CODATA Materials Data, Infrastructure & Interoperability IG) Development of a set of questions that require further consideration before a detailed recommendation on data typing can be issued Impact of Use Case: Process Use Case 19 3 Users 2 1 Federated Set of Type Registries 4 ID ID ID ID Type ID Type ID Type Type Payload Type Payload Type Payload Payload Payload 4 Payload Typed Data Terms:… I Agree 10100 Visualization 11010 Rights 101…. Data Set Data Processing Dissemination Services 1 Client (process or people) encounters unknown data type. 2 Resolved to Type Registry. 3 Response includes type definitions, relationships, properties, and possibly service pointers. Response can be used locally for processing, or, optionally 4 typed data or reference to typed data can be sent to service provider. Endorsements/Adopters Materials Science Adoption Project Demo at RDA’s 6th Plenary in Paris X-ray diffraction use case normalize data sets resulting from multiple proprietary instruments Enable a homogenous analysis platform for data consumers to perform their analyses Deep Carbon Observatory Goal: given a dataset identifier, discover detailed information about the structure(s) within that dataset, and act accordingly DTR is a registry used for explicating structures in the form of type records Facilitate norms of behavior relevant to data curation and re-use Digital Object Identifier Given a DOI, what services are relevant and applicable Having chosen a service, how can a client invoke that service? Having invoked a service, how can a client process the returned data? 20 How You Can Endorse Start a new prototype effort Follow existing prototype efforts Attend the BOF at P6 Join the Data Typing WG when it starts Try the public prototype at typeregistry.org 21 Next Steps and Contact Information A follow-up Working Group (WG) is planned: Data Typing Leverage results of Data Type Registries Working Group Collect results from multiple prototypes Best practices for federation Bird of a Feather session on Data Typing at RDA’s 6th Plenary in Paris (24 Sept., Breakout #6) Proposed Chairs of Data Typing WG Giridhar Manepalli, CNRI Simon Cox, CSIRO Tobias Weigel, DKRZ Larry and Daan are still around 22 PID Information Types: Towards PID Interoperability Tobias Weigel (DKRZ / University of Hamburg) Tim DiLauro (Data Conservancy / Johns Hopkins University) Summary of the Problem Move from management of files towards management of objects 24 IDENTIFIER How does object management scale with increasing numbers? How do we further automate our processes? Issues independent from particular disciplines, repositories, management approaches Understanding the most elemental characteristics of digital objects – for machine agents and human users Facilitate interoperability across PID systems and simplify PID record usage Avoid insular solutions and reiteration of efforts – open licenses Highlights of the Outputs 25 More than 50 group members from EU/US/AU A lot of technical expertise and community experience Key Ouptuts (cf. summary report): Conceptual insights on types and their possible structures Practical type examples geared towards diverse use cases Openly licensed API specification and Java-based prototype IDENTIFIER Verification service properties size checksum timestamps aggregation version license format Size: Format: Checksum: Date: Size: Checksum: Format: License: Impact of the Outputs Some initial types were registered in the TR prototype, making it possible to explore further applications Information on how to register new types available in the report Incited plans in communities and projects about concrete applications PIDs and typing increasingly seen as a crucial component to decouple management of objects from contents Simplify client access to data across domains, implementations and changes in information models More lightweight access to information on less accessible objects 26 Endorsements/Adopters 27 Adopters can be: Communities who can use existing types and share custom types, as well as build tools and services that exploit them PID service providers who can offer a typing service as added value beyond registration and resolution, increasing PID interoperability Adopter Category Country Scope / Goal ENES/ESGF Community Int. Climate data management (CMIP6) DCO-DS/RPI Community US Enhancing existing PID usage EUDAT Community/Service provider EU Added-value service to various disciplinary communities MGI/NIST Community US Automation of data type conversions EPIC Service provider EU CNRI Service provider US DONA Service provider Int. Generic added-value service How You Can Endorse 28 Make use of existing type examples, invent your own types and please tell us about it! Follow-up RDA WGs on Collections and Data Typing will continue the work on concrete types. The PID Interest Group is also a good place to provide general feedback. Specification and prototype source code are openly available Possible development by EUDAT, DCO, ENES and others as interested adopters Offer by PID service providers as a service beyond registration and resolution Contribution to a unified type registry is encouraged Next Steps and Contact Information PID Information Types WG https://rd-alliance.org/groups/pid-information-types-wg.html PID Interest Group https://rd-alliance.org/groups/pid-interest-group.html PID Collections candidate WG https://rd-alliance.org/groups/pid-collections-wg.html https://rd-alliance.org/pid-collections-p6-bof-session.html Data Typing BoF https://rd-alliance.org/data-typing-p6-bof-session.html 29 Practical Policy Reagan Moore, Rainer Stotzka Summary of the Problem Computer actionable policies are used to enforce data management automate administrative tasks validate compliance with assessment criteria automate scientific data processing and analyses Practical Policy: Assertion or assurance that is enforced about a (data) collection (data set, digital object, file) by the creators of the collection Users motivated by issues related to scale, distribution 31 Policy Templates 32 Practical Policy members represented 11 types of data management systems 30 institutions 2 testbeds iRODS Renaissance Computing Institute, DataNet Federation Consortium – DFC GPFS Institute of Physics of the Academy of Sciences, CESNET Garching Computing Centre – RZG Published two documents Moore, R., R. Stotzka, C. Cacciari, P. Benedikt, “Practical Policy Templates” February, 2015, http://dx.doi.org/10.15497/83E1B3F9-7E17-484A-A466B3E5775121CC. Moore, R., R. Stotzka, C. Cacciari, P. Benedikt, “Practical Policy Implementations”, February, 2015, http://dx.doi.org/10.15497/83E1B3F9-7E17-484A-A466B3E5775121CC. Production Environments 33 Computer actionable rules to enforce: Preservation standards Authenticity, integrity, chain of custody, arrangement Data management plans Collection creation, product generation, publication, storage, archives Data distribution Replication, content distribution network Publication Descriptive metadata, time dependent access controls Processing pipelines Workflow execution Endorsements/Adopters Distributed data management environments EUDAT Data Policy Manager B2SAFE use case International Neuroinformatics Coordinating Facility Institut national de physique nucléaire et de physique des particules New Zealand BESTGRID DataNet Federation Consortium NSF data management plans Odum Institute preservation archive The iPlant Collaborative genomics data grid Science Observatory Network digital library SILS LifeTime Library HydroShare NOAA National Climatic Data Center NASA Center for Climate Simulations 34 Applications 35 Policy-based collection management Purpose for assembling the collection Properties required to support the purpose Policies that control when and where the properties are enforced Procedures that execute operations controlled by the policies Persistent state information that is generated by the procedures Periodic assessment criteria that verify compliance RDA Publications Policy templates Constraints, operations, required state information Policy implementations Computer actionable rules to automate policy enforcement Next Steps and Contact Information 36 Data Fabric Interest Group Policies to support Federation Interoperability Data Foundations and Terminology Interest Group Vocabulary for policy management Interoperability testbeds EUDAT http://eudat.eu/data-access-and-reuse-policies-darup National Data Service http://www.nationaldataservice.org DataNet Federation Consortium http://datafed.org 37 Thank you. Questions?