Impact of Outputs - Research Data Alliance

advertisement
RDA’s Recently Endorsed Outputs
September 16, 2015
Agenda






Introduction
Data Foundation and Terminology
Data Type Registries
PID Information Types
Practical Policy
Questions
2
Data Foundation and Terminology
- Talking the Same Language –
Peter Wittenburg, Gary Berg-Cross, Raphael Ritz
Summary of the Problem
4
 What is the problem?
 Data organizations (DOrg) and ideas about it are all different
 We are all speaking different languages, wasting time and
misunderstanding each other in any project involving data
 Different DOrgs make data discovery and integration very time
consuming, inefficient and thus expensive
 Different DOrgs prevent us developing maintainable support software
 Who is impacted?
 All efforts to integrate data (Federations, BDA projects, etc.)
 What are the ramifications of not having the problem
resolved?
 Combining data of all sorts across different origins (projects, repositories,
disciplines, etc.) is a nightmare and requires a lot of curation and
transformation before the actual scientific analysis can start
Highlights of Data Foundation and Terminology
Working Group
 Structure
 60 members
 Almost all regions
 Different types of institutions and disciplines
 Skillsets ranged from relative newcomers up to members with much
experience from data intensive projects
 Outputs
 List of core terms essential to harmonize conceptualization of data
organizations
 Graphical model relating the terms
 Set of auxiliary documents including many use cases to demonstrate
the bottom-up approach and research of the WG
 Term Tool (using Semantic Media Wiki) to store definitions and allow
editing, classification and discussion of terms (which is also open for
other groups)
5
Active Contributors to the Work
Institute/Project
Country/ Region
Domain
CNRI
US
IT Research and Systems
U Cardiff
UK
IT Research and Systems
AWI
DE
Oceanography & Environment
MPG
DE
Research Organisation
EUDAT
EU
Data Infrastructure
CLARIN
EU
Linguistic Research Infrastructure
EPOS
EU
Earth Observation Res. Infrastructure
ENES
Int
World Climate Res. Infrastructure
ENVRI
EU
Environmental Res. Infrastructure
DataOne
US
Environmental Infrastructure
ESSD/RENCI
US
Earth Science System Data
NCGEN/RENCI
US
Clinical Genomics
Europeana
EU
Humanities Infrastructure
DataCite/EPIC
Int
PID Infrastructures
DICE
US
IT Research and Systems
CAS
CN
Earth Science Model
ADCIRC/RENCI
US
Ocean and Storm modeling
6
Impact of Outputs
 The European data infrastructure, EUDAT
 Federating data from many discipline repositories where each data
collection has a different data organization.
 If integration is not simply done at physical level (file structures), this
heterogeneity makes it very costly to integrate all data to enable repurposing and to make it accessible at different repositories.
 The International CLARIN Project :
 According to the Technology Director: Very handy to have a lingua
franca when discussing research infrastructure architectures. It was
good to be involved as adopting community from the start of the work.
 Similar experiences from international colleagues who
work on large scale data integration
 Harmonization greatly reduces integration time
7
Endorsements/Adopters
8
 EUDAT, CLARIN and others with dramatic problems in
data integration
 Approach aligned with the progress of the DFT Working Group
discussion
 Their repository setups adhere now to the DFT model and interaction
with different communities based on it
 The Digital Object, that is described by metadata, is associated with a
Persistent ID and whose instances are stored in trustful repositories (see
simplified diagram)
persistent ID
digital
object
bitstream
repository
metadata
 Other projects (humanities, health, bioinformatics, neuroinformatics and
atmosphere research) adopted these models and the terminology
Endorsements/Adopters
9
Institute/Project
Country/ Region
Domain
CNRI
US
IT Research and Systems
U Cardiff
UK
IT Research and Systems
MPG
DE
Research Organisation
EUDAT
EU
Data Infrastructure
CLARIN
EU
Linguistic Research Infrastructure
EPOS
EU
Earth Observation Res. Infrastructure
ENES
Int
World Climate Res. Infrastructure
ENVRI
EU
Environmental Res. Infrastructure
ESSD/RENCI
US
Earth Science System Data
NCGEN/RENCI
US
Clinical Genomics
DICE
US
IT Research and Systems
ADCIRC/RENCI
US
Ocean and Storm modeling
Deep Carbon Project
US
Environmental/Athmospheric Research
Note: There may be more projects/institutes that have endoresed
or adopted the DFT model without noticing us.
How You Can Endorse
 Outputs are openly available to:
 Anyone who wants to run a project, including those with large data
collections
 Organizations should be strictly compliant to the basic model to guarantee
independence and thus easy re-purposing of all components
 Anyone who is working in a data federation project, integrating data from
different sources, or wants to re-purpose data for data intensive science
 Projects could use the model as a common reference model to design
transformations
 Projects could use the suggested terminology to achieve quick, mutual
understanding
 Software developers, who can adopt the basic model to ensure their
software can be used by almost everyone adhering to state of the art
principles
10
How to Access and Use Outputs
 “Core Terms and Model” document available on
website
 Provides the final model and corresponding terms that can be
applied to your project
 Additional Resources
 Supplementary documents providing information on
conceptualization and background for choices
 Contact the Working Group co-chairs via email or at upcoming
plenary
 Contribute to the now functioning DFT Interest Group via email,
wiki, Term Tool
 Send a request to the RDA Europe support team
11
Next Steps
 Since Working Group focused only on the basic set of
core terms, work needs to be continued
 Much more out there, in particular also in other RDA groups, where
terminology harmonization would help substantially
 We also see the need to consider the dynamics of the field and to be
ready to adapt current definitions and perhaps even the model
 A follow-up Data Foundation and Terminology Interest
Group has been established and will meet at Plenary 6
 Group is meeting at RDA’s 6th Plenary in Paris next week
 A larger scope of integrated work is being discussed as part of the
Data Fabric IG
12
Contact Information
 DFT WG:
https://rd-alliance.org/groups/data-foundation-and-terminology-wg.html
 DFT IG:
https://rd-alliance.org/groups/data-foundations-and-terminology-ig.html
 TeD-T Term Definition Tool:
http://smw-rda.esc.rzg.mpg.de/index.php/Main_Page
 RDA EU Support Team:
dmp@europe.rd-alliance.org
13
Contact Information
 DFT WG:
https://rd-alliance.org/groups/data-foundation-and-terminology-wg.html
 DFT IG:
https://rd-alliance.org/groups/data-foundations-and-terminology-ig.html
 TeD-T Term Definition Tool:
http://smw-rda.esc.rzg.mpg.de/index.php/Main_Page
 RDA EU Support Team:
dmp@europe.rd-alliance.org
14
Data Type Registries
Larry Lannom, CNRI
Daan Broeder, Meertens Institute, KNAW
Summary of the Problem





16
Data sharing requires that data can be parsed, understood, and reused by
people and applications other than those that created the data
How do we do this now?
 For documents – formats are enough, e.g., PDF, and then the
document explains itself to humans
 This doesn’t work well with data – numbers are not self-explanatory
 What does the number 7 mean in cell B27?
Data producers may not have explicitly specified certain details in the data:
measurement units, coordinate systems, variable names, etc.
Need a way to precisely characterize those assumptions such that they
can be identified by humans and machines that were not closely involved
in its creation
Affects all data producers and consumers
Goal of the DTR Effort: Explicate and Share
Assumptions using Types and Type Registries
17
 Evaluate and identify a few assumptions in data that can be
codified and shared in order to…
 Produce a functioning Registry system that can easily be evaluated
by organizations before adoption
 Highly configurable for changing scope of captured and shared
assumptions depending on the domain or organization
 Supports several Type record dissemination variations
 Design for allowing federation between multiple Registry instances
 The emphasis is not on
 Identifying every possible assumption and data characteristic
applicable for all domains
 Technology
Highlights of the Output





18
Confirmation that detailed and precise data typing is a key consideration in
data sharing and reuse and that a federated registry system for such types
is highly desirable and needs to accommodate each community’s own
requirements
Deployment of a prototype registry implementing one potential data model,
against which various use cases can be tested
Involvement of multiple ongoing scientific data management efforts, across
a variety of domains, in actively planning for and testing the use of data
types and associated registries in their data management efforts
Integration with one additional RDA WG (Persistent Identifier Types) and at
least one Interest Group (RDA/CODATA Materials Data, Infrastructure &
Interoperability IG)
Development of a set of questions that require further consideration before
a detailed recommendation on data typing can be issued
Impact of Use Case: Process Use Case
19
3
Users
2
1
Federated Set of Type
Registries
4
ID
ID
ID
ID
Type ID
Type ID
Type
Type
Payload
Type Payload
Type
Payload
Payload
Payload
4
Payload
Typed Data
Terms:…
I Agree
10100
Visualization
11010
Rights
101….
Data Set
Data Processing
Dissemination
Services
1 Client (process or people) encounters unknown data type.
2 Resolved to Type Registry.
3 Response includes type definitions, relationships, properties, and possibly service pointers. Response can be
used locally for processing, or, optionally 4 typed data or reference to typed data can be sent to service provider.
Endorsements/Adopters
 Materials Science Adoption Project
 Demo at RDA’s 6th Plenary in Paris
 X-ray diffraction use case
 normalize data sets resulting from multiple proprietary instruments
 Enable a homogenous analysis platform for data consumers to perform their
analyses
 Deep Carbon Observatory
 Goal: given a dataset identifier, discover detailed information about the structure(s)
within that dataset, and act accordingly
 DTR is a registry used for explicating structures in the form of type records
 Facilitate norms of behavior relevant to data curation and re-use
 Digital Object Identifier
 Given a DOI, what services are relevant and applicable
 Having chosen a service, how can a client invoke that service?
 Having invoked a service, how can a client process the returned data?
20
How You Can Endorse
 Start a new prototype effort
 Follow existing prototype efforts
 Attend the BOF at P6
 Join the Data Typing WG when it starts
 Try the public prototype at typeregistry.org
21
Next Steps and Contact Information
 A follow-up Working Group (WG) is planned:
Data Typing
 Leverage results of Data Type Registries Working Group
 Collect results from multiple prototypes
 Best practices for federation
 Bird of a Feather session on Data Typing at RDA’s 6th
Plenary in Paris (24 Sept., Breakout #6)
 Proposed Chairs of Data Typing WG
 Giridhar Manepalli, CNRI
 Simon Cox, CSIRO
 Tobias Weigel, DKRZ
 Larry and Daan are still around 
22
PID Information Types:
Towards PID Interoperability
Tobias Weigel (DKRZ / University of Hamburg)
Tim DiLauro (Data Conservancy / Johns Hopkins University)
Summary of the Problem
 Move from management of files
towards management of objects
24
IDENTIFIER
 How does object management scale with increasing numbers?
 How do we further automate our processes?
 Issues independent from particular disciplines, repositories,
management approaches
 Understanding the most elemental characteristics of
digital objects – for machine agents and human users
 Facilitate interoperability across PID systems and
simplify PID record usage
 Avoid insular solutions and reiteration of efforts – open
licenses
Highlights of the Outputs
25
 More than 50 group members from EU/US/AU
 A lot of technical expertise and community experience
 Key Ouptuts (cf. summary report):
 Conceptual insights on types and their possible structures
 Practical type examples geared towards diverse use cases
 Openly licensed API specification and Java-based prototype
IDENTIFIER
Verification service
properties
size
checksum
timestamps
aggregation
version
license
format
Size:
Format:
Checksum:
Date:
Size:
Checksum:
Format:
License:
Impact of the Outputs
 Some initial types were registered in the TR prototype,
making it possible to explore further applications
 Information on how to register new types available in the report
 Incited plans in communities and projects about
concrete applications
 PIDs and typing increasingly seen as a crucial
component to decouple management of objects from
contents
 Simplify client access to data across domains, implementations and
changes in information models
 More lightweight access to information on less accessible objects
26
Endorsements/Adopters
27
 Adopters can be:
 Communities who can use existing types and share custom types, as
well as build tools and services that exploit them
 PID service providers who can offer a typing service as added value
beyond registration and resolution, increasing PID interoperability
Adopter
Category
Country
Scope / Goal
ENES/ESGF
Community
Int.
Climate data management (CMIP6)
DCO-DS/RPI
Community
US
Enhancing existing PID usage
EUDAT
Community/Service
provider
EU
Added-value service to various
disciplinary communities
MGI/NIST
Community
US
Automation of data type conversions
EPIC
Service provider
EU
CNRI
Service provider
US
DONA
Service provider
Int.
Generic added-value service
How You Can Endorse
28
 Make use of existing type examples, invent your own
types and please tell us about it!
 Follow-up RDA WGs on Collections and Data Typing will continue the
work on concrete types. The PID Interest Group is also a good place to
provide general feedback.
 Specification and prototype source code are openly
available
 Possible development by EUDAT, DCO, ENES and others as
interested adopters
 Offer by PID service providers as a service beyond
registration and resolution
 Contribution to a unified type registry is encouraged
Next Steps and Contact Information
 PID Information Types WG
 https://rd-alliance.org/groups/pid-information-types-wg.html
 PID Interest Group
 https://rd-alliance.org/groups/pid-interest-group.html
 PID Collections candidate WG
 https://rd-alliance.org/groups/pid-collections-wg.html
 https://rd-alliance.org/pid-collections-p6-bof-session.html
 Data Typing BoF
 https://rd-alliance.org/data-typing-p6-bof-session.html
29
Practical Policy
Reagan Moore, Rainer Stotzka
Summary of the Problem
Computer actionable policies are used to
 enforce data management
 automate administrative tasks
 validate compliance with assessment criteria
 automate scientific data processing and analyses
Practical Policy:
Assertion or assurance that is enforced about a (data) collection
(data set, digital object, file) by the creators of the collection
Users motivated by issues related to scale, distribution
31
Policy Templates
32
 Practical Policy members represented
 11 types of data management systems
 30 institutions
 2 testbeds
 iRODS
Renaissance Computing Institute,
DataNet Federation Consortium – DFC
 GPFS
Institute of Physics of the Academy of Sciences, CESNET
Garching Computing Centre – RZG
 Published two documents
 Moore, R., R. Stotzka, C. Cacciari, P. Benedikt, “Practical Policy Templates”
February, 2015, http://dx.doi.org/10.15497/83E1B3F9-7E17-484A-A466B3E5775121CC.
 Moore, R., R. Stotzka, C. Cacciari, P. Benedikt, “Practical Policy Implementations”,
February, 2015, http://dx.doi.org/10.15497/83E1B3F9-7E17-484A-A466B3E5775121CC.
Production Environments
33
 Computer actionable rules to enforce:
 Preservation standards
 Authenticity, integrity, chain of custody, arrangement
 Data management plans
 Collection creation, product generation, publication, storage,
archives
 Data distribution
 Replication, content distribution network
 Publication
 Descriptive metadata, time dependent access controls
 Processing pipelines
 Workflow execution
Endorsements/Adopters
 Distributed data management environments
 EUDAT Data Policy Manager
 B2SAFE use case




International Neuroinformatics Coordinating Facility
Institut national de physique nucléaire et de physique des particules
New Zealand BESTGRID
DataNet Federation Consortium






NSF data management plans
Odum Institute preservation archive
The iPlant Collaborative genomics data grid
Science Observatory Network digital library
SILS LifeTime Library
HydroShare
 NOAA National Climatic Data Center
 NASA Center for Climate Simulations
34
Applications
35
 Policy-based collection management






Purpose for assembling the collection
Properties required to support the purpose
Policies that control when and where the properties are enforced
Procedures that execute operations controlled by the policies
Persistent state information that is generated by the procedures
Periodic assessment criteria that verify compliance
 RDA Publications
 Policy templates
 Constraints, operations, required state information
 Policy implementations
 Computer actionable rules to automate policy enforcement
Next Steps and Contact Information
36
 Data Fabric Interest Group
 Policies to support
 Federation
 Interoperability
 Data Foundations and Terminology Interest Group
 Vocabulary for policy management
 Interoperability testbeds
 EUDAT
 http://eudat.eu/data-access-and-reuse-policies-darup
 National Data Service
 http://www.nationaldataservice.org
 DataNet Federation Consortium
 http://datafed.org
37
Thank you.
Questions?
Download