df Position paper-BC-KJ-RM-shorter final 9-21

Position Paper for Data Fabric IG Interoperability, Infrastructures and Virtuality https://rd-alliance.org/group/data-fabric-ig.html Gary Berg-Cross, Keith Jeffery, Reagan Moore What Managed processes & Services? What Basic & Flexible Infrastructure Machinery? See https://rd-alliance.org/group/data-fabric-ig/post/re-rdadatafabric-ig-data-fabric-position-paper-broadendiscussion-berg-1 for discussion & attached file. What principles & methods principles are needed to guide the interaction between services interface, protocol ? Key Points 1. Much of the current DF discussion focuses on a data management & lifecycle view.  Lacks a focus on other important topics  standards & federation mechanisms that are needed to assemble collaborations spanning institutions, data management environments. 2. Interoperability, needs to be a 1st class concept in the DF conservation  it is fundamentally important for federation & overcoming data-silo generated problems. 3. There are multiple benefits for development of federated & virtualized mechanisms & mathematical descriptions to assist sharing DOs & (digital) knowledge procedures. 2 Initial ideas for DF IG- Implied Framework is Data Lifecycle View with a Data Management Focus that emerged from the discussions amongst various RDA WG chairs Where is interoperability in a Raw to Citable Data View ? New Groups emerge so Pubs are part of Data Fabric & related analysis 3 Data Fabric Analysis, e.g. components & services in LC via Use CASES How do we come to essential components & services? (guided by use scenarios that need to include collaboration) This scenario doesn’t show analysis or data sharing via DO manipulation! Sharing 4 Make Interoperability a First Class View  Concept of Interoperability:  The extent to which systems and devices can routinely exchange data and services, and interpret that shared data through the shared services  A stronger type of data exchange can include knowledge of the meaning of the data content, usage constraints, and the underlying assumptions.  Bring interoperability (within and cross-domain) aspects into the DF discussions as a ‘first class citizen’ alongside all the other aspects of the research data lifecycle in a domain. 5 Multiple types of DF infrastructure 6  When an enterprise implements a data management solution, one of multiple types of DFs infrastructure is typically chosen:  Data management –enterprise to build a data repository, manage an information catalog, & enforce management & curation policies (but also)  Data analysis –enterprise to process a data collection, apply analysis & visualization tools, and automate a processing pipeline. (but also)  Data preservation –enterprise to build reference collections and knowledge bases that comprise their intellectual capital, while managing technology evolution  Data publication –enterprise to provide descriptive information and arrangement for discovery and access of data collections.  Data sharing – controlled sharing of a data collection, shared analysis workflows, and information catalogs - interoperability. Interoperability mechanisms required for sharing data, information, & knowledge. Composition how the separate components, developed separately, can be made to work together. Minimal set of infrastructure mechanisms & service requirements 7 Different suites of components will have different data fabrics. Brokers Gaps, obstacles and possible incompatibilities Enable reproducible research Data Sharing Use Cases  EUDAT & the DataNet Federation Consortium use cases provide some view to help:  Interoperability mechanisms for sharing DOs & (digital) knowledge procedures  An implication is that researcher can re-execute trusted procedures to obtain identical results, making reproducible datadriven research possible  Community driven research collaborations  Seismology – share seismic data, tsunami prediction workflows between research groups  Climate change – share oceanography environmental data, coastal storm surge analyses, hydrology flood analyses, satellite environmental data  Genomics – build a cohort of genomes, predictive models for humans, plants, animals, diseases 8 Expanded Ideas of Federation: 3 versions of federated systems 9 1. Shared name spaces for users, files, and services. 1. Besides a single sign-on, shared name space providing users federated services we want to afford service for virtual collections that span administrative domains. 2. And a shared name space for services enables re-use of procedures across researcher resources. 2. Shared services for manipulating digital objects. 1. Such as shared service through a broker, accessing the service through its access protocol or an encapsulated service in a virtual machine environment, for movement to the local research resources for execution. 3. Third-party (service) access. 1. Posting requests to a 3rd party, such as a message queue, and eliminate direct communication between the federated system components. Enhanced Use of Virtualization 10  Virtual machines, such as in a CLOUD or GRID environment,  Required to manage dynamic resource allocation, scalability, distributed parallelism, energy efficiency and other aspects.  Virtual collections  Required to build research collaboration environments  We need the appropriate level of abstraction for optimum computing environment/middleware behavior.  Too low or prescriptive a level constrains the environment,  too high or abstract a level does not indicate clearly the requirement of the user.  See Triple-I Computing as a concept (Information-Intention-Incentive model proposed by [Schubert and Jeffery, 2014]) and already research projects are addressing the challenges therein. Analysis and Preliminary Conclusions 11  We have made a useful start but the DF vision needs to be expanded (also focused for maximum benefit) to  more than a domain of registered DO stored in well-managed repositories  Frame DF & its services broadly as data use & applications taking into account the available context or environment.  This doesn’t minimize good data management practices and services which are necessary and deserve support, but are not sufficient to address the challenge for interoperability.  For enhanced, semi-automated interoperability we need to consider:  Improved metadata for data in context with enhanced semantics  Leveraging the emergence of a mathematical foundation for federation of data management systems (e.g. work by Hao Xu). So…. DF, VRE must be integrated  Including datasets, SW services, resources (computers, detectors…), users  Composed as workflows documented mathematically and ideally created autonomically  Achieved through metadata describing the elements of bullet 1  Discovery  Contextualisation (relevance, quality, … through relations to organisations, persons, projects, publications etc. and provenance, rights)  Detailed application-specific (to connect software to data at a resource for a user) i.e. schema-level  The key technologies to achieve interoperability as recognised by researchers are:  AAAI  PID  Metadata with formal syntax and declared semantics 12

df Position paper-BC-KJ-RM-shorter final 9-21

Related documents

Products

Support

df Position paper-BC-KJ-RM-shorter final 9-21

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib