Position Paper for Data Fabric IG Interoperability, Infrastructures and Virtuality https://rd-alliance.org/group/data-fabric-ig.html Gary Berg-Cross, Keith Jeffery, Reagan Moore What Managed processes & Services? What Basic & Flexible Infrastructure Machinery? See https://rd-alliance.org/group/data-fabric-ig/post/re-rdadatafabric-ig-data-fabric-position-paper-broadendiscussion-berg-1 for discussion & attached file. What principles & methods principles are needed to guide the interaction between services interface, protocol ? Key Points 1. Much of the current DF discussion focuses on a data management & lifecycle view. Lacks a focus on other important topics standards & federation mechanisms that are needed to assemble collaborations spanning institutions, data management environments. 2. Interoperability, needs to be a 1st class concept in the DF conservation it is fundamentally important for federation & overcoming data-silo generated problems. 3. There are multiple benefits for development of federated & virtualized mechanisms & mathematical descriptions to assist sharing DOs & (digital) knowledge procedures. 2 Initial ideas for DF IG- Implied Framework is Data Lifecycle View with a Data Management Focus that emerged from the discussions amongst various RDA WG chairs Where is interoperability in a Raw to Citable Data View ? New Groups emerge so Pubs are part of Data Fabric & related analysis 3 Data Fabric Analysis, e.g. components & services in LC via Use CASES How do we come to essential components & services? (guided by use scenarios that need to include collaboration) This scenario doesn’t show analysis or data sharing via DO manipulation! Sharing 4 Make Interoperability a First Class View Concept of Interoperability: The extent to which systems and devices can routinely exchange data and services, and interpret that shared data through the shared services A stronger type of data exchange can include knowledge of the meaning of the data content, usage constraints, and the underlying assumptions. Bring interoperability (within and cross-domain) aspects into the DF discussions as a ‘first class citizen’ alongside all the other aspects of the research data lifecycle in a domain. 5 Multiple types of DF infrastructure 6 When an enterprise implements a data management solution, one of multiple types of DFs infrastructure is typically chosen: Data management –enterprise to build a data repository, manage an information catalog, & enforce management & curation policies (but also) Data analysis –enterprise to process a data collection, apply analysis & visualization tools, and automate a processing pipeline. (but also) Data preservation –enterprise to build reference collections and knowledge bases that comprise their intellectual capital, while managing technology evolution Data publication –enterprise to provide descriptive information and arrangement for discovery and access of data collections. Data sharing – controlled sharing of a data collection, shared analysis workflows, and information catalogs - interoperability. Interoperability mechanisms required for sharing data, information, & knowledge. Composition how the separate components, developed separately, can be made to work together. Minimal set of infrastructure mechanisms & service requirements 7 Different suites of components will have different data fabrics. Brokers Gaps, obstacles and possible incompatibilities Enable reproducible research Data Sharing Use Cases EUDAT & the DataNet Federation Consortium use cases provide some view to help: Interoperability mechanisms for sharing DOs & (digital) knowledge procedures An implication is that researcher can re-execute trusted procedures to obtain identical results, making reproducible datadriven research possible Community driven research collaborations Seismology – share seismic data, tsunami prediction workflows between research groups Climate change – share oceanography environmental data, coastal storm surge analyses, hydrology flood analyses, satellite environmental data Genomics – build a cohort of genomes, predictive models for humans, plants, animals, diseases 8 Expanded Ideas of Federation: 3 versions of federated systems 9 1. Shared name spaces for users, files, and services. 1. Besides a single sign-on, shared name space providing users federated services we want to afford service for virtual collections that span administrative domains. 2. And a shared name space for services enables re-use of procedures across researcher resources. 2. Shared services for manipulating digital objects. 1. Such as shared service through a broker, accessing the service through its access protocol or an encapsulated service in a virtual machine environment, for movement to the local research resources for execution. 3. Third-party (service) access. 1. Posting requests to a 3rd party, such as a message queue, and eliminate direct communication between the federated system components. Enhanced Use of Virtualization 10 Virtual machines, such as in a CLOUD or GRID environment, Required to manage dynamic resource allocation, scalability, distributed parallelism, energy efficiency and other aspects. Virtual collections Required to build research collaboration environments We need the appropriate level of abstraction for optimum computing environment/middleware behavior. Too low or prescriptive a level constrains the environment, too high or abstract a level does not indicate clearly the requirement of the user. See Triple-I Computing as a concept (Information-Intention-Incentive model proposed by [Schubert and Jeffery, 2014]) and already research projects are addressing the challenges therein. Analysis and Preliminary Conclusions 11 We have made a useful start but the DF vision needs to be expanded (also focused for maximum benefit) to more than a domain of registered DO stored in well-managed repositories Frame DF & its services broadly as data use & applications taking into account the available context or environment. This doesn’t minimize good data management practices and services which are necessary and deserve support, but are not sufficient to address the challenge for interoperability. For enhanced, semi-automated interoperability we need to consider: Improved metadata for data in context with enhanced semantics Leveraging the emergence of a mathematical foundation for federation of data management systems (e.g. work by Hao Xu). So…. DF, VRE must be integrated Including datasets, SW services, resources (computers, detectors…), users Composed as workflows documented mathematically and ideally created autonomically Achieved through metadata describing the elements of bullet 1 Discovery Contextualisation (relevance, quality, … through relations to organisations, persons, projects, publications etc. and provenance, rights) Detailed application-specific (to connect software to data at a resource for a user) i.e. schema-level The key technologies to achieve interoperability as recognised by researchers are: AAAI PID Metadata with formal syntax and declared semantics 12