df Position paper-BC-KJ-RM-shorter final 9-21

advertisement
Position Paper for Data Fabric IG
Interoperability, Infrastructures and Virtuality
https://rd-alliance.org/group/data-fabric-ig.html
Gary Berg-Cross, Keith Jeffery, Reagan Moore
What Managed
processes &
Services?
What Basic & Flexible
Infrastructure Machinery?
See https://rd-alliance.org/group/data-fabric-ig/post/re-rdadatafabric-ig-data-fabric-position-paper-broadendiscussion-berg-1 for discussion & attached file.
What principles &
methods principles
are needed to guide
the interaction
between services
interface, protocol ?
Key Points
1. Much of the current DF discussion focuses on a data
management & lifecycle view.
 Lacks a focus on other important topics
 standards & federation mechanisms that are needed to assemble
collaborations spanning institutions, data management
environments.
2. Interoperability, needs to be a 1st class concept in the DF
conservation

it is fundamentally important for federation & overcoming data-silo
generated problems.
3. There are multiple benefits for development of federated
& virtualized mechanisms & mathematical descriptions
to assist sharing DOs & (digital) knowledge
procedures.
2
Initial ideas for DF IG- Implied
Framework is Data Lifecycle
View with a Data Management Focus that emerged from the
discussions amongst various RDA WG chairs
Where is interoperability in a
Raw to Citable Data View ?
New Groups emerge so
Pubs are part of Data
Fabric & related analysis
3
Data Fabric Analysis, e.g. components & services
in LC via Use CASES
How do we come to essential
components & services?
(guided by use scenarios that need
to include collaboration)
This scenario doesn’t
show analysis or data
sharing via DO
manipulation!
Sharing
4
Make Interoperability a First Class
View
 Concept of Interoperability:
 The extent to which systems and devices can routinely
exchange data and services, and interpret that shared data
through the shared services
 A stronger type of data exchange can include knowledge of the
meaning of the data content, usage constraints, and the
underlying assumptions.
 Bring interoperability (within and cross-domain) aspects
into the DF discussions as a ‘first class citizen’
alongside all the other aspects of the research data
lifecycle in a domain.
5
Multiple types of DF infrastructure
6
 When an enterprise implements a data management solution, one of
multiple types of DFs infrastructure is typically chosen:
 Data management –enterprise to build a data repository, manage an
information catalog, & enforce management & curation policies (but
also)
 Data analysis –enterprise to process a data collection, apply analysis
& visualization tools, and automate a processing pipeline. (but also)
 Data preservation –enterprise to build reference collections and
knowledge bases that comprise their intellectual capital, while
managing technology evolution
 Data publication –enterprise to provide descriptive information and
arrangement for discovery and access of data collections.
 Data sharing – controlled sharing of a data collection, shared
analysis workflows, and information catalogs - interoperability.
Interoperability mechanisms required for
sharing data, information, & knowledge.
Composition how the
separate
components,
developed
separately,
can be made
to work
together.
Minimal set of
infrastructure
mechanisms &
service
requirements
7
Different suites
of components
will have
different data
fabrics.
Brokers
Gaps, obstacles
and possible
incompatibilities
Enable
reproducible
research
Data Sharing Use Cases
 EUDAT & the DataNet Federation Consortium use
cases provide some view to help:
 Interoperability mechanisms for sharing DOs & (digital)
knowledge procedures
 An implication is that researcher can re-execute trusted
procedures to obtain identical results, making reproducible datadriven research possible
 Community driven research collaborations
 Seismology – share seismic data, tsunami prediction workflows
between research groups
 Climate change – share oceanography environmental data,
coastal storm surge analyses, hydrology flood analyses,
satellite environmental data
 Genomics – build a cohort of genomes, predictive models for
humans, plants, animals, diseases
8
Expanded Ideas of Federation:
3 versions of federated systems
9
1. Shared name spaces for users, files, and services.
1.
Besides a single sign-on, shared name space providing users
federated services we want to afford service for virtual
collections that span administrative domains.
2. And a shared name space for services enables re-use of
procedures across researcher resources.
2. Shared services for manipulating digital objects.
1. Such as shared service through a broker, accessing the
service through its access protocol or an encapsulated service
in a virtual machine environment, for movement to the local
research resources for execution.
3. Third-party (service) access.
1. Posting requests to a 3rd party, such as a message queue, and
eliminate direct communication between the federated system
components.
Enhanced Use of Virtualization
10
 Virtual machines, such as in a CLOUD or GRID environment,
 Required to manage dynamic resource allocation, scalability, distributed
parallelism, energy efficiency and other aspects.
 Virtual collections
 Required to build research collaboration environments
 We need the appropriate level of abstraction for optimum
computing environment/middleware behavior.
 Too low or prescriptive a level constrains the environment,
 too high or abstract a level does not indicate clearly the requirement of
the user.
 See Triple-I Computing as a concept (Information-Intention-Incentive
model proposed by [Schubert and Jeffery, 2014]) and already research
projects are addressing the challenges therein.
Analysis and Preliminary Conclusions
11
 We have made a useful start but the DF vision needs to be
expanded (also focused for maximum benefit) to
 more than a domain of registered DO stored in well-managed
repositories
 Frame DF & its services broadly as data use & applications
taking into account the available context or environment.
 This doesn’t minimize good data management practices and services
which are necessary and deserve support, but are not sufficient to
address the challenge for interoperability.
 For enhanced, semi-automated interoperability we need to
consider:
 Improved metadata for data in context with enhanced semantics
 Leveraging the emergence of a mathematical foundation for federation
of data management systems (e.g. work by Hao Xu).
So…. DF, VRE must be integrated
 Including datasets, SW services, resources (computers,
detectors…), users
 Composed as workflows documented mathematically and ideally
created autonomically
 Achieved through metadata describing the elements of bullet 1
 Discovery
 Contextualisation (relevance, quality, … through relations to
organisations, persons, projects, publications etc. and provenance,
rights)
 Detailed application-specific (to connect software to data at a resource
for a user) i.e. schema-level
 The key technologies to achieve interoperability as recognised by
researchers are:
 AAAI
 PID
 Metadata with formal syntax and declared semantics
12
Download