RDA Data Fabric (DF) Interest Group Peter Wittenburg & Gary Berg-Cross DF IG Goals Introduction to Group Coming to a reproducible data science is a high priority. Only highly automated self-documenting procedures respecting proper data organization principles will overcome current barriers. The Data Fabric group needs to work out directions in the belief that integrated data fabrics are a critical component of infrastructures paving the way to reproducible science. The goal is to develop alternative views, components and aspects of the DF concept and related infrastructure. Conceptualizations be discussed to come to an agreed RDA view on how the evolving DF landscape can be productively described. The idea for this IG emerged from the discussions amongst the chairs of various RDA WGs Characteristics of a Data Fabric: We are just beginning to scout out the landscape of data fabric . In one view it is a minimalistic set of infrastructure and service requirements by which services can plug into (belong to) the defined fabric. In a data fabric we see how the separate components, developed separately, can be made to work together, this means that for different sets of components the data fabric will be different. We note, strongly, that it is meant as a descriptive/conceptual way to deal with the interrelation between many components, rather than prescriptive (like you would have with an architecture). As part of this, essential DF components and their interrelation need to be identified and defined. This diagram provides a high-level view of possible actions within a Data Fabric running from raw data to increasingly documented data that has been enriched and analyzed creating referable and citable data. As shown publications are part of this Data Fabric since they are often used for data mining and other analysis. Some of the existing RDA groups including metadata WGs and IGs are working on DF components and need to be positioned in such a landscape. New working groups need to be defined to work on identified components and interfaces. Data Fabric IG DF Organizing Chairs: Gary Berg-Cross & Peter Wittenburg Data Foundation and Terminology DFT Chairs: Gary Berg-Cross, Raphael Ritz, Peter Wittenburg Some early statements on Data Fabric & Components Infrastructure Component View (after Reagan Moore) A data fabric is the set of software and hardware infrastructure components that are used to manage data, information, and knowledge. When an enterprise implements a data management solution, one of multiple types of DFs infrastructure is typically chosen to enable the: Data Object View (after Peter Wittenburg) Data Fabric Service View (after Beth Plale) A DF should: The data fabric covers a domain of registered digital objects (DO) that Be self-documenting – a service are stored in well managed contributes to the lifecycle repositories. of data objects it handles and must keep track of the scientifically DOs are associated with metadata relevant actions it performs on Data management –enterprise to build a data describing its creation context and those data objects. repository, manage an information catalog, & enforce history (provenance). management policies The resulting log files are periodically be sent to a provenance consolidator. The Data Fabric covers a domain of Data analysis –enterprise to process a data collection, registered software components apply analysis tools, and automate a processing Track data objects through its service (workflows, services) that are in pipeline. processing using one of the wellfact a special class of DOs. known object identifier schemes Data preservation –enterprise to build reference Actions on DOs may be guided by collections and knowledge bases that comprise the Identify itself as one type of service abstract policies that are explicit intellectual capital, while managing technology as drawn from an RDAagreed upon and thus auditable. evolution list of service types. Data publication –discovery and access of data There can be multiple data fabric Implement an interface to a publishcollections. implementations that should be subscribe system which serves as highly interoperable. the Data Fabric Control mechanism. Data sharing – controlled sharing of a data collection, shared analysis workflows, and information catalogs. People and Plans Who The DF group is a joint initiative of the first 5 working groups who understood that all what they started needs to be seen as part of a bigger plan full of dynamics. They also realized that the work which they started needs to be maintained to meet new requirements. Others interested in the goal of improving our data creation machine is welcome to participate in the DF group. Initiating Members The suggested Data Fabric IG is planned as a forum to discuss these alternative views, components and aspects of the DF concept. Status & Planned Outcomes The DF group will first work on an initial position paper to characterize the landscape of components and To be discussed: • What is the agreed RDA view on a Data Fabric. interfaces that have the potential to realize a reproducible data science. • How the outputs from the RDA working Such a landscape will change over groups fit in the DF concept and how they years where components will be relate to each other and to various related replaced by others requiring that the WGs and IGs within the RDA. position paper will need to be • Which further activities are required to push amended frequently. The DF group will be a continuous the data fabric concept ahead. source for identifying new requirements and barriers that need • Continuation and initialization of working to be addressed by new RDA groups, group activities related to the DF. in particular working groups. Rebecca Koskela, Keith Jefferey, Jane Greenberg, Reagan Moore, • Improving the uptake of the WG outputs by Rainer Stotzka, Tim Delauro, communicating them as a coherent whole Tobias Weigel, Raphael Ritz, Gary within the DF concept. Berg-Cross, Peter Wittenburg, Daan Broeder, Larry Lannom, Prepared for the 4th RDA Plenary Beth Plale, Juan Bicarregui, in Amsterdam, Sept. 2014, Herman Stehouwer Updating the results will be an interesting but worthwhile challenge.