RDA Data Fabric IG: Use Case Humanities Data Centre Stefan Buddenbohm (Max Planck Institute for the Study of Religious and Ethnic Diversity Göttingen), Claudia Engelhardt (Göttingen State and University Library) Daniel Kurzawe (Göttingen State and University Library) 1. Scientific Motivation and Outcomes (max. 0.5 pages) Provide a short summary of the scientific or technical motivation for the use case. What would be the best possible outcome and why? The quantity of digital research data has been growing significantly. The same applies to the use of digital tools and methods. Digital infrastructures for the long-term storage of digital research data and provision are not as developed as in other domains such as in astrophysics or climate research. The Humanities Data Centre (HDC) design project (May 2014 – April 2016), aims to establish a data centre for research data from the humanities. Whereas some of the challenges of this undertaking apply to other disciplines as well, some have humanities specific aspects. Typically, research data in the humanities are: beyond objects: Besides file based research data such as documents, images and videos, more complex types of data such as databases and visualisations of data are increasingly being used. Common repositories are not able to handle complex dynamic visualisationsm as they are focused on managing objects, like file based data. non-generic: The re-usability of data depends on proper interpretation – which is strongly discipline- and case-specific. It requires thorough documentation of context and structure of a research project as well as appearance and behaviour of the data – exceeding by far the actual content of the data. hardly scalable: Technical guidelines and workflows are not sufficient for capturing and managing research data and all relevant context information. For a sustainable handling of research data, human interaction is needed as good documentation well. The consultation with researchers should cover the whole research process, from the design of the project to the long-term management of the research data. The desired outcome of the HDC project is a research data centre for the humanities meeting the requirements resulting from the situation described above. 2. Functional Description (max. 1 page) Give at least one diagram that indicates the overall structure/architecture of the data creation and consumption machinery that is being used in the lab/infrastructure. Describe in simple words the functioning of the machinery. The basic service levels of the HDC have been derived from a mapping of the main significant property categories (reflecting the needs of the scholarly community) on service levels (see Figure 1). We have identified three service levels that should cover most of the functionalities demanded by researchers: sustainability, presentation and integration. Figure 1: Mapping of significant properties on service levels Sustainability is the basis for both presentation and Figure 2: Service levels of the HDC integration. It covers bit stream and content preservation and the associated functions. Digital research data are often dependent on their environment with regard to a) specific presentation (software) environments (sometimes even employing external services) that makes them visible and perceivable for humans, and b) from terminologies, scholarly theories etc. which are determinant for their interpretation. The service level of presentation ensures that the data are accessible in the context of these environments. Integration, finally, brings resources from various sources together (e.g. the interconnection/merging of several data bases) and thereby generates added value. All service components can be assigned to one of these three levels. We expect that the common use case of a research data deposit will be composed of various service components from at least two, in many cases all three levels. The service levels and the components required to achieve them are described in more detail in the next section. 3. Describe essential Components and their Services (max. 1 page) Describe the most essential infrastructural components of the machinery and the kind of services they offer. These descriptions don't have to be comprehensive. The HDC service infrastructure will be composed of three levels that were derived from a mapping of functions (requested by researchers). A sustainability layer will provide the ground for all further curation and re-use aspects. The main services of this layer are 1. bitstream preservation and 2. content preservation. It is a rather conventional solution already in use by most research data centres, hence an established, scalable and calculable service offer. Nevertheless some additional functions bring this service layer in contact with end users: it will allow citation and referencing of research data by PIDs to enhance publications and provide transparency of research. This layer will also take into account basic documentation demands, be they from research funders or by legal demands. The other two service levels, presentation and integration, are not mutually exclusive but can be seen as complementary services depending on the needs of a specific research project. They both build on the sustainability layer. Presentation as a general demand category will be the first point of access for someone looking at research data in the HDC. This layer will provide all necessary information required to evaluate the worth of the research data for individual demands. This layer will reach beyond a conventional repository solution and can include visualisations of research data or interactive applications demonstrating the use case and the results of the research. This layer can be implemented for nearly all types of research data. However, depending on the type and complexity of the data, the technical solutions employed will be different. The time span for which preservation and access can be guaranteed may also be different for individual solutions. For instance, interactive applications or visualisations of research data are often too complex to preserve them for an indefinite time-span, at least at this point of time. But they can be stored in a secured environment (“frozen” in a Virtual Machine) and made accessible for users for an uncertain period of time (as long as they are function in the given environment). Integration as the third service layer is aiming at the re-usability of the research data, including the generation of scholarly added value. Therefore, standardisation and interconnection of data is the focus here. This will be the most useful aspect of a research data centre – but also the most costly one. Different levels of integration are possible, ranging from direct export functionalities for raw data via APIs (if the data is standardised and freely available) to more sophisticated provision of applications (generic viewer for digital editions) or merged data sets (i.e. various merged name authority files). Except for the basic service level “sustainability”, the more sophisticated service levels “presentation” and “integration” may be seen as continua, at their endpoints representing the ideal condition of fully achieved presentation or integration. 4. Describe optional/discipline specific Components and their Services (max. 1 page) Describe the most essential infrastructural components of the machinery and the kind of services they offer. These descriptions don't have to be comprehensive. Aside from the humanities-specific types of research data and the service structure derived from this (described in detail in sections 3 and 5) no optional or discipline specific components can be described yet at this point. 5. Describe essentials of the underlying Data Organization (max. 1 page) Describe the most important aspects of the underlying data organization and compare it with the model outlined by DFT. The HDC decided to focus initially on a small number of research data types that are seen as characteristic for the discipline. Furthermore these research data types had to be more complex and challenging than merely file-based data objects and therefore represent the innovative aspects of the HDC and demonstrate the modularity and expandability of data centre. The research data types include: Digital edition Interactive visualisation of research data / application Database Interview data Two of these research data types are illustrated in the figures below, outlining their provenance and technical requirements with look from the perspective of an infrastructure provider. Figure 3: Archiving of interactive visualisations Figure 4: Long-term preservation of digital editions 6. Indicate the type of APIs being used (max. 1 page) Describe the most relevant APIs and whether they are open for being used. Currently, the technical implementation of the concept is still under development, so no concrete technical specifications can be given yet. In terms of ingest, one option will likely be a simple Dropbox-like interface for the easy self-deposit of data. However, this will only be a solution for less complex (file-based) data. In case of more complex data, advice and support by data curators and probably guided ingest processes may be necessary. Support and training by data curators will be an essential component of the data centre. We envision a decentralised network of agents at, for example, university libraries or research institutions, who are entering the dialogue with researchers at the earliest possible moment in the research process, accompanying them from there to the end of the project, ensuring that the data are managed in a way that prepares them ideally for deposit and re-use. Standardisation and a proper distribution of resources are important issues to tackle in this respect. 7. Achieved Results (max. 0.5 pages) Describe the results (if applicable) that have been achieved compared to the original motivation. In 2016, the HDC will enter the construction phase. To ensure interoperability, the HDC will in permanent exchange with researchers and also communicate and cooperate with other infrastructure providers, data centres, libraries and other related humanities data projects. It is obvious that in the end usability will derive to a large extent from interoperability, visibility and ease of use of the archived research data.