rda_use case Humanities_Data_Centre_v04

advertisement
RDA Data Fabric IG: Use Case
Humanities Data Centre
Stefan Buddenbohm (Max Planck Institute for the Study of Religious and Ethnic Diversity Göttingen),
Claudia Engelhardt (Göttingen State and University Library)
Daniel Kurzawe (Göttingen State and University Library)
1. Scientific Motivation and Outcomes (max. 0.5 pages)
Provide a short summary of the scientific or technical motivation for the use case. What would be the best
possible outcome and why?
The quantity of digital research data has been growing significantly. The same applies to the use of
digital tools and methods. Digital infrastructures for the long-term storage of digital research data
and provision are not as developed as in other domains such as in astrophysics or climate research.
The Humanities Data Centre (HDC) design project (May 2014 – April 2016), aims to establish a data
centre for research data from the humanities.
Whereas some of the challenges of this undertaking apply to other disciplines as well, some have
humanities specific aspects. Typically, research data in the humanities are:
 beyond objects: Besides file based research data such as documents, images and videos,
more complex types of data such as databases and visualisations of data are increasingly
being used. Common repositories are not able to handle complex dynamic visualisationsm as
they are focused on managing objects, like file based data.
 non-generic: The re-usability of data depends on proper interpretation – which is strongly
discipline- and case-specific. It requires thorough documentation of context and structure of
a research project as well as appearance and behaviour of the data – exceeding by far the
actual content of the data.
 hardly scalable: Technical guidelines and workflows are not sufficient for capturing and
managing research data and all relevant context information. For a sustainable handling of
research data, human interaction is needed as good documentation well. The consultation
with researchers should cover the whole research process, from the design of the project to
the long-term management of the research data.
The desired outcome of the HDC project is a research data centre for the humanities meeting the
requirements resulting from the situation described above.
2. Functional Description (max. 1 page)
Give at least one diagram that indicates the overall structure/architecture of the data creation and
consumption machinery that is being used in the lab/infrastructure. Describe in simple words the functioning of
the machinery.
The basic service levels of the HDC have been derived
from a mapping of the main significant property
categories (reflecting the needs of the scholarly
community) on service levels (see Figure 1). We have
identified three service levels that should cover most
of the functionalities demanded by researchers:
sustainability, presentation and integration.
Figure 1: Mapping of significant properties on service levels
Sustainability is the basis for both presentation and
Figure 2: Service levels of the HDC
integration. It covers bit stream and content
preservation and the associated functions.
Digital research data are often dependent on their
environment with regard to a) specific presentation
(software) environments (sometimes even employing
external services) that makes them visible and
perceivable for humans, and b) from terminologies,
scholarly theories etc. which are determinant for
their interpretation. The service level of presentation
ensures that the data are accessible in the context of
these environments.
Integration, finally, brings resources from various
sources together (e.g. the interconnection/merging of several data bases) and thereby generates
added value.
All service components can be assigned to one of these three levels. We expect that the common
use case of a research data deposit will be composed of various service components from at least
two, in many cases all three levels. The service levels and the components required to achieve them
are described in more detail in the next section.
3. Describe essential Components and their Services (max. 1 page)
Describe the most essential infrastructural components of the machinery and the kind of services they offer.
These descriptions don't have to be comprehensive.
The HDC service infrastructure will be composed of three levels that were derived from a mapping of
functions (requested by researchers).
A sustainability layer will provide the ground for all further curation and re-use aspects. The main
services of this layer are 1. bitstream preservation and 2. content preservation. It is a rather
conventional solution already in use by most research data centres, hence an established, scalable
and calculable service offer. Nevertheless some additional functions bring this service layer in
contact with end users: it will allow citation and referencing of research data by PIDs to enhance
publications and provide transparency of research. This layer will also take into account basic
documentation demands, be they from research funders or by legal demands.
The other two service levels, presentation and integration, are not mutually exclusive but can be
seen as complementary services depending on the needs of a specific research project. They both
build on the sustainability layer.
Presentation as a general demand category will be the first point of access for someone looking at
research data in the HDC. This layer will provide all necessary information required to evaluate the
worth of the research data for individual demands. This layer will reach beyond a conventional
repository solution and can include visualisations of research data or interactive applications
demonstrating the use case and the results of the research. This layer can be implemented for nearly
all types of research data. However, depending on the type and complexity of the data, the technical
solutions employed will be different. The time span for which preservation and access can be
guaranteed may also be different for individual solutions. For instance, interactive applications or
visualisations of research data are often too complex to preserve them for an indefinite time-span,
at least at this point of time. But they can be stored in a secured environment (“frozen” in a Virtual
Machine) and made accessible for users for an uncertain period of time (as long as they are function
in the given environment).
Integration as the third service layer is aiming at the re-usability of the research data, including the
generation of scholarly added value. Therefore, standardisation and interconnection of data is the
focus here. This will be the most useful aspect of a research data centre – but also the most costly
one. Different levels of integration are possible, ranging from direct export functionalities for raw
data via APIs (if the data is standardised and freely available) to more sophisticated provision of
applications (generic viewer for digital editions) or merged data sets (i.e. various merged name
authority files).
Except for the basic service level “sustainability”, the more sophisticated service levels
“presentation” and “integration” may be seen as continua, at their endpoints representing the ideal
condition of fully achieved presentation or integration.
4. Describe optional/discipline specific Components and their Services
(max. 1 page)
Describe the most essential infrastructural components of the machinery and the kind of services they offer.
These descriptions don't have to be comprehensive.
Aside from the humanities-specific types of research data and the service structure derived from this
(described in detail in sections 3 and 5) no optional or discipline specific components can be
described yet at this point.
5. Describe essentials of the underlying Data Organization (max. 1 page)
Describe the most important aspects of the underlying data organization and compare it with the model
outlined by DFT.
The HDC decided to focus initially on a small number of research data types that are seen as
characteristic for the discipline. Furthermore these research data types had to be more complex and
challenging than merely file-based data objects and therefore represent the innovative aspects of
the HDC and demonstrate the modularity and expandability of data centre. The research data types
include:
 Digital edition
 Interactive visualisation of research data / application
 Database
 Interview data
Two of these research data types are illustrated in the figures below, outlining their provenance and
technical requirements with look from the perspective of an infrastructure provider.
Figure 3: Archiving of interactive visualisations
Figure 4: Long-term preservation of digital editions
6. Indicate the type of APIs being used (max. 1 page)
Describe the most relevant APIs and whether they are open for being used.
Currently, the technical implementation of the concept is still under development, so no concrete
technical specifications can be given yet. In terms of ingest, one option will likely be a simple
Dropbox-like interface for the easy self-deposit of data. However, this will only be a solution for less
complex (file-based) data. In case of more complex data, advice and support by data curators and
probably guided ingest processes may be necessary.
Support and training by data curators will be an essential component of the data centre. We
envision a decentralised network of agents at, for example, university libraries or research
institutions, who are entering the dialogue with researchers at the earliest possible moment in the
research process, accompanying them from there to the end of the project, ensuring that the data
are managed in a way that prepares them ideally for deposit and re-use. Standardisation and a
proper distribution of resources are important issues to tackle in this respect.
7. Achieved Results (max. 0.5 pages)
Describe the results (if applicable) that have been achieved compared to the original motivation.
In 2016, the HDC will enter the construction phase. To ensure interoperability, the HDC will in
permanent exchange with researchers and also communicate and cooperate with other
infrastructure providers, data centres, libraries and other related humanities data projects. It is
obvious that in the end usability will derive to a large extent from interoperability, visibility and ease
of use of the archived research data.
Download