rep-reg-comparison - Research Data Alliance

advertisement
Comparison between Re3data,
databib, and GOCDB
Goal: To provide a quick overview of Re3data on the one hand and GOCDB on the
other hand; and how these might be suitable to the data-fabric group
Author: Herman Stehouwer
The data fabric group has an inherent need to use a (set of) registries of services
(including data processing, repositories, etc.). In order to meet the needs of the
data fabric such registries need to be able to explicitly list the available services
and some information on service providers in great detail. In any case it must be
possible to chain services1 together into a single workflow as well as to allow the
general machinery to use them. To achieve this the input, output and processing
provided by these services must be specified formally (in a machine-actionable
way).
At the type of writing two possible options have been discussed within the
context of the data fabric group. To me it seems that neither of these will work as
an off-the-shelf solution for our needs. However it seems that GOCDB is probably
the easiest to get to what we need. Especially considering that we could run
instance(s) specific to the data fabric needs.
I do wish to stress that choosing a prototype running a certain registry does not
mean we should lock ourselves in to that registry. If we define clearly which
information we need, registries should be fairly swappable.
Re3data & Databib
In March 2014 it was announced that, by the end of 2015, the Re3data and
Databib services will be merged and joined under the auspices of DataCite.
De Re3data service consists of a conveniently searchable and browsable website
that has a list of repositories. These repositories are classified by the applicable
subject fields and the content types they provide.
Furthermore, under the “Standards” area of the repository description API’s and
other services can be listed with some short description of the API type.
Conceivably the system could be relatively easily be extended to contain the
required information. This would however affect the entire registry, i.e. a redesign of the database that stores the information etc. would be needed. In
addition, it is unclear on how to keep the stored information up to date, which is
essential in a fast growing and changing landscape.
Create a data collections (offered by repositories), execute chains of operations
on them and create and add new collections to the store to be managed.
1
Information in the registry is suggested by external parties and entered in the
system immediatly, later it is reviewed at some point by the re3data team.
Decentralized processes would be required to let trusted repositories change
relevant information in a DF situation.
GOCDB
The GOCDB software has been designed to help run grid-type infrastructures and
is now also in use in the backbone of the EUDAT project. It is supported by STFC.
The registry can support multiple projects, topologies, administrative domains,
sites, services, groups of services, etc.
There is a strong definition of responsibilities with specific users having
responsibilities for specific sites in terms of maintaining the information. A
flexible authentication layer enables this. This is very attractive in maintaining a
data-fabric type infrastructure, i.e. a designated (group of) person(s) from each
site can be made responsible for the records describing the capabilities and
services.
There is also a clear API available and GOCDB for data fabric could be deployed
as a separate index. It is very easy to expand the information in a GOCDB record.
Specifically there is the option to add (agreed upon) key-value pairs.
The main disadvantage is that GOCDB is more complex than our needs (it
contains large areas of information which is needed for running a grid) and
needs expertise and commitment to run and keep up-to-date. However, so does a
reliable service or repository. Furthermore, it would be fairly trivial to query
multiple instances as long as they all extend the formal description of services in
the same way. A short overview and description can be found here
https://wiki.egi.eu/wiki/GOCDB_Documentation_Index.
Summary
The DF discussion needs to lead to requirements for such a registry of
repositories and based on that needs to come to a specification. Then a critical
comparison would be required to see whether an existing solution can be
adopted and adapted. It seems to me that the re3data approach is very much
suited for their purposes of having a web-site with a list of trusted repositories,
but that this solution is too simple for the type of automatic processing
environment intended for DF. The GOCDB program seems very suitable and
adaptable, but there might also be other options out there.
Download