Comparison between Re3data, databib, and GOCDB Goal: To provide a quick overview of Re3data on the one hand and GOCDB on the other hand; and how these might be suitable to the data-fabric group Author: Herman Stehouwer The data fabric group has an inherent need to use a (set of) registries of services (including data processing, repositories, etc.). In order to meet the needs of the data fabric such registries need to be able to explicitly list the available services and some information on service providers in great detail. In any case it must be possible to chain services1 together into a single workflow as well as to allow the general machinery to use them. To achieve this the input, output and processing provided by these services must be specified formally (in a machine-actionable way). At the type of writing two possible options have been discussed within the context of the data fabric group. To me it seems that neither of these will work as an off-the-shelf solution for our needs. However it seems that GOCDB is probably the easiest to get to what we need. Especially considering that we could run instance(s) specific to the data fabric needs. I do wish to stress that choosing a prototype running a certain registry does not mean we should lock ourselves in to that registry. If we define clearly which information we need, registries should be fairly swappable. Re3data & Databib In March 2014 it was announced that, by the end of 2015, the Re3data and Databib services will be merged and joined under the auspices of DataCite. De Re3data service consists of a conveniently searchable and browsable website that has a list of repositories. These repositories are classified by the applicable subject fields and the content types they provide. Furthermore, under the “Standards” area of the repository description API’s and other services can be listed with some short description of the API type. Conceivably the system could be relatively easily be extended to contain the required information. This would however affect the entire registry, i.e. a redesign of the database that stores the information etc. would be needed. In addition, it is unclear on how to keep the stored information up to date, which is essential in a fast growing and changing landscape. Create a data collections (offered by repositories), execute chains of operations on them and create and add new collections to the store to be managed. 1 Information in the registry is suggested by external parties and entered in the system immediatly, later it is reviewed at some point by the re3data team. Decentralized processes would be required to let trusted repositories change relevant information in a DF situation. GOCDB The GOCDB software has been designed to help run grid-type infrastructures and is now also in use in the backbone of the EUDAT project. It is supported by STFC. The registry can support multiple projects, topologies, administrative domains, sites, services, groups of services, etc. There is a strong definition of responsibilities with specific users having responsibilities for specific sites in terms of maintaining the information. A flexible authentication layer enables this. This is very attractive in maintaining a data-fabric type infrastructure, i.e. a designated (group of) person(s) from each site can be made responsible for the records describing the capabilities and services. There is also a clear API available and GOCDB for data fabric could be deployed as a separate index. It is very easy to expand the information in a GOCDB record. Specifically there is the option to add (agreed upon) key-value pairs. The main disadvantage is that GOCDB is more complex than our needs (it contains large areas of information which is needed for running a grid) and needs expertise and commitment to run and keep up-to-date. However, so does a reliable service or repository. Furthermore, it would be fairly trivial to query multiple instances as long as they all extend the formal description of services in the same way. A short overview and description can be found here https://wiki.egi.eu/wiki/GOCDB_Documentation_Index. Summary The DF discussion needs to lead to requirements for such a registry of repositories and based on that needs to come to a specification. Then a critical comparison would be required to see whether an existing solution can be adopted and adapted. It seems to me that the re3data approach is very much suited for their purposes of having a web-site with a list of trusted repositories, but that this solution is too simple for the type of automatic processing environment intended for DF. The GOCDB program seems very suitable and adaptable, but there might also be other options out there.