FAIMS Repository Storage Land Scape (‘Use Case’) Revision Chart Version 0.1 - Draft Internal 0.2 – Draft External 0.3 – Draft External 1.0 – Final Primary Author(s) Martin Paulo Martin Paulo Martin Paulo Martin Paulo Description of Version Initial draft for internal review Draft for external review Draft for external review, adds Lucene indexes For release Date Completed 7th January, 2013 8th January, 2013 9th January, 2013 15th January, 2013 Introduction The FAIMS repository will be build by scaling up the existing AHAD repository hosted at La Trobe University. The AHAD repository is in turn powered by tDAR, an open source digital archive and repository. In effect, an existing large body of working Java software will have new features added to it. By this means the repository development is intended to leverage the cooperative nature of open source software, giving a low risk approach to delivering a repository that will allow Australian archaeological researchers to deposit and access data. Page 1 of 4 tDAR, A Very High Level View tDAR is first and foremost a repository. In that role, users upload digital artifacts into the repository. These artifacts are stored in their original format, and also, if need be according to their type, copied into an associated preservation format. In the process of uploading digital artifacts, metadata is entered by the user and associated with the digital artifacts. This metadata facilitates the organization of, the discovery of, and the definition of relationships between the uploaded artifacts. Certain types of digital artifacts, containing tabular data, such as Excel spreadsheets and MS-Access databases, also have their contents extracted and copied into a ‘read only’ relational database. tDAR provides toolsets that allow users to easily share this extracted data with other users, and also to integrate the extracted data to form new data sets offering greater insight and knowledge. tDAR makes use of Lucene to index the digital artifacts on ingest. tDAR Storage, The Main Parts They say a picture is worth a thousand words. So: tDAR file store personal file store tDAR metadata tDAR data Lucene indexes File Store The file store is the repository where the uploaded digital artifacts are stored. It is implemented by means of a PairTree: a file system hierarchy. Page 2 of 4 Personal File Store The personal file store acts as user specific digital artifact holding pen: one in which files uploaded asynchronously are assembled before being worked with, or where digital artifacts generated by the user in working with datasets are stored. It is implemented by means of BagIt: a file system based assembly designed to facilitate the exchange of digital content. tDAR Metadata tDAR Metadata is a relational PostgreSQL database in which the metadata entered by the user is stored. tDAR Data tDAR Data is a relational PostgreSQL database into which uploaded tabular data is copied. This data is not editable by the user: it there to simply to facilitate the sharing and integration of tabular data. The ultimate source of truth will always be the original uploaded file. Lucene Indexes Lucene expects to find its indexes on local storage. It also expects to have only one Lucene instance writing updates to the indexes. Thoughts As written, tDAR expects to find a file system available to store uploaded files on. tDAR manages the files within that file system and does not expect to find any other system, including other copies of tDAR, updating or modifying files on that file system. This is especially true of the Lucene index files. This makes moving tDAR to the cloud a somewhat interesting exercise. Ephemeral storage is simply not a viable option for the repository. A repository with a file system that can be arbitrarily reset to its initial state is not a repository. Similarly, the two databases require non-ephemeral storage. At the very simplest, persistent block storage that can be mounted and written to is required. But even with mountable block storage available, the number of instances of tDAR that can be run, as written, is limited to a single instance. Thus moving tDAR to the NeCTAR cloud will result in an application that is less reliable and Page 3 of 4 less performant than if tDAR were installed on dedicated hardware. This is not a desired outcome! Ideally the PairTree and BagIt file stores should be replaced with implementations that make use of the available redundant object storage. However, this will involve some significant amount of effort, as the underlying internal semantics of the application will be changed. This change will also impact the original tDAR development team, as it will mean making changes to an important and stable part of their application! Perhaps a better approach would be to create a virtual file system that maps ‘files’ to objects in the object store. This could possibly be done by means of an implementation of the FUSE package, for example, CloudFuse. Regardless of approach, the sharing of the Lucene indexes amoungst instances is going to be an interesting challenge to tackle. Conclusion As it currently stands tDAR is not a good fit for the NeCTAR cloud. The availability of persistent block storage is a necessary prerequisite before this migration can be considered. Only once persistent block storage becomes available can the migration can be started. That migration has two possible paths: one is via a rewrite of some core components, the other is by means of presenting the cloud components through a façade that preserves the semantics expected by tDAR. Further work is required to establish the optimum path. Page 4 of 4