PERICLES - Promoting and Enhancing Reuse of Information throughout the Content Lifecycle taking account of Evolving Semantics [Digital Preservation] DELIVERABLE 7.3.1 Initial Version of Test Bed B Implementation GRANT AGREEMENT: 601138 SCHEME FP7 ICT 2011.4.3 Start date of project: 1 February 2013 Duration: 48 months DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI funded by the European Commission within the Seventh Framework Programme Project co-funded (2007-2013) Dissemination level PU PUBLIC PP Restricted to other PROGRAMME PARTICIPANTS (including the Commission Services) RE RESTRICTED to a group specified by the consortium (including the Commission Services) CO CONFIDENTIAL only for members of the consortium (including the Commission Services) X DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI Revision History istory V# Date Description / Reason of change Author V0.1 05/05/14 Initial content table Sven Bingert V0.2 27/05/14 Basic content available Sven Bingert V0.3 23/06/14 Last version for review Sven Bingert V1.0 27/06/14 Final version Sven Bingert Authors and Contributors Authors Partner Name UGOE Sven Bingert, Philipp Wieder, Wieder Noa Campos López DOT George Antoniadis UEDIN Alistair Grant ULIV John Harrison SpaceApps Emanuele Milani Contributors Partner Name UEDIN Rob Baxter, Amy Krause SpaceApps Rani Pinchuk KCL Simon Waddington, Waddington Mark Hedges, Christine Sauter (reviewers) UGOE Jens Ludwig ULIV Adil Hasan (reviewer) DOT Stavros Tekes (reviewer) © PERICLES Consortium Page 2 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI Table of Contents Contents 1 Executive Summary ................................................................................................ ................................ ................................................ 7 2 Introduction & Rationale ................................................................................................ ................................ ........................................ 8 2.1 Context of this deliverable ................................................................................................ ..................................... 8 2.1.1 PERICLES project ................................................................................................ ................................ .......................................... 8 2.2 What to expect from this document ................................................................ ...................................................... 8 2.3 Document structure ................................................................................................ ................................ ............................................... 9 3 User scenarios ................................................................................................ ................................ ...................................................... 10 3.1 3.2 3.3 4 Arts & Media domain ................................................................................................ ........................................... 10 Science domain ................................................................................................ ................................ .................................................... 10 Component mapping................................................................................................ ................................ ............................................ 11 Integration framework ................................................................................................ ................................ ......................................... 13 4.1 Integration framework architecture ................................................................ .................................................... 13 4.1.1 Workflow engine................................................................................................ ................................ ........................................ 14 4.1.2 Ingest ......................................................................................................................... ................................ ......................... 16 4.1.3 Archival storage ................................................................................................ ................................ ......................................... 16 4.1.4 Data management................................................................................................ ..................................... 16 4.1.5 Access................................ ......................................................................................................................... ......................... 16 4.2 Connections and connection types ................................................................ ...................................................... 16 4.3 Handlers ............................................................................................................................... ................................ ............................... 17 4.3.1 What are they? ................................................................................................ ................................ .......................................... 17 4.3.2 Why do we need them? ............................................................................................. ............................. 17 4.3.3 What they have to do ................................................................................................ ................................ 17 4.3.4 Anatomy of a handler and a component ................................................................ ................................... 18 4.3.5 Handlers in a chain ................................................................................................ .................................... 19 4.3.6 Status responses ................................................................................................ ................................ ........................................ 21 4.3.7 Payloads and workflows ............................................................................................ ............................ 22 4.4 Data object and data management ................................................................ ..................................................... 23 4.4.1 Information package versioning ................................................................ ................................................ 23 4.4.2 Local storage and version management for workflow .............................................. ................................ 24 4.5 Integration and testing ................................................................................................ ......................................... 25 5 Test beds................................ .............................................................................................................................. .............................. 27 5.1 Common technologies ................................................................................................ ......................................... 27 5.1.1 Test management: Jenkins ................................................................ ........................................................ 27 5.1.2 iRods .......................................................................................................................... ................................ .......................... 27 5.1.3 Maven ................................................................................................ ................................ ........................................................ 28 5.1.4 BagIt................................ ........................................................................................................................... ........................... 28 5.1.5 Topic Maps engine ................................................................................................ ..................................... 28 5.1.6 Web service containers .............................................................................................. .............................. 28 5.1.7 Vagrant ................................................................................................ ................................ ...................................................... 28 5.1.8 Workflow engine and management ................................................................ .......................................... 29 © PERICLES Consortium Page 3 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI 5.1.9 Development languages ............................................................................................ ............................ 29 5.2 Description of the arts & media test bed ................................................................ ............................................. 29 5.2.1 Structure of the test bed ............................................................................................ ............................ 29 5.2.2 Domain specific technologies ................................................................ .................................................... 30 5.2.3 Current status and tests............................................................................................. ............................. 31 5.3 Description of the science test bed................................................................ ...................................................... 31 5.3.1 Structure ................................................................................................ ................................ .................................................... 31 5.3.2 Domain specific technologies ................................................................ .................................................... 33 5.3.3 Status and current tests ............................................................................................. ............................. 33 6 Roadmap ............................................................................................................................. ................................ ............................. 35 6.1 6.2 6.3 7 Media test bed ................................................................................................ ................................ ..................................................... 35 Science test bed ................................................................................................ ................................ ................................................... 36 Integration of tools and components ................................................................ .................................................. 36 Conclusion ........................................................................................................................... ................................ ........................... 38 List of Figures and Tables ................................................................................................ ................................ ............................................ 39 List of Figures ................................................................................................................................ ................................ ................................. 39 List of Tables ................................................................................................................................ ................................ .................................. 39 Bibliography ............................................................................................................................... ................................ ............................... 40 Appendix ................................................................ ................................................................................................ ................................... 42 © PERICLES Consortium Page 4 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI Glossary Abbreviation / Acronym AIP API Architecture Component DIP Framework Meaning The Archival Information Package is the information stored by the archive. Application Programming Interface. The architecture is an abstraction of a complex software system or program aiming to describe how the system will behave. It provides a high level design of the system, the logics of the software elements, the connectors and the relations between them instead of a description of the implementation and technical details. A component is a functional unit designed and developed to provide specific sets of operations and behaviors. Functional units can be models, workflows or software. The Dissemination Information Package is the information sent to the user when requested. A software framework provides generic functionality that can be augmented by user-written user code to create a specific purpose application. A software framework is a reusable software platform to develop software applications, products and solutions. Software frameworks include support programs, compilers, code libraries, tool sets, and APIs (application programming programming interfaces) that bring together all the different components to enable development of a project or solution. JSON The JavaScript Object Notation is a standard and simple text-based text message format designed to be human-readable. human readable. It uses text to transmit data objects consisting of attribute––value pairs. It is commonly used to transmit data between between a server and a web application. LRM LRM is an operational OWL ontology to be used to model the dependencies between digital resources handled by the PERICLES preservation ecosystems. LTDP Long Long-Term Data Preservation. OASIS model The Open Archival Information System model is a standard for digital repositories. The OAIS model specifies how digital assets should be preserved for a community of users from the moment digital material is ingested into the storage area, through subsequent preservation strategies strategies to the creation of the package containing the information required for the end user. The OAIS reference model is a high-level high level reference model, which means it is flexible enough to use in a wide variety of environments. The REpresentational State Transfer describes a way to create, read, update or delete information on a server using simple HTTP REST © PERICLES Consortium Page 5 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI SIP Test bed Test scenario calls. RESTful web services keep interactions between components simple. Clients only refer to the target resource and the actions; acti each request contains all the information required for execution. No client state is stored on the server between requests. The Submission Information Package is the information sent from the producer to the archive. A test-bed test is a project development platform used for replicable and transparent testing of new tools and technologies. A testtest bed consists of real hardware and provides an environment for testing new software. A Test Scenario is a subset of a user scenario and associated user scenario that allow for enactable testing to be carrying out. A Test Scenario should have clear criteria for testing - these can include pass/fail, performance and quality criteria. A Test Scenario can encompass an entire user scenario or a minor part of the user scenario. UML Use case User scenario Test Scenario can enable the testing of requirements against product. The Unified Modeling Language provides a standard way to visualize the design of a software software system. UML diagrams represent in an easy and understandable way the behavior, actions and components of the system. A use case is the list of interactions between actors to achieve a discrete and distinct goal. Actors can be human, internal or external systems and agents. A use case is a formalization of a path in a user scenario. A user scenario describes foreseeable sets of interactions between user roles and system agents. A user scenario describes the process by which a goal will be achieved by an initiating entity in plain language. The user scenario should define the interacting agents in the process and the time frame for the process to be enacted across. VM User scenarios scenarios should include scope definitions both at the equipment and organizational level. Virtual Machine. Workflow A workflow is a sequence of operations, a step by step description of a real work. xIP Information Package of any type (SIP, AIP or DIP). © PERICLES Consortium Page 6 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI 1 Executive Summary The PERICLES projects addresses the aspect of change within long-term long term preservation and its impact on the reuse of digital data over a prolonged period of time. The research focuses on developing models which will guide the development of services, guidelines, tools, applications, interfaces in support of managing change over time. Two cases studies from outside the traditional archives, archives one from the media and art domain and one from the science domain, have been chosen to validate the outcome come of the project and to direct the research into practice domains. dom For the Arts & Media domain several user scenarios that describe the interactions between the user and the system were created, whereas for the Science domain workflow descriptions that capture the relevant process information were defined. Combining the information provided by the user scenarios, workflow descriptions and the user requirements of both domains, we extract the system requirements, as well as common and specific components and tools that cover the functionalities required. The current version of the integration framework, developed so far and described in the deliverable 7.1.1 First version of integration framework and API implementation, implementation, allows integrating and testing the extracted components and tools. Its architecture is based on the OAIS model so we can find the common high-level level components as Ingest, Data Management, Archival Storage and Access, but also a Workflow Engine that is responsible for orchestrating the actions actions of preservation components. Another new concept was introduced: introduced: the handler. Handlers are the basis of the integration framework, in the sense they act as the communication points for each entity within the system, they do also deal (not the components) with the rest of the PERICLES system and the outside world. The functionality and behavior of the components mapped from the user scenarios and workflow descriptions need to be tested. To this end we developed test beds for the Arts & Media and Science domains. mains. Both test beds are based on the integration framework and are coordinated and managed by Jenkins. They share some common components and tools, but they use different technologies. Tests were run successfully on the first version of the test beds and are described in this document. © PERICLES Consortium Page 7 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI 2 Introduction & Rationale 2.1 Context of this deliverable This document is the second deliverable from the PERICLES WP7 Integration and Test Beds. Beds In the description for this deliverable it says: “Describes “ and documents the first iteration of the test beds for media and science”. The deliverable describes the first version of the test bed, which is based on the initial version of integration framework as described in the D7.1.1 First version of integration framework and API implementation. It is also marks the t milestone MS3 Initial version of prototypes, which relates to the first iteration of the evaluation of the integrated prototype. 2.1.1 PERICLES project Modern concepts of the long-term term archiving need to incorporate continually evolving environments. This includes not only the changes in technology but also evolving semantics. To validate the approach and the proposed solutions, the research will apply the outcome to two distinct domains, both of which differ from the traditional library archive: archive • Arts & Media edia domain, with a variety of complex, large-scale large scale and dynamic media data from TATE, such as digital images and videos, video born-digital data, software-based based art installations, • Science cience domain, with experimental and contextual scientific, operational, and engineering data from the European Space Agency (ESA) and International Space Station (ISS), like raw data, operation commands, calibration curves, etc. Not only the data of each case study differs, but also their respective stakeholders: stakeholders artists, archivists, historians, researchers, scientists and engineers. The interest in the data depends on the stakeholder, which will evolve over the time, as will the technologies and the semantics. sema The PERICLES project seekss to assure that the data generated today will be available and useful for the next generations of users. These challenges are germane to both cases as shown in the following examples: • In the media case:: in order to ensure that a digital video artwork remains playable over the next hundred years without losing fidelity with its original, original it is necessary to record different types of information: how the data was produced, its color, format and other properties, but also how to represent the format changes. changes • In the science case: in order to preserve experimental data in a way that would allow future researchers to replicate or continue the original work, work it would be necessary to record important experimental data like operation commands, ground and orbital conditions, c etc. but also the contextual data like algorithms, algorithms configurationss and operation activities. 2.2 What to expect from this document Deliverable 7.3.1 covers the first version of the test bed integration. The description of the current test beds includes the common technologies, domain specific technologies and the technical infrastructure. This document also provides information about tests performed on the test beds and the relations between the components and the functionality of the test beds. Furthermore, Furthermore the base of the test beds, the test scenarios, and their relation to the user scenarios and requirements is demonstrated. This document also gives an update of the integration framework defined in the © PERICLES Consortium Page 8 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI deliverable 7.1.1. This update includes new concepts for the communication between components, workflow management and data management. The document also describes the next steps as well as the roadmap of the whole cycle of iterative test bed implementations. 2.3 Document structure tructure In Chapter 3 a short summary of the use scenarios and the cross-reference to the work package WP2 is presented. It also describes the concept of the component mapping applied for the first versions of the test beds. The integration ntegration framework used for the test bedss is described in Chapter 4. This chapter gives an update to the deliverable D7.1.1 and introduces a new concept. The current status of the test beds are explained in detail in Chapter 5.. This chapter is divided into two parts for the Arts & Media and the Science case. In Chapter 6 we present a roadmap for the development of the test beds and the integration of tools and components over the complete course of the project. We conclude in Chapter 7 and give the sources in the Bibliography and an example test plan in the Appendix. © PERICLES Consortium Page 9 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI 3 User scenarios cenarios PERICLES distinguishes, as mentioned above (see Section 2.1.1), ), two domains for which the project addresses the long-term preservation of digital data in evolving ecosystems: the arts & media domain and the science domain. User scenarios are stories describing the interactions between the user and the system for reaching a goal, and workflow descriptions capture relevant process information. User scenarios and workflow descriptions are the source of user requirements, requirement which guide the architecture development, and will act as basis for tests test and evaluations. User scenarios from arts & media domain and workflow descriptions from science domain are used as input to WP6 to derive common components and tools through the methodology m explained in section 3.3. During the duration of the project, scenarios and workflow descriptions will evolve through the addition of components and results from the PERICLES research. 3.1 Arts & Media domain Within the arts & media domain, working groups specified and analysed four different sub-domains (as reported in D2.1 Requirements gathering plan and methodology): methodology ● ● ● ● Born digital archive collections Digital video artworks Software-based based artworks Media productions (Tate Media) domains cover a wide range of long-term long term digital preservation challenges and each of them These sub-domains represents specific issues regarding semantic changes. change . Several scenarios, user us roles and requirements have been already identified for these four sub-domains domains (D2.3.1 Media and science case study functional requirements and user descriptions). descriptions). The main workflow areas of those scenarios are: ● ● ● ● Preservation reservation planning & policies Appraisal & ingest Archive rchive & data management Access & re-use 3.2 Science domain Due to different remits of the domains, domains, the Space Science stakeholders are less engaged with effective and pervasive preservation in comparison to the Arts & Media stakeholders. Therefore, rapid case studies, which are very high-level high level user scenarios, were defined to bootstrap interviews with different types of stakeholders, in order to extract detailed data, process and workflow information. From this, user requirements could be extracted for for the different user categories, in particular by analyzing their workflows with respect to dataset production and utilization. The initial set of components reflects reflect the gathered requirements and focus on: • Continual ontinual ingest of data; • Relations and semantic extraction from data; © PERICLES Consortium Page 10 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI • • information accessibility and presentation; creation of meaningful archival packages for atomic pieces of data. 3.3 Component mapping m A key aspect in developing the test bed is to match parts of the user scenarios and workflow descriptions to components and technologies. Some components had already been pointed out in the user requirements by the pilot users (Tate and B.USOC), B. ), and other components need to be developed in order to resolve technology gaps. We established a methodology1 to allow for results from WP2 and WP6 to inform WP 3-5. The requirements The solutions objective is to track technological gaps detected with the help of the requirements. might either be provided by those components anticipated in the DoW, or encourage partners to develop specific components not anticipated in the DoW. DoW We applied following methods: a) Building a step-by-step step visual flow of the user scenarios for discussions with WPs 3-5. 3 We decided to use Process Flows as UML diagrams (see Figure 1: Example of a process diagram with components mapping)) which are straightforward diagrams showing how steps in each process fit together.. They facilitate the communication of how processes work and clearly document how a particular job gets done. done Within those diagrams, we tried to match components against most of the UML symbols (elongated circles). Partners from WPs 3-5 3 5 then tried to realize extended uses for the components they had already been building (described ( in the DoW), in order to fill the orphan component boxes in the diagrams. Moreover, Moreover partners sought to uncover areas that were missing a component allocation, which might prove useful for other potential Long-Term Long Digital Preservation (LTDP) cases. By following the above process we ended up with UML diagrams that had most of the symbols matched to components which are either already available on the market, or are currently under development or anticipated to be be created as part of the PERICLES project. b) After this process had been carried out, some parts of the diagrams were not matched with any component. t. The corresponding activities in the workflow cannot be automatic and need to be performed manually to continue continue the process. In the future such "gaps" may be filled by components built or made available by other projects. c) Fill in component information in the registry document. In month 7 the partners started to provide information about components they were building bui or intended to build. These se were referenced to very early versions of the user scenarios and could only provide very vague descriptions for the functionality that they planned to cover. Once the scenarios evolved and reached a more mature stage, the component list was re-activated re and updated. 1 described in more detail in the deliverable D6.1 Specification of architecture, component and design characteristics (M20) © PERICLES Consortium Page 11 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI Figure 1: Example of a process diagram with components mapping © PERICLES Consortium Page 12 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI 4 Integration framework f The framework for integrating and testing the components and principles of the PERICLES approach allow llow a tester or developer to monitor and initiate a test run at different levels. It was defined as a set of concepts and structures for testing combinations of software modules in a controlled and systematic manner and provides the infrastructure and procedures to test large lar concepts and operations that span multiple systems. 4.1 Integration framework ramework architecture The basis of the implementation of the integration framework is the architecture architecture (see Figure 2) as described in the upcoming deliverable D6.1 Specification of architecture, components and design characteristics from WP6. Since the architecture is still being developed and refined the following definitions serve only to provide a better understanding of the building blocks that form the integration framework and are subject to change. The integration framework architecture is based on the OAIS model and is comprised of the following high-level architectural blocks: 1. Workflow Engine 2. Ingest 3. Archival Storage 4. Data Management 5. Access © PERICLES Consortium Page 13 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI Figure 2: Integration Framework Architecture 4.1.1 Workflow engin ngine At the center of the architecture is the workflow engine. It is responsible for orchestrating the actions of preservation components. A workflow is a sequence of actions that components need to take in order to ensure the preservation of digital objects that are ingested into the digital preservation system. It achieves achieves this by providing the necessary functionality to define, store, retrieve, update and enforce workflows that govern the system. The Workflow Engine is comprised up of a number of smallerr functional blocks (see Table 1). © PERICLES Consortium Page 14 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI Table 1: Functional blocks of the Workflow Engine. Block Name Responsibility Access Control List (ACL) Management Managing access rights for users and components to the digital material stored within the digital preservation system. Workflow Management Enabling creation, update and validation of the workflows that govern the digital preservation system. Workflow Repository Storing the workflows that govern the digital preservation system and delivering them to other functional blocks when requested. Decision Point Determining when preservation actions are required, creating the necessary action chains (workflows) and triggering the Enforcement Point. Enforcement Point Executing generated action chains (workflows), including managing errors etc. Execution happens by triggering component handlers. Component Registry Identifying, describing, and providing the location of available preservation component handlers. The workflow engine as described by the integration framework is, as most parts of the PERICLES architecture, modular and allows blocks to be exchanged with other components/tools that perform the same functions. Workflow management, repository and ACL are not exempt from this fact. The use of a third party workflow system is and should remain an option with the addition of an facade or other intermediate layer on the third party system’s API to conform with the interface of the implementation framework’s workflow workf engine. As a result of preservation planning lanning,, workflows and policies are established to govern the digital preservations system in order to ensure continued access to the stored digital objects over time. Preservation events, such as the deposit of new new material, or an alteration of preservation workflows, are detected by the decision point, p which will determine if the current workflow mandates any further actions. If so, the decision ecision point will attempt to determine the action or sequence of actions that need to be taken. It will then use the component registry to determine the appropriate components for the required actions and create an action chain (a workflow of the necessary components) and forward this to the enforcement point. The enforcement point oint is responsible for ensuring that the work of each component in the action chain is completed before forwarding the task to the next component in the chain and so on. The way in which it does this will be discussed in section 5.3. Once the action chain is complete, the enforcement point will inform the decision point in order to determine if further steps are required. © PERICLES Consortium Page 15 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI 4.1.2 Ingest The ingest block is responsible for accepting a Submission Information on Package (SIP) from a producer,, and carrying out the steps necessary to create an Archival Information Package (AIP) suitable for storing in archival storage torage. This process is likely to be very use-case case specific, involving a number of sub-component component acting in concert. The way in which such components are chained together is discussed in section 4.3. 4.3 4.1.3 Archival storage torage The archival storage block is responsible for the secure storage of the bit streams (the sequences of 1s and 0s) that encode the digital content of each object. It will need to have some form of internal cataloguing (that is, a mapping of the path or identifier of an object to to the physical location of the data in the underlying storage devices(s)) in order to provide ongoing access to the object, and will also be responsible on some level for allowing or denying access to the various digital objects through enforcement of the ACLs that are set and managed through the policy engine. e The archival storage block may be responsible for Versioning of the digital objects stored within it. It may achieve secure storage through a combination of strategies, the simplest being replication, but alternative or complementary approaches may also involve RAID, erasure coding etc. 4.1.4 Data management anagement The data management block is responsible for ensuring the preservations of a digital object’s function. 4.1.5 Access The access block is responsible for handling a request for an object from a consumer onsumer. When such a request is made, the access block carries out the necessary steps in order to convert the AIP into a Dissemination Information Package (DIP) that is appropriate to the consumer onsumer who made the request. This process is also highly use-case use case specific, potentially involving activities such as format migration, information redaction etc. 4.2 Connections and connection types Communication between components is performed through a communication munication layer in the form of an HTTP API. Components registered in the system need to conform conform to a specific interface that allows other components and the workflow engine to exchange information and data with them. Such an architectural approach allows for a separation of concerns, meaning the functional API can be layered ed with other services and keeps interactions between components simple. For the design of the API the RESTful architectural style has been adopted using JSON as the method of transportingg information. The actual data packages are provided either encoded or as URIs depending on the size and use. Each component is able to provide a description of the actions it is able to perform together with its inputs, outputs and parameters. The API allows ows the components to accept action chains with the work they must perform together with two additionally actions or chains; one for when their work is completed successfully and one for error handling. All components need d to be registered on the workflow engine and provide a namespace/identifier as well as a list of actions/endpoints and input/output types. © PERICLES Consortium Page 16 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI 4.3 Handlers In the following sections the document describes the newly new introduced concept of the handlers used for the communication of the components with wit in the system. 4.3.1 What are they? they The handlers are the basis of the integration framework for PERICLES. The handlers are to function as the communication points for each major entity within the system. The handlers deal with validating incoming requests, exercising rcising the functionality of developed components, storing and transferring results of component functions and initiating the communications with further components to be used. The handlers do not perform any operations, which change or alter the data contained cont in xIPs, only components can alter and change data – but a component should not have knowledge outside itself. The components should in essence be highly functional ‘dumb’ pieces of software. This means they can do their function and only their function funct – the handlers deal with the rest of a PERICLES system and the outside world. This simplifies the job of the component developers – they only have to provide a method, be it a shell script, executable or callable function, to the handler which will be used to exercise the component functionality. Additional functions can be added to the handler based on the component needs – but these are designed to be plugins to the handler and are not mandatory. These functions could include functions like data validation, ation, or checking if the right data format was passed to the handler for the component to operate on. 4.3.2 Why do we need them? them In order to facilitate the creation and operation of situation aware workflows within a system, the individual end point handlers for components must have the facilities described in order to allow an auditable and recoverable processing workflow trail to be created. The handlers described here would allow the overall system to track and adapt to changing circumstances, while providing providi the means to avoid repeating process steps in the event of a component failure. The local storage of the handler would store processing workloads for a given period of time as defined by a workflow coordinator - likely to be deleted when final storage is achieved. In concert with this, the prospect of component failure across any complex and multi-node multi node system is an everever present risk. The logging and ability to adapt the targets for next steps in the handler allow this to be mitigated while providing a monitoring onitoring mechanism for the overall processing workload. Providing the means to adapt workflows on the fly and to keep an audit trail with local backups of current processing will allow the overall architecture to respond better to individual component failure. 4.3.3 What they have to do Table 2 summarises the operations that a handler must be able to perform: © PERICLES Consortium Page 17 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI Table 2: Operations of a handler Get Incoming Request Validate and Respond to Request Pull and Validate Data Exercise Component Store to Local Storage Get Next Target Handler Perform Request Accept Response Error Handling: • Local Response • Central Response Garbage Collect Local Data Store on Request Provide Status of Handler and Component 4.3.4 Anatomy of a handler and a component Table 3 gives a view of the different parts of a handler and its relationship to a functional component. The handlers are to provide ide the functions listed previously and the below anatomy should be used as a template so that the handlers can have a plugin basis for adaptation to changing technologies and targets. Table 3: Handler and Component Anatomy (Green are framework developed, Purple are component developed) Incoming Request Workflow Coordination Outgoing Request Communications Communications Communications Validation Adaptor Local Storage Validation Adaptor © PERICLES Consortium Page 18 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI Component Adaptor Component 4.3.5 Handlers in a chain To function as part of a PERICLES system the handlers must be connected connected in a chain (see Figure 3). These chains form a workflow within the overall archive system. The workflows need to be stored at central and/or local (distributed) levels; this requires a workflow repository. This would form part of the preservation and data management layers in the OAIS model. The workflows will be based upon the policies within the system as determined by the archive operators and where possible translatable into machine executable workflows. The need for a contactable mechanism for determining what the next step in the workflow is requires the use of a workflow manager. At present, the development of the test bed uses an a incrementally developed approach for a prototype, though in future versions, this could conceivably be replaced by a full workflow system such as Taverna. Figure 3: Chained Handlers Figure 3 shows the concept of chaining the handlers together into a workflow. The workflow management is where a handler can determine its next step in a workflow – where the data has to go. The actual mechanics of workflow management – it could be a centralized component, c through this would introduce a single point of failure and possibly performance issues, or it could be distributed where local cached copies of the part of a workflow relevant to a handler are stored. This would introduce the need for a refresh mechanism mechanism but would remove a single point of failure. So far this introduces to the framework, three major functional aspects: © PERICLES Consortium Page 19 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI • • • Components:: These are the functional unit workers. These should only be focused on the task they were designed for and should not be required to communicate beyond themselves. Handlers:: These are the wrappers and communication hubs that allow a workflow to be followed through the system. These should follow the required behaviors listed and have the ability to respond to status inquiries. inqu Workflow Manager:: This should store the standard workflows as templates, which can be instantiated with a workflow instance identifier and handlers can communicate with the manager to determine the next endpoint to target including parameters that are a required to call that endpoint. The communications in the system operates over HTTP using the CRUD actions, GET, POST, PUT, DELETE. The standard method of calling another handler is done via the POST method with a JSON payload. methods. The GET method (see Table 4) The handlers must providee at a minimum the GET and POST methods. should provide, at a minimum, a status method that will return the current status of all local workloads at a given component. To extend this, the GET method should provide a method for getting ing the status of a specific workload. Table 4: GET method for getting the status of a specific workload. GET Target Results /status Returns a list of all current workloads and status. /status/<ID> Returns a status of the workload identified by the ID The POST method (see POST Target Results /<HANDLER> with JSON Payload Returns a validation and response code: • Valid Payload: HTTP 201 Created • Invalid Payload: HTTP 400 Bad Request Table 5) should allow for the transmission, reception and validation of a JSON payload containing the information required for the handler to exercise the component that it represents and then move the he workload to the next handler. POST Target Results /<HANDLER> with JSON Payload Returns a validation and response code: • Valid Payload: HTTP 201 Created • Invalid Payload: HTTP 400 Bad Request Table 5: POST method provide for a handler © PERICLES Consortium Page 20 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI The handlers should be able to query the workflow manager server to get a workflow response to allow them to determine what the next handler to be contacted is. This facility should be exposed by the server via a GET operation (see Table 6). Table 6: GET method for getting the workflow GET Target Results /workflows Returns a list of current workflows /workflows/<ID> Returns the workflow identified by the ID /workflows/<ID>/<STEP> Returns the step <STEP> of workflow <ID> 4.3.6 Status responses esponses In order to coordinate workflows, monitoring the readiness and activity of components in an archive system, the handlers will be required as previously shown to respond to status queries from another entity. To codify the possible responses and levels of response the following types of status as proposed as an initial version. The handlers should respond to a status request with a reply composed of the elements detailed in Table 7. Table 7: Status response of a handler Response Level Response Values Meaning One Alive Handler and Component are active and will respond with further status information levels. Two Dead Handler is not able to process requests. Idle Handler has no active workloads. Queue Handler has a queue of active workloads and continues with further status information. Three Pending Workload is in queue awaiting processing Working Workload is actively being processed Blocked Component is not responding to handler Error Error has occurred during the processing of a workload © PERICLES Consortium Page 21 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI Finished Workload has finished. 4.3.7 Payloads and workflows Workflows (Code 1 and Code 2)) and Payloads (Code ( 3 and Code 4) are described by JSON structures – these are lightweight and descriptive and readily interpreted by many different technologies. 4.3.7.1 { "id": "wf": { "id": }, { "id": WORKFLOWS "1", [ "componentX", "url": "http://127.0.0.1:7000/handlers/x", "params": [ "--in INFILE", " "--out OUTFILE" ] "componentY", "url": "http://127.0.0.1:7000/handlers/y http://127.0.0.1:7000/handlers/y", "params": [ "--validate", "--in INFILE" ]}]} Code 1: Example Workflow A workflow, such as that in the above example, is composed of an identifier and a set of steps, which identify component handlers and their inputs. Below is the skeleton JSON: { "id": "wf": { "id": <Unique Workflow Identifier>, [ <Set of Steps in the Workflow> <Component Identifier>, "url": <Component Endpoint>, "params": [<Parameter Array>] }]} Code 2: Workflow JSON Skeleton 4.3.7.2 PAYLOADS Each POST operation to a handler must as part of the call include a JSON payload which identifies the data source, xIP or workload identifier, workflow instance identifier, workflow template identifier and step and the parameters meters for the handler to use with the underlying components. The payload should be well-formed formed JSON and the handler can have further validation on the parameters section and possibly the data source identifier. { "payload":{ "xid":"1", “xuri":"https://github.com/pericles “xuri":"https://github.com/pericles-project/tests.git", "wid":"1/0", © PERICLES Consortium Page 22 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI “wiid”:”389982-1293081” “params”: }} Code 3: Example Payload { "payload":{ "xid":<Workload/xIP Identifier>, “xuri":<URI for Data Source>, "wid":<Workflow ID/Step>, “wiid”:<Workflow Instance>, “params”:<Parameters> }} Code 4: Payload Structure The payload consists of five elements: • xid: the identifier for the current xIP • xuri: URI for the data source to obtain the xIP content to be operated on • wid: Workflow ID and what step the component is • wiid: Unique operating instance ID of the workflow. • Params: array of parameters for the underlying component. 4.4 Data object bject and data management The basic unit handled by the major major systems in an archive described by deliverable D7.1.1 is the transmitted data packages. These are bags as defined by the BagIt library. It is a defined format that includes manifest and fixity values validation. The intention is to use this as the basic bas interchange and storage format - individual components within the major system operates on the contents of bags but outputs a set of bags for handling by the integration framework for an archive. Having a defined interchange format allows the communications communications of the archive to be removed from individual components to another abstraction layer, thus allowing components to be tightly focused on a single purpose. As part of using bags as the basic transfer and storage unit, a unique identifier for each bag stored in the overall archive is generated. This identifier must be unique only within a given archive and is intended to internal usage only. This identifier allows artworks and materials to be associated with one another through the data management layer and via any storage metadata. The use of the identifiers allows networks of AIPs to be followed to allow for new information to be added to existing artworks and documentation. The intention is to allow any AIP to be related to another. It is important to keep the identifiers in the metadata in the storage to allow reconstruction of AIP networks in the case of catastrophic failure of the data management layer. This provides a method to follow the evolution of an object within the archive. To enable this functionality nctionality to be consistent it is important that individual implementations of the storage and data management layers conform to consistent and uniform interfaces designed to expose required functionality. 4.4.1 Information package p versioning One part of the data ta object management design has to cope with different versions of the xIP structures being transferred within the system. It can be envisioned that requirements for the information and structure will vary over time and component needs may change over time as well. This could result in updates to how a xIP is structured and defined. © PERICLES Consortium Page 23 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI This is to be expected in a long-lived long lived system, to help accommodate this, the components and handlers must be able to identify the xIP versions, which can be received as input and an transmitted as output. This information will be required by the workflow coordination and policy management components such that alternate paths for workflows can be defined. It is important to maintain on the input side at a minimum backwards compatibility compatibil for xIP versions. 4.4.2 Local storage torage and version management for workflow orkflow A potential solution to the transfer and management of changes to data flowing through the archive system process is to have the handlers have local storage under version management, which w are used to store the data, changes and audit logs. Figure 4: Data Pull Figure 4 shows the path of data being pulled along a chain of handlers between the local data storage (yellow circles). The diagram shows a normal fully functional case where once a handler receives a request (POST) from another handler or client, the targeted handler will respond with a okay (HTTP 200 OK response) or resource created (HTTP 201 Created response). Once the request is being processed the new handler will pull data from the previous handler handl - this will include a change log and a file set in the form of a repository. This repository could be implemented via Git or other source control system. The logic behind this is to create a series of auditable workloads, workloads which will allow processed work to be retained at a local storage at components in the event of a failure in a workflow. This need not be linked to any specific technology techn – the main criteria are: • Pull and Push Mechanism for Data Transfer - this will allow data to be transferred between betwee components. • Delta Storage Log - the change log for the digital objects - this should allow the changes and processing of the he objects to be tracked. • Audit Log Creation - record who and when made changes. There are potentially a few issues with this approach: © PERICLES Consortium Page 24 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI • • • • • • Changes and Tracking on Binary Files: Current methods for tracking changes in binary files are not efficient and could cause performance or storage problems when binary files undergo many changes. Scheduled Cleanups: How are the the local repositories removed when no longer required by the system how do the handlers get informed to remove data? Workflow Mapping: At what level should a component and its handler know about its place in a workflow? Can a balance be found such that a component component can intelligently balanced its workloads and be able to adapt to failures in up or downstream downstream processing elements. What is being stored: What should be stored on the local storage - should it be an entire repository or a subset of the repository that that the component operates on. When and how ow transfers are affected: Are the transfers synchronous with the requests, queued and what will be used to affect the transfer of content? What is being transferred: Are changes and deltas for files being exchanged or are entire repositories being transferred? Version Management is a common and standard practice in software development and with some investigation could be adapted for use in the handler layer of the framework. This introduces an element of redundancy redundancy to the storage of processing data; if a component or chain fails the actual processed data to that point is still stored by other components and possibly a centralized backup (depending on policy). Due to delta storage, only specific artifacts need to be marked as origin or important copies to be stored in their entirety, other parts can be generated through the delta logs. This mirrors the steps that the user scenarios are exploring to record information about the prepre ingest processes to create the data to be stored in an archive. Using version management form an AIP in the storage layer means the original data, changes and additions to the data and importantly the record of how these changes were made and in what order can be stored as an atomic unit. 4.5 Integration and testing The approach being taken as laid out in deliverable D7.1.1 is a test driven approach which will allow for unit, integration, and system testing by developers of components in the test bed while allowing scenario and disaster testing to be implemented as higher level automated testing. It is important to note that system testing and scenario testing are not the same - scenario testing is against specific criteria laid out by the stakeholder for a given set of circumstances - system testing sting is more general operation and behaviour testing. To accommodate the wide range of components and target parts of an archive system, the major infrastructure components in the test bed will be continuously operational. This is to reduce the potential scope for error and complications in these components were to be redeployed for each test run. To try and ensure no carry over between test runs - reset scripts will be employed to delete and recreate a basic data set in the components for the tests to operate ope on. These tests will be coordinated and managed by Jenkins continuous integration framework. Jenkins will obtain the source code from the repositories identified by the developers, test the structure of the component, the compilation of the component and then start on the different level of testing. Initially itially most components should be provided with a set of test cases and a skeleton code - these test cases should fail if the code is incomplete. Note that non-compiling non compiling code will automatically fail all testing. Part of this will include the generation of documentation documentation for a component - this includes © PERICLES Consortium Page 25 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI source code documentation and user manuals where appropriate. An example test plan is located in the appendix. Prior to code submission, the developers should should make sure that the test bed has the required compilers and system libraries required by their code. Should third party libraries be required, required a dependency management solution must be employed. Where such a system is used, the developers should contact thee test bed administrator to ensure that this is installed and available to Jenkins. The tests will report back on the success of compilation and the success/failures of the tests implemented at that point in time. The reporting mechanism will be via the dashboard da on Jenkins (http://129.215.213.251/jenkins/ http://129.215.213.251/jenkins/). ). This will include a history of recent builds, execution, and compilation times and a link to a generated website if the build script and Jenkins provides support. su It is the developer’s responsibility to check the status of the components and the tests. In D7.1.1, the main types of test were described, current and future work builds upon the definitions of unit, integration, system and scenario testing. Unit testing esting of components are the responsibility of the component developers - these tests are about the functionality of the component. The framework components for communications and coordination will be unit tested by the framework developers. Both sets of unit u tests should be available to the continuous integration testing management system as pre-cursors pre cursors to further tests. Integration testing will fall into four main categories of inter-component inter component testing within the framework. These categories are: • Status Tests - These are common tests for ensuring the availability and monitoring capabilities of the handlers and infrastructure coordination. • Communication Tests - Tests the ability to move data and commands between the handlers and components in small scale tests tests between isolated component groups. • Behavior Tests - Tests the functionality and consistency of the component groups under test conditions, which should include normal and error cases. • Versioning Tests - Tests to cope with variability in the versions of components c and transmitted data packet structures. System Testing will be focused around end-to-end end end general operation of components within the integrated system. These follow the pattern of the integration tests and add two more test categories: • Replacement Testing - Tests to evaluate how the replacement of components can be accomplished and where required indicate the need for migration planning. • Disaster Testing - Tests for archive system resilience to induced major component failure and data loss. The last category of tests is scenario tests as determined by the two case studies. These tests will follow similar patterns to previous types but will be centred around specific actions and responses. © PERICLES Consortium Page 26 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI 5 Test beds The functionality and behavior of the components and tools mapped from the user scenarios and workflows descriptions and their requirements should be tested. To this end we developed test beds for both domains, Arts & Media and Science. A test bed is a project oject development platform that provides an environment for rigorous, replicable and transparent testing of new tools and technologies. It includes software, hardware and networking elements. The test beds are based on the integration framework previously detailed allowing developers to test a particular component or tool in an isolated fashion, as the framework is implemented around the component. In that way, the component behaves as if included in the program/system, which it will later be added on. 5.1 Common technologies echnologies 5.1.1 Test management: anagement: Jenkins Jenkins (Jenkins, 2013) is an open source fork from the Hudson project and a continuous integration system for software testing. The system can execute local and remote test suites, monitoring the progress and statuss of execution with reports being generated to a web interface. The system can be administered through a web interface. interface The test system can interact with remote virtual machines, code repositories and build systems. The flexibility of the system in how reports are generated and how tests are developed and added to the execution plan is well documented and explained. Key factors that have helped Jenkins to become an accepted system are the easy and flexible configuration, the plugin model for new tools and and process, the reporting mechanisms and the distributed build and execution system. For PERICLES, Jenkins offers a good continuous integration tool, which can evolve as the project matures, removing issues about the work expanding beyond the testing capabilities. capabilities. The Jenkins dashboard for the test status is accessible via the URL (A&M): http://129.215.213.251/jenkins The current tests being hosted by Jenkins are build and unit testing for the framework components compone that form connections between the ingest subsystem and the storage layers. 5.1.2 iRods To quote the online documentation (iRods, 2013): “iRODS is the integrated Rule-Oriented Oriented Data-management Data System, a community--driven, open source, data grid software solution.” iRods provides the functionality to manage large file collections across disparate file systems and storage technologies, providing facilities for data replication, integrity checks and policy based file migrations. © PERICLES Consortium Page 27 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI Similar to Archivematica, iRods Rods uses a micro-service micro service architecture monitoring and controlled by a configurable and extendable rules system. The micro-service micro service architecture is well defined and has a clearly defined set of interfaces for micro-service micro service components. This could be of use in PERICLES at multiple levels, iRods could function in ingest, access or storage layers or a combination of layers. iRods has a large user and development community with interfaces to multiple languages and different approaches to interactions including command line scripting and web services. iRods appears to be well placed for the near future for stability and any cessation of activity in the software will likely be well notified. (A&M) iRods: hosted on an internal virtual machine. It acts as the storage storage layer in the current test bed. torage component for archival packages. (Sc) An iRODS installation, finally, provides the archival storage 5.1.3 Maven Maven is a build automation tool used to manage the software build, packaging and deployment process, initially for Java project but has been expanded to support other languages. This framework supports dependency management and a plugin behavior support to include new and different functions. The plugin nature of Maven is done via an exception model, where the default Maven option is used unless an explicit setting is described. 5.1.4 BagIt BagIt is a file packaging structure for storing and transferring digital content. The structure (or bags) mandates a payload (the content to be stored) and tags (metadata about the file structure). st The tags are important as they contribute to the validation and verification of bag contents through checksums and file listings. BagIt is intended to be cross-platform cross platform and is supported via libraries in several programming languages. The format has has wide adoption in digital libraries. 5.1.5 Topic Maps engine e The Topic Maps engine provides the ability to store and work with both data and metadata (relationship among pieces of data), that is, it is suitable for implementing the data management component. Once nce an ontology has been defined, data and metadata can be organized in a semantic model by means of the Topic Maps engine. The functionalities to query and edit the semantic model are exposed implementing TMAPI, a standard Java interface. 5.1.6 Web service containers c Where required the virtual machines run a web service container which hosts the web services fronting each major component. 5.1.7 Vagrant Vagrant is a free and open-source source software that provides a reproducible, easy to configure virtual development environment. It works with several virtualization software, such as VirtualBox and VMware. © PERICLES Consortium Page 28 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI Vagrant allows all members of a team to run code in the same environment, against the same dependencies and all configured the same way, independently of the platform (Linux, Mac OS X or Windows). Vagrant makes easier the code development, testing, versioning and revision control, as well as managing virtual machines via a simple to use command line interface. 5.1.8 Workflow engine ngine and management The basic concept in the test bed is the use of workflows and to this end the handlers are intended to facilitate communication on this basis. It has been noted that if this is the intended approach that existing workflow engines and management systems may provide much if not all of the t required functionality. To this end, existing workflow engines will be examined for either replacing the handlers or augmenting them. Two of the initial engines to be examined are Taverna and Kepler. Taverna Workbench is to be examined for this purpose. purpose. Developed by myGrid and being actively developed with an extensible design philosophy and used in commercial and academic organisations, this is a prime candidate for workflow design and execution. Kepler is another workflow flow design and execution engine that, t like Taverna, presents the workflows as directed graphs with the nodes being the execution and decision points and edges being communication channels. Kepler was initially initial developed in 2002 and has been used in many scientific fields for workflow development lopment including hydrology, physics and computer science. 5.1.9 Development languages The following languages have been identified as being useful in the development of the test bed and integration framework components: • Python - general purpose high level programming language designed to emphasis code readability and brevity. Python supports multiple styles of programming and includes dynamic typing and automatic memory management. • Java - general purpose high level programming object oriented language intended inten to support multiple platforms in a “write once run anywhere” model. This is accomplished via the compilation to byte-code code and Java virtual machines for different platforms to interpret bytebyte code. • Javascript - a prototype--based script language using dynamic typing with first class functions. Originally for web browser client side scripting, Javascript and frameworks based on it have been adopted for other purposes such as GNOME Shell using it as a default programming language. • Coffeescript - a language e that compiles to Javascript but improves the development process by adopting ideas from other languages into its syntax. This improves the readability and maintainability of programs developed using it. It should be noted that this list in not exhaustive exhaustive and only reflects current work in this area, and will be added to as required in the lifetime of the project. 5.2 Description of the arts & media test bed ed 5.2.1 Structure of the test bed The media test bed is based on the application framework presented in the deliverable D7.1.1. The current test bed implementation is based on the main requirements for components and © PERICLES Consortium Page 29 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI functionality required in the ingest and storage processes from the Arts & Media user scenarios. The test bed is intended to have the main infrastructure components (including ingest engine, data management) anagement) set up on virtual machines as required by the test scenarios. Upon these virtual machines, instantiations of components under test in the current test scenario will be installed and operated upon. Figure 5:: Conceptual view of the Arts & Media Test Bed Instantiation Figure 5 shows the current test bed approach in terms of an OAIS structure. structure. In this form a test bed instantiation would be composed of five distinct entities: • Ingest Engine - Archivematica • Data Management - Mock-Up Mock TMS • Archival/Storage - iRods • Access - Web Portal • Test Management - Jenkins (Not Shown) An instance of Jenkins is used to manage and coordinate the tests enacted upon the test bed. Jenkins is configured to pull source code from Git and SVN and and can use Maven 2/3 or other build systems to test for the build stage of tests and then to run associated tests. 5.2.2 Domain specific pecific technologies The following software components are hosted across the virtual machine network: • Archivematica: a software component required by the arts and media domain. This is hosted on the externally visible system and serves as an entry point to the ingest process. MySQL:: this is currently serving as a data management layer in place of the proprietary systems like TMS (The he Museum System, Tate’s collection management system). system . The current instance mimics the data structures exposed via an example data set provide by Tate. © PERICLES Consortium Page 30 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI 5.2.3 Current status tatus and tests The current test bed implementation is hosted by EPCC at the University of Edinburgh. This test bed is hosted as a series of Linux (Ubuntu) virtual machines (VM). One virtual machine serves as an external entry point to the test bed system. This is an externally externally accessible system, with the other components of the test bed being hosted on internal virtual machines. The access layer is currently undergoing design and development and is not part of the test bed. The main functionality currently under unde test is the use of BagIt and compression ompression file conversions to move between the ingest, archival and data management layers. No research components are under test on the test bed at this stage, the only tests are on integration technologies. Future tests will includee tests for integrating research components, third party software and robustness testing. The he current system uses preinstalled components on a number of VMs in Edinburgh. There will be an initialisation phase to set up the test bed before running the tests.. It is not very feasible to store VM images in a revision control repository so we are proposing to use Vagrant (http://www.vagrantup.com/)) to create VMs for the test bed components. The configuration files ("Vagrant-files") files") contain descriptions of machine types and bootstrapping configurations for VMs. They are text files that can be easily change-tracked change tracked in revision control. The time it takes to create the VMs will have to be determined, in order to decide how often often these can realistically be executed. The cloud at GWDG supports Vagrant. 5.3 Description of the science test bed 5.3.1 Structure The Space Science test bed includes clones of tools in use at B.USOC B USOC as well as additional software and models that implement the different components in the OAIS Model and unit test. In this form a test bed instantiation would be composed of four distinct entities: • Ingest Engine - YAMCS • Data Management - Web Portal • Archival/Storage - iRods • Access - Web Portal The diagram in Figure 6 illustrates such tools and components and their major interactions. Additionally, a separate infrastructure (hosted by SpaceApps) is in place to support software development, providing feature/bug tracking, planning and unit testing. © PERICLES Consortium Page 31 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI Figure 6:: Conceptual view of a Science Test Bed Instantiation © PERICLES Consortium Page 32 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI 5.3.2 Domain specific technologies The cloned tools in use at B.USOC USOC are: • • • YAMCS,, an open source mission control system; it manages the data coming from payloads in the ISS, including both experiment and “housekeeping” data Predictor,, an operation planning and reporting tool; operators use it to have a centralized overview of all the relevant information that affect operations; Alfresco,, an open source document management system; at BUSOC it is used to store, catalogue and search different kinds of digital documents (reports, manuals, procedures…). The components developed within PERICLES are: • Semantic model,, a Topic Maps representation of the data and metadata in the Space Science case; metadata, adata, in particular, include the relations among different pieces of data. • PERICLES Portal is the web front end and main user interface of the preservation system; it is meant to provide control of the system itself and access to its content. The presentation presenta of information is focused on the relations modeled by the semantic model and is customizable by users. Packager is a tool to build archival packages out of selections of the information managed by the system (including relations). 5.3.3 Status and current urrent tests Each of the cloned tools in the Space Science test bed is hosted by Linux virtual machines within the B.USOC USOC premises, and is populated with a representative sample of the real data stored at B.USOC. B In this context, besides recreating the current existing environment, they play the role of data sources, that is, they are the locations data is pulled from during d ingest. The semantic model currently in use (for the very initial test bed implementation) is very limited, li as shown in the Figure 7.. However, a much deeper and richer one is under development development - a semantic model that already contains hundreds of types and relationships and covers the whole SOLAR experiment. Figure 7:: Example of the semantic model provided by Topic Maps Engine The portal is at the moment focused on data access, as it presents the content ntent of the Topic Maps engine via web pages, one for each atomic piece of information. Data presentation is customizable © PERICLES Consortium Page 33 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI through templates. The portal also provides automatically generated templates, templates which take advantage of the relations stored in Topic Maps. The first prototype of the portal has been implemented and presented to the consortium. A screenshot is provided below. The packager is currently a stand-alone stand tool, which interfaces with the semantic mantic model through the Topic Maps engine. It offers: • preliminary reliminary support for different packaging standards (SAFE, BagIt, custom descriptors); • configurable onfigurable data selection based on the semantic model and original data sources; • ability bility to upload packages to and browse the content of the archival storage component (currently iRODS). There is currently no complete component for automatic ingest from data sources. On the contrary, this operation is still carried out through a semi-automated semi procedure, which consists nsists on: • data ata sources scanning and scraping; • ad-hoc hoc semantic extraction, relation inferring (e.g. based on time information, or well-known well identifiers); • Topic Map population. The unit test and development infrastructure is hosted and managed by SpaceApps. A Jenkins installation is configured to pull sources from the code repository, build the projects and run unit tests. For building and unit testing, Jenkins relies on Maven and sbt depending on the programming language used for the single piece of software. Integration tests are run on installations within BUSOC premises and carried out manually. They cover: • relations extraction during ingestion (e.g. checking that the relations relations between reports, document and experiment data are identified and are correct); • browsability of relations in the PERICLES Portal (e.g. checking that all relations of a certain piece of data are shown, and that such links can be actually used to reach its neighbor pieces of data); • readability of automatically rendered pages in the PERICLES Portal (that is, an assessment of whether a page makes sense to humans); • archival package creation and upload to storage component (checking that the content of the packages ages matches the configuration and that the storage component does receive them). © PERICLES Consortium Page 34 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI 6 Roadmap The development of the test bed for both the science and media test bed is a continuous process during the whole period of the PERICLES project. This deliverable describes the first implementations being aware of not having the full functionality or documentation available for now. The next important steps identified to increase the functionality in the test beds are: • Refactoring and Extension ension of the Framework: Framework: the current framework needs to be refactored to reflect better the understanding of the requirements for the test cases. Additional functionality will be required to accommodate the range of tests and components under examination. • Integration of Component Testing: Testing: The methods for testing need to be published to component developers. These need to state how the source should be exposed, how test data is used, and what types of tests should be carried out at what stage. • Test Data Repository:: A test data repository should be created to avoid the need for replication of test data with a set of scenarios. This would allow better sharing of resources and a common understanding of what and why something is part of the test data body. • Automated tomated Scenario Testing: Testing: The current method of testing a scenario is manual execution. For further and repeatable testing this has to be replaced with a scripted automatic test. • Document Generation:: The intention is to provide a set of living documentation documentatio to reflect the current status of components, testing and future plans. This will be accomplished via the automated document generation facilities available in tools like Maven and Jenkins given the appropriate source materials. NOTE: Components should bee put into testing/build system as they come online. This means as tests are developed for them they should be added, even when the tests fail. NOTE: It should be noted that these timeframes below are for the components/tools coming online for further development lopment where appropriate - tests should be continually updated d for example as new information becomes available. 6.1 Media test bed ed Table 8: Arts & Media test bed roadmap Month Task Description July onwards Additional Technology This is an on-going going tasks to be updated with the further development of the user scenarios: Identify all current additional target technologies for use in the test bed, which are not provided by project partners. This is based on use case scenarios. Once identified, dentified, set up instances for use in future test beds. August Identify a core set of test scenarios for use in the test bed and create the test plan. Collect and ensure test data is available. Initial Test Scenarios and Data © PERICLES Consortium Page 35 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI SeptemberDecember Test Scenarios Implementation and Maintenance Implement the identified test scenarios, monitoring for additional required tests, document errors and failures. Fix components accordingly. Begin recording of success/failure metrics, process and software quality metrics. m January onwards Continuous Test Scenario Development Improve, augment and expand the test scenarios based on n the user scenarios and software development process. 6.2 Science test bed b Table 9: Science test bed roadmap Month Task Description July Automatic test bed deploy and integration test Automate the deployment of the test bed components within B.USOC USOC infrastructure. Automate the current manual integration test. test AugustSeptember Integration of shared test bed architecture Implement specifications, formats and interfaces of the shared test bed architecture in the current science data test bed. October Test case definition Preparation of test plans based on test scenarios and requirements. Preparation of datasets for automatic ingestion. November onwards Test case implementation Test case implementations based on test plans. Improvement of test bed components, datasets, test plans and tests. Document test failures failur with perfailure dedicated tests. Begin recording of success/failure metrics, process and software quality metrics. January onwards Continuous Test Scenario Development Improve, augment and expand the test scenarios based on n the user scenarios and software soft development process. 6.3 Integration of tools and components Table 10: 10 Integration of tools and components roadmap Month Task Description JulyJune Testing Tools Agreement Ensure all partners agree on testing tools, build strategy, code repositories, and associated topics. topics July Test Bed Analysis Analysis of existing test beds to look for commonalities and report back. © PERICLES Consortium Page 36 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI July Test Planning Draft to whole project about how testing is/will be done. This includes types of test, success/failure/error criteria, test layout and recording of rationale. rationale August Component Interface Draft proposal for interfaces for different types of components and target locations in OAIS. For starting the OASIS diagram from ULIV can used as a basis. basis September Handler Syntax Draft for how Handler actions will happen. happen September Test Bed Commonality Publish Test Bed analysis. Set up common scripts if possible. Unified testing view over test beds. beds October Common Tests Online Have heartbeat and common scenario testing online. Adding new AIP and component change. change OctoberNovember First Version of Handler and component interface First full versions of these definitions. definitions NovemberJanuary Automated Build Test, Harness of Components Initial build test harness ess for components from WP3-5. WP3 Should report on compilation status, if documentation is available. Note: This is the harness being ready components may be unavailable. DecemberFebruary Automated Unit and Integration Test Harness Components from WP3-5. Should report in failures and errors. NOTE: This is the harness being ready components may be unavailable. November onwards Feedback and Update on Test Systems Improvements to process underlying structures. structures © PERICLES Consortium Page 37 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI 7 Conclusion This deliverable describes the first first version of the test bed implementations of the two domains of the PERICLES project. The test beds use technologies and tools, tools which are on the one hand side found to be a common component and on the other identified as domain specific. For both test beds bed the first tests have been successfully performed. performed. The document describes in detail the technologies applied but also hints on the current limitations. To overcome the latter a clear roadmap with short and longlong term goals was defined and presented in this document. The current test bed implementations implementation show a successful collaboration between the different working groups of the project to reach that goal. © PERICLES Consortium Page 38 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI List of Figures and Tables List of Figures Figure 1: Example of a process diagram with components mapping ................................................... ................................ 12 Figure 2: Integration Framework Architecture ................................................................ ..................................................... 14 Figure 3: Chained Handlers ................................................................................................ ................................ ................................................... 19 Figure 4: Data Pull ................................................................................................................................ ................................ ................................. 24 Figure 5: Conceptual view of the Arts & Media Test Bed Instantiation ................................................ ................................ 30 Figure 7: Example of the semantic model provided by Topic Maps Engine ......................................... ................................ 33 List of Tables Table 1: Functional blocks of the Workflow Engine. ................................................................ ............................................. 15 Table 2: Operations of a handler ................................................................................................ ........................................... 18 Table 3: Handler and Component Anatomy (Green are framework..................................................... ................................ 18 Table 4: GET method for getting the status of a specific workload. ..................................................... ................................ 20 Table 5: POST method provide for a handler ................................................................ ........................................................ 20 Table 6: GET method for getting the workflow................................................................ ..................................................... 21 Table 7: Status response of a handler ................................................................................................ ................................... 21 Table 8: Arts & Media test bed roadmap .............................................................................................. .............................. 35 Table 9: Science test bed roadmap ................................................................................................ ....................................... 36 Table 10: Integration of tools and components roadmap r ................................................................ .................................... 36 Table 11: Example test plan ................................................................................................ ................................ .................................................. 42 © PERICLES Consortium Page 39 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI Bibliography Alfresco - Open Source document management http://www.alfresco.com Archivematica (2013) – Archivematica Website, 2013, Tested Live – 16/01/2014 https://www.archivematica.org/wiki/Main_Page CoffeeScript (2014) – Documentation, 2014, Tested Lived – 29/01/2014 http://coffeescript.org/ Git (2014) - Git - Documentation and Product, 2014, Tested Live - 23/05/2014 http://git-scm.com/ scm.com/ IETF (2013), The BagIt File Packaging Format, ITEF, 2013, Tested Live -- 26/01/2014 http://tools.ietf.org/html/draft http://tools.ietf.org/html/draft-kunze-bagit-09 iRODS(2013) iRODS – Documentation and Wiki, 2013, Tested Live – 16/01/2014 https://www.irods.org/index.php# Jenkins (2013) – Jenkins Continuous Integration Website, 2013, Tested Live – 16/01/2014 https://wiki.jenkins https://wiki.jenkins-ci.org/display/JENKINS/Home JSON - JavaScript Object Notation http://en.wikipedia.org/wiki/JSON Kepler – Scientific workflow system https://kepler-project.org project.org MAVEN (2014) - Maven Documentation and Product, 2014, Tested Live - 23/05/2014 http://maven.apache.org/ MySQL - open-source relational database management system http://www.mysql.com SVN: Subversion (2014) - Subversion Documentation and Product, 2014, Tested Live - 23/05/2014: http://subversion.apache.org/ OAIS - Open Archival Information System http://en.wikipedia.org/wiki/OAIS REST - Representational State Transfer © PERICLES Consortium Page 40 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI http://en.wikipedia.org/wiki/REST SAFE - Standard Archive Format for Europe http://earth.esa.int/SAFE/ .esa.int/SAFE/ Taverna - an open source and domain-independent domain Workflow Management System http://www.taverna.org.uk Topic Maps - a standard for the representation and interchange of knowledge, with an emphasis on the findability of information. http://en.wikipedia.org/wiki/Topic_Maps /wiki/Topic_Maps YAMCS - A Mission Control System http://www.busoc.be © PERICLES Consortium Page 41 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI Appendix Table 11: Example test plan Test Name Handler for Conversion Test Purpose Confirm that Handler will call component for audio conversion and confirm output/workflow/payloads Test Component Handler Conversion Component Test Data Source Audio File Expected Audio File Test Workflows Pre-Conditions Initialised Git Repository with Source Audio File Mock Workflow Server Active Dummy Check Handler Active Post-Conditions Git Repository Removed Mock Workflow Server Removed Dummy Check Handler Removed Tests 1 Normal Behaviour - Conversion Worked, Workflow Valid, Target Contactable 2 Unavailable Payload - Handler Active - payload non-existent 3 Invalid Params - Handler works and handles parameters errors © PERICLES Consortium Page 42 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI 4 Component Failure - Handler operational, conversion component fails 5 Next Handler Down - Cope with failure of target handler 6 Mock Server unavailable - Handle failure of server 1 Pass - Dummy handler receives converted file, correct/expected parameters Success Criteria Fail - Anything Else 2 Pass - Handler produces error that payload is unavailable for processing, terminates chain Fail - Handler tries to pass invalid payload to component, handler tries to follow normal workflow. 3 Pass - Handler produces error that parameters have been mishandled (malformed etc) Fail - Handler validates the parameters and attempts to run conversion 4 Pass - Error of component failure is logged and chain terminates at this point. Fail - Next handler is contacted 5 Pass - Handler retries X times, Handler produces error and parks to repo. Fail - Tries indefinitely or tries and ignores failure 6 Pass - Handler stops and logs error Fail - Anything Else What is Not Tested Full Behavior of Conversion Component - only interactions with handler. Test 1 Kick off process - contact handler under test with valid parameters and payload Monitor handler status - check working/finished status Monitor dummy status © PERICLES Consortium Page 43 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI Dummy when activated should do checks on: ● ● ● 2 Converted File to Expected File Check Workflow Valid Check Parameters Valid Kick off process - contact handler under test with valid parameters and invalid payload Monitor handler status - check error status Assert Error Log has valid error message 3 Kick off process - contact handler under test with invalid parameters and valid payload Monitor handler status - check error status Assert Error Log has valid error message messag 4 Kick off process - contact handler under test with valid parameters and valid payload Induce Conversion Component Failure Monitor handler status - check status becomes blocked Assert Error Log has valid error message messag 5 Kick off process - contact handler under test with valid parameters and valid payload Disable Dummy Handler Monitor handler status - check status becomes error Assert Error Log has valid error message Check Repo 6 Kick off process - contact handler under test with valid parameters and valid payload © PERICLES Consortium Page 44 / 46 DELIVERABLE 7.3.1 INITIAL VERSION OF TEST EST BED IMPLEMENTATION IMPLEMENTATI Disable Server Monitor handler status - check status becomes error Assert Error Log has valid error message Check Repo © PERICLES Consortium Page 45 / 46