Publications Office Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Software Architecture Document Subject Software Architecture Document Version / Status 1.00 Release Date 28/04/2010 Filename TED-SAD-v1.00.doc Document Reference TED-SAD Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 Table Of Contents 1 Introduction ....................................................................................................................................... 5 1.1 Purpose of the Document ...................................................................................................... 5 1.2 Scope of the Document.......................................................................................................... 5 1.3 Intended Audience ................................................................................................................. 5 2 Reference and Applicable Documents ............................................................................................. 6 3 Acronyms and Abbreviations ............................................................................................................ 7 4 Architectural Representation ............................................................................................................. 8 5 Logical View ...................................................................................................................................... 9 5.1 5.2 5.3 TED Website .......................................................................................................................... 9 5.1.1 Overview ........................................................................................................................ 9 5.1.2 Web Layer Design Package ........................................................................................ 11 5.1.3 Service Layer Design Package ................................................................................... 13 5.1.4 Domain layer ............................................................................................................... 14 5.1.5 Data access layer ........................................................................................................ 14 5.1.6 General Principles ....................................................................................................... 14 Monitoring data-warehouse.................................................................................................. 15 5.2.1 BIRT ............................................................................................................................ 15 5.2.2 Cacti ............................................................................................................................ 16 5.2.3 Webalizer ..................................................................................................................... 16 License Holder environment ................................................................................................ 17 5.3.1 5.4 Email analysis and notifications ........................................................................................... 18 5.5 Workflow engine ................................................................................................................... 18 5.6 6 Authentication and logging .......................................................................................... 18 5.5.1 Validation and files transformation .............................................................................. 19 5.5.2 PDF generation and time-stamping............................................................................. 19 5.5.3 Indexing ....................................................................................................................... 20 5.5.4 DVD image creation .................................................................................................... 20 5.5.5 Contracting authority notification ................................................................................. 20 Notice viewer ........................................................................................................................ 21 Implementation View ....................................................................................................................... 22 6.1 6.2 40643464 TED Website ........................................................................................................................ 22 6.1.1 Overview ...................................................................................................................... 22 6.1.2 TED XSL transformation ............................................................................................. 23 Email analysis and notifications ........................................................................................... 23 Page 2 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD 6.3 Workflow engine ................................................................................................................... 24 6.3.1 The workflow engine package ..................................................................................... 24 6.3.2 The workflow engine Implementation .......................................................................... 24 6.3.3 The Indexing Implementation ...................................................................................... 25 6.3.4 The workflow Transformation ...................................................................................... 26 6.3.5 The workflow management tool .................................................................................. 27 Ted System i18n support ..................................................................................................... 27 6.5 Notice viewer ........................................................................................................................ 27 6.6 Reference data ..................................................................................................................... 28 6.6.1 Reference data: Deletion ............................................................................................. 29 6.6.2 Reference data: Addition ............................................................................................. 29 6.6.3 Reference data: Modification ....................................................................................... 29 Content modification ............................................................................................................ 30 6.7.1 Addition of a new form ................................................................................................. 30 6.7.2 Modification of reference data ..................................................................................... 30 6.8 Application Dependencies.................................................................................................... 32 6.9 Backup procedure ................................................................................................................ 33 6.9.1 Daily Back-end backup procedure .............................................................................. 33 6.9.2 Daily Front-end backup procedure .............................................................................. 34 6.9.3 Daily Data warehouse backup procedure ................................................................... 34 6.9.4 Daily Common backup procedure ............................................................................... 34 Data View ........................................................................................................................................ 35 7.1 MySQL cluster ...................................................................................................................... 35 7.2 Technical Columns ............................................................................................................... 36 7.2.1 8 Version: 1.00 6.4 6.7 7 Software Architecture Document Audit segment .............................................................................................................. 36 Deployment view ............................................................................................................................. 37 8.1 8.2 8.3 40643464 Network File System server ................................................................................................. 39 8.1.1 TED repository file system .......................................................................................... 39 8.1.2 TED temporary backup file system ............................................................................. 40 8.1.3 TED mirror backup file system .................................................................................... 40 8.1.4 Windows XP via VMWare ........................................................................................... 40 James email Servers ............................................................................................................ 41 8.2.1 DNS Configuration....................................................................................................... 41 8.2.2 Spam folders ............................................................................................................... 41 Database Organisation ........................................................................................................ 41 Page 3 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 LIST OF TABLES Table 1: Reference Documents ............................................................................................................... 6 Table 2: Applicable Documents ............................................................................................................... 6 Table 3: TED XSL Transformation ........................................................................................................ 23 Table 4: Indexed fields .......................................................................................................................... 26 Table 5: Modified On and Version columns .......................................................................................... 36 Table 6: Modified By column ................................................................................................................. 36 LIST OF FIGURES Figure 1 TED system modules ................................................................................................................ 9 Figure 2 TED website responsibility based layers ................................................................................ 10 Figure 3 Integration of Spring MVC with other layers ........................................................................... 11 Figure 4 Flow of a Request through Spring Security Filters .................................................................. 12 Figure 5 CACTI Active Session diagram ............................................................................................... 16 Figure 6 Webalizer Traffic Analysis diagram ......................................................................................... 17 Figure 7 Webalizer Summary by Month diagram .................................................................................. 17 Figure 8 Production management steps ............................................................................................... 19 Figure 9 Web application file structure .................................................................................................. 22 Figure 10 Workflow engine package Structure ..................................................................................... 24 Figure 11 Flow Service class diagram .................................................................................................. 25 Figure 12 Indexing Flow ........................................................................................................................ 25 Figure 13 Notice Viewer structure ......................................................................................................... 28 Figure 14 Reference Data Model ......................................................................................................... 30 Figure 15 Database replication with MySQL Cluster ............................................................................ 35 Figure 16 TED Deployment diagram ..................................................................................................... 37 Figure 17: Deployment diagram ............................................................................................................ 38 40643464 Page 4 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 1 INTRODUCTION 1.1 PURPOSE OF THE DOCUMENT The aim of this document is to provide a comprehensive architectural overview of the TED system. This document describes how functional analysis and use cases are translated and structured in the architecture by the development team. 1.2 SCOPE OF THE DOCUMENT This document presents the technical architecture of the TED system. In this document, we focus on the choices made for the TED system. Hereafter, the readers will find information about the frameworks, tools and technologies used by the TED system. 1.3 INTENDED AUDIENCE The present document is intended to be read by the following people: Publishing operation team; Publications Office Project Team; Developments Project Team. 40643464 Page 5 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 2 REFERENCE AND APPLICABLE DOCUMENTS This section contains the lists of all references an applicable document. When referring to any of the documents below, the bracketed reference will be used in the text, such as [R01]. REFERENCE DOCUMENTS Ref. Title Reference Version Date R01 TED-FSP-Functional Specifications TED-FSP 1.00 07/09/2009 R02 TED-DML-Data Model TED-DML 1.00 28/04/2010 Table 1: Reference Documents APPLICABLE DOCUMENTS Ref. Title Reference Version Date N° 10186 N/A 06/01/2009 N°10186 NA 06/01/2009 TED-PQP 1.01 08/09/2010 General Invitation to Tender A01 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and on line media Specifications Hybrid service contract A02 A03 Production and dissemination of the supplement to the Official Journal of the European Union: TED Website, OJS DVD—ROM and related offline and on line media Project Quality Plan Table 2: Applicable Documents 40643464 Page 6 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 3 ACRONYMS AND ABBREVIATIONS ABBREVIATIONS AND ACRONYMS Abbreviation Meaning AOP Aspect Oriented Programming API Application Programming Interface CPV Common Procurement Vocabulary CRUD Create Retrieve Update Delete DAO Data Access Object ECMT European Commission Machine Translation FTP File Transfer Protocol HTTP HyperText Transfer Protocol IoC Inversion of Control JAR Java Archive JDK Java Development Kit JEE Java Enterprise Edition JSP Java Server Page JTA Java Transaction API LGPL GNU Library or Lesser General Public License. MVC Model View Controller NUTS Nomenclature des Unités Territoriales et Statistiques OJS Official Journal Supplement OOD Object-Oriented Design OPOCE Office des Publications Officielles des Communautés Européennes POJO Plain-Old Java Object TED Tenders Electronic Daily UDF Universal Disk Format URL Uniform Resource Locator WAR Java Web Archive 40643464 Page 7 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 4 ARCHITECTURAL REPRESENTATION This document is a part of the Technical Specification of the TED System, the result of the design phase. This document presents the necessary views to represent the software architecture: The Logical View: presents the decomposition of the software architecture into subsystems and packages; The Implementation View describes the overall structure of the implementation model, the decomposition of the software into layers and subsystems; The Data View describes the persistent data storage perspective of the system; The Deployment View describes the physical infrastructure on which the TED software is deployed and run. It specifies the physical nodes and network configuration that executes the software, and also maps the processes defined in the Process View on to physical nodes. 40643464 Page 8 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 5 LOGICAL VIEW The Logical View presents an overview of the architecture and then provides the decomposition of the software into design packages and sub-systems. The TED system has been decomposed into six distinct modules represented in the next figure: TED website Monitoring Datawarehouse License Holder environment Email analysis and notifications Notice viewer Content management (Workflow engine) Figure 1 TED system modules TED website: This module represents all the components needed for the public web interface of the TED system. The TED Website section describes it in details. Monitoring Data-warehouse: This module contains the data-warehouse user interface, it is described in detail in the Monitoring data-warehouse section. License Holder environment: The environment available for subscriber having a privileged access to the contents of the TED. This module is described in details in the License Holder environment section. Email analysis and notifications: This module is responsible for the mailing of notifications and received emails analysis. This module is described in details in the Email analysis and notifications section. Workflow engine: The workflow engine module contains all the components used for the production management of the documents on the TED system. This includes indexing, creation of DVD images, file transformations and the production dashboard. This module is described in details in the section Workflow engine. Notice viewer: The notice viewer is a simple stand alone Java application that executes the transformations on the given XML files. This module is described in details in the Notice viewer section. 5.1 TED WEBSITE 5.1.1 OVERVIEW The TED system architecture is based on the J2EE application architecture. This architecture is decomposed into ‘tiers’ and ‘layers’ as recommended by the J2EE specification. The Layering design pattern when applied to a system breaks down the complexity of the system as a whole by identifying the different parts of the system and reducing coupling between them. Layering 40643464 Page 9 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 reduces the impact of a change in one layer on the rest of the system. Multi-dimensional layering is about the combination of two other strategies: Responsibility-based layering that associates each layer with a specific responsibility (presentation, business and integration); Reuse-based layering that identifies components that have a high potential of reusability, possibly across different projects. The application is made of several responsibility based layers: Web eu.europa.opoce.ted.controller WebContent eu.europa.opoce.ted.view Domain Service eu.europa.opoce.ted.service eu.europa.opoce.ted.model Integration eu.europa.opoce.ted.dao Figure 2 TED website responsibility based layers The Web Layer contains the logic to handle the interaction between the user and the system via a Web Browser. To achieve this interaction, the Web Layer is allowed to call high-level functions, provided in the Service Layer, and manipulate domain models, exposed in the presentation, of the Domain Layer; The Service Layer contains the business logic structured in high-level methods oriented around use cases that may result in CRUD operations on entities of the Domain Layer. These CRUD operations are realized by accessing the Data Access Layer; The Domain Layer contains data, common rules and logic of the model: business or technical, persistent or transient. 40643464 Page 10 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 The Data Access Layer acts as a medium between the entities of the Domain Layer and the technical solutions insuring its durability. The Data Access Layer knows how and where the persistent entities are stored. Typically, an entity of the Domain Layer has a corresponding Data Access Object (DAO) in this layer that exposes methods to manage the object persistence. 5.1.2 WEB LAYER DESIGN PACKAGE The Web Layer is built on top of the Spring MVC framework. This framework implements and makes intensive use of the different design patterns: Model-View-Controller; Front Controller; Command Object. The purpose here is not to give a complete explanation of how Spring MVC works but rather to describe the philosophy and how in practice the TED system uses and extends Spring MVC. The Model-View-Controller is the separation of concerns applied to the presentation tier, i.e., it separates the view from the business data and processes, the controller being responsible for handling requests and acting as a medium between the model and the view. With Spring MVC, business objects can be reused as they are (no class extension or interface implementation required). The following figure shows how Spring MVC components interact with the Application and Data Access layers. Presentation request Spring security Application Data Access DispatcherServlet Service Controller Integration Domain response Model View Figure 3 Integration of Spring MVC with other layers In TED, the Model-View-Controller is implemented as Java classes extending the Spring Framework classes (such as SimpleFormController). The Front Controller design allows one to avoid having a separate servlet for each controller. Instead, Spring MVC provides a generic servlet, the DispatcherServlet, which dispatches the request to a specific controller. In TED, the Front Controller is handled by the DispacherServlet class of the Spring Framework The Command Object design pattern is used to map the HTTP request and parameters to a Java object holding all the information. In TED, the Command Objects are implemented as simple Java classes, which all extend the same parent class: TedDefaultPageCommand. 40643464 Page 11 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 5.1.2.1 Access Control When a request is submitted to a Spring Security protected web application, it is ensured to be processed by Spring Security, through the standard Java Servlets and Filters. Indeed, Spring Security provides a framework and a set of components to build the whole security processing chain. Thus a request to a protected resource passes through each of Spring Security’s filters as depicted in figure below: Request 1. Channel-Processing Filter (optional) 2. Authentication-Processing Filter 3. Integration Filter 4. Security Enforcement Filter 5. Secured Web Resource Figure 4 Flow of a Request through Spring Security Filters Each of these filters play a specific role in the amount of security desired to protect a web resource. The first filter for example can enforce that web requests must use a given channel; i.e. HTTPS for example. The second filter, Authentication-Processing Filter, is in charge of redirecting the user and authenticating the user if the web resource is indeed protected. The fourth filter, Security Enforcement Filter, is also interesting in that it checks that the appropriate access rights are given to the logged-in user in order to access that web resource. This check is modular and might comprise of a combination of different rules; allowing complex Access Control Lists (ACL). This simple but powerful chaining mechanism ensures that all requests made by a web browser comply with the security constraints imposed. These constraints can be set in configuration files, such that it is external from the base source code. 5.1.2.2 Rollover menus with CSS For all rollover menus of the website we use the “:hover” CSS attribute. IE6 does not support this attribute on every html tag. To make it work on IE6 we use a javascript function and a supplementary class in the CSS files (see below). To inform "IE6/javascript disabled" users we use the tag <!--[if lt IE 7]> on our HTML pages. Under this condition we use a <noscript> tag with a warning message which informs users about the unusability of the navigation, when using IE 6 with no Javascript. Javascript function: 40643464 Page 12 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 <!--[if lt IE 7]> <script type="text/javascript"> //Fonction destinée à remplacer le "LI:hover" pour IE 6 sfHover = function() { var sfEls = document.getElementsByTagName("li"); for (var i=0; i<sfEls.length; i++) { sfEls[i].onmouseover = function() { this.className = this.className.replace(new RegExp("sfhover"), ""); this.className += " sfhover"; } sfEls[i].onmouseout = function() { this.className = this.className.replace(new RegExp("sfhover"), ""); } } } if (window.attachEvent) window.attachEvent("onload", sfHover); </script> <![endif]--> Warning Message: <!--[if lt IE 7]> <noscript> <span class="red"> Attention vous utilisez une ancienne version d'internet explorer sans javascript ... </span> </noscript> <![endif] --> CSS class: Every <tag>:hover must have an equivalent <tag>.sfhover 5.1.3 SERVICE LAYER DESIGN PACKAGE Transaction demarcations are managed declaratively using the Spring Framework. The selected underlying transactions are handled by the Spring DataSourceTransactionManager. The transactions are defined at the Service Level. Service class methods represent use-cases that are usually considered atomic from a transactional point of view. This is then a good place to manage the transaction. Following Spring’s philosophy, the transactions are configurable in all aspects (isolation level, timeout …) in annotations. The definition of the transactional boundaries has no impact on the Service classes. 5.1.3.1 Search service One of the major characteristics of the TED website is its search capability. All the search functionalities are implemented on the top of the Lucene library. Apache Lucene is a high-performance, full-featured text search engine library written in Java. It is suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching. The Lucene API is also known for its flexibility that allows it to be independent of the file format to index. 40643464 Page 13 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 All the search capabilities needed by the TED website are encapsulated within this search service. The index file is progressively aggregated by the addition of the information retrieved from the parsing and the indexation of the new documents. This process of indexation is handled by the content management module which performs this operation for each new OJS release. 5.1.4 DOMAIN LAYER The Domain Layer contains data, common rules and logic of the model. This layer contains the identified business entities. This layer is unaware of how the domain object persistence is managed. That is the responsibility of the Data Access Layer; 5.1.5 DATA ACCESS LAYER This section describes the approach to form the basis for JDBC database access using Spring. The JdbcTemplate class is the central class in the Spring JDBC core package that is used by TED. It simplifies the use of JDBC since it handles the creation and release of resources. This helps to avoid common errors such as forgetting to always close the connection. It executes the core JDBC workflow like statement creation and execution, leaving application code to provide SQL and extract results. This class executes SQL queries, update statements or stored procedure calls, imitating iteration over ResultSets and extraction of returned parameter values. It also catches JDBC exceptions and translates them to a more informative exception hierarchy. The system makes use of the SimpleJdbcTemplate class which is a wrapper around the classic JdbcTemplate that takes advantage of Java 5 language features such as variable arguments and auto-boxing. In order to work with data from a database, one needs to obtain a connection to the database. The way Spring does this is through a DataSource. A DataSource is part of the JDBC specification and can be seen as a generalized connection factory. It allows a container or a framework to hide connection pooling and transaction management issues from the application code. Spring provides other utility classes such as the RowMapper. A RowMapper instance is a convenience class used to map one object per row obtained from iterating over the ResultSet that is created during the execution of the query. 5.1.6 GENERAL PRINCIPLES This section contains the general principles underlying the system and promoted by the architecture. These principles are too general to be exposed as a specific design package but they are important enough to be mentioned. This section provides a short description of these principles. 5.1.6.1 Programming to Interfaces This principle is also known in the longer version ‘Programming to Interfaces, not implementations’. When a piece of software is developed, an implementation class must not directly be dependent on other implementation classes but rather to their implemented interface. This improves the scalability and maintainability of the software as other implementations of the interfaces can be substituted for the current one with little impact on the dependent modules. Its use is facilitated by the ‘Dependency Injection’ principle. The ‘Programming to Interfaces’ principle also eases the test strategy of the software, mostly with unit testing. Object classes are tested in isolation, as the test provides mock implementation for the dependent interfaces used by the tested object. 40643464 Page 14 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 5.1.6.2 Dependency Injection The ‘Dependency Injection’ principle greatly facilitates the previous design principle, ‘Programming to interfaces’. It removes the need for each object to declare explicitly in the JAVA code its dependencies to the implementation classes. Configuration files do the job instead. Each object is created by a container that populates the object with its dependencies. Thus, the object does not know anymore the implementation class, only the interfaces. In this project, this container is shipped along with the Spring Framework. Dependency injection (IoC: Inversion of Control) is the base principle of Spring. 5.1.6.3 Aspect Oriented Programming Aspect-Oriented Programming (AOP) complements Object-Oriented Programming (OOP) by providing another way of thinking about program structure. In addition to classes, AOP gives you aspects. Aspects enable modularization of concerns such as transaction management that cut across multiple types and objects. (Such concerns are often termed crosscutting concerns.) One of the key components of Spring is the AOP framework. While the Spring IoC container does not depend on AOP, meaning you don't need to use AOP if you don't want to, AOP complements Spring IoC to provide a very capable middleware solution. 5.2 MONITORING DATA-WAREHOUSE The data warehouse information is made available for administrators using the web interface. Its content is built using several tools, which are described in this section. The Layering design pattern is also applied for the monitoring data-warehouse to break down the complexity of the system as a whole by identifying the different parts of the system and reducing coupling between them. The following sections give an overview of the components that are used to combine and represent the information needed into web reports. 5.2.1 BIRT BIRT (Business Intelligence and Reporting Tools) is a reporting system for web applications. BIRT has two main components: a report designer based on Eclipse, and a runtime component. BIRT also offers a charting engine that lets you add charts to your own application. BIRT stated goals within the TED project are to address a wide range of reporting needs including: Lists - The simplest reports are lists of data. As the lists get longer, BIRT supports grouping to organize related data together but also totals, averages and other summaries. Charts - For some reports numeric data are presented as a chart. BIRT provides pie charts, line charts, bar charts and many more. BIRT charts can be rendered in several formats. Crosstabs - Crosstabs (also called a cross-tabulation or matrix) are used to displays reports that need to represent data in two dimensions. Compound Reports – This kind of report is used to display side-by-side previously described elements into a single document. BIRT reports consist of four main parts: data, data transformations, business logic and presentation. 40643464 Data – Several kinds of data sources may be used simultaneously with BIRT. For the TED project the main data source is the data warehouse databases. JDBC is used as connector between the database and BIRT. Page 15 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 Data Transformations - Reports present data sorted, summarized, filtered and grouped to fit the user's needs. While the database can do some of this work, BIRT is used to perform sophisticated operations such as grouping on sums, percentages of overall totals and more. Business Logic - Since data is seldom structured exactly as it is needed, some reports require business-specific logic to convert raw data into information useful for the user. Presentation - Once the data is ready, a wide range of display options may be used; tables, charts, text and more. 5.2.2 CACTI Cacti is a complete network graphing solution designed to harness the power of RRDTool's data storage and graphing functionality. Cacti provides a fast poller, advanced graph templating, multiple data acquisition methods, and user management features out of the box. Figure 5 CACTI Active Session diagram 5.2.3 WEBALIZER Website traffic analysis is produced by grouping and aggregating various data items captured by the web server in the form of log files while the website visitor is browsing the website. 40643464 Page 16 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 Figure 6 Webalizer Traffic Analysis diagram Figure 7 Webalizer Summary by Month diagram 5.3 LICENSE HOLDER ENVIRONMENT The License Holder environment module is limited as the ProFTPD server and its modules. The content of the environment is generated by the content management module. Then, a symbolic link used by the ProFTPD server is updated to put at License Holder disposal the new files. 40643464 Page 17 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 5.3.1 AUTHENTICATION AND LOGGING The ProFTPD server for the License Holders makes use of a specific module to enhance his functionalities. The needed functionalities are Authentication using the user information contained in the MySQL database. Logging of the License Holder environment usage statistics. These files will be parsed and the extracted information will be stored in the datawarehouse database. The mod_sql module is installed to add these two functionalities to ProFTPD. It is comprised of a front end module (mod_sql) and backend database-specific modules (mod_sql_mysql). The front end module leaves the specifics of handling database connections to the backend modules. 5.4 EMAIL ANALYSIS AND NOTIFICATIONS The email analysis and notifications module is in charge of the analysis of received emails and the mailing of notifications and reminders to Contracting Authorities or web site users. The email analysis and notifications module is implemented as an email processing agent built on the top of the Apache James Mailet API. A mailet is a mail processing component which is executed within a mailet container. The Mailet API defines interfaces for both Matchers and Mailets: Matchers are used to match mail messages against certain conditions. They return some subset (possibly the entire set) of the original recipients of the message if there is a match. An inherent part of the Matcher contract is that a Matcher should not induce any changes in a message under evaluation. Mailets are responsible for actually processing the message. They may alter the message in any fashion, or pass the message to an external API or component. This can include delivering a message to its destination repository or SMTP server. In the TED project, Matchers are used to analyse the emails and detect spam. An internet blacklist is used to detect the undesirable email (any mail with sender matching an entry in this blacklist is automatically forwarded to the spam folder). The “out of office” replies are also managed in a special way: all incoming mails are searched for a given pattern (for instance “*out of office*”) in the subject or content of the mails. If the pattern matches, the mail is automatically flagged as out-of-office, and is forwarded to the out-of-office folder. The subject of these mails is prefixed by “Out of office”. Mailers, on the other hand, are used to fulfil the mailing of notifications. For the TED project, the Apache JAMES server is used as container and is responsible for the assembly and configuration of the deployed Mailet and Matchers. 5.5 WORKFLOW ENGINE The workflow engine is responsible for the processing of the document files received by the Publications Office and the creation of the file system used for the creation of the DVD images (daily, weekly and monthly images). The performed operations are mainly transformations and indexation of the received file. The content management module also contains the production management dashboard which is used to monitor and controls the steps depicted on the figure bellow: 40643464 Page 18 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 Figure 8 Production management steps 5.5.1 VALIDATION AND FILES TRANSFORMATION The purpose of the Validation and files transformation step is to create the formatted content to be published from the received XML notices. The prepared content is then stored in the content library. The different transformations may be executed at different time. Technically, the transformations are performed through XSLT for all the formats to be supported. RSS feeds are generated for the publication day by querying the corresponding notices, formatting the RSS feed and storing the result in content library. RSS feeds are generated after the creation of the index according to the description of the next section. Notice family changes are populated to existing notices. 5.5.2 PDF GENERATION AND TIME-STAMPING For the generation of PDFs, we use XSL-FO as an intermediate format, and custom version of Apache FOP as the composition engine. Apache FOP was customized in order to add support of PDF/A-1a. Standard compression and file organisation techniques are used to compile the results per publication channel. The time stamping of the PDF/A-1a notices is performed by the PDF Time Stamping tool using open source PDF and Cryptography libraries. Once the PDFs are time stamped, they are stored in the content library.1 1 Notice that PDF time stamping is currently not activated on the TED web site: a flag permit to put the time-stamping service in a degraded mode. 40643464 Page 19 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 5.5.3 INDEXING All the files are indexed after validation and transformation using Apache Lucene. Documents are parsed to extract elements needed for the search on specific elements but also for the free text search. The indexing process is split in three distinct steps; creation of the five days index, creation of the active index and finally update of the archive index. 5.5.4 DVD IMAGE CREATION Three distinct DVD images related to the dissemination of the Supplement of the Official Journal are created by the system during the production process. These image files contain different file types which are mainly related to the PDF format: PDF/A-1a: the file format for the long-term archiving of electronic documents. It is based on the PDF Reference Version 1.4 from Adobe Systems Inc. PDF/A-1a time stamped: The time stamped version of the PDF/A-1a document file. PDX: The Acrobat Catalogue Index file contains the index of all the document of an OJS issue. Acrobat reader is able to directly use this kind of file to perform searches on the content of the documents. This file is built for the weekly DVD-ROM. Generation of PDF/A-1a and PDF/A-1a time stamped files of the documents are generated by the system using the XML documents. These transformations are explained in Validation and files transformation. In the weekly DVD-ROM image, a PDX or PDF index file is created manually to index all the document of the current OJS issue. Adobe Acrobat Professional is used for this purpose by the publishing operations team. For performance reasons, this tool is installed in the production environment to ensure a direct access to the files to index. Therefore, a remote desktop access to Acrobat Professional is put at the disposal of the publishing team. The table of contents PDF file is generated automatically by the system using iText. iText is a library available under LGPL license for dynamic PDF document generation and manipulation. The creation of the image file is an automated process triggered by the publishing operation team. This is achieved using the mkisofs tool, with support of UDF format. 5.5.5 CONTRACTING AUTHORITY NOTIFICATION A Contracting Authorities are notified by the TED system about the publication of their notices. A notification is sent to each contracting authority to notify them that their notices have been published in the OJS. The email contains an UDL link to the notice of the corresponding contracting authority and the time-stamped PDF/A 1a. The TED system also sends a reminder to the Contracting Authority for each contract notice that does not have a corresponding award notice. Of course, in order to be able to send reminder and notification emails, the TED system needs to be able to retrieve the email address of the Contracting Authorities for each specific notice. Unfortunately, there is no way to extract this contracting authority email address in a “standard” way. This information does not exists in the common notice XML header. Actually, a different extraction method exists for each type of form. The table named DOCUMENT_XML_INFO contains the XPath to the Contracting Authority email for the different type of forms. This implementation choice avoid to hardcode the extraction rules in the code, and provide a much more flexible way to support new form in the system. 40643464 Page 20 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 5.6 NOTICE VIEWER The notice viewer module represents a stand-alone java application responsible for the transformation of XML files in a well formatted version suited to be directly displayed. This application is installed on an Office server and is used directly using command lines. The notice viewer doesn’t include user interface nor persistence capabilities. The notice viewer transformation support two output formats, the HTML output format and the PDF format. It will first validate the input notice using the XML schema, then an UTF-8 validation of the notice is performed. Technically, the transformations are performed through XSLT for the generation of HTML files and for the generation of PDFs, we use XSL-FO as an intermediate format, and Apache FOP as the composition engine. An HSQL files based (read only) database is used to retrieve the translation of the different reference data. 40643464 Page 21 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 6 IMPLEMENTATION VIEW The implementation view describes the overall structure of the implementation model and the decomposition of the software into modules and specific components. 6.1 TED WEBSITE 6.1.1 OVERVIEW The TED application is packaged as two separate Web Archive files (WAR) that represent the TED website and the data-warehouse. This separation allows the deployment of each of these applications separately on different servers. The following figure shows the physical contents of these web applications. Note that the two applications share the same file structure; the difference being the specific JSP pages and Java classes (along with their dependent Java libraries). Figure 9 Web application file structure 40643464 Page 22 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 6.1.2 TED XSL TRANSFORMATION The TED Website is based on the XML transformation (XSL transformation) to transform the input TED_EXPORT XML file to the different presentation views: HTML or PDF. During the daily processing (see Workflow engine) the export XML file is transformed to an internal XML (TED_INTERNAL) for each supported languages. The TED internal XML format contains all the information needed for the presentation layer. For instance the reference data are translated in the internal XML file for each supported language, the internal format is enriched with formatting information such as paragraph, URL and email addresses. Only documents transformed during the daily processing are persisted on the file system. Historical documents are transformed into the ted internal format on the fly (from OJS 2005/206 to OJS 2010/041). The following table shows the list of XSL transformations (input/output) for the TED system. TED TRANSFORMATION XSL Input Output “2.0.5 DTD” xml or “TED_EXPORT 2.0.7” xml TED_INTERNAL XML InternalTed-To-HtmlTed.xsl TED_INTERNAL xml Notice HTML InternalTed-ToHtmlDataViewTed.xsl TED_INTERNAL xml Notice data view HTML InternalTed-ToXmlFOTed.xsl TED_INTERNAL xml Notice PDF InternalTed-ToLicenseHolderMETA.xsl INTERNAL_OJS xml Notice Meta License holder InternalTed-ToLicenseHolderUTF-8.xsl INTERNAL_OJS xml Notice UTF-8 License holder InternalOJS-ToInternalTed.xsl InternalOJS-ToInternalTed_<<FORM>>.xsl Table 3: TED XSL Transformation 6.2 EMAIL ANALYSIS AND NOTIFICATIONS The email analysis and notifications module is packaged in a JAR file that contains the classes developed for the handling and filtering of emails. This jar is deployed on the James email server. The email module is implemented as an email processing agent built on the top of the Apache James Mailet API using Matchers and Mailets interface. 40643464 Page 23 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 6.3 WORKFLOW ENGINE 6.3.1 THE WORKFLOW ENGINE PACKAGE The workflow engine is packaged as an executable JAR file that contains the classes developed for the file system creation, XML transformation, files indexing processing, and DVD generation. The workflow engine is responsible for the instantiation and processing of a new flow for each publication date. Figure 10 Workflow engine package Structure 6.3.2 THE WORKFLOW ENGINE IMPLEMENTATION The workflow engine is composed of multiple flow definitions: - The daily OJS flow: is responsible for the processing of the data for the next publication date. - The User management flow: is responsible for the workflow management. - The Contracting Authority reminder flow: is responsible of sending the notice reminders. - The reporting flow: is responsible for the processing of the report for cacti and datawarehouse reports. - The cleanup flow: is responsible to clean the file system of all temporary files. Remarks: it exists a specific flow that is used only once for the historical data processing. - The take up archive flow: is responsible for the processing of the full historical data already in production (5 years of publication). 40643464 Page 24 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 Figure 11 Flow Service class diagram Each workflow definition is defined in a spring configuration. The flow definition must implement the ProdFlowService interface. It contains the list of steps that compose the flow. The flow definition is composed of several steps responsible for the execution of a specific task according the step specification. Those steps must implement the ProdStepService interface and specify the dependencies between the steps (waitingSteps). 6.3.3 THE INDEXING IMPLEMENTATION During the Daily OJS flow several indexes are generated to provide a fast search engine to the TED Website. The indexing process is based on the Lucene framework. The Ted application receives a TED_EXPORT XML file. The input XML file is converted into a Lucene Document object where each value to be indexed is mapped using a key/value pair. Each field’s value is analysed using a Standard Lucene Analyser then it is indexed into the appropriate folder. A full description of indexed field is available in the Table 4: Indexed fields. Figure 12 Indexing Flow Search field Code Awarding authority search fields Country of the awarding authority CY Name of the awarding authority AU Place TW Type of awarding authority AA 40643464 Page 25 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Internet address (URL) of the awarding authority Version: 1.00 IA Date search fields Date document sent to the Publications Office DS Deadline for request of documents DD Deadline for receipt of tenders DT Publication date PD Reference search fields Original language OL Number of reference document RN Document number ND Edition number of Supplement to the Official Journal OJ Codification search fields Type of document TD Type of contract NC Type of procedure PR Origin (applicable regulation of procurement) RP Type of tender, division into lots TY Criteria for award of contract AC Title of document TI Main activity MA Title of the main activity MN Classification search fields Original CPV code (until 16 September 2008) OC Original title of the CPV code (until 16 September 2008) ON Current CPV code (from 17 September 2008) PC Current title of the CPV code (from 17 September 2008) PN NUTS code RC Title of the NUTS code RG TED specific fields FT Full text Table 4: Indexed fields 6.3.4 THE WORKFLOW TRANSFORMATION Several steps during the daily processing use the XML transformation to transform the TED_EXPORT input files into different other file formats such as license holder files or PDF notices. 40643464 Page 26 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 The Table 3: TED XSL Transformation shows the list of XSL transformation (input/output) used by the TED Workflow engine. 6.3.5 THE WORKFLOW MANAGEMENT TOOL The workflow management tool (Dashboard or workflow management interface) application is packaged as Web Archive (WAR). The workflow management allows the control of the production lines within a single user interface. It is implemented using standard java Servlet and JSP. The communication between the management tool and the workflow engine is built over the socket API. 6.4 TED SYSTEM I18N SUPPORT We use two mechanisms to support the multilingualism for the TED System (TED Website and TED Workflow engine). The business data (such as CPV, NUTS) translations are stored into the database and the interface message are stored into XML files. The reference data are all translated in the database in the table <code>_Description that contains the translations in the 23 languages supported by the TED Website. The spring framework offers a simple and easy mechanism to support i18n: ReloadableResourceBundleMessageSource. The labels in the Ted Website are all translated using the spring mechanisms. The files messages_<<language code>>.xml contain the labels and messages displayed to the users by the TED Website. The errors_<<language code>>.xml files contain the error messages shown to the users. 6.5 NOTICE VIEWER The Notice viewer is packaged as a tar.gz archive that contains the classes developed for XML transformations. These archive contain all the dependencies necessary for the XML file transformation and production management. The notice viewer implementation use an embedded HSQL database to hold the reference data and associated translations for multilingual support. The transformations are performed through XSLT for the generation of HTML and PDF files. 40643464 Page 27 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 Figure 13 Notice Viewer structure 6.6 REFERENCE DATA The reference data are the business code data. Each reference data are composed of a code and 23 translations. All reference data are versionable, some of these data are also hierarchical. The reference data are: Heading Country Country groups Type of authority (sector or awarding authority) Contract type (market code) Procedure type Document type Regulation type Type of bid Award criteria CPV code Business Sectors Main activity 40643464 Page 28 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document NUTS code Languages Extended CPV code (Additional vocabulary) Version: 1.00 Reference data stored on the TED system should change over the time, and the modifications should be taken in account including the relationship between the codes of the new version and the previous one. Modifications of these reference data such as CPV Codes have an important impact on the whole TED system and more especially on the search features. To handle alteration on reference data, the TED system use a versioning algorithm that permits the translation of old code version to new one to adapt as much as possible the search features. The content of document won’t be modified. The search index will generally be modified after a code change but the document itself won’t change. Thus, it’s possible that a free text search will find document that do not have the searched text in its content. 6.6.1 REFERENCE DATA: DELETION The version n of the reference data has codes that have been deleted in the version n+1. When a code is deleted it is not available anymore in the search interface. All the documents using the deleted code won’t be found anymore using the related criteria. 6.6.2 REFERENCE DATA: ADDITION A new code has been added to version n+1. The new code is added to the search interface. There’s no impact on the previous documents. 6.6.3 REFERENCE DATA: MODIFICATION Several modification types could be foreseen especially in case of hierarchical data. Case 1: Code in version N is replaced by a single code in version N+1. Documents that use the previous version of the code will be re-indexed in order to be found with using the associated new version of the code. Case 2: Code in version N is replaced by multiple codes in version N+1. In case of hierarchical data, Documents that use the previous version of the code will be re-indexed in order to be found using the parent of the code in the old version. If no parent exists or if the data is non hierarchical, this modification will be handled like a deletion. Case 3: Code in version N moves in the hierarchy in version N+1. Documents that use one of the previous version of the codes will be re-indexed in order to be found using its new parents. Searching for this code using old parent codes won’t be possible anymore. An exception to this rule will be made for Countries, when a country change of group the documents won’t be re-indexed. In such case, only new documents will be found using the new parent code. 40643464 Page 29 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 Figure 14 Reference Data Model 6.7 CONTENT MODIFICATION This section describes the procedure that must be followed to add a new form or to update the reference data. 6.7.1 ADDITION OF A NEW FORM 6.7.1.1 Prerequisite The following information must be known before the addition of a new form in the TED system: Does the form contain specific business functionalities that must reflected on the TED website? For instance for cancellation document a “cancelled” indicator is shown on the document impacted. Does the new form contain contracting authority email addresses? Request the labels needed for the document view translated in all languages. 6.7.1.2 tasks If the document contains contracting authority email addresses, then the XPath to the tag containing these addresses must be added to the table DOCUMENT_XML_INFO. The new XSLT transformations must be implemented to generate the internal format, the HTML view, the PDF and License Holder’s specific formats. 6.7.2 MODIFICATION OF REFERENCE DATA Several modification types are foreseen regarding the reference data. First, all the information needed are described. Then the actions needed depending on the modification are explained. 40643464 Page 30 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 6.7.2.1 Prerequisite In case of the addition of a new code. All the labels in every language must be requested. In case of the addition of a new code in a hierarchical reference data. The place of the code in the hierarchy must be known. If the creation of a new version is foreseen. Then all the mappings between the current and the next version must be clearly identified. 6.7.2.2 Modification of an existing reference data version We consider a modification of an existing reference data version in the following cases: Only labels of existing codes in the current version must be changed. New codes are added and all the codes in the current version must be kept. In these cases the reference data tables must be updated with the modifications needed. Then if existing codes are modified; a full re-indexation for the reference data must be performed. If the procedure type or document type reference data are impacted please also refer to section 6.7.2.4. 6.7.2.3 Creation of a new reference data version We consider the creation of a new version of the reference data impacted in the following cases: Some of the codes in the current version are not used anymore and must be removed from the website interface (search mask, browse,…). The code or the signification of a reference data changes. A new version of the reference data must be created in the database: CODE_XXX table : mandatory CODE_XXX_VERSION table: mandatory CODE_XXX_MAPPING table: mandatory CODE_XXX_HIERARCHY table: mandatory if the reference data is hierarchical. When the new version of the reference data is valid (not before!): All the documents must have been re-indexed using the new version of the code (with the help of the new mapping). In the DOCUMENT table; the column XXX_CURRENT_VERSION must have been updated with the id of the reference data in the last version. If the procedure type or document type reference data are impacted please also refer to the next section. 6.7.2.4 Modification of procedure (PR) and document type (TD) If the procedure type or document type reference data are modified additional action must be performed. 40643464 Page 31 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD 6.7.2.4.1 Software Architecture Document Version: 1.00 Procedure (PR) and document type (TD) The combination between PR and TD gives information about the need to award a document. Therefore if new PR or TD code must be added, it’s necessary to know if these new codes are related to documents that need a reminder or not. This information is stored in XX_CONTRACT_AWARD_NEEDED column in the main reference data table. 6.7.2.4.2 Document type (TD) Some document types are used to indicate that a notice is contract award. Therefore if a new TD code is added, it’s necessary to know if the new code should be considered as an awarding type. If a new awarding type must be taken in account or removed the column TD_CONTRACT_AWARD_NOTICE of table CODE_DOCUMENT_TYPE must be updated accordingly. The same procedure must be followed for document types identified as corrigenda. The column TD_CORRIGENDA of table CODE_DOCUMENT_TYPE must be updated accordingly. 6.8 APPLICATION DEPENDENCIES APPLICATION DEPENDENCIES Application Layer External Systems / Dependency Spring MVC Web Layer Spring Security Spring Integration Ted Application File System Integration Layer MySQL Database lucene Web Layer JSP/Java Servlet Spring Integration File System XSL TED Workflow Application XSL-FO Integration Layer FOP iText James Mail MySQL Database Mkisofs 40643464 Page 32 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 lucene Notice Viewer Spring Integration XSL XSL-FO FOP HSQL 6.9 BACKUP PROCEDURE The daily backup is implemented using a full database backup and an incremental file system repository backup. These backups are configured to run overnight on each production lane. The MySQL databases are additionally backed up using an export script. This script runs before the scheduled disk backup in order to ensure that they are also included in the backup. It produces a standard MySQL database export, which can be used for easy recovery into another database instance. The script adds another level of fail tolerance for the stored data on top of the replication mechanism. Every day a daily backup on the back-end, front-end, data warehouse and common backup is executed and a temporary folder on the NFS is created to hold the different backups. A cron script is responsible to transfer the result to an external backup unit server. 6.9.1 DAILY BACK-END BACKUP PROCEDURE Non cluster database backup To backup back-end non-cluster database a dump is made for each back-end server. All non-cluster tables and views are dumped. Finally a restore script is created for each dump. Cluster database backup The cluster database is split on the two production lanes, so the backup will dump the entire database. To backup cluster database the MySQL Node manager is used. It takes a snapshot of each node of the cluster. Then an archive of each snapshot is made. Finally a single restore script is created to restore each node of the cluster. Repository file system backup In order to reduce the time of the file system backup; inotify 2 is used. It permits to log all modifications made on a set of folders. Inotify is used on the repository folder in order to make a file, listing all files modified, since the last backup. In this case a faster rsync is possible by using this file. Finally inotify file is also used to create an archive of the set of files modified since the last backup. 2 inotify is a file change notification system, a kernel feature that allows applications to request the monitoring of a set of files against a list of events. When the event occurs, the application is notified 40643464 Page 33 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 Backup synchronisation between databases and file system To backup back-end server’s databases and file system must be synchronised. In order to achieve this task the following tasks are made: 1. a lock is put on databases, to avoid any modifications; 2. inotify file snapshot is made; 3. Locks are released. TED data file system In a first part only the backup of the indexes and the RSS files of one of the two servers is made. In a second part all logs of both servers are backed up. 6.9.2 DAILY FRONT-END BACKUP PROCEDURE Database backup The same backup procedure as back-end non-cluster database backup is used. TED data file system The same backup procedure as back end TED data file system backup is used. In this case logs and license holder environment are backed up. 6.9.3 DAILY DATA WAREHOUSE BACKUP PROCEDURE Database backup The same backup procedure as the back-end non cluster database backup is used. 6.9.4 DAILY COMMON BACKUP PROCEDURE The common backup is a backup that is execute on all server and are common to all server. Configuration file system A backup of the snapshot configuration of each server is made. TED data file system For each backup a snapshot of the logs are made. 40643464 Page 34 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 7 DATA VIEW This chapter describes the persistent data view of the system. More specifically, it explains the technical database columns and the functions required to implement version and session management, optimistic locking and user contexts. The Object-Relational Mapping used to implement the persistence layer is Spring JDBC. Full information about the TED data model is available in [R02]. 7.1 MYSQL CLUSTER MySQL cluster is used for the TED databases. MySQL Cluster is a high-availability, high-redundancy database adapted for the distributed computing environment. It uses the NDBCLUSTER storage engine to be able to run in a cluster. A MySQL Cluster consists of a set of computers, each running a MySQL server, a data node and a management server. MySQL cluster is used within the TED system for documents and TED website data (also called volatile data). The relationship of these components in a cluster is shown here: MySQL clients Management client SQL nodes NDB node NDB node Data Nodes NDB node NDB node NDB management server Figure 15 Database replication with MySQL Cluster All these elements work together to form a MySQL Cluster. When data is stored in the NDBCLUSTER storage engine, the tables are stored in the data nodes. Such tables are directly accessible from all other MySQL servers in the cluster. The data stored in the data nodes for MySQL Cluster is mirrored; the cluster handles failures of individual data nodes. The two major types of nodes are described below: 40643464 Page 35 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 Data node: This type of node stores cluster data. There are as many data nodes as there are replicas, times the number of fragments. A fragment is a portion of a database table; a table is broken up into and stored as a number of fragments. Under the NDB storage engine, each table fragment has a number of replicas stored on other data nodes in order to provide redundancy. The TED MySQL Cluster is configured using one fragment and four replicas giving a total number of four data nodes. SQL node: This is a node that accesses the cluster data. In the case of MySQL Cluster, an SQL node is a traditional MySQL server that uses the NDBCLUSTER storage engine. The TED system uses one NDB node and one SQL node per back-end server that makes a total of four NDB nodes and four SQL nodes. Each production line has one NDB management server. 7.2 TECHNICAL COLUMNS Some database tables used to store business entities in the TED system have columns that do not hold business data but are used only to implement specific functionalities. 7.2.1 AUDIT SEGMENT Each table of the MySQL TED databases contains a MODIFIED_ON column and a VERSION column for the optimistic locking and for versioning: Column Data Type Description MODIFIED_ON TIMESTAMP The last update date of the record VERSION INT The version number of the record Table 5: Modified On and Version columns Volatile data tables also contain the MODIFIED_BY column that gives the identifier of the user who has modified/created the entry. Column MODIFIED_BY Data Type VARCHAR Description The username of the user who has modified or created the entry. Table 6: Modified By column The data is persisted into the database when the Spring JDBC ‘persist’ method is called. At this time, a trigger checks that the VERSION field of the updated entity is the same that the one stored into the database. This verification allows the system to know if a new version has overridden the last loaded data during the session data manipulation. If the record version we want to update is the same record version of the record stored in database, the record is updated. A trigger is then called, which updates MODIFIED_ON and increment the VERSION field of the record. Otherwise, it results in an exception that avoids concurrent modification of the same entity. 40643464 Page 36 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 8 DEPLOYMENT VIEW The TED system runs on two distinct production lanes. One front end Load-balancer, in charge of routing any requests to a production lane following the figure bellow: TED System Main Loadbalancer Production Lane 1 Production Lane 2 Figure 16 TED Deployment diagram The TED modules are deployed on different physical components. The following list contains all the servers required for one production lane and the association between the module described previously and the server on which they are deployed. The frontend Web Server of each production line is composed of: An Apache Web Server with a Tomcat Load-balancing module, in charge of routing the HTTP requests to a Server and serving the static resources (such as pictures, for instance); A ProFTPD Server which is used for the License Holder environment module; A James email server: The Email analysis and notifications module is deployed on this email server. A MySQL Database for Data warehouse information. The two backend servers of each production line are composed of: A Tomcat Application Server which hosts the TED website, the datawarehouse and the TED Workflow management tool website. The workflow engine modules that run on a specific JVM. A copy of the indexes used for the searches performed by the Web application. A copy of the public RSS feeds used by the Web application. A MySQL Database for document and volatile data; The NFS server is mainly in charge of hosting the content library. It uses RAID level 5 to provide a high level of fault tolerance combined with high performance. The content library is sized to contain all the received documents from the Office and all the subsequent documents obtained by transformation. The Network File System (NFS) server is composed of: A windows XP OS via VMWare for the Adobe PDF indexes of the weekly DVD (using Adobe professional) A MySQL cluster manager node. It is in charge of managing the cluster replication data between the four instances (two on each production lane) of MySQL Cluster node. 40643464 Page 37 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 The following schema gives an overview of the deployment of the different modules of the TED system on the different servers present in one production lane. Figure 17: Deployment diagram The production of the content runs in parallel on all production lanes. The objective is to have the information available on all production lanes. This to ensure that if one fails, the other one can 40643464 Page 38 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 continue to serve the content. The full content library is duplicated on the NFS of each production lane. The daily switch of the publication day is synchronized on each production lane to ensure that they serve the same content. Initially the entire TED system is composed of two production lines. Some processing steps are only processed on one production lane (e.g. sending of emails). If one production line breaks, the operator executes these processes on the other line. The load balancers dispatch the incoming requests based on the load of each server. Session forwarding is used to keep all requests from one user to the same server. Some document related information is replicated on the MySQL local instance of each back end server. This task is performed by the production workflow; synchronization steps are used to update databases of each back-end server to ensure data coherence. Volatile database information (e.g. registered user information and document meta data) is inserted in all database servers simultaneously with the replication mechanisms offered by MySQL cluster. The Data warehouse information is duplicated. Each data warehouse database instance contains the information of all production lanes and is filled with the logs of all the servers and processes. 8.1 NETWORK FILE SYSTEM SERVER Two network file system servers are configured, one on each production lane. Each NFS server holds the content library and the TED backup and it hosts a Windows XP via WMWare and a MySQL cluster manager node. The result of the backup procedure is constructed and stored on each NFS, then all this information is copied at the backup site. 8.1.1 TED REPOSITORY FILE SYSTEM The TED repository file system has the following structure: /data/ted-[A||B]/ted-data/input /data/ted-[A||B]/ted-data/repository /data/ted-[A||B]/ted-data/dvd /data/ted-[A||B]/ted-data/license-holder where [A|B] means production lane A or production lane B TED REPOSITORY Path /data/ted-[A||B]/ted-data/input /data/ted-[A||B]/ted-data/repository /data/ted-[A||B]/ted-data/dvd /data/ted-[A||B]/ted-data/license-holder 40643464 Information contains the original compressed input file contains the processed files (XML, PDF, PDF time stamped) contains the last daily, weekly and monthly DVD contains the license holder files (UTF-8 ,and META-XML format) Page 39 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 8.1.2 TED TEMPORARY BACKUP FILE SYSTEM Each NFS holds a temporary backup file system which is built during each backup process. This is a temporary storage of backed-up data, the content of all this folder is copied at the backup site in a second phase. The TED backup file system has the following structure: TED BACKUP Path Information /data/ted-[A||B]/backup/daily-backup /data/ted-[A||B]/backup/repository contains the daily backup files contains a full repository backup of /data/ted[A||B]/ted-data/repository /home/backup contains backup scripts /home/backup/log contains the backup of processing logs 8.1.3 TED MIRROR BACKUP FILE SYSTEM This section contains a description of the filesystem present on the backup machines at the backup site. The TED backup file system has the following structure: TED BACKUP Path /data/backup/Prodlane-[A|B]/daily-backup /data/backup/Prodlane-[A|B]/repository /data/backup/Prodlane-[A|B]/repository-delta /home/backup /home/backup/log Information contains the mirror of the daily backup files folder contains the mirror of the repository backup folder contains delta repository files contains synchronization backup scripts contains synchronization backup processing log 8.1.4 WINDOWS XP VIA VMWARE The Windows XP (via VMWare) has the only purpose to host Adobe Professional. The Adobe Professional product is use to generate PDX (Acrobat Catalogue Index) file to be included in the weekly DVD. The protocol samba is used between the virtual Windows and its host to share the file system. 40643464 Page 40 of 41 Production and dissemination of the Supplement to the Official Journal of the European Union: TED website, OJS DVD-ROM and related offline and online media Ref: TED-SAD Software Architecture Document Version: 1.00 8.2 JAMES EMAIL SERVERS Two James email server are installed, one on the front-end of each production line. This implies a specific DNS configuration explained in the following point. 8.2.1 DNS CONFIGURATION Two MX (Mail Exchanger) entries are registered on the DNS server of “ted.europa.eu”: mail1.ted.europa.eu and mail2.ted.europa.eu. These entries ensure any SMTP requests to reach the requested mail-server (James of production line 1 for mail1.ted.europa.eu and James of production line 2 for mail2.ted.europa.eu). This means that the load balancing process is not triggered when accessing these mail-severs. 8.2.2 SPAM FOLDERS The spam folders are available, using an SSH access, for browsing, downloading and burning the CDs with the supposed spam content. 8.3 DATABASE ORGANISATION A TED Production Lane is organised in five databases: On each front-end server one TED_DATAWAREHOUSE database: this is the database that holds the data related to the data-warehouse and monitoring. There’s no replication between the two production lanes: each TED_DATAWAREHOUSE database contains the full TED data-warehouse data; On each backend server runs a single MySQL instance. On this instance, tables are created with two distinct engines; the first one for tables that must be created using the MySQL cluster and the second one for tables that must be create locally. TED schema contains local table and TED cluster schema contains tables shared over the cluster. In summary, as there are two production lanes, considering that the cluster database is shared among all the back-end servers in all production lanes. This brings the total number of MySQL databases to seven: - One instance per front-end server for a total of 2 instances. - One instance per back-end server for a total of 4 instances. - One cluster instance shared among the back-end servers. 40643464 Page 41 of 41