Project Acronym: FIRST Project Title: Large scale information extraction and integration infrastructure for supporting financial decision making Project Number: 257928 Instrument: Thematic Priority: STREP ICT-2009-4.3 Information and Communication Technology D2.2 Conceptual and technical integrated architecture design Work Package: WP2 – Technical analysis, scaling strategy and architecture Due Date: 30/09/2011 Submission Date: 30/09/2011 Start Date of Project: 01/10/2010 Duration of Project: 36 Months Organisation Responsible for Deliverable: ATOS Version: 1.0 Status: Final Author Name(s): Mateusz Radzimski, Murat Kalender (ATOS), Miha Grcar (JSI), Achim Klein, Tobias Haeusser (UHOH), Markus Gsell (IDMS), Jan Muntermann (UGOE), Michael Siering (GUF) Reviewer(s): Achim Klein UHOH Irina Alic UGOE R – Report P – Prototype D – Demonstrator O - Other PU - Public CO - Confidential, only for members of the consortium (including the Commission) RE - Restricted to a group specified by the consortium (including the Commission Services) Nature: Dissemination level: Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013) D2.2 Revision history Version 0.1 0.2 Date 18/04/2011 13/06/2011 0.3 04/07/2011 0.4 11/07/2011 0.5 26/07/2011 0.6 9/08/2011 0.7 10/08/2011 0.8 18/08/2011 0.8.5 29/08/2011 0.9 14/09/2011 0.91 27/09/2011 0.95 1.0 29/09/2011 30/09/2011 Modified by Comments Mateusz Radzimski (ATOS) Early version of ToC provided Mateusz Radzimski (ATOS) Primary content for “Requirements Analysis” section Murat Kalender (ATOS) Primary content for “Integration approach” section Mateusz Radzimski (ATOS) Primary contribution to “Architectural perspective” section Mateusz Radzimski Added content to “Architectural (ATOS), Murat Kalender perspective” section, add Annex 1. (ATOS) Further contribution for “Integration Approach” section. Markus Gsell (IDMS) Added contribution to “Data storage design principles” Mateusz Radzimski (ATOS) Document refactoring and minor corrections according to teleconference discussions. Murat Kalender (ATOS) Added “Integrated GUI” subsection. Murat Kalender (ATOS), Changes to „Integrated GUI” Mateusz Radzimski chapter. Added „Hosting platform” (ATOS), chapter. Miha Grcar (JSI) Miha Grcar (JSI), Achim Added chapter “Example of the Klein, Tobias Haeusser FIRST process”. (UHOH), Markus Gsell (IDMS), Jan Muntermann (UGOE), Michael Siering (GUF), Mateusz Radzimski (ATOS) Mateusz Radzimski Addressing reviewers’ comments, (ATOS), Murat Kalender Added “industrial perspective” (ATOS), Markus Reinhardt subchapter. Contributions to (NEXT) “Design” chapter. Mateusz Radzimski (ATOS) Final Version. Tomás Pariente (ATOS) Final QA and preparation for submission D2.2 Copyright © 2011, FIRST Consortium The FIRST Consortium (www.project-first.eu) grants third parties the right to use and distribute all or parts of this document, provided that the FIRST project and the document are properly referenced. THIS DOCUMENT IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS DOCUMENT, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ---------------- D2.2 Executive Summary This document aims at performing comprehensive architectural analysis of the FIRST system. Based on both technical and functional requirements, a candidate architecture and technical design is defined that will ensure meeting goals of FIRST large-scale financial information system. This analysis focuses also on investigating on system integration approach and choosing most suitable architectural patterns, mechanisms and technologies. It consists of technical details concerning high level organisation of subsystems, their interaction and specific aspects of technical components with an emphasis on performance and high scalability. D2.2 Table of Contents Executive Summary ...................................................................................................... 4 Abbreviations and acronyms ....................................................................................... 7 1. Introduction ............................................................................................................ 8 2. Requirement analysis ............................................................................................ 9 2.1. Relation of technical requirements and user requirements ............................... 9 2.2. Architecturally significant requirements ........................................................... 11 3. FIRST architectural perspective .......................................................................... 14 3.1. Overall project goals and architectural considerations .................................... 14 3.2. High level FIRST architectural view................................................................. 15 4. Integration approach ............................................................................................ 18 4.1. State of the art on integration approach .......................................................... 18 4.1.1 Application Integration .............................................................................. 18 4.1.2 Enterprise Application Integration ............................................................ 21 4.2. Pipeline processing ......................................................................................... 22 4.3. GUI integration ................................................................................................ 25 4.3.1 Introduction .............................................................................................. 25 4.3.2 Web Application Frameworks................................................................... 26 4.4. Data storage design principles ........................................................................ 29 4.4.1 Storage paradigms ................................................................................... 29 4.4.2 Access layer ............................................................................................. 30 4.4.3 Mediation layer ......................................................................................... 31 5. Design ................................................................................................................... 32 5.1. Detailed components interaction perspective .................................................. 32 5.2. Sample FIRST process ................................................................................... 35 5.2.1 Sample scenarios..................................................................................... 35 5.2.2 Data acquisition and preprocessing pipeline ............................................ 37 5.2.3 Information extraction pipeline ................................................................. 38 5.2.4 Decision support models .......................................................................... 39 5.2.5 Data exchange between pipeline components ......................................... 41 5.2.6 Role of the integration layer ..................................................................... 42 5.2.7 GUI integration ......................................................................................... 45 5.2.8 Role of the storage components .............................................................. 45 6. Deployment ........................................................................................................... 47 6.1. Hosting platform .............................................................................................. 47 6.2. Deployment scenarios ..................................................................................... 47 6.3. Industrial perspective of the FIRST system ..................................................... 48 7. Conclusion ............................................................................................................ 49 References ................................................................................................................... 50 Annex 1. Requirements groups ............................................................................ 52 D2.2 Index of Figures Figure 1: Integrated architecture in the context of other workpackages ......................................... 8 Figure 2: Architectural mechanisms applied to requirements (Eeles, 2005) ................................ 11 Figure 3: FIRST High-level architecture – logical view ............................................................... 16 Figure 4: Remote Procedure Invocation Architecture (Hohpe & Woolf, 2003) ........................... 19 Figure 5: Communication model of a broker based messaging system (Zeromq, 2010).............. 19 Figure 6: Communication model of a brokerless messaging system (Zeromq, 2010) .................. 20 Figure 7: Communication model of a broker based messaging system (Kusak, 2010) ................ 21 Figure 8: Messaging systems performance comparison results .................................................... 23 Figure 9: ZeroMQ messaging patterns (Piël, 2010) ...................................................................... 24 Figure 10: ZeroMQ messaging patterns performance comparisons. ............................................ 25 Figure 11: Data Acquisition and Information Extraction pipeline integration architecture ......... 25 Figure 12: Client/server application and web application architectures (Howitz, 2010) .............. 25 Figure 13: Detailed component view (overview) .......................................................................... 32 Figure 14: Detailed component view (with highlighted interactions and data flow) .................... 34 Figure 15: An example of the topic trend visualization ................................................................ 35 Figure 16: DJIA vs. smoothed time series of the sentiment index (see (Klein, Altuntas, Kessler, & Häusser, 2011)) ......................................................................................................................... 36 Figure 17: The current topology of the data acquisition and preprocessing pipeline. .................. 38 Figure 18: Canyon Flow: a “view” of a hierarchy of document clusters ...................................... 39 Figure 19: The Canyon Flow pipeline and the corresponding part of the Web-based integrated GUI. ............................................................................................................................................... 40 Figure 20: Annotated document corpus serialized into XML. ...................................................... 41 Figure 21: Annotated document serialized into HTML and displayed in a Web browser ............ 42 Figure 22: Integration between data acquisition and information extraction components w/load balancing. ...................................................................................................................................... 43 Figure 23: Example of synchronous high-level services invocation............................................. 44 Figure 24: Data-push mechanism for notification services ........................................................... 45 Figure 25: Instrument Cockpit of MACOC application augmented with FIRST data (mockup) . 48 Index of Tables Table 1: Use case vs. Technical requirements cross matrix .......................................................... 10 Table 2: Architecturally significant requirements analysis ........................................................... 13 Table 3: GWT advantages and disadvantages............................................................................... 27 Table 4: JSF advantages and disadvantages .................................................................................. 27 Table 5: Spring MVC advantages and disadvantages ................................................................... 28 Table 6: A comparison of notable Java web application frameworks (Raible, Comparing JVM Web Frameworks, 2010) ............................................................................................................... 29 Table 7: Main characteristics of relational and non-relational storage paradigms ....................... 30 Table 8: Portfolio selection scenario based on DJIA stocks and sentiment extracted from blog posts (Klein, Altuntas, Kessler, & Häusser, 2011)........................................................................ 37 D2.2 Abbreviations and acronyms DOW Description of Work WP Workpackage TBD To be defined SOA Service Oriented Architecture API Application Programming Interface ESB Enterprise Service Bus UC Use Case PUB/SUB Publish/Subscribe REQ/REP Request/Reply JVM Java Virtual Machine CLR Common Language Runtime BOW Bag of words MVC Model View Control UI User Interface JSF Java Server Faces GWT Google Web Toolkit © FIRST consortium Page 7 of 53 D2.2 1. Introduction The most important aspect of this document is to provide a communication within the project regarding architectural and technical point of view. Design will affect all technical components developed in other workpackages by outlining possible interactions patterns, dependencies and dataflow. Architecture is also heavily influenced by articulated requirements, both technical and use case. For example, processing methods envisaged for analysing huge data streams provide an important input that constrain architecture techniques and determine further technological choices. Therefore the idea is to encompass all such requirements and constraints into coherent design that will enable for seamless future development of FIRST system. It is also important to keep the description at proper level of abstraction to avoid overengineering the design. FIRST being a research project is driven by experiments and improvements of current state-of-the-art techniques, therefore defining all details at this stage of the project would be infeasible. Instead, those details will be presented along with prototype release milestones throughout project lifetime in corresponding deliverables. The relation of this document with other deliverables and technical workpackages has been presented in Figure 1. Technical components influences WP3 Use case requirements specification D1.2 Integrated architecture Technical requirements and state of the art D2.2 D2.1 WP4 WP5 WP6 D2.3 WP7 Scaling strategy Figure 1: Integrated architecture in the context of other workpackages © FIRST consortium Page 8 of 53 D2.2 2. Requirement analysis This section is aimed at recapping and analysing technical requirements described in D2.1 that concern architecture and behaviour of the overall FIRST system. We will ensure that requirement analysis is sound and provide a firm ground for further technical architecture design. Therefore we will study and evaluate requirements collected within D1.2 (use-case perspective) and D2.1 (technical perspective) in order to assess how business use cases defined in WP1 are satisfied by envisaged technical provisioning captured by D2.1 technical requirements analysis. It is also important to analyze requirements with regard to the benefit it brings to the overall system and their viability within the limits of project’s resources or technical and scientific feasibility. It allows assigning requirements’ priorities accordingly, deciding on which functionalities are essential for satisfying the use cases and project goals and which should be treated as supplemental. Significant part of this chapter is devoted to analysis of architecturally significant requirements. Those are requirements that have a clear, technical impact on the project and influence architecture and design of the system. We will proceed with this task by choosing relevant ones and extending them if necessary by providing proposed design and technological details. They will serve as enablers of the architectural analysis. 2.1. Relation of technical requirements and user requirements The use case requirement specification described in D1.2 from a comprehensive overview of a system as seen by use case stakeholders, through the description of system functionalities, non-functional attributes, actors, and context. For each of three use cases a similar list has been provided. By analysing all requirements altogether from a functional point of view, eight main category groups can be identified: 1. Data feeding and acquisition 2. Retrieval of topics and features 3. Retrieval and classification of sentiments 4. User Interface and Service delivery 5. Access control and security 6. Configuration and maintenance 7. Storage, persistence and data access 8. Decision support and visualisation Requirements falling into each category are closely related and they define a common fragment of system functionality. Note that Non Functional Requirements are orthogonal to the aforementioned groups, thus not listed here. The list of assignment of each requirement to the group above has been presented in Annex 1. By clustering use case requirements into functional categories, we can reduce the number of items to analyse, thus it is viable to perform a requirement coverage breakdown using requirements traceability matrix. Table 1 shows the coverage of each group (horizontal axis) by relevant technical requirement (vertical axis). The analysis simplifies multidimensional nature of system requirements into 2-dimentional matrix. By identifying cross-reference relationships we obtain quantitative result indicating how well each of functionalities has been described in terms of technological provisioning. The table should be read as: Technical requirement R1.1 covers some aspects of global functionality nr 1. The numbers in the row titled “Quantitative technical coverage” denotes how many technical requirements are related to certain functionality. © FIRST consortium Page 9 of 53 D2.2 Technical requirement name Requirement ID 1. Data feeding and acquisition 2. Retrieval of topics and features 3. Retrieval and Classification of sentiments 4. User Interface & Service delivery 5. Access control and security 6. Maintenance and configuration 7. Storage, persistence and data access 8. Decision Support system and visualisations Functionalities Internet connection bandwidth Concurrent execution of processes – hardware infrastructure Memory and persistent storage API for external access R1.1 Flexibility of the infrastructure Concurrent execution of processes – software infrastructure Logging & monitoring R2.2 Stability R3.1 Pipeline latency R4.1 Pipeline throughput R4.2 Document format and encoding Interchange data format R4.3 x R4.4 Data formats R5.1 x x Unified access R5.2 Ontology format R6.1 Ontology availability R6.2 Ontology purpose R6.3 Ontology evolution R6.4 Data acquisition pipeline – functionality Data acquisition pipeline – supported Web content formats Information extraction components Sentiment analysis R7.1 Decision support models – features Decision support models – streams Programming / runtime environment – data acquisition pipeline Programming / runtime environment – information extraction pipeline Runtime environment – knowledge base Programming / runtime environment – decision support components R7.5 x R7.6 x x R1.2 x x x x R1.3 x R2.1 x x R2.3 R2.4 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x R7.2 R7.3 x R7.4 x R8.1 x x R8.2 x R8.3 x R8.4 2 1 2 7 6 Low Low Medium Medium High High 13 14 12 High Quantitative technical coverage Low x Table 1: Use case vs. Technical requirements cross matrix © FIRST consortium Page 10 of 53 D2.2 It can be observed from a cross-reference analysis that the coverage is lower for the storage and decision-support components (Category 8). This is due to the fact that at this stage of the project, the data acquisition and preprocessing pipeline, as well as the information retrieval framework, are relatively well defined and early implementations has already started, while the decision-support models are yet to be fully defined. For this reason, fever constraints have been put on this part of the system to ensure enough flexibility when pursuing the use-case goals at a later time and they will receive more technical coverage in the scope of WP6 deliverables according to the work plan. Consequently, the data storage components will need to adapt to the system at that time in order to properly store the models, predictions, and potentially other relevant data and metadata. Categories 4, 5 and 6 have been covered relatively low. Category 4 (“User interface and service delivery”) will be described further in the context of integration and GUI provisioning (see chapter 4.3) for building Integrated Financial Market Information System (WP7). Categories 5 and 6 are not considered main research and technical challenges within the project and their priorities are of least importance in comparison with others. 2.2. Architecturally significant requirements From architectural perspective some requirements are more important than the others. Those which contain attributes or constraints related to architecture and design are called architecturally significant requirements. The list of already defined requirements in D1.2 and D2.1 already contain such. FURPS+ classification model (Grady, 1992) divides requirements based on following characteristics (FURPS): Functionality, Usability, Reliability, Performance, and Supportability. Those requirements have been defined mostly in (FIRST D1.2 Usecase requirements specification, 2011). The “+” sign in FURPS+ model adds additional types, that are: Design, Implementation, Interface and Physical requirements. They are covered by specification of additional technical requirements in (FIRST D2.1 Technical requirements and state-of-the-art, 2011). While latter requirement group usually define concrete design or implementation features thus making implicit influence on architecture, the former group may needs some further analysis and elaboration (see Figure 2) in order to define their effect on architecture. Figure 2: Architectural mechanisms applied to requirements (Eeles, 2005) © FIRST consortium Page 11 of 53 D2.2 Table 2 presents relevant architecturally significant requirements chosen from (FIRST D1.2 Usecase requirements specification, 2011) and (FIRST D2.1 Technical requirements and stateof-the-art, 2011) and performs short analysis on their technical impact on overall architecture. Those aspects will be further considered when defining FIRST system design. Requirement ID1 Analysis & Design mechanisms R4.1 “Pipeline latency” (also in use cases nonfunctional requirements, e.g. UC3.RP1), R4.2 “Pipeline throughput” Huge data stream processed in FIRST will data processed in a pipeline fashion, which requires a different integration approach that typical SOA-based systems represent. For instance, we are assured that data is coming in a constant stream, therefore explicit requests might increase latency and lower overall throughput by inducing extra traffic (R4.2) UC1.R-EU1 Alerts and notification features of “Alert Reports use cases provides an indirect and Cockpit” assumption of delivering messages as soon as they appear in the data stream. Typical constant polling for messages (“did alert X appeared in the data”) might quickly become unscalable with growing number of data and registered alerts. R8.1, R8.2, System comprises components R8.3 written using different technologies. Therefore typical programmatic integration is not possible. A common integration that will ensure robust exchange of data between different execution environments should be provided. R7.1-6 and Technical components involved R-E3.1, Rin data processing are providing E3.2 certain functionalities used for end-user GUIs or integration with other applications (e.g. use case implementations). That requires exposing them in a standard way in a form of APIs (abstracted from underlying technical components) that can be Influence on architecture Most important aspect is to analyse other than WebService (WS) and SOA approaches, especially those suitable for stream processing and compare them based on their performance and throughput capabilities. It may also result in some components being integrated following different, more robust techniques, while other (less performance constrained) following traditional approaches. System should deliver alerts and notification in a “push” based manner, directly to subscribed parties. That approach should be preferred to the typical requesting for data (polling), that might in the future become not efficient. An integration middleware that will connect such components should be able to communicate in a technology independent manner. Exposed functionalities should be resembled in architectural overview in a form of high-level service layer, comprising APIs for developing Integrated Financial Market Information System and Use Case implementations. 1 Requirements IDs are corresponding to their counterparts defined in (FIRST D1.2 Usecase requirements specification, 2011) and (FIRST D2.1 Technical requirements and state-of-the-art, 2011) © FIRST consortium Page 12 of 53 D2.2 Requirement ID1 Analysis & Design mechanisms accessible for other interested components. R2.2 Architecture flexibility requires “Flexibility of certain level of components the decoupling and integration infrastructure” middleware should possess such characteristics. This will also support development and testing process. R2.3 To support concurrent “Concurrent processing, architecture should execution of easily allow techniques as processes – parallelisation from the very software beginning. It affects middleware infrastructure” layer as well as individual components. Influence on architecture FIRST will follow component-based design techniques and integration middleware will allow for clear system decomposition and will support deploying system components independently. Data distribution and results collecting across different components should be possible and easily supported by architecture. Such techniques as load balancing or distributed processing should be supported by architecture as well. Table 2: Architecturally significant requirements analysis © FIRST consortium Page 13 of 53 D2.2 3. FIRST architectural perspective This chapter creates high level overview of the FIRST architecture in order to support FIRST project goals. It also depicts a coarse grained decomposition of the main FIRST building blocks and explains how different use cases are build on top of generic FIRST architecture. 3.1. Overall project goals and architectural considerations The role of defining architecture is to encompass the project vision and defined requirements into a common technical platform that will successfully fulfil the projects goals. Architectural analysis focuses on defining candidate architecture and constraining the architectural techniques to be used in the system (IBM, 2009). The input for this task is coming from the project goals, use case analysis and requirements definition in order to provide a suitable architecture that will ensure that the overall project meets its objectives. It especially accommodates non functional and technical requirements that have strong influence on final design and implementation. The result is a technical overview of organization and structure of components (IBM, 2009). From the global project perspective, the objective of the system is to improve decision making process (e.g. reduce investment risk, increase return, etc. See (FIRST D1.1 Definition of market surveillance, risk management and retail brokerage usecases, 2011)) based on unstructured data sources, such as news articles or blogs. For that reason, system is required to process loads of data and perform automatic analysis, that otherwise would be impossible. Therefore, the architecture should enable to: - process, analyze and make sense of huge number of unstructured data - provide results of financial news (articles, blogs, etc) analysis and relevant information extraction in reasonably short time (near real-time), - integrate components and algorithms in order to automatically process financial data and allow to apply sophisticated decision models, - develop a graphical interface to present and visualize data relevant for decision making. Furthermore such architecture (among other characteristics) should ensure maintainability, extendibility and flexibility. E.g. flexibility may enable further exploitation possibilities; while maintainability and extendibility allows for seamless development process that may include introducing changes and technical modifications during the development and further stages of the project. Architecture should follow component based structure. While the most important functionality comes from the data processing pipeline (FIRST D2.1 Technical requirements and state-of-theart, 2011) individual parts of the system might be also subject for further exploitation separately in the future. However, technologies supporting modular architecture and decomposition, by bringing a new layer of abstraction, usually impose some performance limitations or extra message overhead. In case of SOA integration approaches it may be a central broker or application server for deploying services. In FIRST we may divide the architecture integration into two parts: Lightweight integration for pipeline processing Integration of components directly involved in processing unstructured documents (pipeline processing components), exchanging huge amounts of data – the focus is mostly on lightweight and performance-wise approaches that will ensure the project goals. © FIRST consortium Page 14 of 53 D2.2 Classical SOA-based system integration Integration of other Integrated Financial Market Information system components, such as services for constructing user interface, use cases implementation and accessing stored data – integration should follow best known integration approaches. Also some features of components involved in pipeline processing might be exposed as non-strictly performance-oriented services, i.e. offering some fragments of the pipeline as a separate service might provide added value for further project exploitation or components individual exploitation possibilities. Also reusing components as a service could be viable. For example, isolated sentiment analysis functionality can be wrapped as a service and offered separately. This part also contains GUI implementation that will be analyzed later in this document. The overall FIRST architecture will accommodate both aforementioned techniques in a coherent way in order to satisfy project goals. 3.2. High level FIRST architectural view This section depicts architectural, high-level perspective of FIRST system. It provides structural view of components and explains their relationships within the other parts of the system. The static analysis is covered within this chapter, while more detailed analysis including dynamic view is presented in chapter 5. We will follow loosely adapted approach of presenting architecture as a set of different but coherent views of the system and its environment (IEEE Std 1471-2000, Recommended Practice for Architectural Description, 2000). From the software engineering perspective, the very heart of the project is sentiment analysis analytical pipeline for information extraction from unstructured sources (FIRST D2.1 Technical requirements and state-of-the-art, 2011). It is providing the core functionality and serves as a common part providing necessary data to all use cases. For allowing flexible implementation of e.g. use cases functionalities a standard API will be defined that will expose high-level services to use case providers and FIRST GUI demonstrating system’s capabilities. This API will further allow other parties to develop their own applications or integrate it with their own systems in order to offer new services and added value on top of FIRST system. Based on that description a multi-tier architecture is a suitable choice for describing the system. From the analytical point of view, a multi-tiered architectural pattern allows to clearly distinguish and separate different layers (tiers) in the system (Fowler, 2002). In the FIRST system, the logical, high-level view consists of the following layers (as depicted on Figure 3). © FIRST consortium Page 15 of 53 D2.2 Media coverage Market Surveilance Risk Management Use Case Use Case Retail Brokeage Use Case FIRST Integrated GUI Sentiments Use cases and GUI implementati on layer Blogs Alerts Experts opinions Event predictions FIRST Financial Marketplace Services Financial timeseries FIRST Sentiment Analysis services FIRST Visualisations services FIRST Decision Support Services FIRST Alerts Services FIRST Lightweight Integration Layer High-level FIRST Services Layer Middleware Layer External Datasources Decision support Computation results FIRST Analytical Pipeline(s) Unstructured Data Data Data Acquisition Acquisition Ontology Ontology Learning Learning Information Information Extraction Extraction Sentiment Sentiment Analysis Analysis Decision Decision Support Support Pipeline components layer Reports Visualisations 5 4 3 FIRST Information Integration 2 1 Recommendations Structured Data Data Storage Layer Decision Models Sentiment history Ontologies Annotated documents Figure 3: FIRST High-level architecture – logical view The system, as envisaged from the architectural point of view consists of following parts, depicted as layers. Starting from the top, those are: Use case and GUI implementation layer – It consists of implementation of 3 FIRST Use Cases, and also FIRST Integrated GUI (Integrated Financial Market Information System). It includes end-user user interfaces and provisioning of graphical widgets for displaying results of computation including: sentiments, alerts, event predictions, decision support and data stream visualisations. Moreover, FIRST Integrated GUI provides an “entry point” for showcase of FIRST functionalities. All 4 parts are implemented accessing FIRST APIs and include necessary technological provisioning (e.g. web application deployment server). High-level FIRST services layer (FIRST APIs) – set of higher level services running on top of the FIRST Analytical Pipeline and providing necessary access to its computation results. They provide all concrete functionalities offered by underlying technical components, wrapping them as services, therefore forming a logical abstraction from the high performance lower level components. Those services are delivered by technical components implemented within following workpackages: WP3, WP4, WP5, WP6 and WP7. Middleware/integration layer – provides fast, robust and lightweight infrastructure for integration of different components of the FIRST Analytical Pipeline while supporting low latency, high performance and throughput. It also offers advanced techniques for general pipeline scaling, as explained in (FIRST D2.3 Scaling Strategy, 2011). This layer integrates components developed within following workpackages: WP3, WP4, WP6 and WP7, that take part in pipeline processing. Pipeline layer – FIRST Analytical Pipeline is a set of components that are processing stream of document in a sequential manner. While in principle we use the term “pipeline” in singular form, it may consist of more number of parallel “pipelines” balanced for © FIRST consortium Page 16 of 53 D2.2 handling bigger data streams. The integration of those components is provided by middleware/integration layer. Data storage layer – an underlying set of storage services providing unified data access for supporting the pipeline operations such as: storing intermediate documents, decision models, ontology versioning, or archiving computation results. Design has been carried out within WP5’s (FIRST D5.1 Specification of the information-integration model, 2011). © FIRST consortium Page 17 of 53 D2.2 4. Integration approach In the following sections we analyse common integration approaches and choose most suitable for the FIRST architecture. 4.1. State of the art on integration approach The term integration, in computer science domain, expresses the process of making disparate software modules work together by exchanging information in order to build a complete application or fulfil a task. Integration can be categorized into three types according to the application areas (Westermann, 2009): Application Integration (AI), Enterprise Application Integration (EAI) and Business Integration (BI). In general, Application Integration makes applications to exchange information. Messaging is one of the commonly used approaches in Application Integration. Enterprise Application Integration builds on application integration methodologies by dealing with integration and orchestration of enterprise applications and services. Enterprise Message Bus and Enterprise Service Bus are commonly used technologies in EAI. Business Integration builds on EAI methodologies which deal with technical infrastructure of an organization such as exposing parts of business’ operations on the public internet for use of costumers (Westermann, 2009). A scalable and efficient integration platform is required in order to communicate components of the FIRST analytical pipeline and build a complete system. In the following sections, application integration and Enterprise Application integration approaches are presented in details, and then their suitability’s for the FIRST pipeline is discussed. Business Integration approaches are not analysed because they are applied in business organization level for integrating several complex systems. For this reason these approaches are not suitable for the integration of FIRST components to build FIRST system. 4.1.1 Application Integration There are mainly four Application Integration approaches, which are: File Transfer, Shared Database, Remote Procedure Invocation and Messaging (Hohpe & Woolf, 2003). In File Transfer approach applications communicate via files that can be accessible with integrated applications. One application writes files and another application reads later on. An agreement is required on the filenames, locations, formats and maintenance of files between applications. The following figure shows integration of two applications using the File Transfer approach. Figure 1: File Transfer Architecture (Hohpe & Woolf, 2003) Shared Database approach is integration of applications using a single shared database. In this approach, applications are able to access same database and information. Therefore, there is no need to transfer information between applications directly. One of the biggest difficulties with Shared Database is design of the database schema. The following figure shows integration of three applications using the Shared Database approach. © FIRST consortium Page 18 of 53 D2.2 Figure 2: Shared Database Architecture (Hohpe & Woolf, 2003) Remote Procedure Invocation approach integrates applications by exposing functionalities of applications, which can be called remotely by other applications. Exposed functionalities can be used for data transfer between applications or modification of the data by external applications. Web Services are examples of the Remote Procedure Invocation, which use standards such as SOAP and XML .The following figure shows integration of two applications using the Remote Procedure Invocation approach. Figure 4: Remote Procedure Invocation Architecture (Hohpe & Woolf, 2003) Messaging is exchange of messages between applications in a form of loosely coupled distributed manner (Java Message Service, 2011). Communication between applications can be established using TCP network sockets, HTTP, etc... Messaging channels are opened and applications transfer messages by sending and receiving messages through the channel. The applications must agree on channel and message format for the integration. There are two different models of how messaging can be done, which are broker and brokerless. In a broker based messaging system, there is a messaging server in the middle. Every application is connected to the central broker. No application is speaking directly to the other application. All the communication is passed through the broker. Figure 5 shows communication model of a broker based messaging system. Figure 5: Communication model of a broker based messaging system (Zeromq, 2010) © FIRST consortium Page 19 of 53 D2.2 The advantages and disadvantages of broker based messaging systems are: Advantages Applications don't have to have any idea about location of other applications. Message sender and message receiver lifetimes don't have to overlap. Resistant to the application failure. Disadvantages Excessive amount of network communication causes performance decrease. Broker is the bottleneck of the whole system. When broker fails, whole system would stop working. In a brokerless based messaging system, clients interact directly with each other. There is no central messaging server. Figure 6 shows communication model of a brokerless messaging system. Figure 6: Communication model of a brokerless messaging system (Zeromq, 2010) The advantages and disadvantages of brokerless messaging systems are: Advantages No single bottleneck. High performance with less network communication. Disadvantages Each application has to connect to the applications it communicates with and thus it has to know the network address of each such application. Application failures cause persistence and stability problems. File Transfer, Shared Database and Messaging are data based integration approaches, which enable applications to share their data but not their functionality. Remote Procedure Invocation enables applications to share their functionality, which makes them tightly coupled (dependent) to each other. Remote calls are also much slower, and they are much more likely to fail compared to local procedure calls that may cause performance and reliability problems. File Transfer and Shared Database allow keeping the applications well decoupled and therefore different technologies can be used in the applications. However, these approaches require synchronization mechanism in order to inform integrated applications, when data is shared for consumption. Moreover, these approaches require disk access to store and retrieve data, which increases cost of communication between applications. © FIRST consortium Page 20 of 53 D2.2 To integrate applications in a more loosely coupled, reliable way with high performance, Messaging would be the most suitable approach as an application integration approach to integrate the FIRST pipeline. Messaging is reliable since messaging has retry mechanism to make sure message transfer succeeds. Applications are synchronized to each other with automatic notification of message transfer, which increases performance of a system. 4.1.2 Enterprise Application Integration The term Enterprise Application Integration denotes the usage of tools to integrate enterprise applications and services. EAI tools typically act as a middleware between applications. Communication between the EAI tools and applications are handled by usage of a messaging middleware inside. There are mainly four Enterprise Application Integration types, which are: Point-to-Point topology, Hub-and-Spoke topology, Enterprise Message Bus Integration and Enterprise Service Bus Integration (Kusak, 2010). Figure 7 shows the architectures of the integration approaches. Figure 7: Communication model of a broker based messaging system (Kusak, 2010) In Point to Point approach, a point to point topology is formed by direct interaction of applications, which creates tight coupling between applications. This approach is generally used when there are few applications and few interactions between them. In this type of integration, integration logic is embedded into application, which removes central control, administration and management. Hub-and-Spoke topology is formed by interaction of applications via a central Hub. The main advantage of this topology is that less connections is required in this kind of topology compared to the point to point topology. Also interactions can be managed centrally. However, Hub becomes the single point of failure. When the Hub fails, whole integration topology fails. © FIRST consortium Page 21 of 53 D2.2 Enterprise Message Bus integration approach is an evolved version of the Hub-and-Spoke approach. Messaging Bus forms the central part of the topology. Applications communicate with Messaging Bus via message brokers or adapters. Main advantage of this topology is a messaging underlying the communication. Messaging offers performance, persistence and reliability. Enterprise Service Bus (ESB) is a software infrastructure, which is used as a backbone to SOA implementations. Gartner Group defines ESB as "a new architecture that exploits Web services, messaging middleware, intelligent routing, and transformation. ESBs act as a lightweight, ubiquitous integration backbone through which software services and application components flow." ESB promotes less complexity, flexibility, low cost by combining benefits of standards based Web Services with other EAI approaches (Jude Pereira, 2006). ESB would be the most suitable approach as EAI to integrate the FIRST pipeline with its advantages compared to the other approaches. However it is focused issues not present in FIRST system (such as: integration of big number of components or content-based message routing) that results in being too complex and adding possible performance overhead in information exchanging. FIRST pipeline composes of few applications and each of them has a clearly defined communication schema (passing data from one to another in a component chain). For this reason, Application Integration approaches specifically messaging would be more suitable because of their simplicity and better performance in comparison to more complex ESB middleware designed for much larger scale integration. 4.2. Pipeline processing FIRST pipeline has three major modules to be integrated: Data acquisition and Ontology infrastructure (WP3), Semantic information extraction system (WP4) and Decision support infrastructure (WP6). For the First semantic information extraction prototype WP3 and WP4 modules are integrated using the messaging approach. This section presents the design and implementation of the messaging integration solution in details. High performance and support for integration of components written in different technologies (.NET, Java) are most important requirements for the FIRST integration approach (FIRST D2.1 Technical requirements and state-of-the-art, 2011). For programming language independency, messaging systems which support multiple platforms and programming languages are investigated as a potential solution for the integration problem. Messaging systems that support only specific programming language are eliminated during our survey of messaging systems. The following popular messaging systems are chosen for further investigation, after the programming language independency elimination: ZeroMQ1, ActiveMQ2, RabbitMQ3, OpenJMS 4 and Apache Qpid5. Performance is the most important criteria’s when selecting the messaging platform for the pipeline integration. For this purpose, performance tests are done using a randomly generated data with fixed message sizes. In the experiments, 1000 messages are transferred from one application to another running on the same machine. Total transfer duration is used for calculating throughput of popular messaging systems. Figure 8 shows performances of these messaging systems: 1 http://www.zeromq.org/ http://activemq.apache.org/ 3 http://www.rabbitmq.com/ 4 http://openjms.sourceforge.net/ 5 http://qpid.apache.org/ 2 © FIRST consortium Page 22 of 53 D2.2 Throughput ( MB / second) 3.5 3 2.5 2 1.5 1 0.5 0 RabbitMq ActiveMq OpenJMS ZeroMq Qpid Figure 8: Messaging systems performance comparison results The experiment results showed that ZeroMQ performed better than the other messaging platforms. ZeroMQ performs better since it is brokerless based messaging system. It requires less network communication. As a broker based messaging system, RabbitMQ performed better than the other systems. Based on the experiment results ZeroMQ is decided to be used as a messaging platform for the FIRST pipeline integration because of its performance and numerous bindings for most programming environments. ZeroMQ (ØMQ) is a high performance asynchronous messaging library, in which communication between applications can be established without much effort. ZeroMQ offers performance, simplicity and scalability without a dedicated message broker (Piël, 2010). The library provides sockets to connect applications and exchange information. ZeroMQ provides several communication patterns which are: Request-reply connects a set of applications. Firstly, message consumer (client) request for a message transfer and message producer (server) sends a message. Request-reply pattern supports multiple message producers and consumers. There is a load balancing between message producers and consumers and each message is consumed one time. The advantage of this approach is the synchronization between producer and consumer. Each message is consumed one by one. In contrast, two messages are sent for each packet of the message producer, which increases network traffic and decreases the performance. Publish-subscribe connects a set of publishers to a set of subscribers. Publishers publish messages and subscribers receive the message. Each message is delivered to all subscribers. Push-pull (pipeline) connects applications in a pipeline pattern. Message producer pushes messages to the pipe without request from the message consumer. The messages are kept in a queue until they are consumed by the message receiver. The pipeline approach performs faster than the request and replay approach. However, the high performance comes with the risks of queue overflow and synchronization problems. Exclusive pair connects two applications in an exclusive way. Both application can send and consume messages. © FIRST consortium Page 23 of 53 D2.2 Figure 9: ZeroMQ messaging patterns (Piël, 2010) Messaging system would be responsible for exchange of information from WP3 to WP4 in an efficient, stable and reliable way. Thus the pipeline and request-reply message patterns are much suitable for our needs. To observe performances and suitability of these patterns experiments are done on a test dataset collected by the WP3. The test data set composes of 2,378 files with total size of 5.26 GB. Each file is transferred as a one message from one application to another using the both message patterns. Throughput ( MB / second ) 3.5 3 2.5 2 ZeroMq Pipeline 1.5 ZeroMq Request-Replay 1 0.5 0 ZeroMq messaging patterns © FIRST consortium Page 24 of 53 D2.2 Figure 10: ZeroMQ messaging patterns performance comparisons. In the experiments, we observed that the pipeline approach (data push) performs approximately three times better than the request-reply pattern (Figure 10) due to lack of extra communication overhead related to data request sent every time by client. Instead, the data is pushed to the client once it is available. For these reasons, pipeline pattern is selected for transferring messages between the pipeline components (WP4 and WP6). Figure 11 shows the architecture of the pipeline between the systems. Figure 11: Data Acquisition and Information Extraction pipeline integration architecture 4.3. GUI integration 4.3.1 Introduction A common front-end Graphical User Interface (GUI) is required to present the Integrated Financial Market Information System to the end users. This section discusses the technical requirements of the GUI and presents state of the art technologies that fit the requirements. There are two types of applications that are commonly used to implement the integrated GUI: Client/server application and web application. Client/server application follows two-tier architecture, which runs on the client side and access information on the remote server. Web application follows n-tier architecture, which is accessed via web browsers and all application logic is handled on the server side (see Figure 12). Web applications are generally designed with 3-tier architecture: Presentation tier: Front-end is the content rendered by the browser. Application tier: Controls applications functionality. Data tier: Stores and retrieves information. Figure 12: Client/server application and web application architectures (Howitz, 2010) © FIRST consortium Page 25 of 53 D2.2 In a Client/server based application, application must be installed on each client’s computer. To avoid the burden in deploying in each user machine and maintaining them, the integrated GUI would be implemented as a web application. The database tier in case of FIRST system is provided in a form of Data Storage APIs, provided by FIRST high-level services layer. 4.3.2 Web Application Frameworks Web application frameworks (WAF) are mostly used for developing web applications for their benefits (i.e. simplicity, consistency, efficiency, reusability). There are large amount of available programming language specific web application frameworks. For platform independency and performance, the integrated GUI will be implemented in Java language and for this reason we will focus on Java based web application frameworks. From the client perspective, only a web browser should be needed. Java web application frameworks can be categorized into five categories (Shan & Hua, 2006): Request-based Framework uses controllers and actions, which handles incoming request from users. User session is kept on the server side in this type of frameworks. Struts1, WebWork 2and Stripes 3are examples of Request-based Framework. Component-based Framework abstracts the request handling mechanism and encapsulates the logic into reusable components. Events are triggered automatically for handling incoming requests from users. Development model of this type of framework is similar to the desktop GUI framework models. JSF 4and Apache Wicket 5are examples of Component-based Framework, which are widely used for web application development. Hybrid Framework is a combination of both request-based and component-based frameworks. The entire data and logic flow of components are handled as in a requestbased model. RIFE6, a full-stack web application framework, falls into this category. Meta Framework provides set of core interfaces for integrating components and services. Meta framework can be considered as a framework of frameworks. Spring 7 and Keel 8are examples of Meta Framework. Rich Internet Framework uses client-side container model in which requests are handled on the client side. Therefore, the amount of server communication and load decreases. Google Web Toolkit 9 (GWT) and Echo2 10 are popular Rich Internet frameworks. The main purpose of using software frameworks is reducing the amount of time, effort, and resources required to develop and maintain web applications. Performance of the framework is also very important factor when choosing the web application framework. From this perspective, popular frameworks are analyzed: GWT, JSF and Spring MVC. GWT is a component based Rich Internet Framework that allows web developers to create and maintain complex JavaScript front-end applications. GWT allows you to write AJAX applications in Java and then compile the source to highly optimized JavaScript that runs across all 1 http://struts.apache.org/ http://www.opensymphony.com/webwork/ 3 http://www.stripesframework.org/display/stripes/Home 4 http://javaserverfaces.java.net/ 5 http://wicket.apache.org/ 6 http://rifers.org/ 7 http://www.springsource.org/ 8 http://www.keelframework.org/ 9 http://code.google.com/webtoolkit/ 10 http://echo.nextapp.com/site/echo2 2 © FIRST consortium Page 26 of 53 D2.2 browsers, including mobile browsers for Android and the iPhone. Advantages and disadvantages of GWT are listed below: Advantages Disadvantages - - Simplicity o o - - No need to learn/use JavaScript language (Use a reliable, stronglytyped language (Java) for development and debugging) Leverage various tools of Java programming language for writing/debugging/testing Steep learning curve Heavy dependence on Javascript o Not search engine friendly Need more components o Performance o Generates code o Can use complex Java on the client optimised Results in the client web browser applications that consume much of the memory JavaScript GWT doesn't come out of the box with all the possible Widgets; there is a need to use extra components. Scalability o Stateful client, stateless server - AJAX support - Compatibility o No need to handle browser incompatibilities and quirks Table 3: GWT advantages and disadvantages Java Server Faces (JSF) is a component oriented and event driven framework based on the Model View Control (MVC) pattern. View layer is separated from controller and model. Event driven User Interface (UI) components are provided by the JSF API. The UI components and their state are represented on the server with a defined life-cycle of the UI components. Advantages and disadvantages of JSF are listed below: Advantages Disadvantages - - Simplicity o Easy to learn for existing Java web developers o Enables the use of IDEs for Rapid Application Development (NetBeans, Jdeveloper, Eclipse, etc) o - Follow MVC design pattern Compatibility o Performance o Every button or link clicked results in a form post, which might in a bad user experience from the end user point of view. Scalability o States of the components are stored in session objects and it provides difficulties to run in distributed mode. No need to handle browser incompatibilities and quirks Table 4: JSF advantages and disadvantages © FIRST consortium Page 27 of 53 D2.2 Spring MVC is the request-based framework of Spring Framework for developing web applications. The framework defines strategy interfaces for all of the responsibilities, which are tightly coupled to the Servlet API. The following are Spring MVC advantages and disadvantages: Advantages o o Disadvantages Simplicity o Configuration intensive Easy to test o Follow MVC design pattern No common parent controller, resulting in the need for handling many issues individually Cleaner code o No built-in Ajax support Integrates with many view options seamlessly: JSP/JSTL, Tiles, Velocity, FreeMarker, Excel, XSL, PDF. Table 5: Spring MVC advantages and disadvantages This is a comparison of notable Java web application frameworks that compares features of the frameworks. The frameworks are rated between 0-1 and rating logic for the features are described in (Raible, JVM Web Frameworks Rating Logic, 2010). Table 6 shows the comparison results. Spring MVC and GWT are the highly rated frameworks based on the evaluations of the author (the higher note means better). Criteria Struts 2 Spring MVC Wicket JSF 2 Tapestry Stripes GWT Vaadin Developer Productivity 0.5 0.5 0.5 0.5 1.0 0.5 1.0 1.0 Developer Perception 0.5 1.0 1.0 0.0 0.5 1.0 1.0 1.0 Learning Curve 1.0 1.0 0.5 0.5 0.5 1.0 1.0 1.0 Project Health 0.5 1.0 1.0 1.0 0.5 0.5 1.0 1.0 Developer Availability 0.5 1.0 0.5 1.0 1.0 0.5 1.0 0.5 Job Trends 1.0 1.0 0.5 1.0 0.5 0.0 1.0 0.0 Templating 1.0 1.0 1.0 0.5 1.0 1.0 0.5 0.5 Components 0.0 0.0 1.0 1.0 1.0 0.0 0.5 1.0 Ajax 0.5 1.0 0.5 0.5 0.5 0.5 1.0 1.0 Plugins or Add-Ons 0.5 0.0 1.0 1.0 0.5 0.0 1.0 1.0 Scalability 1.0 1.0 0.5 0.5 0.5 1.0 1.0 0.5 Testing 1.0 1.0 0.5 0.5 1.0 1.0 0.5 0.5 i18n and l10n 1.0 1.0 1.0 0.5 1.0 1.0 1.0 1.0 Validation 1.0 1.0 1.0 0.5 1.0 1.0 1.0 1.0 Multi-language Support (Groovy / Scala) 0.5 0.5 1.0 1.0 1.0 1.0 0.0 1.0 Quality of Documentation/Tutorials 0.5 1.0 0.5 0.5 0.5 1.0 1.0 1.0 Books Published 1.0 1.0 0.5 1.0 0.5 0.5 1.0 0.5 REST Support (client and server) 0.5 1.0 0.5 0.0 0.5 0.5 0.5 0.5 Mobile / iPhone Support 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 © FIRST consortium Page 28 of 53 D2.2 Criteria Degree of Risk Totals Struts 2 Spring MVC Wicket JSF 2 Tapestry Stripes GWT Vaadin 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5 14,5 17 15 13,5 15 14 17 15,5 Table 6: A comparison of notable Java web application frameworks (Raible, Comparing JVM Web Frameworks, 2010) Performance and scalability are very important factors for success of the First project. For this reason, GWT is selected as web application framework for developing the Integrated GUI after analyzing advantages and disadvantages of the application frameworks. GWT is the most promising solution with its simplicity, scalability and performance. GWT enables development of web applications without writing JavaScript code on the client side. In some special cases (integration with a non GWT application), it may require to develop custom JavaScript code. JQuery1, the most popular JavaScript library, would be used for this purpose. JQuery is a cross-browser JavaScript library designed to simplify the client-side scripting. There is plug-in called GwtQuery, which can be used like JQuery within GWT framework. 4.4. Data storage design principles The design of the knowledge base is conducted with the following prerequisites in mind: Choice of paradigm(s) for physical storage system Provide stable access interface to hide underlying complexity Encapsulate business logic in mediation layer 4.4.1 Storage paradigms The most crucial and fundamental decision regarding data storage is to choose the paradigm of the physical storage system. Besides the prevalent relational database systems there is a variety of alternatives, each with its characteristic advantages and disadvantages, which are commonly referred to under the umbrella term NoSQL2. Many of these non-relational storage alternatives provide better performance, but this comes in most cases at the cost of a relaxation of the ACID guarantees, which refers to the atomicity of transactions, consistency of the data store before and after a transaction, isolation of transaction as well as the durability of completed transactions. As already (Härder & Reuter, 1983, p. 291) noted “These four properties, atomicity, consistency, isolation, and durability (ACID), describe the major highlights of the transaction paradigm, which has influenced many aspects of development in database systems.” This was already true nearly three decades ago when the statement was made, but also was the main guideline for advancements in database technology made since then. Not until the advent of internet technology and strongly increasing read and write accesses to (potentially distributed) database systems, it was questioned whether this underlying paradigm fits best for all use cases. Though, as (Brewer, 2000) pointed out in his CAP-Theorem, only two of the three desirable characteristics consistency, availability and partition-tolerance of distributed databases can be fulfilled at a time3. Therefore, in many high load real-world use 1 http://jquery.com/ The buzzword NoSQL is commonly spelled out as “Not only SQL” but lacks a formal definition 3 (Gilbert & Lynch, 2002) provide a more formal approach to prove the theorem 2 © FIRST consortium Page 29 of 53 D2.2 cases1, the requirements toward the ACID guarantees are relaxed in favour of heightened availability, i.e. performance, of the system. Non-relational database paradigms are usually not fulfilling the ACID guarantees to the full extent, but adhere to the so-called BASE paradigm, whereby BASE stands for Basically available, soft state, eventual consistency (Brewer, 2000). Within the BASE paradigm, temporary data inconsistencies are tolerated in order to improve read and write performance. Depending on the requirements of the particular use case, it has to be decided whether such a relaxation is tolerable or not. NoSQL alternatives can roughly be separated in the following groups according to their data model (see e.g. (Cattell, 2010)): Key-value stores: provide efficient storage and retrieval for values (objects) of any kind based on a programmer-defined key. Typically the values (objects) are not interpreted by the system, but stored and retrieved as is, which limits searching capabilities to the associated key. Examples: Project Voldemort, Tokyo Cabinet, Memcached Document stores: provide efficient storage and retrieval for documents (objects) of any kind. In contrast to key-value stores, the content and structure of the documents, i.e. attribute names, scalar values but also nested lists or other documents, are interpreted by the system and made available for queries. Examples: MongoDB, CouchDB Extensible record stores: organize data in a way similar to relational tables, but enable a dynamic number of attributes. Furthermore, tables can be partitioned both vertically and horizontally at the same time. Examples: Google BigTable, Apache HBase Relational databases NoSQL data stores Difficult schema evolution/adaptation Schemaless ACID guarantees, strong consistency BASE paradigm, weak consistency Data typically normalized Data typically de-normalized Standardized, expressive query language Individual query capabilities (mainly) vertical scaling Horizontal scaling Table 7: Main characteristics of relational and non-relational storage paradigms The decision whether to use the relational database approach or to use NoSQL data stores is mainly a trade-off between ACID compliance and query versatility on the one hand and improvements in terms of performance and scalability on the other hand. This decision has to be conducted considering the requirements of the particular use case. 4.4.2 Access layer As the FIRST knowledge base is faced with the requirement to provide a high-performance storage for data of different modalities (some data items are more structured, some are less structured), with different expected insert and retrieval frequencies and patterns, the knowledge base will address this heterogeneous requirements by using the data storage paradigm that 1 See e.g. (Vogels, 2008) or (Pritchett, 2008) for Amazon and eBay respectively © FIRST consortium Page 30 of 53 D2.2 best suits each data modality individually, rather than using a common paradigm for all kinds of data. This implies that on the actual storage layer a variety of paradigms will be applied1. This diversity on the storage layer, calls for an abstract access layer in order to hide the complexity of the storage layer from knowledge base clients, i.e. the other technical work packages. Besides being in line with the DoW, this layer of indirection provides further advantages, such as the exchangeability of storage components. The parallel usage of different paradigms enables to compare their suitability for different large-scale requirements. As the choice of an approach shall not generate a technological lock-in in the long run, there shall remain the option to switch the underlying storage paradigm, in case one of the chosen approaches proves to not scale as expected. Without an abstract access layer, exchanging storage components would be hardly achievable as it would impact all clients and require them to adapt their way of accessing the data accordingly. With the existence of an abstraction layer exchanging components can be conducted seamlessly, as clients not even notice that the backend storage structure has changed. 4.4.3 Mediation layer To enable the seamless exchange of a storage solution while providing a stable access interface, the actual business logic is encapsulated in a mediation layer, which encapsulates various functions: Transformation of requests accepted from the access layer into transactions on the storage layer, utilizing knowledge about its actual structure Provision of services to foster performance, e.g. caching, if not provided by the underlying storage solution Provision of services to handle exceptional load, e.g. queuing of requests, if not provided by the underlying storage solution Provision of services to avoid resource bottlenecks, e.g. maintaining connection pools Maintaining thread pools to cater for parallel processing of independent requests 1 Regarding the specific use cases addressed by the knowledge base and the actually chosen storage paradigms to cope with the particular requirements, see deliverable D5.1 © FIRST consortium Page 31 of 53 D2.2 5. Design This chapter describes details of system design and interaction of main components. I will also depict sample FIRST scenario to illustrate data flow through the pipeline components and the role of storage components and integration layer for bringing results of pipeline computations to the end user. Design of FIRST is based on stare of the art on system integration and as well as technical requirement and architectural analysis. Moreover, internal details of FIRST process will be provided as a part of subsection 5.2. 5.1. Detailed components interaction perspective Figure 13 presents detailed component view divided into layers (light blue boxes) that corresponds to the tiers described in chapter 3.2. Components have been surrounded with red boxes that clearly indicate certain workpackage they belong to. Rest of details have been occulted for brevity. FIRST high-level services layer hasn’t been marked with red box, as it exposes services of all technical components (of workpackages: WP3, WP4, WP5, WP6 and WP7) and each is developed within its own workpackage: Use cases and GUIs WP7 WP8 Integrated Financial Market Information System Use Case 1 (Market Surveilance) WP3 Use Case 2 (Reputational Risk) Use Case 3 (Retail Brokeage) Document Stream Visualisations Integrated Financial Market Information System backend services FIRST high-level services: Data Cleaning API Sentiment Analysis API Visualisation API Decision Support API Storage Access API Alerts API Pipeline Integration Data Acquisition Unstructured Data 1 2 3 4 5 WP4 Ontology Learning Information Extraction Sentiment Analysis Ontology sharing sub High-performance asynchronous messaging pub pub WP3 Queue Queue High-performance asynchronous messaging sub pub Pipeline components WP7 sub Queue High-performance asynchronous messaging WP6 WP7 Decision Support System Data collecting component Universal Data Adapter Structured Data Information Integration Layer WP5 Unified Datastore Access & Mediation layer Data Acquisition Document Storage Ontology versioning NLP-processed documents Sentiment data Decision Rules And Models Computed results storage Figure 13: Detailed component view (overview) Data Cleaning API: provides services based on data acquisition components (WP3) that deal with preparing data to be further analyzed by WP4 information extraction components (such as language detection, boilerplate removal, near duplicate removal, etc). © FIRST consortium Page 32 of 53 D2.2 Visualisation API: services for providing data for visualisations user interface (WP3). As visualisation is connected to the data stream going through the pipeline, Visualisation API provides a stream of data pushed to the registered components. Sentiment Analysis API: set of services for performing sentiment analysis of news article or chosen text employing WP4 components. Decision Support API: set of services exposing decision making infrastructure (e.g. event detection, perditions) implemented within WP6 components. Alerts API: PUB/SUB services used to register for specific event. Receiver is notified as soon as event occurs in the pipeline. Events are detected by WP6 components by analyzing data stream, while WP7 collector component is handling dispatch job. Storage Access API: provide unified access to the multiple WP5 data storage solutions. Those services provide building blocks for implementing Integrated Financial Market Information System end user GUIs. The list of services is not definite and may be expanded according to future use case and GUI requirements. Depending on the usage scenario, services above can be both request reply (REQ/REP) kind of services or publish-subscribe (PUB/SUB) streambased services except for Visualisation API and Alerts API that have PUB/SUB, stream based nature. Details of using high level services have been described in section 5.2. Figure 14 shows components that are exposing high-level services with red arrows. Interaction between components has been highlighted in Figure 14. The main part in the centre (irregular shape in dark-blue) denotes the data flow through the pipeline components. Data is passed between components that use different technologies through lightweight pipeline integration components (high performance asynchronous messaging). They interconnect data acquisition component (WP3) with Information extraction (WP4) and later sentiment analysis component (WP4) with Decision support (WP6). Results of decision support (e.g. events) are received by Data collecting component of WP7. Pipeline integration components work by pushing the data as it comes, without need to acknowledging or replies. Details of this approach have been outlined in section 5.2.6.1. © FIRST consortium Page 33 of 53 D2.2 Use cases and GUIs Dependency (e.g. service invocation) WP7 Exposed functionality (service provisioning) pub WP8 Integrated Financial Market Information System Use Case 1 (Market Surveilance) WP3 Use Case 2 (Reputational Risk) Use Case 3 (Retail Brokeage) Document Stream Visualisations Data flow Internal component integration Integrated Financial Market Information System backend services Pipeline integration (messaging queue) FIRST high-level services: Data acquisition Software component Data Cleaning API Sentiment Analysis API Visualisation API Decision Support API Storage Access API Alerts API Pipeline Integration Unstructured Data 1 2 3 4 WP4 Ontology Learning Information Extraction Sentiment Analysis Ontology sharing sub High-performance asynchronous messaging pub pub WP3 Data Acquisition Queue Queue High-performance asynchronous messaging sub pub Pipeline components WP7 sub Queue High-performance asynchronous messaging WP6 WP7 Decision Support System Data collecting component Universal Data Adapter 5 Structured Data Information Integration Layer WP5 Unified Datastore Access & Mediation layer Data Acquisition Document Storage Ontology versioning NLP-processed documents Sentiment data Decision Rules And Models Computed results storage Figure 14: Detailed component view (with highlighted interactions and data flow) Internally, WP3 and WP4 components might be closely integrated within their own execution environment for performance reasons (.NET CLR for WP3 components and JVM for WP4 components). It has been depicted with gray arrows connecting data acquisition with ontology learning and information extraction with sentiment analysis. Direction of the gray and thick blue arrow denotes data flow. Information Integration later (presented on the bottom) exposes its API for both: (i) high-level services to foster use cases and GUI implementation that will access data gathered during pipeline processing, and (ii) for individual components, mostly for storing results of processing. Each storage facility is dedicated for storing different kind of data, therefore each technical component connects to different store, depending of the data type. Internal details on Information Integration layer are presented in (FIRST D5.1 Specification of the informationintegration model, 2011). Structured data is delivered to the system through Universal Data Adapter that connects external data providers (such as IDMS Financial Data API) with components that internally uses it for data processing. From the current analysis, only Decision Support System is connected so far. Ontology learning component is also taking part in pipeline processing, however outputs are not streamed back into the pipeline. The processing result of this component is an updated ontology that is shared with Information Extraction component. Due to the fact that ontology snapshots updates are not very frequent (daily), the sharing mechanism is not required to be robust, and simple file sharing is considered. © FIRST consortium Page 34 of 53 D2.2 5.2. Sample FIRST process The purpose of this section is mainly to illustrate how the FIRST integrated system is employed for two concrete sample scenarios: (1) topic trend visualization use case and (2) portfolio selection use case. In the contexts of these two use cases, we demonstrate the FIRST analytical pipeline (WP3–WP6), point out how the chosen inter-component messaging technology (i.e., ZeroMQ) is employed, and show how the Web-based integrated GUI (WP7) is built on top of the pipeline. 5.2.1 Sample scenarios 5.2.1.1 Topic trend visualization The topic trend visualisation provides valuable insights into how topics evolved through time. One of such visualisations is called ThemeRiver (Havre, Hetzler, & Nowell, 2000). It visualises thematic variations over time for a given time period. The visualisation resembles a river of coloured “currents” representing different topics. A current narrows or widens to indicate a decrease or increase in the strength of the corresponding topic. The topics of interest are predefined by the analyst as a set of keywords. The strength of a topic is computed as the number of documents containing the corresponding keyword. Shaparenko et al. build on top of the ThemeRiver idea to analyse a dataset of scientific publications (Shaparenko, Caruana, Gehrke, & Joachims, 2005). They identify topics automatically by employing the k-means clustering algorithm. In Figure 15, an example of the topic trend visualization is shown (taken from (FIRST D2.1 Technical requirements and state-of-the-art, 2011) Section 2.5.4). Figure 15: An example of the topic trend visualization The topic trend visualization is required by the Use Case 3 (Retail brokerage use case) Application Scenario 3 (Visualize topic trends and their impact) as specified by the requirements UC3.R-EU8.x (see (FIRST D1.2 Usecase requirements specification, 2011) Section 4.4.3 for more details). The main idea of this application scenario is to enable the user to visually detect new emerging or diminishing topics and assess the relevance of topics currently being discussed in news and/or blogs. This use case demonstrates the entire FIRST process with the exception of the information extraction pipeline which is bypassed in this particular case. The acquired data is first sent through the preprocessing pipeline and then sent (through the ZeroMQ messaging technology) directly to the Canyon Flow pipeline. The Canyon Flow model (i.e., a cluster hierarchy) is sent to © FIRST consortium Page 35 of 53 D2.2 a Web server which forwards it to its clients. A JavaScript component, part of the FIRST Webbased integrated GUI, visualizes the model to the user. 5.2.1.2 Portfolio selection use case The portfolio selection use case is to demonstrate the economic utility of results of Work Package 4 “Semantic Information Extraction System”. The Work Package’s model extracts investor sentiment with respect to objects and features that are specific to the three use cases of FIRST. The portfolio selection use case specifically targets extraction results for the use case “Investment Management and Retail Brokerage”. In this use case, the observed sentiment objects are financial instruments and the feature of the object is “expected future price change”. In the portfolio selection case, sentiment from blog posts that refer to stocks of the Dow Jones Industrial Average (DJIA) is being extracted. The extraction starts on the sentence level, yielding a crisp classification of sentiment polarity with respect to a stock. The sentiments that refer to the same instrument are then being aggregated to the document level by means of realvalued score ratio that accounts for direction and intensity of the sentiment. It is normalized to the range of values of the interval [–1, 1]. Scores >0 are interpreted to be positive, <=0 ones are negative. As the score is normalized, we are able to average the scores of several documents from the same day that refer to the same instrument. An example of a time series of the resulting sentiment index with respect to the DJIA is displayed in the following figure. Figure 16: DJIA vs. smoothed time series of the sentiment index (see (Klein, Altuntas, Kessler, & Häusser, 2011)) The Figure 16 displays a smoothed sentiment index (simple moving average of length 20 days). We hypothesize that the sentiment index can be beneficially exploited in the portfolio selection use case. That is, a selection strategy that involves the sentiment would provide excess returns over a buy and hold strategy. We simply enter a long position on the next day’s open price once the sentiment index si(day)>=th or a short position if si(day)<th with th being a threshold on the interval [0, 1] with the default value of 0. The position would be closed on the close price of © FIRST consortium Page 36 of 53 D2.2 day+n with n>=1 if si(day+n) changes its direction (attributed by a change of the algebraic sign). To test this strategy, we specify a historic back-testing simulation. The simulation scenario consists of daily price time series for 26 DJIA stocks and blog posts retrieved from the blogger.com platform in the period 2007–2010. See Table 8 for details. Stock symbol Number of documents Number of days with sentiment Stock symbol Number of documents Number of days with sentiment AA 691 450 JNJ 241 193 AXP 871 600 JPM 1061 652 BA 1718 965 KFT 231 202 BAC 1748 672 KO 2735 1190 CAT 1110 730 MRK 515 417 CSCO 2756 1111 MSFT 3948 1332 CVX 400 322 PFE 331 284 DIS 1922 974 PG 373 293 GE 1617 901 T 2173 924 HD 1337 829 TRV 40 39 HPQ 1206 721 UTX 79 70 IBM 2900 1220 WMT 4180 1382 INTC 1006 623 XOM 844 571 Table 8: Portfolio selection scenario based on DJIA stocks and sentiment extracted from blog posts (Klein, Altuntas, Kessler, & Häusser, 2011) For the results of the portfolio selection test in the historic simulation, please refer to (Klein, Altuntas, Kessler, & Häusser, 2011). 5.2.2 Data acquisition and preprocessing pipeline The data acquisition and preprocessing pipeline is common for both use cases. It consists of (i) data acquisition components, (ii) data cleaning components, (iii) natural-language preprocessing components, (iv) semantic annotation components, and (v) ZeroMQ emitter components. The current workflow topology is shown in Figure 17. © FIRST consortium Page 37 of 53 D2.2 Load balancing RSS reader Boilerplate remover Language detector Duplicate detector Sentence splitter Tokenizer POS tagger Semantic annotator ZeroMQ emitter RSS reader Boilerplate remover Language detector Duplicate detector Sentence splitter Tokenizer POS tagger Semantic annotator ZeroMQ emitter . . . . . . RSS reader Boilerplate remover Duplicate detector Sentence splitter Tokenizer POS tagger Semantic annotator ZeroMQ emitter processing pipelines Language detector One reader per site (80 readers) Figure 17: The current topology of the data acquisition and preprocessing pipeline. The data acquisition components are mainly RSS readers that poll for data in parallel. One RSS reader is instantiated for each Web site of interest. The RSS sources, corresponding to a particular Web site, are polled one after another by the same RSS reader to prevent the servers from rejecting requests due to concurrency. An RSS reader, after it has collected a new set of documents from an RSS source, dispatches the data to one of the 20 available processing pipelines. The pipeline is chosen according to its current load size (load balancing). A processing pipeline consists of a boilerplate remover, language detector, duplicate detector, sentence splitter, tokenizer, part-of-speech tagger, semantic annotator, and ZeroMQ emitter. The majority of these components were already discussed in FIRST D2.1 (see (FIRST D2.1 Technical requirements and state-of-the-art, 2011) Section 2.1). The natural-language processing stages (i.e., sentence splitter, tokenizer, and part-of-speech tagger) were added because they are a prerequisite for the semantic annotation component and also for the information extraction tasks. Finally, the ZeroMQ emitters were added to establish a “messaging channel” between the data acquisition and preprocessing components (WP3) and the information extraction components (WP4). This enables us to run the two sets of components in two different processes (i.e., runtime environments) or even on two different machines. The information extraction components are discussed in the following section. 5.2.3 Information extraction pipeline The information extraction pipeline receives the already acquired, pre-processed, and annotated documents that previous stages of the pipeline deliver and that have been described above. The main purpose of this part of the pipeline is to extract, classify, and aggregate sentiment with respect to use case-specific financial objects and features such as the price or volatility of a financial instrument. These sentiments can then be used as semantic features in subsequent decision models. To be exploited with respect to this matter, the end of this pipeline segment stores all extracted sentiments with respective attributes to the knowledge base. For realizing the extraction of sentiments, we employ an ontology-guided and rule-based information extraction approach. In contrast to pure machine learning approaches that leverage on statistics, this allows for a deeper analysis that is specific with respect to certain objects and features. We integrate financial knowledge by modelling and conceptualizing relevant parts of the domain in the ontology. Linguistic knowledge (independent of a specific domain) is brought to work to enable as generic as possible formulation of rules. These rules define sentimentextraction patterns. Inherent parts of the definition of these patterns are the annotations of the text created by previous parts of the information processing pipeline in FIRST (e.g., part-of© FIRST consortium Page 38 of 53 D2.2 speech tagging, lemmatization, and named entity extraction). The analysis takes place on several levels of a document, starting on word and phrase level. Based on this, the sentiment is being aggregated to the document level. For this purpose, simple scoring (quantification of sentiment that represents sentiment direction and intensity) is being employed. 5.2.4 Decision support models 5.2.4.1 Topic trend visualisation models In FIRST, topic trends will be visualized with a ThemeRiver-like algorithm called Canyon Flow. Technically speaking, the algorithm provides a “view” of a hierarchy of document clusters as illustrated in Figure 18. The underlying algorithm is essentially a hierarchical bisecting clustering algorithm employed on the bag-of-words representation of documents. In Figure 18, c(C, t) denotes the number of documents with time stamp t in cluster C. This implies that the “width of a current” at time t is proportional to the number of documents with time stamp t in the corresponding cluster. Note that this algorithm is originally not employed on document streams – it merely visualizes a dataset of documents with time stamps. Note also that this visualization is interactive – the user is able to select a different view of the cluster hierarchy (e.g., more finegrained view of a certain topic). C2 View C2 C4 C3 C1 C3 c(C4, t1) C1 C4 t1 t2 Figure 18: Canyon Flow: a “view” of a hierarchy of document clusters The approach to employing the algorithm on streams and scaling it up will be very similar to the pipelining approach presented in (FIRST D2.1 Technical requirements and state-of-the-art, 2011) Section 4.2.1. The algorithm will be decomposed into pipeline stages as shown in Figure 19. When a new data instance, i.e., an annotated document corpus, will enter the Canyon Flow pipeline, the documents will be first transformed into their bag-of-words representation. Then, the BOW vectors will be sent to the hierarchical clustering stage. The usual bisecting (k-means) hierarchical clustering algorithm will be replaced with an efficient online (i.e., stream-based) variant (see (FIRST D2.3 Scaling Strategy, 2011) Annex 1). The obtained cluster hierarchy (more accurately, the changes to the current cluster hierarchy) will be pushed, through the ZeroMQ messaging technology, to a Web server that will forward it to the clients (i.e., Web browsers). On the client-side, a JavaScript component will be responsible for updating the visualization (the currents will move with time from right to left) and interacting with the user. © FIRST consortium Page 39 of 53 D2.2 Web-based integrated GUI Web browser Web browser Web browser HTTP push ZeroMQ channel Web server Annotated corpus ZeroMQ receiver BOW vectors BOW processor Topic hierarchy Online bisecting clustering ZeroMQ emitter Figure 19: The Canyon Flow pipeline and the corresponding part of the Web-based integrated GUI. It is also important to note that the user will be able to specify a set of keywords to express his interests (as discussed in (Havre, Hetzler, & Nowell, 2000)). The Canyon Flow pipeline will not be altered in any way for this purpose – the required keyword filter will be applied by the Web server prior to pushing the data to the client. 5.2.4.2 Portfolio selection models As described above, the portfolio selection use case aims at building a portfolio of financial instruments whereas the instruments are selected according to the sentiment expressed within the information sources. This use case builds upon the sentiment measure provided through the analytical pipeline. In a first step, the portfolio selection process will only be based on the sentiment measure. As previous research has shown, buying or selling stocks according to the value of a daily sentiment measure can be profitable (Klein, Altuntas, Kessler, & Häusser, 2011). For that purpose, a daily aggregation of the sentiment measure will be retrieved from the analytical pipeline. In this context, we will investigate which threshold of the sentiment measure is appropriate for portfolio selection decisions. In a second step, we will investigate whether the sentiment measure can be used as an input into more sophisticated approaches. For that purpose, different methods including machine learning techniques (explained in detail in (FIRST D6.1 Models and visualisation for financial decision making, 2011)), qualitative modelling, and optimization techniques will be considered. Apart from the sentiment measure, possible input variables which will be received from the previous steps are technical and fundamental indicators or bag-of-words representations of texts. Furthermore, historical price series will be requested through the knowledge base developed in WP5. Taking into account these inputs, we aim at detecting patterns which can serve as a basis for developing portfolio selection models. Once developed, the portfolio selection models will be stored in the knowledge base and will be updated frequently. Portfolio selection decision support will be calculated on the server-side and will be provided on request of the user (pull-based), e.g., using a Web browser. © FIRST consortium Page 40 of 53 D2.2 5.2.5 Data exchange between pipeline components A batch of documents (either news or blog posts), passed between components (stages) in the data acquisition pipeline (WP3), is internally stored as an annotated document corpus object. The annotated document corpus data structure (ADC) is very similar to the one that GATE1 uses and is best described in the GATE user’s guide2. ADC can be serialized either into XML or into a set of HTML files. Figure 20 shows a toy example of ADC serialized into XML. In short, a document corpus normally contains one or more documents and is described with features (i.e., a set of key-value pairs). A document is also described with features and in addition contains annotations. An annotation gives a special meaning to a text segment (e.g., token, sentence, named entity). Note that an annotation can also be described with features. <DocumentCorpus xmlns="http://freekoders.org/latino"> <Features> <Feature> <Name>source</Name> <Value>smh.com.au/technology</Value> </Feature> </Features> <Documents> <Document> <Name>Steve Jobs quits as Apple CEO</Name> <Text>Tech industry legend and one of the finest creative minds of a generation, Steve Jobs, has resigned as chief executive of Apple.</Text> <Annotations> <Annotation> <SpanStart>75</SpanStart> <SpanEnd>84</SpanEnd> <Type>named entity/person</Type> <Features /> </Annotation> <Annotation> <SpanStart>122</SpanStart> <SpanEnd>126</SpanEnd> <Type>named entity/company</Type> <Features> <Feature> <Name>stockSymbol</Name> <Value>AAPL</Value> </Feature> </Features> </Annotation> </Annotations> <Features> <Feature> <Name>URL</Name> <Value>http://www.smh.com.au/technology/technology-news/steve-jobs-quits-as-apple-ceo20110825-1jat8.html</Value> </Feature> </Features> </Document> </Documents> </DocumentCorpus> Figure 20: Annotated document corpus serialized into XML. 1 GATE is a Java suite of tools developed at the University of Sheffield used for all sorts of natural language processing tasks, including information extraction. It is freely available at http://gate.ac.uk/ 2 Available at http://gate.ac.uk/sale/tao/split.html (see http://gate.ac.uk/sale/tao/splitch5.html#x8-910005.4.2 for some simple examples of annotated documents in GATE). © FIRST consortium Page 41 of 53 D2.2 The annotated document, contained in the XML in Figure 20, serialized into HTML and displayed in a Web browser is shown in Figure 21. Figure 21: Annotated document serialized into HTML and displayed in a Web browser The data acquisition pipeline ends with a ZeroMQ emitter that sends an annotated document corpus into the information extraction pipeline (WP4). The information extraction pipeline is based on GATE and since GATE also uses annotated document corpora internally, a relatively simple transformation (more accurately: serialization) is applied to transform an XML, received by the ZeroMQ receiver, into a GATE document corpus object. However for performance reasons, data acquisition pipeline should also serialize documents straight into GATE format to avoid extra XML manipulations. Note that the decision support components (WP6) also form pipelines. The data, passed between the decision support pipeline stages, is not necessarily in the form of document corpora. At this stage it is not possible to define exactly, what kind of data will be passed between those components. The main data structure will no doubtingly be a sparse matrix, used to describe graphs and feature vectors. 5.2.6 Role of the integration layer From the dataflow perspective, the integration layer has a three-fold role in the project: first, it enables robust data communication between different component groups that form the analytical pipeline; second, it provides a set of high-level services for accessing concrete features of FIRST for implementing use case scenarios; and third, it supports GUI services by providing publish-subscribe mechanisms to deal with event notifications coming from constant document stream processing. We present the role of all three integration facets in the following sections. 5.2.6.1 Messages passing in pipeline integration layer Analytical pipeline is composed of groups of components performing specialized tasks (data acquisition, information extraction and sentiment analysis, decision support and visualisation); every group of components is developed by its respective partner and written in different technologies as described in (FIRST D2.1 Technical requirements and state-of-the-art, 2011). Internally, those components can be integrated using each language’s most suitable way (method invocation, threads, events, internal queues, etc.), that does not impose any extra overhead: document stream can be handled within each own execution environment (Java Virtual Machine for Java components or Common Language Runtime for .NET components) in © FIRST consortium Page 42 of 53 D2.2 a zero-copy manner: without necessity to copy memory or data from one component to another. On the other hand, integration middleware is crucial where components written in different technologies need to pass data between each other. Message passing using lightweight communication over network sockets (e.g. ZeroMQ) allows for minimal overhead integration both on local machine and in the high-speed local network environment, enabling pipeline decoupling, and overall system distribution, due to lack of central broker. While pipeline components are designed to process data in a pipeline approach, the data passing, load balancing and queue buffering is handled by middleware layer. The data in the pipeline is always pushed forward to the next component. Each emitter component is passing data only to its receiving counterpart without mediation or routing, which is characteristic for large, distributed bus-based systems (e.g. ESB-based systems). In FIRST the number of components is limited and fixed and although traffic load balancing is taken into account, the number of components remains static. Technically, the analytical pipeline integration layer is connecting data acquisition, information extraction, decision support systems and data collecting components for GUI support (as depicted in previous chapter). There is also the possibility to further pipeline manipulation techniques, e.g. splitting the pipeline into more parts in order to maximize the usage of computing resources, but the mechanism of communication remains the same. This and other techniques are supporting system scalability goals and are further described in (FIRST D2.3 Scaling Strategy, 2011) Section 2. Message publishing component .NET native invocation Data acquisition C# (CLR) binding Sending buffer Message receiving component Data control channel (command queue) ZeroMQ emitter ZeroMQ receiver Data push Network Sockets Java native invocation Java (JVM) binding Information extraction Further pipeline components Java (JVM) binding Information extraction Further pipeline components Receiving buffer Load balancing Data control channel (command queue) ZeroMQ receiver Receiving buffer Parallel data processing component (optional) Figure 22: Integration between data acquisition and information extraction components w/load balancing. Figure 22 illustrates sample integration using ZeroMQ messaging. The whole emitter component (in blue) is integrated within data acquisition components and shares a common heap space with it. The data (single documents) is asynchronously fed to data sending emitter, through a helper buffer. The role of the buffer is explained in (FIRST D2.3 Scaling Strategy, 2011) and is used for supporting synchronous operations and in data-peak scenarios. Once sent through the socket, the data is received by ZeroMQ receiver and made available for Java component integrated with information extraction component (in green). The receiving queue is buffering the documents and compensates delays of data processing. 5.2.6.2 High-level services data flow High level system services are conceptual building blocks for GUI and use case implementation. They form a common system API for any envisaged integration by exposing high level, concrete functionalities of FIRST system. As opposed to the push-based pipeline dataflow, the high-level services are mostly request-response based: they respond to ondemand query over gathered data. Those services are exposed using classical SOA approach and can be invoked synchronously according to the use case application internal business logic (see Figure 23). © FIRST consortium Page 43 of 53 D2.2 Use case implementation SOA Invocation SOA Invocation SOA Invocation 1: request 3: request 2: response 5: request 6: response 4: response High level services: Data Cleaning API Sentiment Analysis API Visualisation API Decision Support API Alerts API Storage Access API Figure 23: Example of synchronous high-level services invocation 5.2.6.3 Push-based GUI services Communication services are needed to get incoming data from the high level services to the Integrated GUI. Web applications communicate using the HTTP protocol. HTTP has no support for allowing a server to notify a client; Request-response model is strictly used where the client (web browser running the Integrated GUI) makes a request to the server (calls high level system services), which must then respond with the requested data. In this protocol, sending a notification to the client is not applicable. Therefore, polling technique is used to retrieve data from server. Client sends request to the server in time intervals (every one second, two second) to get data, if there is. But this approach is inefficient. Server Push (Comet) is used as an alternative to the polling technique to overcome inefficiency and performance problems. Server Push is an approach in which a long-held HTTP request allows a web server to push data to a browser, without the browser explicitly requesting it. Therefore, the client doesn't have to keep asking for updates. There are several Server Push frameworks (GWT Comet and Atmosphere) that can be integrated to GWT based web applications. These frameworks will be analyzed and most suitable for the First project will be selected as a communication frameworks between Integrated GUI and high level services. The key component for push-based GUI services is the WP7 data collector component attached to the final stage of the pipeline. It doesn’t process data, but only listens to specific (subscribed) events. Once event occur, it is sent instantly to the subscribed party. In case of GUI implementation, it is application server (e.g. Java EE or servlet container). If the client is connected, the event is sent through always-open HTTP channel in long polling mode. It is client’s responsibility (through client-based JavaScript logic) to reopen the connection once the timeout occurs, and thus maintain the open channel constantly, through the whole user session. The whole process is depicted in Figure 24. © FIRST consortium Page 44 of 53 D2.2 Client (browser) 4: long polling (data push) 3: HTTP request (event) Using long polling (comet) approach for sending notification to the GUI WP7 Application Server (GUI logic) Use case Implementation 1: subscribe (events) Application server is notified as soon as event occurs 2: publish (events) 2: publish (events) 1: subscribe (events) High level services Decision Support API Alerts API Storage Access API Managing subscriptions and dispatching events to subscribers WP7 Analytical pipeline data Data collecting component Analyzing data and detecting if any subscribed events ocurred Figure 24: Data-push mechanism for notification services 5.2.7 GUI integration FIRST system will comprise of several demonstrator applications (e.g. sentiment extraction, document stream visualisation, etc.) that will be included in the FIRST Integrated Financial Market Information System. It will be the one common GUI, providing common “entry point” for showcasing FIRST functionalities. All applications will be exposed as web applications; therefore, an extra integration will be required. GUI integration will follow decoupled, widgetbased approach, where web application is composed of many different smaller widgets (small atomic application), that may be deployed on different servers and displaying data provided by multiple sources (e.g. pipeline components). However, from the end-user point of view they will remain one consistent application. Technologically, GUI integration approach will follow approach presented in section 5.2.6.3 using technologies analyzed in section 4.3. Detail will be further elaborated within the work of WP7 and associated deliverables. 5.2.8 Role of the storage components As outlined before, the different steps of the processing pipeline are directly communicating with each other by using messaging technology to forward data items. Thereby, several advantages are achieved as subsequent processing steps are provided with new data as it becomes © FIRST consortium Page 45 of 53 D2.2 available and the potential bottleneck of using a central storage facility for massive data exchange is bypassed. Using a central storage facility would cause database overhead due to inserting and querying data and would also require notifying the downstream components in some way to make them aware that new data is available for further processing. As near-real time processing of huge amounts of data is in the scope of FIRST, it has been decided to circumvent this time-consuming procedure and to directly forward data to downstream components in the processing pipeline. Nevertheless, at each point where data is handed over from one work package to the responsibility of another work package, it is also written to the central storage facility. This means, in parallel to the emission of messages along the processing pipeline, the passed data is also inserted into the storage facility. Thereby, the storage component is not directly involved in the processing pipeline, which avoids the aforementioned drawbacks, but nevertheless holds all relevant data for potential back-testing or analysis purposes. While it is solely a data drain from the processing pipeline perspective, it serves as a rich data source for back-testing and analysis purposes. Besides it passive role along the processing pipeline, the knowledge base also holds periodically archived versions of the evolved ontology as well as the results of the sentiment analysis which are to be used both by the decision models as well as the integrated GUIs built on top of the FIRST system. © FIRST consortium Page 46 of 53 D2.2 6. Deployment 6.1. Hosting platform The FIRST pipeline is currently running on a single dedicated server machine. The machine has 25 terabytes of disk space plus 200 gigabytes of solid-state disk. It has 256 gigabytes of memory and is able to run up to 48 concurrent processes (4 processors, 12 cores each), running on 64bit Windows Server 2008 operating system. At this time, it is expected that this machine will be enough to analyze the data, being acquired in FIRST, in real time. However, if these capacities prove insufficient, the integration middleware (ZeroMQ messaging technology) will be used to distribute the data-analysis pipeline across several machines. Technically, this means that the pipeline will end with a ZeroMQ emitter at some point and continue with a ZeroMQ receiver on a different machine. In this setting, the data will be processed to a certain stage on one machine and then sent via the messaging "channel" to another machine in the network for further processing. In addition, it would be also possible to perform scaling techniques (e.g. load balancing) across different machines as pointed out in (FIRST D2.3 Scaling Strategy, 2011). Also other aspects such as instantiate processing components dynamically (per demand) but we currently do not plan to implement these complex scaling mechanisms in FIRST. Note that in reality, we do not run one single pipeline but rather a workflow consisting of several pipelines. Even so, the intuitions and principles for distributed processing of FIRST data streams remain the same. 6.2. Deployment scenarios As pointed out in previous chapter, messaging middleware gives more opportunities for flexible deployment, taking into account such aspects as scalability or hardware availability. In the primary deployment scenario the FIRST system will run on the single, dedicated machine as pointed out in 6.1. However, for the sake of further system exploitation, more deployment scenarios are considered. As the architecture support system distribution (through ZeroMQ messaging middleware) and given that network connection between the nodes is reliable and fast, we may consider more further system evaluation possibilities. Apart from standalone deployment, those are: Deploying FIRST system on multiple, commodity hardware. By distributing FIRST processing pipeline on more machines, we can use more servers but less powerful (and cheaper in price) to simulate processing power of better performing expensive servers. Such scenario, based on distributing pipeline components (pipeline splitting and parallelization) has been described from technical point of view in D2.3 Section 2.2. Distributing FIRST system across many nodes may obviously affect timeliness of processing time, but it may still be desired from economical point of view. Deploying FIRST in cloud-based environment. This scenario might be suitable where pay-as-you-go cloud services are considered for deployment. In such case, FIRST might be deployed on number of virtual machines, based on the target resource usage demand. However it is not the scope of the project, to provide dynamically controlled deployment for the cloud. Rather static deployment is considered, based on required throughput estimations. It must be started that those scenarios, though conceivable and viable within architectural scope, does not have to be supported in the further development of the project if proven needless from exploitation or business point of view. © FIRST consortium Page 47 of 53 D2.2 6.3. Industrial perspective of the FIRST system The FIRST system and its computation results provide important data source to complement existing financial industry applications. In this section we provide the perspective of Use Case 1 (UC1) with regard to FIRST architecture and its context within financial industry. The area of capital markets compliance involves the processing of large amounts of information regarding the trading activities of financial institutions, the customers and employees effecting these trades, reference market data, and – new and importantly – announcements and other textual information which may affect the prices of securities. These data are delivered into the existing compliance architecture by way of interfaces from various trading and settlement systems, market data providers such as Thomson Reuters and Bloomberg, and other sources of static data. There are also large flows of unstructured but highly relevant information such as news releases, earnings announcements, and other textual information from these and many other sources which until now have not been amenable to automated processing. This data in the structured form, as soon available as results of FIRST project, may be incorporated into the stream of financial data in order to provide additional analysis scenarios. The FIRST architecture provides necessary connectivity means that facilitate integration with other financial systems. Such integration might be realized (i) in a data-oriented way, by accessing databases with results of unstructured data processing or (ii) in a service-oriented way, by invoking high-level FIRST services to request for certain functionality (e.g. alert reports). Such data may be later used to augment existing financial systems with new analytical perspectives, e.g. by combining into existing analysts’ cockpits (see Figure 25) by their system providers. Sentiment index Analyse sentiment index of chosen suspicious instrument Alerts coming from BNEXT MACOC platform including extra data coming from FIRST. Figure 25: Instrument Cockpit of MACOC application augmented with FIRST data (mockup) The overall result will enable a significant increase in scope to the existing automated architectures. In the aforementioned UC1 example, financial institutions and regulatory authorities will be given new tools to better ensure compliance in the trading environment. By integrating unstructured information into these architectures, it will now be possible to automate trading surveillance to detect entirely new kinds of scenarios involving the misuse of such unstructured information. © FIRST consortium Page 48 of 53 D2.2 7. Conclusion This document presents coherent integrated architecture as a baseline for development of the FIRST system. By juxtaposing requirements and available state-of-the-art technologies we chose appropriate techniques in order to deliver suitable design and technical description that will ensure meeting project criteria. This document also establishes technical communication between technical partners. It is important input for future development in the following months, especially in the scope of WP7 system integration, but also other technical workpackages. © FIRST consortium Page 49 of 53 D2.2 References Bohanec, M. (2008). DEXi: Program for multi-attribute decision making, User's manual, Version 3.00. IJS Report DP-9989. Ljubljana: Jožef Stefan Institute. Bondi, A. B. (2000). Characteristics of Scalability and Their Impact on performance. Proceedings of the 2nd international workshop on Software and performance (pp. 195-203). ACM. Brewer, E. A. (2000, July). Towards robust distributed systems. Key Note at Principles of Distributed Computing . Portland, Oregon, US. Cattell, R. (2010, December). Scalable SQL and NoSQL data stores. ACM SIGMOD Record , 39 (4), pp. 12-27. Eeles, P. (2005, November 15). Capturing Architectural Requirements. Retrieved July 1, 2011, from IBM Rational Works Library: http://www.ibm.com/developerworks/rational/library/4706.html FIRST D1.1 Definition of market surveillance, risk management and retail brokerage usecases. (2011). FIRST D1.2 Usecase requirements specification. (2011). FIRST D2.1 Technical requirements and state-of-the-art. (2011). FIRST D2.3 Scaling Strategy. (2011). FIRST D5.1 Specification of the information-integration model. (2011). FIRST D6.1 Models and visualisation for financial decision making. (2011). Fowler, M. (2002). Patterns of Enterprise Application Architecture. Addison Wesley. Gilbert, S., & Lynch, N. (2002). Brewer's conjecture and the feasibility of consitent, available, and partition-tolerant web services. ACM SIGACT New , 33 (2), pp. 51-59. Gómez-Pérez, A., & Manzano-Maho, D. (2003). OntoWeb Deliverable 1.5: A Survey of Ontology Learning Methods and Techniques. OntoWeb (IST-2000-29243). Grady, R. (1992). Practical Software Metrics for Project Management and Process Improvement. Prentice-Hall. . Härder, T., & Reuter, A. (1983, December). Principles of Transaction-Oriented Database Recovery. ACM Computing Surveys , 15 (4), pp. 287-317. Har-Peled, S., Roth, D., & Zimak, D. (2003). Constraint Classification for Multiclass Classification and Ranking. In S. Becker, S. Thrun, & K. Obermayer (Ed.), Advances in Neural Information Processing Systems 15: Proceedings of the 2002 Conference (pp. 809816). British Columbia, Canada: MIT Press. Hartigan, J. A., & Wong, M. A. (1979). Algorithm 136: A k-Means Clustering Algorithm. Applied Statistics , 28, 100–108. Hatzivassiloglou, V., & McKeown, K. (1997). Predicting the Semantic Orientation of Adjectives. pp. 174-181. Havre, S., Hetzler, B., & Nowell, L. (2000). ThemeRiver: Visualising Theme Changes over Time. Proceedings of InfoVis 2000, (pp. 115–123). Hohpe, G., & Woolf, B. (2003). Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions. Addison Wesley. Howitz, C. (2010, May 03). What Is 3-Tier(Multi-Tier) Architecture And Why Do You Need It? Retrieved August 16, 2011, from SimCrest ERP Round Table Blog: http://blog.simcrest.com/what-is-3-tier-architecture-and-why-do-you-need-it/ IBM. (2009). Rational Unified Process. IEEE Std 1471-2000, Recommended Practice for Architectural Description. (2000). Jude Pereira. (2006). Enterprise Application Integration: Approaches to Integration. MindTree Consulting. © FIRST consortium Page 50 of 53 D2.2 Klein, A., Altuntas, O., Kessler, W., & Häusser, T. (2011). Extracting Investor Sentiment from Weblog Texts. Proceedings of the 13th IEEE Conference on Commerce and Enterprise Computing (CEC). Luxembourg. Kusak, D. (2010). Comparison of Enterprise Application Integration Platforms. Piël, N. (2010, 06 23). ZeroMQ an introduction. Retrieved 06 30, 2011, from Nicholas Piël: http://nichol.as/zeromq-an-introduction Pritchett, D. (2008, June). BASE: An ACID Alternative. ACM Queue , 6 (3), pp. 48-55. Raible, M. (2010, November 18). Comparing JVM Web Frameworks. Retrieved August 17, 2011, from Raible Designs: http://raibledesigns.com/rd/entry/my_comparing_jvm_web_frameworks Raible, M. (2010, December 6). JVM Web Frameworks Rating Logic. Retrieved September 28, 2011, from https://docs.google.com/document/pub?id=1X_XvpJd6TgEAMe4a6xxzJ38yzmthvrA6wD 7zGy2Igog Shan, T. C., & Hua, W. W. (2006). Taxonomy of Java Web Application Frameworks. Proceedings of the IEEE International Conference on e-Business Engineering (pp. 378-385). Washington, DC, USA: IEEE Computer Society. Shaparenko, B., Caruana, R., Gehrke, J., & Joachims, T. (2005). Identifying Temporal Patterns and Key Players in Document Collections. Proceedings of TDM 2005, (pp. 165–174). Vogels, W. (2008, October). Eventually Consistent. ACM Queue , 6 (6), pp. 14-19. Westermann, E. (2009, April 7). Why EAI? Retrieved 06 29, 2011, from The Code Project: http://www.codeproject.com/Articles/35151/Why-EAI.aspx Zeromq. (2010, 08 04). Broker vs. Brokerless. Retrieved 06 30, 2011, from Zeromq: http://www.zeromq.org/whitepapers:brokerless ZeroMQ. (2011). ØMQ - The Guide. Retrieved July 28, 2011, from ØMQ: http://zguide.zeromq.org/page:all © FIRST consortium Page 51 of 53 D2.2 Annex 1. Requirements groups UC1.R-E1.1 UC1.R-E1.2 UC1.R-E3 UC1.R-EU5 UC2.R-E1.1 UC2.R-E1.2 UC2.R-E1.3 UC2.R-E1.4 UC2.R-E2.1 UC2.R-E2.2 UC2.R-EU3.1 UC2.R-F1 Data feeding – external sources Data feeding – UC1.R-E1 Search functionality Reputation UC1.R-F1 DocumentationContextual Display andhelp Access control UC1.R-M1 UC2.R-F2 Classification of UC1.R-E4 sentimentsobjects UC1.R-E2 Sentiment UC1.R-E2 UC2.R-F1 UC1.R-I1 Security UC1.R-M2 proprietary Data feeding – external sources Ad hoc news UC2.R-F3 vocabulary Ontology UC2.R-F3 Ontology UC1.R-E2.1 visualisation User interfaceof UC2.R-EU1 UC2.R-F3.1 UC2.R-F3.1 User Reports and Cockpit Service delivery – UC2.R-EU2 Insert a registration UC1.R-M3 request Login UC1.R-EU2 UC3.R-EU7.1 Recognisable features for Sentiment UC1.R-E2.2 Data feeding – external sources Data feeding – Recognisable features for Event: New important topics UC2.R-F8.1 UC1.R-E3.2 UC3.R-F2.3 UC2.R-F8.2 UC3.R-F2.4 Time Horizon of prognosis Document-level web Service delivery – client Reports and Alert UC2.R-EU7 external sources Data feeding – external sources Data feeding – extraction Sentiment classification – Sentiment UC2.R-F9 extraction and Sentiment analysis UC1.R-EU4 external sources Data feeding – proprietary Data feeding – UC3.R-F4.2 topic Drill-Down Topic UC3.R-F1.1 Individual instruments Aggregated UC1.R-EU5 instruments Abstract instruments Individual baskets UC1.R-EU8 (portfolios)future Expected price change Expected future UC3.R-F4.3 proprietary Choose input data UC3.R-U2 set Reputation UC2.R-F8 Drill-Down Source UC3.R-F1.2 Drill-down UC3.R-F1.3 UC3.R-F1.4 UC1.R-E3.1 UC1.R-EU1 UC2.R-EU6 Handle users UC1.R-EU7 registration request UC1.R-F2 Handle users profile UC1.R-F3 Cockpit Display Sentiments Ad hoc news UC1.R-SC1 Alert Reports and Cockpit to Access UC2.R-EU5 historical data and Display Sentiments Report Repository UC2.R-U1 UC1.R-F4 UC1.R-EU6 UC1.R-EU9 UC2.R-EU3.1 UC2.R-M1 8. Decision Support system and visualisations 7. Storage, persistence and data access 6. Maintenance and configuration 5. Access control and security 4. User Interface & Service delivery 2. Retrieval of topics and features 1. Data Feeding and acquisition UC1 UC2 UC3 3. Retrieval and Classification of Sentiments The following table presents system requirements grouped into 8 functional categories as an input for analyzing project coverage from technical point of view. See Section 2 for details. External sources list maintenance Proprietary UC1.R-EU3 Access to UC2.R-F3.1 historicaltodata and UC2.R-F4 Access historical data Historical list ofand UC2.R-F11 Recognisable features forfuture Expected information list Ontology maintenance Configuration of UC1.R-F7 usecase parameter Configuration of UC2.R-EU9 usecase parameter Classification of UC2.R-F5.1 data sources and UC2.R-F5.2 Configuration UC1.R-F6 relevant analysed UC2.R-F12 Alert search function Sentiments history UC2.R-M1 Decision support system (DSS)of Maintenance CDS prices UC3.R-E1.2 Stock prices UC3.R-EU3 risk models Display capabilities Event Detection maintenance Time series of UC2.R-F5.3 (structured data) Choose input data UC2.R-F6 set View and edit UC2.R-F10 Pillar 3 info UC3.R-EU5.1 system Maintenance of risk models Interface to Unstructured data UC3.R-EU6.2 sources archives Sentiments history UC3.R-EU7.2 UC1.R-EU8 UC2.R-SC4 UC3.R-EU2 reputation ontology UC3.R-F4.1 Retrieval of relevant UC3.R-EU5.2 product andcockpit UC3.R-EU6.1 Reputation reputation change Risk measurement Event: Sentiment peaks Notification. Sentiment peak in Event: Changes correlation Notification. Changes in New Notification: important topics UC3.R-F2.1 UC2.R-F7 vocabulary Retrieval of relevant structuredof relevant Retrieval UC1.R-F5 Cockpit functions UC2.R-I1 unstructured Data feed UC3.R-F3.1 UC3.R-F3.2 Alert search function on Access UC3.R-EU8 Unstructured data sources Drill-down volatility change Continuous classification Discrete UC1.R-F7 UC2.R-SC3 UC3.R-EU8.1 Visualization of topic trends and Currently relevant UC3.R-F4.4 classification Evaporation UC2.R-EU4 UC3.R-EU8.2 topics Evolution of topics UC2.R-EU8 demand to reports Access standardsentiments reports Display UC3.R-EU8.3 UC2.R-EU9 Sentiments history UC3.R-EU9 Evolution of topics related to specific Portfolio UC2.R-F10 Reputation cockpit UC3.R-EU9.1 optimization/asset Choose instrument UC3.R-E1 Financial information Searching system UC3.R-EU9.2 Integral part of diversification Drop in portfolio capabilities Display capabilities Watchlist/Portfolio UC3.R-EU10.2 UC3.R-E1.3 UC3.R-EU1 Display sentiments UC3.R-EU11.2 UC3.R-EU2 Sentiments history UC3.R-EU12 rebalancing Notification. Portfolio Scenario UC3.R-EU4 Notification UC3.R-F4.5 simulationeffects Lead-lag UC3.R-EU5.2 Notification. Sentiment peak Notification. UC2.R-F5 UC3.R-U2 UC3.R-F2.2 UC2.R-EU3 UC3.R-E1.1 UC3.R-E1.2 UC3.R-EU6.2 UC3.R-EU11.2 Changes in Drop Notification: in portfolio Notification. UC3.R-F4.2 Portfolio Topic Drill-Down UC3.R-F4.3 Drill-Down Source UC3.R-U1 Aggregated information Drill-down UC3.R-EU10.2 UC3.R-U2 © FIRST consortium Archiving Sentiments UC3.R-EU10.1 UC3.R-EU11.1 Page 52 of 53 sentiment Drop Notification: in portfolio Portfolio D2.2 © FIRST consortium Page 53 of 53