Abbreviations and acronyms

advertisement
Project Acronym:
FIRST
Project Title:
Large scale information extraction and
integration infrastructure for supporting
financial decision making
Project Number:
257928
Instrument:
Thematic Priority:
STREP
ICT-2009-4.3 Information and Communication
Technology
D2.2 Conceptual and technical integrated architecture
design
Work Package:
WP2 – Technical analysis, scaling strategy and
architecture
Due Date:
30/09/2011
Submission Date:
30/09/2011
Start Date of Project:
01/10/2010
Duration of Project:
36 Months
Organisation Responsible for Deliverable:
ATOS
Version:
1.0
Status:
Final
Author Name(s):
Mateusz Radzimski, Murat
Kalender (ATOS), Miha Grcar
(JSI), Achim Klein, Tobias
Haeusser (UHOH), Markus Gsell
(IDMS), Jan Muntermann
(UGOE), Michael Siering (GUF)
Reviewer(s):
Achim Klein
UHOH
Irina Alic
UGOE
R – Report
P – Prototype
D – Demonstrator
O - Other
PU - Public
CO - Confidential, only for members of the
consortium (including the Commission)
RE - Restricted to a group specified by the
consortium (including the Commission Services)
Nature:
Dissemination level:
Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013)
D2.2
Revision history
Version
0.1
0.2
Date
18/04/2011
13/06/2011
0.3
04/07/2011
0.4
11/07/2011
0.5
26/07/2011
0.6
9/08/2011
0.7
10/08/2011
0.8
18/08/2011
0.8.5
29/08/2011
0.9
14/09/2011
0.91
27/09/2011
0.95
1.0
29/09/2011
30/09/2011
Modified by
Comments
Mateusz Radzimski (ATOS) Early version of ToC provided
Mateusz Radzimski (ATOS) Primary content for “Requirements
Analysis” section
Murat Kalender (ATOS)
Primary content for “Integration
approach” section
Mateusz Radzimski (ATOS) Primary contribution to
“Architectural perspective” section
Mateusz Radzimski
Added content to “Architectural
(ATOS), Murat Kalender
perspective” section, add Annex 1.
(ATOS)
Further contribution for
“Integration Approach” section.
Markus Gsell (IDMS)
Added contribution to “Data
storage design principles”
Mateusz Radzimski (ATOS) Document refactoring and minor
corrections according to
teleconference discussions.
Murat Kalender (ATOS)
Added “Integrated GUI”
subsection.
Murat Kalender (ATOS),
Changes to „Integrated GUI”
Mateusz Radzimski
chapter. Added „Hosting platform”
(ATOS),
chapter.
Miha Grcar (JSI)
Miha Grcar (JSI), Achim
Added chapter “Example of the
Klein, Tobias Haeusser
FIRST process”.
(UHOH), Markus Gsell
(IDMS), Jan Muntermann
(UGOE), Michael Siering
(GUF), Mateusz Radzimski
(ATOS)
Mateusz Radzimski
Addressing reviewers’ comments,
(ATOS), Murat Kalender
Added “industrial perspective”
(ATOS), Markus Reinhardt subchapter. Contributions to
(NEXT)
“Design” chapter.
Mateusz Radzimski (ATOS) Final Version.
Tomás Pariente (ATOS)
Final QA and preparation for
submission
D2.2
Copyright © 2011, FIRST Consortium
The FIRST Consortium (www.project-first.eu) grants third parties the right to use and distribute
all or parts of this document, provided that the FIRST project and the document are properly
referenced.
THIS DOCUMENT IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
DOCUMENT, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
----------------
D2.2
Executive Summary
This document aims at performing comprehensive architectural analysis of the FIRST system.
Based on both technical and functional requirements, a candidate architecture and technical
design is defined that will ensure meeting goals of FIRST large-scale financial information
system.
This analysis focuses also on investigating on system integration approach and choosing most
suitable architectural patterns, mechanisms and technologies. It consists of technical details
concerning high level organisation of subsystems, their interaction and specific aspects of
technical components with an emphasis on performance and high scalability.
D2.2
Table of Contents
Executive Summary ...................................................................................................... 4
Abbreviations and acronyms ....................................................................................... 7
1. Introduction ............................................................................................................ 8
2. Requirement analysis ............................................................................................ 9
2.1. Relation of technical requirements and user requirements ............................... 9
2.2. Architecturally significant requirements ........................................................... 11
3. FIRST architectural perspective .......................................................................... 14
3.1. Overall project goals and architectural considerations .................................... 14
3.2. High level FIRST architectural view................................................................. 15
4. Integration approach ............................................................................................ 18
4.1. State of the art on integration approach .......................................................... 18
4.1.1
Application Integration .............................................................................. 18
4.1.2
Enterprise Application Integration ............................................................ 21
4.2. Pipeline processing ......................................................................................... 22
4.3. GUI integration ................................................................................................ 25
4.3.1
Introduction .............................................................................................. 25
4.3.2
Web Application Frameworks................................................................... 26
4.4. Data storage design principles ........................................................................ 29
4.4.1
Storage paradigms ................................................................................... 29
4.4.2
Access layer ............................................................................................. 30
4.4.3
Mediation layer ......................................................................................... 31
5. Design ................................................................................................................... 32
5.1. Detailed components interaction perspective .................................................. 32
5.2. Sample FIRST process ................................................................................... 35
5.2.1
Sample scenarios..................................................................................... 35
5.2.2
Data acquisition and preprocessing pipeline ............................................ 37
5.2.3
Information extraction pipeline ................................................................. 38
5.2.4
Decision support models .......................................................................... 39
5.2.5
Data exchange between pipeline components ......................................... 41
5.2.6
Role of the integration layer ..................................................................... 42
5.2.7
GUI integration ......................................................................................... 45
5.2.8
Role of the storage components .............................................................. 45
6. Deployment ........................................................................................................... 47
6.1. Hosting platform .............................................................................................. 47
6.2. Deployment scenarios ..................................................................................... 47
6.3. Industrial perspective of the FIRST system ..................................................... 48
7. Conclusion ............................................................................................................ 49
References ................................................................................................................... 50
Annex 1.
Requirements groups ............................................................................ 52
D2.2
Index of Figures
Figure 1: Integrated architecture in the context of other workpackages ......................................... 8
Figure 2: Architectural mechanisms applied to requirements (Eeles, 2005) ................................ 11
Figure 3: FIRST High-level architecture – logical view ............................................................... 16
Figure 4: Remote Procedure Invocation Architecture (Hohpe & Woolf, 2003) ........................... 19
Figure 5: Communication model of a broker based messaging system (Zeromq, 2010).............. 19
Figure 6: Communication model of a brokerless messaging system (Zeromq, 2010) .................. 20
Figure 7: Communication model of a broker based messaging system (Kusak, 2010) ................ 21
Figure 8: Messaging systems performance comparison results .................................................... 23
Figure 9: ZeroMQ messaging patterns (Piël, 2010) ...................................................................... 24
Figure 10: ZeroMQ messaging patterns performance comparisons. ............................................ 25
Figure 11: Data Acquisition and Information Extraction pipeline integration architecture ......... 25
Figure 12: Client/server application and web application architectures (Howitz, 2010) .............. 25
Figure 13: Detailed component view (overview) .......................................................................... 32
Figure 14: Detailed component view (with highlighted interactions and data flow) .................... 34
Figure 15: An example of the topic trend visualization ................................................................ 35
Figure 16: DJIA vs. smoothed time series of the sentiment index (see (Klein, Altuntas, Kessler,
& Häusser, 2011)) ......................................................................................................................... 36
Figure 17: The current topology of the data acquisition and preprocessing pipeline. .................. 38
Figure 18: Canyon Flow: a “view” of a hierarchy of document clusters ...................................... 39
Figure 19: The Canyon Flow pipeline and the corresponding part of the Web-based integrated
GUI. ............................................................................................................................................... 40
Figure 20: Annotated document corpus serialized into XML. ...................................................... 41
Figure 21: Annotated document serialized into HTML and displayed in a Web browser ............ 42
Figure 22: Integration between data acquisition and information extraction components w/load
balancing. ...................................................................................................................................... 43
Figure 23: Example of synchronous high-level services invocation............................................. 44
Figure 24: Data-push mechanism for notification services ........................................................... 45
Figure 25: Instrument Cockpit of MACOC application augmented with FIRST data (mockup) . 48
Index of Tables
Table 1: Use case vs. Technical requirements cross matrix .......................................................... 10
Table 2: Architecturally significant requirements analysis ........................................................... 13
Table 3: GWT advantages and disadvantages............................................................................... 27
Table 4: JSF advantages and disadvantages .................................................................................. 27
Table 5: Spring MVC advantages and disadvantages ................................................................... 28
Table 6: A comparison of notable Java web application frameworks (Raible, Comparing JVM
Web Frameworks, 2010) ............................................................................................................... 29
Table 7: Main characteristics of relational and non-relational storage paradigms ....................... 30
Table 8: Portfolio selection scenario based on DJIA stocks and sentiment extracted from blog
posts (Klein, Altuntas, Kessler, & Häusser, 2011)........................................................................ 37
D2.2
Abbreviations and acronyms
DOW
Description of Work
WP
Workpackage
TBD
To be defined
SOA
Service Oriented Architecture
API
Application Programming Interface
ESB
Enterprise Service Bus
UC
Use Case
PUB/SUB
Publish/Subscribe
REQ/REP
Request/Reply
JVM
Java Virtual Machine
CLR
Common Language Runtime
BOW
Bag of words
MVC
Model View Control
UI
User Interface
JSF
Java Server Faces
GWT
Google Web Toolkit
© FIRST consortium
Page 7 of 53
D2.2
1. Introduction
The most important aspect of this document is to provide a communication within the project
regarding architectural and technical point of view. Design will affect all technical components
developed in other workpackages by outlining possible interactions patterns, dependencies and
dataflow. Architecture is also heavily influenced by articulated requirements, both technical and
use case. For example, processing methods envisaged for analysing huge data streams
provide an important input that constrain architecture techniques and determine further
technological choices. Therefore the idea is to encompass all such requirements and
constraints into coherent design that will enable for seamless future development of FIRST
system. It is also important to keep the description at proper level of abstraction to avoid
overengineering the design. FIRST being a research project is driven by experiments and
improvements of current state-of-the-art techniques, therefore defining all details at this stage of
the project would be infeasible. Instead, those details will be presented along with prototype
release milestones throughout project lifetime in corresponding deliverables.
The relation of this document with other deliverables and technical workpackages has been
presented in Figure 1.
Technical
components
influences
WP3
Use case requirements
specification
D1.2
Integrated
architecture
Technical requirements
and state of the art
D2.2
D2.1
WP4
WP5
WP6
D2.3
WP7
Scaling strategy
Figure 1: Integrated architecture in the context of other workpackages
© FIRST consortium
Page 8 of 53
D2.2
2. Requirement analysis
This section is aimed at recapping and analysing technical requirements described in D2.1 that
concern architecture and behaviour of the overall FIRST system. We will ensure that
requirement analysis is sound and provide a firm ground for further technical architecture
design. Therefore we will study and evaluate requirements collected within D1.2 (use-case
perspective) and D2.1 (technical perspective) in order to assess how business use cases
defined in WP1 are satisfied by envisaged technical provisioning captured by D2.1 technical
requirements analysis. It is also important to analyze requirements with regard to the benefit it
brings to the overall system and their viability within the limits of project’s resources or technical
and scientific feasibility. It allows assigning requirements’ priorities accordingly, deciding on
which functionalities are essential for satisfying the use cases and project goals and which
should be treated as supplemental.
Significant part of this chapter is devoted to analysis of architecturally significant requirements.
Those are requirements that have a clear, technical impact on the project and influence
architecture and design of the system. We will proceed with this task by choosing relevant ones
and extending them if necessary by providing proposed design and technological details. They
will serve as enablers of the architectural analysis.
2.1. Relation of technical requirements and user requirements
The use case requirement specification described in D1.2 from a comprehensive overview of a
system as seen by use case stakeholders, through the description of system functionalities,
non-functional attributes, actors, and context. For each of three use cases a similar list has
been provided. By analysing all requirements altogether from a functional point of view, eight
main category groups can be identified:
1. Data feeding and acquisition
2. Retrieval of topics and features
3. Retrieval and classification of sentiments
4. User Interface and Service delivery
5. Access control and security
6. Configuration and maintenance
7. Storage, persistence and data access
8. Decision support and visualisation
Requirements falling into each category are closely related and they define a common fragment
of system functionality. Note that Non Functional Requirements are orthogonal to the
aforementioned groups, thus not listed here. The list of assignment of each requirement to the
group above has been presented in Annex 1.
By clustering use case requirements into functional categories, we can reduce the number of
items to analyse, thus it is viable to perform a requirement coverage breakdown using
requirements traceability matrix. Table 1 shows the coverage of each group (horizontal axis) by
relevant technical requirement (vertical axis). The analysis simplifies multidimensional nature of
system requirements into 2-dimentional matrix. By identifying cross-reference relationships we
obtain quantitative result indicating how well each of functionalities has been described in terms
of technological provisioning. The table should be read as: Technical requirement R1.1 covers
some aspects of global functionality nr 1. The numbers in the row titled “Quantitative technical
coverage” denotes how many technical requirements are related to certain functionality.
© FIRST consortium
Page 9 of 53
D2.2
Technical
requirement
name
Requirement
ID
1. Data feeding and
acquisition
2. Retrieval of topics and
features
3. Retrieval and
Classification of
sentiments
4. User Interface & Service
delivery
5. Access control and
security
6. Maintenance and
configuration
7. Storage, persistence and
data access
8. Decision Support system
and visualisations
Functionalities
Internet connection
bandwidth
Concurrent execution of
processes – hardware
infrastructure
Memory and persistent
storage
API for external access
R1.1
Flexibility of the
infrastructure
Concurrent execution of
processes – software
infrastructure
Logging & monitoring
R2.2
Stability
R3.1
Pipeline latency
R4.1
Pipeline throughput
R4.2
Document format and
encoding
Interchange data format
R4.3
x
R4.4
Data formats
R5.1
x
x
Unified access
R5.2
Ontology format
R6.1
Ontology availability
R6.2
Ontology purpose
R6.3
Ontology evolution
R6.4
Data acquisition pipeline
– functionality
Data acquisition pipeline
– supported Web content
formats
Information extraction
components
Sentiment analysis
R7.1
Decision support models
– features
Decision support models
– streams
Programming / runtime
environment – data
acquisition pipeline
Programming / runtime
environment –
information extraction
pipeline
Runtime environment –
knowledge base
Programming / runtime
environment – decision
support components
R7.5
x
R7.6
x
x
R1.2
x
x
x
x
R1.3
x
R2.1
x
x
R2.3
R2.4
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
R7.2
R7.3
x
R7.4
x
R8.1
x
x
R8.2
x
R8.3
x
R8.4
2
1
2
7
6
Low
Low
Medium
Medium
High
High
13 14 12
High
Quantitative
technical
coverage
Low
x
Table 1: Use case vs. Technical requirements cross matrix
© FIRST consortium
Page 10 of 53
D2.2
It can be observed from a cross-reference analysis that the coverage is lower for the storage
and decision-support components (Category 8). This is due to the fact that at this stage of the
project, the data acquisition and preprocessing pipeline, as well as the information retrieval
framework, are relatively well defined and early implementations has already started, while the
decision-support models are yet to be fully defined. For this reason, fever constraints have been
put on this part of the system to ensure enough flexibility when pursuing the use-case goals at a
later time and they will receive more technical coverage in the scope of WP6 deliverables
according to the work plan. Consequently, the data storage components will need to adapt to
the system at that time in order to properly store the models, predictions, and potentially other
relevant data and metadata.
Categories 4, 5 and 6 have been covered relatively low. Category 4 (“User interface and service
delivery”) will be described further in the context of integration and GUI provisioning (see
chapter 4.3) for building Integrated Financial Market Information System (WP7). Categories 5
and 6 are not considered main research and technical challenges within the project and their
priorities are of least importance in comparison with others.
2.2. Architecturally significant requirements
From architectural perspective some requirements are more important than the others. Those
which contain attributes or constraints related to architecture and design are called
architecturally significant requirements. The list of already defined requirements in D1.2 and
D2.1 already contain such. FURPS+ classification model (Grady, 1992) divides requirements
based on following characteristics (FURPS): Functionality, Usability, Reliability, Performance,
and Supportability. Those requirements have been defined mostly in (FIRST D1.2 Usecase
requirements specification, 2011). The “+” sign in FURPS+ model adds additional types, that
are: Design, Implementation, Interface and Physical requirements. They are covered by
specification of additional technical requirements in (FIRST D2.1 Technical requirements and
state-of-the-art, 2011). While latter requirement group usually define concrete design or
implementation features thus making implicit influence on architecture, the former group may
needs some further analysis and elaboration (see Figure 2) in order to define their effect on
architecture.
Figure 2: Architectural mechanisms applied to requirements (Eeles, 2005)
© FIRST consortium
Page 11 of 53
D2.2
Table 2 presents relevant architecturally significant requirements chosen from (FIRST D1.2
Usecase requirements specification, 2011) and (FIRST D2.1 Technical requirements and stateof-the-art, 2011) and performs short analysis on their technical impact on overall architecture.
Those aspects will be further considered when defining FIRST system design.
Requirement
ID1
Analysis & Design mechanisms
R4.1
“Pipeline
latency” (also
in use cases
nonfunctional
requirements,
e.g. UC3.RP1), R4.2
“Pipeline
throughput”
Huge data stream processed in
FIRST will data processed in a
pipeline fashion, which requires
a different integration approach
that typical SOA-based systems
represent. For instance, we are
assured that data is coming in a
constant stream, therefore
explicit requests might increase
latency and lower overall
throughput by inducing extra
traffic (R4.2)
UC1.R-EU1
Alerts and notification features of
“Alert Reports use cases provides an indirect
and Cockpit” assumption of delivering
messages as soon as they
appear in the data stream.
Typical constant polling for
messages (“did alert X appeared
in the data”) might quickly
become unscalable with growing
number of data and registered
alerts.
R8.1, R8.2,
System comprises components
R8.3
written using different
technologies. Therefore typical
programmatic integration is not
possible. A common integration
that will ensure robust exchange
of data between different
execution environments should
be provided.
R7.1-6 and
Technical components involved
R-E3.1, Rin data processing are providing
E3.2
certain functionalities used for
end-user GUIs or integration with
other applications (e.g. use case
implementations). That requires
exposing them in a standard way
in a form of APIs (abstracted
from underlying technical
components) that can be
Influence on architecture
Most important aspect is to analyse
other than WebService (WS) and SOA
approaches, especially those suitable
for stream processing and compare
them based on their performance and
throughput capabilities.
It may also result in some components
being integrated following different,
more robust techniques, while other
(less performance constrained)
following traditional approaches.
System should deliver alerts and
notification in a “push” based manner,
directly to subscribed parties.
That approach should be preferred to
the typical requesting for data (polling),
that might in the future become not
efficient.
An integration middleware that will
connect such components should be
able to communicate in a technology
independent manner.
Exposed functionalities should be
resembled in architectural overview in a
form of high-level service layer,
comprising APIs for developing
Integrated Financial Market Information
System and Use Case implementations.
1
Requirements IDs are corresponding to their counterparts defined in (FIRST D1.2 Usecase requirements
specification, 2011) and (FIRST D2.1 Technical requirements and state-of-the-art, 2011)
© FIRST consortium
Page 12 of 53
D2.2
Requirement
ID1
Analysis & Design mechanisms
accessible for other interested
components.
R2.2
Architecture flexibility requires
“Flexibility of certain level of components
the
decoupling and integration
infrastructure” middleware should possess such
characteristics. This will also
support development and testing
process.
R2.3
To support concurrent
“Concurrent
processing, architecture should
execution of easily allow techniques as
processes – parallelisation from the very
software
beginning. It affects middleware
infrastructure” layer as well as individual
components.
Influence on architecture
FIRST will follow component-based
design techniques and integration
middleware will allow for clear system
decomposition and will support
deploying system components
independently.
Data distribution and results collecting
across different components should be
possible and easily supported by
architecture. Such techniques as load
balancing or distributed processing
should be supported by architecture as
well.
Table 2: Architecturally significant requirements analysis
© FIRST consortium
Page 13 of 53
D2.2
3. FIRST architectural perspective
This chapter creates high level overview of the FIRST architecture in order to support FIRST
project goals. It also depicts a coarse grained decomposition of the main FIRST building blocks
and explains how different use cases are build on top of generic FIRST architecture.
3.1. Overall project goals and architectural considerations
The role of defining architecture is to encompass the project vision and defined requirements
into a common technical platform that will successfully fulfil the projects goals. Architectural
analysis focuses on defining candidate architecture and constraining the architectural
techniques to be used in the system (IBM, 2009). The input for this task is coming from the
project goals, use case analysis and requirements definition in order to provide a suitable
architecture that will ensure that the overall project meets its objectives. It especially
accommodates non functional and technical requirements that have strong influence on final
design and implementation. The result is a technical overview of organization and structure of
components (IBM, 2009).
From the global project perspective, the objective of the system is to improve decision making
process (e.g. reduce investment risk, increase return, etc. See (FIRST D1.1 Definition of market
surveillance, risk management and retail brokerage usecases, 2011)) based on unstructured
data sources, such as news articles or blogs. For that reason, system is required to process
loads of data and perform automatic analysis, that otherwise would be impossible. Therefore,
the architecture should enable to:
-
process, analyze and make sense of huge number of unstructured data
-
provide results of financial news (articles, blogs, etc) analysis and relevant information
extraction in reasonably short time (near real-time),
-
integrate components and algorithms in order to automatically process financial data and
allow to apply sophisticated decision models,
-
develop a graphical interface to present and visualize data relevant for decision making.
Furthermore such architecture (among other characteristics) should ensure maintainability,
extendibility and flexibility. E.g. flexibility may enable further exploitation possibilities; while
maintainability and extendibility allows for seamless development process that may include
introducing changes and technical modifications during the development and further stages of
the project.
Architecture should follow component based structure. While the most important functionality
comes from the data processing pipeline (FIRST D2.1 Technical requirements and state-of-theart, 2011) individual parts of the system might be also subject for further exploitation separately
in the future. However, technologies supporting modular architecture and decomposition, by
bringing a new layer of abstraction, usually impose some performance limitations or extra
message overhead. In case of SOA integration approaches it may be a central broker or
application server for deploying services.
In FIRST we may divide the architecture integration into two parts:

Lightweight integration for pipeline processing
Integration of components directly involved in processing unstructured documents
(pipeline processing components), exchanging huge amounts of data – the focus is
mostly on lightweight and performance-wise approaches that will ensure the project
goals.
© FIRST consortium
Page 14 of 53
D2.2

Classical SOA-based system integration
Integration of other Integrated Financial Market Information system components, such as
services for constructing user interface, use cases implementation and accessing stored
data – integration should follow best known integration approaches. Also some features
of components involved in pipeline processing might be exposed as non-strictly
performance-oriented services, i.e. offering some fragments of the pipeline as a
separate service might provide added value for further project exploitation or
components individual exploitation possibilities. Also reusing components as a service
could be viable. For example, isolated sentiment analysis functionality can be wrapped
as a service and offered separately. This part also contains GUI implementation that will
be analyzed later in this document.
The overall FIRST architecture will accommodate both aforementioned techniques in a coherent
way in order to satisfy project goals.
3.2. High level FIRST architectural view
This section depicts architectural, high-level perspective of FIRST system. It provides structural
view of components and explains their relationships within the other parts of the system. The
static analysis is covered within this chapter, while more detailed analysis including dynamic
view is presented in chapter 5. We will follow loosely adapted approach of presenting
architecture as a set of different but coherent views of the system and its environment (IEEE
Std 1471-2000, Recommended Practice for Architectural Description, 2000).
From the software engineering perspective, the very heart of the project is sentiment analysis
analytical pipeline for information extraction from unstructured sources (FIRST D2.1 Technical
requirements and state-of-the-art, 2011). It is providing the core functionality and serves as a
common part providing necessary data to all use cases. For allowing flexible implementation of
e.g. use cases functionalities a standard API will be defined that will expose high-level services
to use case providers and FIRST GUI demonstrating system’s capabilities. This API will further
allow other parties to develop their own applications or integrate it with their own systems in
order to offer new services and added value on top of FIRST system.
Based on that description a multi-tier architecture is a suitable choice for describing the system.
From the analytical point of view, a multi-tiered architectural pattern allows to clearly distinguish
and separate different layers (tiers) in the system (Fowler, 2002). In the FIRST system, the
logical, high-level view consists of the following layers (as depicted on Figure 3).
© FIRST consortium
Page 15 of 53
D2.2
Media coverage
Market Surveilance Risk Management
Use Case
Use Case
Retail Brokeage
Use Case
FIRST
Integrated GUI
Sentiments
Use cases
and GUI
implementati
on layer
Blogs
Alerts
Experts opinions
Event predictions
FIRST Financial Marketplace Services
Financial timeseries
FIRST
Sentiment
Analysis services
FIRST
Visualisations
services
FIRST
Decision Support
Services
FIRST
Alerts Services
FIRST Lightweight Integration Layer
High-level
FIRST
Services
Layer
Middleware
Layer
External
Datasources
Decision support
Computation
results
FIRST Analytical Pipeline(s)
Unstructured Data
Data
Data
Acquisition
Acquisition
Ontology
Ontology
Learning
Learning
Information
Information
Extraction
Extraction
Sentiment
Sentiment
Analysis
Analysis
Decision
Decision
Support
Support
Pipeline
components
layer
Reports
Visualisations
5
4
3
FIRST Information Integration
2
1
Recommendations
Structured Data
Data Storage
Layer
Decision Models
Sentiment history
Ontologies
Annotated documents
Figure 3: FIRST High-level architecture – logical view
The system, as envisaged from the architectural point of view consists of following parts,
depicted as layers. Starting from the top, those are:

Use case and GUI implementation layer – It consists of implementation of 3 FIRST Use
Cases, and also FIRST Integrated GUI (Integrated Financial Market Information
System). It includes end-user user interfaces and provisioning of graphical widgets for
displaying results of computation including: sentiments, alerts, event predictions,
decision support and data stream visualisations. Moreover, FIRST Integrated GUI
provides an “entry point” for showcase of FIRST functionalities. All 4 parts are
implemented accessing FIRST APIs and include necessary technological provisioning
(e.g. web application deployment server).

High-level FIRST services layer (FIRST APIs) – set of higher level services running on
top of the FIRST Analytical Pipeline and providing necessary access to its computation
results. They provide all concrete functionalities offered by underlying technical
components, wrapping them as services, therefore forming a logical abstraction from the
high performance lower level components. Those services are delivered by technical
components implemented within following workpackages: WP3, WP4, WP5, WP6 and
WP7.

Middleware/integration layer – provides fast, robust and lightweight infrastructure for
integration of different components of the FIRST Analytical Pipeline while supporting low
latency, high performance and throughput. It also offers advanced techniques for
general pipeline scaling, as explained in (FIRST D2.3 Scaling Strategy, 2011). This layer
integrates components developed within following workpackages: WP3, WP4, WP6 and
WP7, that take part in pipeline processing.

Pipeline layer – FIRST Analytical Pipeline is a set of components that are processing
stream of document in a sequential manner. While in principle we use the term “pipeline”
in singular form, it may consist of more number of parallel “pipelines” balanced for
© FIRST consortium
Page 16 of 53
D2.2
handling bigger data streams. The integration of those components is provided by
middleware/integration layer.

Data storage layer – an underlying set of storage services providing unified data access
for supporting the pipeline operations such as: storing intermediate documents, decision
models, ontology versioning, or archiving computation results. Design has been carried
out within WP5’s (FIRST D5.1 Specification of the information-integration model, 2011).
© FIRST consortium
Page 17 of 53
D2.2
4. Integration approach
In the following sections we analyse common integration approaches and choose most suitable
for the FIRST architecture.
4.1. State of the art on integration approach
The term integration, in computer science domain, expresses the process of making disparate
software modules work together by exchanging information in order to build a complete
application or fulfil a task. Integration can be categorized into three types according to the
application areas (Westermann, 2009): Application Integration (AI), Enterprise Application
Integration (EAI) and Business Integration (BI).
In general, Application Integration makes applications to exchange information. Messaging is
one of the commonly used approaches in Application Integration. Enterprise Application
Integration builds on application integration methodologies by dealing with integration and
orchestration of enterprise applications and services. Enterprise Message Bus and Enterprise
Service Bus are commonly used technologies in EAI. Business Integration builds on EAI
methodologies which deal with technical infrastructure of an organization such as exposing
parts of business’ operations on the public internet for use of costumers (Westermann, 2009).
A scalable and efficient integration platform is required in order to communicate components of
the FIRST analytical pipeline and build a complete system. In the following sections, application
integration and Enterprise Application integration approaches are presented in details, and then
their suitability’s for the FIRST pipeline is discussed. Business Integration approaches are not
analysed because they are applied in business organization level for integrating several
complex systems. For this reason these approaches are not suitable for the integration of
FIRST components to build FIRST system.
4.1.1 Application Integration
There are mainly four Application Integration approaches, which are: File Transfer, Shared
Database, Remote Procedure Invocation and Messaging (Hohpe & Woolf, 2003).
In File Transfer approach applications communicate via files that can be accessible with
integrated applications. One application writes files and another application reads later on. An
agreement is required on the filenames, locations, formats and maintenance of files between
applications. The following figure shows integration of two applications using the File Transfer
approach.
Figure 1: File Transfer Architecture (Hohpe & Woolf, 2003)
Shared Database approach is integration of applications using a single shared database. In this
approach, applications are able to access same database and information. Therefore, there is
no need to transfer information between applications directly. One of the biggest difficulties with
Shared Database is design of the database schema. The following figure shows integration of
three applications using the Shared Database approach.
© FIRST consortium
Page 18 of 53
D2.2
Figure 2: Shared Database Architecture (Hohpe & Woolf, 2003)
Remote Procedure Invocation approach integrates applications by exposing functionalities of
applications, which can be called remotely by other applications. Exposed functionalities can be
used for data transfer between applications or modification of the data by external applications.
Web Services are examples of the Remote Procedure Invocation, which use standards such as
SOAP and XML .The following figure shows integration of two applications using the Remote
Procedure Invocation approach.
Figure 4: Remote Procedure Invocation Architecture (Hohpe & Woolf, 2003)
Messaging is exchange of messages between applications in a form of loosely coupled
distributed manner (Java Message Service, 2011). Communication between applications can be
established using TCP network sockets, HTTP, etc... Messaging channels are opened and
applications transfer messages by sending and receiving messages through the channel. The
applications must agree on channel and message format for the integration. There are two
different models of how messaging can be done, which are broker and brokerless. In a broker
based messaging system, there is a messaging server in the middle. Every application is
connected to the central broker. No application is speaking directly to the other application. All
the communication is passed through the broker. Figure 5 shows communication model of a
broker based messaging system.
Figure 5: Communication model of a broker based messaging system (Zeromq, 2010)
© FIRST consortium
Page 19 of 53
D2.2
The advantages and disadvantages of broker based messaging systems are:
Advantages

Applications don't have to have any idea about location of other applications.

Message sender and message receiver lifetimes don't have to overlap.

Resistant to the application failure.
Disadvantages

Excessive amount of network communication causes performance decrease.

Broker is the bottleneck of the whole system. When broker fails, whole system would
stop working.
In a brokerless based messaging system, clients interact directly with each other. There is no
central messaging server. Figure 6 shows communication model of a brokerless messaging
system.
Figure 6: Communication model of a brokerless messaging system (Zeromq, 2010)
The advantages and disadvantages of brokerless messaging systems are:
Advantages

No single bottleneck.

High performance with less network communication.
Disadvantages

Each application has to connect to the applications it communicates with and thus it has
to know the network address of each such application.

Application failures cause persistence and stability problems.
File Transfer, Shared Database and Messaging are data based integration approaches, which
enable applications to share their data but not their functionality. Remote Procedure Invocation
enables applications to share their functionality, which makes them tightly coupled (dependent)
to each other. Remote calls are also much slower, and they are much more likely to fail
compared to local procedure calls that may cause performance and reliability problems.
File Transfer and Shared Database allow keeping the applications well decoupled and therefore
different technologies can be used in the applications. However, these approaches require
synchronization mechanism in order to inform integrated applications, when data is shared for
consumption. Moreover, these approaches require disk access to store and retrieve data, which
increases cost of communication between applications.
© FIRST consortium
Page 20 of 53
D2.2
To integrate applications in a more loosely coupled, reliable way with high performance,
Messaging would be the most suitable approach as an application integration approach to
integrate the FIRST pipeline. Messaging is reliable since messaging has retry mechanism to
make sure message transfer succeeds. Applications are synchronized to each other with
automatic notification of message transfer, which increases performance of a system.
4.1.2 Enterprise Application Integration
The term Enterprise Application Integration denotes the usage of tools to integrate enterprise
applications and services. EAI tools typically act as a middleware between applications.
Communication between the EAI tools and applications are handled by usage of a messaging
middleware inside. There are mainly four Enterprise Application Integration types, which are:
Point-to-Point topology, Hub-and-Spoke topology, Enterprise Message Bus Integration and
Enterprise Service Bus Integration (Kusak, 2010). Figure 7 shows the architectures of the
integration approaches.
Figure 7: Communication model of a broker based messaging system (Kusak, 2010)
In Point to Point approach, a point to point topology is formed by direct interaction of
applications, which creates tight coupling between applications. This approach is generally used
when there are few applications and few interactions between them. In this type of integration,
integration logic is embedded into application, which removes central control, administration and
management.
Hub-and-Spoke topology is formed by interaction of applications via a central Hub. The main
advantage of this topology is that less connections is required in this kind of topology compared
to the point to point topology. Also interactions can be managed centrally. However, Hub
becomes the single point of failure. When the Hub fails, whole integration topology fails.
© FIRST consortium
Page 21 of 53
D2.2
Enterprise Message Bus integration approach is an evolved version of the Hub-and-Spoke
approach. Messaging Bus forms the central part of the topology. Applications communicate with
Messaging Bus via message brokers or adapters. Main advantage of this topology is a
messaging underlying the communication. Messaging offers performance, persistence and
reliability.
Enterprise Service Bus (ESB) is a software infrastructure, which is used as a backbone to SOA
implementations. Gartner Group defines ESB as "a new architecture that exploits Web services,
messaging middleware, intelligent routing, and transformation. ESBs act as a lightweight,
ubiquitous integration backbone through which software services and application components
flow." ESB promotes less complexity, flexibility, low cost by combining benefits of standards
based Web Services with other EAI approaches (Jude Pereira, 2006).
ESB would be the most suitable approach as EAI to integrate the FIRST pipeline with its
advantages compared to the other approaches. However it is focused issues not present in
FIRST system (such as: integration of big number of components or content-based message
routing) that results in being too complex and adding possible performance overhead in
information exchanging. FIRST pipeline composes of few applications and each of them has a
clearly defined communication schema (passing data from one to another in a component
chain). For this reason, Application Integration approaches specifically messaging would be
more suitable because of their simplicity and better performance in comparison to more
complex ESB middleware designed for much larger scale integration.
4.2. Pipeline processing
FIRST pipeline has three major modules to be integrated: Data acquisition and Ontology
infrastructure (WP3), Semantic information extraction system (WP4) and Decision support
infrastructure (WP6). For the First semantic information extraction prototype WP3 and WP4
modules are integrated using the messaging approach. This section presents the design and
implementation of the messaging integration solution in details.
High performance and support for integration of components written in different technologies
(.NET, Java) are most important requirements for the FIRST integration approach (FIRST D2.1
Technical requirements and state-of-the-art, 2011). For programming language independency,
messaging systems which support multiple platforms and programming languages are
investigated as a potential solution for the integration problem. Messaging systems that support
only specific programming language are eliminated during our survey of messaging systems.
The following popular messaging systems are chosen for further investigation, after the
programming language independency elimination: ZeroMQ1, ActiveMQ2, RabbitMQ3, OpenJMS
4
and Apache Qpid5.
Performance is the most important criteria’s when selecting the messaging platform for the
pipeline integration. For this purpose, performance tests are done using a randomly generated
data with fixed message sizes. In the experiments, 1000 messages are transferred from one
application to another running on the same machine. Total transfer duration is used for
calculating throughput of popular messaging systems. Figure 8 shows performances of these
messaging systems:
1
http://www.zeromq.org/
http://activemq.apache.org/
3
http://www.rabbitmq.com/
4
http://openjms.sourceforge.net/
5
http://qpid.apache.org/
2
© FIRST consortium
Page 22 of 53
D2.2
Throughput ( MB / second)
3.5
3
2.5
2
1.5
1
0.5
0
RabbitMq
ActiveMq
OpenJMS
ZeroMq
Qpid
Figure 8: Messaging systems performance comparison results
The experiment results showed that ZeroMQ performed better than the other messaging
platforms. ZeroMQ performs better since it is brokerless based messaging system. It requires
less network communication. As a broker based messaging system, RabbitMQ performed better
than the other systems. Based on the experiment results ZeroMQ is decided to be used as a
messaging platform for the FIRST pipeline integration because of its performance and
numerous bindings for most programming environments.
ZeroMQ (ØMQ) is a high performance asynchronous messaging library, in which
communication between applications can be established without much effort. ZeroMQ offers
performance, simplicity and scalability without a dedicated message broker (Piël, 2010).
The library provides sockets to connect applications and exchange information. ZeroMQ
provides several communication patterns which are:

Request-reply connects a set of applications. Firstly, message consumer (client) request
for a message transfer and message producer (server) sends a message. Request-reply
pattern supports multiple message producers and consumers. There is a load balancing
between message producers and consumers and each message is consumed one time.
The advantage of this approach is the synchronization between producer and consumer.
Each message is consumed one by one. In contrast, two messages are sent for each
packet of the message producer, which increases network traffic and decreases the
performance.

Publish-subscribe connects a set of publishers to a set of subscribers. Publishers
publish messages and subscribers receive the message. Each message is delivered to
all subscribers.

Push-pull (pipeline) connects applications in a pipeline pattern. Message producer
pushes messages to the pipe without request from the message consumer. The
messages are kept in a queue until they are consumed by the message receiver. The
pipeline approach performs faster than the request and replay approach. However, the
high performance comes with the risks of queue overflow and synchronization problems.

Exclusive pair connects two applications in an exclusive way. Both application can send
and consume messages.
© FIRST consortium
Page 23 of 53
D2.2
Figure 9: ZeroMQ messaging patterns (Piël, 2010)
Messaging system would be responsible for exchange of information from WP3 to WP4 in an
efficient, stable and reliable way. Thus the pipeline and request-reply message patterns are
much suitable for our needs. To observe performances and suitability of these patterns
experiments are done on a test dataset collected by the WP3. The test data set composes of
2,378 files with total size of 5.26 GB. Each file is transferred as a one message from one
application to another using the both message patterns.
Throughput ( MB / second )
3.5
3
2.5
2
ZeroMq Pipeline
1.5
ZeroMq Request-Replay
1
0.5
0
ZeroMq messaging patterns
© FIRST consortium
Page 24 of 53
D2.2
Figure 10: ZeroMQ messaging patterns performance comparisons.
In the experiments, we observed that the pipeline approach (data push) performs approximately
three times better than the request-reply pattern (Figure 10) due to lack of extra communication
overhead related to data request sent every time by client. Instead, the data is pushed to the
client once it is available. For these reasons, pipeline pattern is selected for transferring
messages between the pipeline components (WP4 and WP6). Figure 11 shows the architecture
of the pipeline between the systems.
Figure 11: Data Acquisition and Information Extraction pipeline integration architecture
4.3. GUI integration
4.3.1 Introduction
A common front-end Graphical User Interface (GUI) is required to present the Integrated
Financial Market Information System to the end users. This section discusses the technical
requirements of the GUI and presents state of the art technologies that fit the requirements.
There are two types of applications that are commonly used to implement the integrated GUI:
Client/server application and web application. Client/server application follows two-tier
architecture, which runs on the client side and access information on the remote server. Web
application follows n-tier architecture, which is accessed via web browsers and all application
logic is handled on the server side (see Figure 12). Web applications are generally designed
with 3-tier architecture:

Presentation tier: Front-end is the content rendered by the browser.

Application tier: Controls applications functionality.

Data tier: Stores and retrieves information.
Figure 12: Client/server application and web application architectures (Howitz, 2010)
© FIRST consortium
Page 25 of 53
D2.2
In a Client/server based application, application must be installed on each client’s computer. To
avoid the burden in deploying in each user machine and maintaining them, the integrated GUI
would be implemented as a web application. The database tier in case of FIRST system is
provided in a form of Data Storage APIs, provided by FIRST high-level services layer.
4.3.2 Web Application Frameworks
Web application frameworks (WAF) are mostly used for developing web applications for their
benefits (i.e. simplicity, consistency, efficiency, reusability). There are large amount of available
programming language specific web application frameworks. For platform independency and
performance, the integrated GUI will be implemented in Java language and for this reason we
will focus on Java based web application frameworks. From the client perspective, only a web
browser should be needed.
Java web application frameworks can be categorized into five categories (Shan & Hua, 2006):

Request-based Framework uses controllers and actions, which handles incoming
request from users. User session is kept on the server side in this type of frameworks.
Struts1, WebWork 2and Stripes 3are examples of Request-based Framework.

Component-based Framework abstracts the request handling mechanism and
encapsulates the logic into reusable components. Events are triggered automatically for
handling incoming requests from users. Development model of this type of framework is
similar to the desktop GUI framework models. JSF 4and Apache Wicket 5are examples
of Component-based Framework, which are widely used for web application
development.

Hybrid Framework is a combination of both request-based and component-based
frameworks. The entire data and logic flow of components are handled as in a requestbased model. RIFE6, a full-stack web application framework, falls into this category.

Meta Framework provides set of core interfaces for integrating components and
services. Meta framework can be considered as a framework of frameworks. Spring
7
and Keel 8are examples of Meta Framework.

Rich Internet Framework uses client-side container model in which requests are handled
on the client side. Therefore, the amount of server communication and load decreases.
Google Web Toolkit 9 (GWT) and Echo2 10 are popular Rich Internet frameworks.
The main purpose of using software frameworks is reducing the amount of time, effort, and
resources required to develop and maintain web applications. Performance of the framework is
also very important factor when choosing the web application framework. From this perspective,
popular frameworks are analyzed: GWT, JSF and Spring MVC.
GWT is a component based Rich Internet Framework that allows web developers to create and
maintain complex JavaScript front-end applications. GWT allows you to write AJAX applications
in Java and then compile the source to highly optimized JavaScript that runs across all
1
http://struts.apache.org/
http://www.opensymphony.com/webwork/
3
http://www.stripesframework.org/display/stripes/Home
4
http://javaserverfaces.java.net/
5
http://wicket.apache.org/
6
http://rifers.org/
7
http://www.springsource.org/
8
http://www.keelframework.org/
9
http://code.google.com/webtoolkit/
10
http://echo.nextapp.com/site/echo2
2
© FIRST consortium
Page 26 of 53
D2.2
browsers, including mobile browsers for Android and the iPhone. Advantages and
disadvantages of GWT are listed below:
Advantages
Disadvantages
-
-
Simplicity
o
o
-
-
No need to learn/use JavaScript language (Use a reliable, stronglytyped
language
(Java)
for
development and debugging)
Leverage various tools of Java
programming
language
for
writing/debugging/testing
Steep learning curve
Heavy dependence on Javascript
o
Not search engine friendly
Need more components
o
Performance
o
Generates
code
o
Can use complex Java on the
client
optimised
Results in the client web browser
applications that consume much of
the memory
JavaScript
GWT doesn't come out of the box
with all the possible Widgets; there
is a need to use extra components.
Scalability
o
Stateful client, stateless server
-
AJAX support
-
Compatibility
o
No need to handle browser
incompatibilities and quirks
Table 3: GWT advantages and disadvantages
Java Server Faces (JSF) is a component oriented and event driven framework based on the
Model View Control (MVC) pattern. View layer is separated from controller and model. Event
driven User Interface (UI) components are provided by the JSF API. The UI components and
their state are represented on the server with a defined life-cycle of the UI components.
Advantages and disadvantages of JSF are listed below:
Advantages
Disadvantages
-
-
Simplicity
o
Easy to learn for existing Java web
developers
o
Enables the use of IDEs for Rapid
Application
Development
(NetBeans, Jdeveloper, Eclipse, etc)
o
-
Follow MVC design pattern
Compatibility
o
Performance
o
Every button or link clicked results
in a form post, which might in a
bad user experience from the end
user point of view.
Scalability
o
States of the components are
stored in session objects and it
provides difficulties to run in
distributed mode.
No need to handle browser
incompatibilities and quirks
Table 4: JSF advantages and disadvantages
© FIRST consortium
Page 27 of 53
D2.2
Spring MVC is the request-based framework of Spring Framework for developing web
applications. The framework defines strategy interfaces for all of the responsibilities, which are
tightly coupled to the Servlet API. The following are Spring MVC advantages and
disadvantages:
Advantages
o
o
Disadvantages
Simplicity
o
Configuration intensive

Easy to test
o

Follow MVC design pattern

No common parent controller,
resulting in the need for handling
many issues individually
Cleaner code
o
No built-in Ajax support
Integrates with many view options
seamlessly:
JSP/JSTL,
Tiles,
Velocity, FreeMarker, Excel, XSL,
PDF.
Table 5: Spring MVC advantages and disadvantages
This is a comparison of notable Java web application frameworks that compares features of the
frameworks. The frameworks are rated between 0-1 and rating logic for the features are
described in (Raible, JVM Web Frameworks Rating Logic, 2010). Table 6 shows the
comparison results. Spring MVC and GWT are the highly rated frameworks based on the
evaluations of the author (the higher note means better).
Criteria
Struts 2
Spring
MVC
Wicket
JSF 2
Tapestry
Stripes
GWT
Vaadin
Developer Productivity
0.5
0.5
0.5
0.5
1.0
0.5
1.0
1.0
Developer Perception
0.5
1.0
1.0
0.0
0.5
1.0
1.0
1.0
Learning Curve
1.0
1.0
0.5
0.5
0.5
1.0
1.0
1.0
Project Health
0.5
1.0
1.0
1.0
0.5
0.5
1.0
1.0
Developer Availability
0.5
1.0
0.5
1.0
1.0
0.5
1.0
0.5
Job Trends
1.0
1.0
0.5
1.0
0.5
0.0
1.0
0.0
Templating
1.0
1.0
1.0
0.5
1.0
1.0
0.5
0.5
Components
0.0
0.0
1.0
1.0
1.0
0.0
0.5
1.0
Ajax
0.5
1.0
0.5
0.5
0.5
0.5
1.0
1.0
Plugins or Add-Ons
0.5
0.0
1.0
1.0
0.5
0.0
1.0
1.0
Scalability
1.0
1.0
0.5
0.5
0.5
1.0
1.0
0.5
Testing
1.0
1.0
0.5
0.5
1.0
1.0
0.5
0.5
i18n and l10n
1.0
1.0
1.0
0.5
1.0
1.0
1.0
1.0
Validation
1.0
1.0
1.0
0.5
1.0
1.0
1.0
1.0
Multi-language Support
(Groovy / Scala)
0.5
0.5
1.0
1.0
1.0
1.0
0.0
1.0
Quality of
Documentation/Tutorials
0.5
1.0
0.5
0.5
0.5
1.0
1.0
1.0
Books Published
1.0
1.0
0.5
1.0
0.5
0.5
1.0
0.5
REST Support (client and
server)
0.5
1.0
0.5
0.0
0.5
0.5
0.5
0.5
Mobile / iPhone Support
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
© FIRST consortium
Page 28 of 53
D2.2
Criteria
Degree of Risk
Totals
Struts 2
Spring
MVC
Wicket
JSF 2
Tapestry
Stripes
GWT
Vaadin
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.5
14,5
17
15
13,5
15
14
17
15,5
Table 6: A comparison of notable Java web application frameworks (Raible, Comparing JVM
Web Frameworks, 2010)
Performance and scalability are very important factors for success of the First project. For this
reason, GWT is selected as web application framework for developing the Integrated GUI after
analyzing advantages and disadvantages of the application frameworks. GWT is the most
promising solution with its simplicity, scalability and performance.
GWT enables development of web applications without writing JavaScript code on the client
side. In some special cases (integration with a non GWT application), it may require to develop
custom JavaScript code. JQuery1, the most popular JavaScript library, would be used for this
purpose. JQuery is a cross-browser JavaScript library designed to simplify the client-side
scripting. There is plug-in called GwtQuery, which can be used like JQuery within GWT
framework.
4.4. Data storage design principles
The design of the knowledge base is conducted with the following prerequisites in mind:

Choice of paradigm(s) for physical storage system

Provide stable access interface to hide underlying complexity

Encapsulate business logic in mediation layer
4.4.1 Storage paradigms
The most crucial and fundamental decision regarding data storage is to choose the paradigm of
the physical storage system. Besides the prevalent relational database systems there is a
variety of alternatives, each with its characteristic advantages and disadvantages, which are
commonly referred to under the umbrella term NoSQL2. Many of these non-relational storage
alternatives provide better performance, but this comes in most cases at the cost of a relaxation
of the ACID guarantees, which refers to the atomicity of transactions, consistency of the data
store before and after a transaction, isolation of transaction as well as the durability of
completed transactions. As already (Härder & Reuter, 1983, p. 291) noted “These four
properties, atomicity, consistency, isolation, and durability (ACID), describe the major highlights
of the transaction paradigm, which has influenced many aspects of development in database
systems.” This was already true nearly three decades ago when the statement was made, but
also was the main guideline for advancements in database technology made since then.
Not until the advent of internet technology and strongly increasing read and write accesses to
(potentially distributed) database systems, it was questioned whether this underlying paradigm
fits best for all use cases. Though, as (Brewer, 2000) pointed out in his CAP-Theorem, only two
of the three desirable characteristics consistency, availability and partition-tolerance of
distributed databases can be fulfilled at a time3. Therefore, in many high load real-world use
1
http://jquery.com/
The buzzword NoSQL is commonly spelled out as “Not only SQL” but lacks a formal definition
3
(Gilbert & Lynch, 2002) provide a more formal approach to prove the theorem
2
© FIRST consortium
Page 29 of 53
D2.2
cases1, the requirements toward the ACID guarantees are relaxed in favour of heightened
availability, i.e. performance, of the system.
Non-relational database paradigms are usually not fulfilling the ACID guarantees to the full
extent, but adhere to the so-called BASE paradigm, whereby BASE stands for Basically
available, soft state, eventual consistency (Brewer, 2000). Within the BASE paradigm,
temporary data inconsistencies are tolerated in order to improve read and write performance.
Depending on the requirements of the particular use case, it has to be decided whether such a
relaxation is tolerable or not.
NoSQL alternatives can roughly be separated in the following groups according to their data
model (see e.g. (Cattell, 2010)):

Key-value stores: provide efficient storage and retrieval for values (objects) of any kind
based on a programmer-defined key. Typically the values (objects) are not interpreted
by the system, but stored and retrieved as is, which limits searching capabilities to the
associated key.
Examples: Project Voldemort, Tokyo Cabinet, Memcached

Document stores: provide efficient storage and retrieval for documents (objects) of any
kind. In contrast to key-value stores, the content and structure of the documents, i.e.
attribute names, scalar values but also nested lists or other documents, are interpreted
by the system and made available for queries.
Examples: MongoDB, CouchDB

Extensible record stores: organize data in a way similar to relational tables, but enable
a dynamic number of attributes. Furthermore, tables can be partitioned both vertically
and horizontally at the same time.
Examples: Google BigTable, Apache HBase
Relational databases
NoSQL data stores
Difficult schema evolution/adaptation
Schemaless
ACID guarantees, strong consistency
BASE paradigm, weak consistency
Data typically normalized
Data typically de-normalized
Standardized, expressive query language
Individual query capabilities
(mainly) vertical scaling
Horizontal scaling
Table 7: Main characteristics of relational and non-relational storage paradigms
The decision whether to use the relational database approach or to use NoSQL data stores is
mainly a trade-off between ACID compliance and query versatility on the one hand and
improvements in terms of performance and scalability on the other hand. This decision has to
be conducted considering the requirements of the particular use case.
4.4.2 Access layer
As the FIRST knowledge base is faced with the requirement to provide a high-performance
storage for data of different modalities (some data items are more structured, some are less
structured), with different expected insert and retrieval frequencies and patterns, the knowledge
base will address this heterogeneous requirements by using the data storage paradigm that
1
See e.g. (Vogels, 2008) or (Pritchett, 2008) for Amazon and eBay respectively
© FIRST consortium
Page 30 of 53
D2.2
best suits each data modality individually, rather than using a common paradigm for all kinds of
data. This implies that on the actual storage layer a variety of paradigms will be applied1.
This diversity on the storage layer, calls for an abstract access layer in order to hide the
complexity of the storage layer from knowledge base clients, i.e. the other technical work
packages.
Besides being in line with the DoW, this layer of indirection provides further advantages, such
as the exchangeability of storage components. The parallel usage of different paradigms
enables to compare their suitability for different large-scale requirements. As the choice of an
approach shall not generate a technological lock-in in the long run, there shall remain the option
to switch the underlying storage paradigm, in case one of the chosen approaches proves to not
scale as expected. Without an abstract access layer, exchanging storage components would be
hardly achievable as it would impact all clients and require them to adapt their way of accessing
the data accordingly. With the existence of an abstraction layer exchanging components can be
conducted seamlessly, as clients not even notice that the backend storage structure has
changed.
4.4.3 Mediation layer
To enable the seamless exchange of a storage solution while providing a stable access
interface, the actual business logic is encapsulated in a mediation layer, which encapsulates
various functions:

Transformation of requests accepted from the access layer into transactions on the
storage layer, utilizing knowledge about its actual structure

Provision of services to foster performance, e.g. caching, if not provided by the
underlying storage solution

Provision of services to handle exceptional load, e.g. queuing of requests, if not provided
by the underlying storage solution

Provision of services to avoid resource bottlenecks, e.g. maintaining connection pools

Maintaining thread pools to cater for parallel processing of independent requests
1
Regarding the specific use cases addressed by the knowledge base and the actually chosen storage paradigms to
cope with the particular requirements, see deliverable D5.1
© FIRST consortium
Page 31 of 53
D2.2
5. Design
This chapter describes details of system design and interaction of main components. I will also
depict sample FIRST scenario to illustrate data flow through the pipeline components and the
role of storage components and integration layer for bringing results of pipeline computations to
the end user. Design of FIRST is based on stare of the art on system integration and as well as
technical requirement and architectural analysis. Moreover, internal details of FIRST process
will be provided as a part of subsection 5.2.
5.1. Detailed components interaction perspective
Figure 13 presents detailed component view divided into layers (light blue boxes) that
corresponds to the tiers described in chapter 3.2. Components have been surrounded with red
boxes that clearly indicate certain workpackage they belong to. Rest of details have been
occulted for brevity. FIRST high-level services layer hasn’t been marked with red box, as it
exposes services of all technical components (of workpackages: WP3, WP4, WP5, WP6 and
WP7) and each is developed within its own workpackage:
Use cases and GUIs
WP7
WP8
Integrated Financial
Market Information System
Use Case 1
(Market Surveilance)
WP3
Use Case 2
(Reputational Risk)
Use Case 3
(Retail Brokeage)
Document Stream
Visualisations
Integrated Financial Market Information System backend services
FIRST high-level services:
Data
Cleaning API
Sentiment
Analysis API
Visualisation
API
Decision
Support API
Storage
Access API
Alerts API
Pipeline Integration
Data
Acquisition
Unstructured
Data
1
2
3
4
5
WP4
Ontology
Learning
Information
Extraction
Sentiment
Analysis
Ontology
sharing
sub
High-performance
asynchronous messaging
pub
pub
WP3
Queue
Queue
High-performance
asynchronous messaging
sub
pub
Pipeline components
WP7
sub
Queue
High-performance asynchronous
messaging
WP6
WP7
Decision
Support
System
Data
collecting
component
Universal
Data Adapter
Structured
Data
Information Integration Layer
WP5
Unified Datastore Access & Mediation layer
Data Acquisition
Document Storage
Ontology
versioning
NLP-processed
documents
Sentiment data
Decision Rules
And Models
Computed
results storage
Figure 13: Detailed component view (overview)

Data Cleaning API: provides services based on data acquisition components (WP3) that
deal with preparing data to be further analyzed by WP4 information extraction
components (such as language detection, boilerplate removal, near duplicate removal,
etc).
© FIRST consortium
Page 32 of 53
D2.2

Visualisation API: services for providing data for visualisations user interface (WP3). As
visualisation is connected to the data stream going through the pipeline, Visualisation
API provides a stream of data pushed to the registered components.

Sentiment Analysis API: set of services for performing sentiment analysis of news article
or chosen text employing WP4 components.

Decision Support API: set of services exposing decision making infrastructure (e.g.
event detection, perditions) implemented within WP6 components.

Alerts API: PUB/SUB services used to register for specific event. Receiver is notified as
soon as event occurs in the pipeline. Events are detected by WP6 components by
analyzing data stream, while WP7 collector component is handling dispatch job.

Storage Access API: provide unified access to the multiple WP5 data storage solutions.
Those services provide building blocks for implementing Integrated Financial Market Information
System end user GUIs. The list of services is not definite and may be expanded according to
future use case and GUI requirements. Depending on the usage scenario, services above can
be both request reply (REQ/REP) kind of services or publish-subscribe (PUB/SUB) streambased services except for Visualisation API and Alerts API that have PUB/SUB, stream based
nature. Details of using high level services have been described in section 5.2. Figure 14 shows
components that are exposing high-level services with red arrows.
Interaction between components has been highlighted in Figure 14. The main part in the centre
(irregular shape in dark-blue) denotes the data flow through the pipeline components. Data is
passed between components that use different technologies through lightweight pipeline
integration components (high performance asynchronous messaging). They interconnect data
acquisition component (WP3) with Information extraction (WP4) and later sentiment analysis
component (WP4) with Decision support (WP6). Results of decision support (e.g. events) are
received by Data collecting component of WP7. Pipeline integration components work by
pushing the data as it comes, without need to acknowledging or replies. Details of this approach
have been outlined in section 5.2.6.1.
© FIRST consortium
Page 33 of 53
D2.2
Use cases and GUIs
Dependency
(e.g. service invocation)
WP7
Exposed functionality
(service provisioning)
pub
WP8
Integrated Financial
Market Information System
Use Case 1
(Market Surveilance)
WP3
Use Case 2
(Reputational Risk)
Use Case 3
(Retail Brokeage)
Document Stream
Visualisations
Data flow
Internal component
integration
Integrated Financial Market Information System backend services
Pipeline integration
(messaging queue)
FIRST high-level services:
Data
acquisition
Software component
Data
Cleaning API
Sentiment
Analysis API
Visualisation
API
Decision
Support API
Storage
Access API
Alerts API
Pipeline Integration
Unstructured
Data
1
2
3
4
WP4
Ontology
Learning
Information
Extraction
Sentiment
Analysis
Ontology
sharing
sub
High-performance
asynchronous messaging
pub
pub
WP3
Data
Acquisition
Queue
Queue
High-performance
asynchronous messaging
sub
pub
Pipeline components
WP7
sub
Queue
High-performance asynchronous
messaging
WP6
WP7
Decision
Support
System
Data
collecting
component
Universal
Data Adapter
5
Structured
Data
Information Integration Layer
WP5
Unified Datastore Access & Mediation layer
Data Acquisition
Document Storage
Ontology
versioning
NLP-processed
documents
Sentiment data
Decision Rules
And Models
Computed
results storage
Figure 14: Detailed component view (with highlighted interactions and data flow)
Internally, WP3 and WP4 components might be closely integrated within their own execution
environment for performance reasons (.NET CLR for WP3 components and JVM for WP4
components). It has been depicted with gray arrows connecting data acquisition with ontology
learning and information extraction with sentiment analysis. Direction of the gray and thick blue
arrow denotes data flow.
Information Integration later (presented on the bottom) exposes its API for both: (i) high-level
services to foster use cases and GUI implementation that will access data gathered during
pipeline processing, and (ii) for individual components, mostly for storing results of processing.
Each storage facility is dedicated for storing different kind of data, therefore each technical
component connects to different store, depending of the data type. Internal details on
Information Integration layer are presented in (FIRST D5.1 Specification of the informationintegration model, 2011).
Structured data is delivered to the system through Universal Data Adapter that connects
external data providers (such as IDMS Financial Data API) with components that internally uses
it for data processing. From the current analysis, only Decision Support System is connected so
far.
Ontology learning component is also taking part in pipeline processing, however outputs are not
streamed back into the pipeline. The processing result of this component is an updated ontology
that is shared with Information Extraction component. Due to the fact that ontology snapshots
updates are not very frequent (daily), the sharing mechanism is not required to be robust, and
simple file sharing is considered.
© FIRST consortium
Page 34 of 53
D2.2
5.2. Sample FIRST process
The purpose of this section is mainly to illustrate how the FIRST integrated system is employed
for two concrete sample scenarios: (1) topic trend visualization use case and (2) portfolio
selection use case. In the contexts of these two use cases, we demonstrate the FIRST
analytical pipeline (WP3–WP6), point out how the chosen inter-component messaging
technology (i.e., ZeroMQ) is employed, and show how the Web-based integrated GUI (WP7) is
built on top of the pipeline.
5.2.1 Sample scenarios
5.2.1.1
Topic trend visualization
The topic trend visualisation provides valuable insights into how topics evolved through time.
One of such visualisations is called ThemeRiver (Havre, Hetzler, & Nowell, 2000). It visualises
thematic variations over time for a given time period. The visualisation resembles a river of
coloured “currents” representing different topics. A current narrows or widens to indicate a
decrease or increase in the strength of the corresponding topic. The topics of interest are
predefined by the analyst as a set of keywords. The strength of a topic is computed as the
number of documents containing the corresponding keyword. Shaparenko et al. build on top of
the ThemeRiver idea to analyse a dataset of scientific publications (Shaparenko, Caruana,
Gehrke, & Joachims, 2005). They identify topics automatically by employing the k-means
clustering algorithm. In Figure 15, an example of the topic trend visualization is shown (taken
from (FIRST D2.1 Technical requirements and state-of-the-art, 2011) Section 2.5.4).
Figure 15: An example of the topic trend visualization
The topic trend visualization is required by the Use Case 3 (Retail brokerage use case)
Application Scenario 3 (Visualize topic trends and their impact) as specified by the requirements
UC3.R-EU8.x (see (FIRST D1.2 Usecase requirements specification, 2011) Section 4.4.3 for
more details). The main idea of this application scenario is to enable the user to visually detect
new emerging or diminishing topics and assess the relevance of topics currently being
discussed in news and/or blogs.
This use case demonstrates the entire FIRST process with the exception of the information
extraction pipeline which is bypassed in this particular case. The acquired data is first sent
through the preprocessing pipeline and then sent (through the ZeroMQ messaging technology)
directly to the Canyon Flow pipeline. The Canyon Flow model (i.e., a cluster hierarchy) is sent to
© FIRST consortium
Page 35 of 53
D2.2
a Web server which forwards it to its clients. A JavaScript component, part of the FIRST Webbased integrated GUI, visualizes the model to the user.
5.2.1.2
Portfolio selection use case
The portfolio selection use case is to demonstrate the economic utility of results of Work
Package 4 “Semantic Information Extraction System”. The Work Package’s model extracts
investor sentiment with respect to objects and features that are specific to the three use cases
of FIRST. The portfolio selection use case specifically targets extraction results for the use case
“Investment Management and Retail Brokerage”. In this use case, the observed sentiment
objects are financial instruments and the feature of the object is “expected future price change”.
In the portfolio selection case, sentiment from blog posts that refer to stocks of the Dow Jones
Industrial Average (DJIA) is being extracted. The extraction starts on the sentence level,
yielding a crisp classification of sentiment polarity with respect to a stock. The sentiments that
refer to the same instrument are then being aggregated to the document level by means of realvalued score ratio that accounts for direction and intensity of the sentiment. It is normalized to
the range of values of the interval [–1, 1]. Scores >0 are interpreted to be positive, <=0 ones are
negative. As the score is normalized, we are able to average the scores of several documents
from the same day that refer to the same instrument. An example of a time series of the
resulting sentiment index with respect to the DJIA is displayed in the following figure.
Figure 16: DJIA vs. smoothed time series of the sentiment index (see (Klein, Altuntas, Kessler,
& Häusser, 2011))
The Figure 16 displays a smoothed sentiment index (simple moving average of length 20 days).
We hypothesize that the sentiment index can be beneficially exploited in the portfolio selection
use case. That is, a selection strategy that involves the sentiment would provide excess returns
over a buy and hold strategy. We simply enter a long position on the next day’s open price once
the sentiment index si(day)>=th or a short position if si(day)<th with th being a threshold on the
interval [0, 1] with the default value of 0. The position would be closed on the close price of
© FIRST consortium
Page 36 of 53
D2.2
day+n with n>=1 if si(day+n) changes its direction (attributed by a change of the algebraic sign).
To test this strategy, we specify a historic back-testing simulation. The simulation scenario
consists of daily price time series for 26 DJIA stocks and blog posts retrieved from the
blogger.com platform in the period 2007–2010. See Table 8 for details.
Stock
symbol
Number of
documents
Number of
days with
sentiment
Stock
symbol
Number of
documents
Number of
days with
sentiment
AA
691
450
JNJ
241
193
AXP
871
600
JPM
1061
652
BA
1718
965
KFT
231
202
BAC
1748
672
KO
2735
1190
CAT
1110
730
MRK
515
417
CSCO
2756
1111
MSFT
3948
1332
CVX
400
322
PFE
331
284
DIS
1922
974
PG
373
293
GE
1617
901
T
2173
924
HD
1337
829
TRV
40
39
HPQ
1206
721
UTX
79
70
IBM
2900
1220
WMT
4180
1382
INTC
1006
623
XOM
844
571
Table 8: Portfolio selection scenario based on DJIA stocks and sentiment extracted from blog
posts (Klein, Altuntas, Kessler, & Häusser, 2011)
For the results of the portfolio selection test in the historic simulation, please refer to (Klein,
Altuntas, Kessler, & Häusser, 2011).
5.2.2 Data acquisition and preprocessing pipeline
The data acquisition and preprocessing pipeline is common for both use cases. It consists of (i)
data acquisition components, (ii) data cleaning components, (iii) natural-language
preprocessing components, (iv) semantic annotation components, and (v) ZeroMQ emitter
components. The current workflow topology is shown in Figure 17.
© FIRST consortium
Page 37 of 53
D2.2
Load
balancing
RSS
reader
Boilerplate
remover
Language
detector
Duplicate
detector
Sentence
splitter
Tokenizer
POS tagger
Semantic
annotator
ZeroMQ
emitter
RSS
reader
Boilerplate
remover
Language
detector
Duplicate
detector
Sentence
splitter
Tokenizer
POS tagger
Semantic
annotator
ZeroMQ
emitter
.
.
.
.
.
.
RSS
reader
Boilerplate
remover
Duplicate
detector
Sentence
splitter
Tokenizer
POS tagger
Semantic
annotator
ZeroMQ
emitter
processing
pipelines
Language
detector
One reader
per site
(80 readers)
Figure 17: The current topology of the data acquisition and preprocessing pipeline.
The data acquisition components are mainly RSS readers that poll for data in parallel. One RSS
reader is instantiated for each Web site of interest. The RSS sources, corresponding to a
particular Web site, are polled one after another by the same RSS reader to prevent the servers
from rejecting requests due to concurrency. An RSS reader, after it has collected a new set of
documents from an RSS source, dispatches the data to one of the 20 available processing
pipelines. The pipeline is chosen according to its current load size (load balancing). A
processing pipeline consists of a boilerplate remover, language detector, duplicate detector,
sentence splitter, tokenizer, part-of-speech tagger, semantic annotator, and ZeroMQ emitter.
The majority of these components were already discussed in FIRST D2.1 (see (FIRST D2.1
Technical requirements and state-of-the-art, 2011) Section 2.1). The natural-language
processing stages (i.e., sentence splitter, tokenizer, and part-of-speech tagger) were added
because they are a prerequisite for the semantic annotation component and also for the
information extraction tasks. Finally, the ZeroMQ emitters were added to establish a “messaging
channel” between the data acquisition and preprocessing components (WP3) and the
information extraction components (WP4). This enables us to run the two sets of components in
two different processes (i.e., runtime environments) or even on two different machines. The
information extraction components are discussed in the following section.
5.2.3 Information extraction pipeline
The information extraction pipeline receives the already acquired, pre-processed, and
annotated documents that previous stages of the pipeline deliver and that have been described
above. The main purpose of this part of the pipeline is to extract, classify, and aggregate
sentiment with respect to use case-specific financial objects and features such as the price or
volatility of a financial instrument. These sentiments can then be used as semantic features in
subsequent decision models. To be exploited with respect to this matter, the end of this pipeline
segment stores all extracted sentiments with respective attributes to the knowledge base.
For realizing the extraction of sentiments, we employ an ontology-guided and rule-based
information extraction approach. In contrast to pure machine learning approaches that leverage
on statistics, this allows for a deeper analysis that is specific with respect to certain objects and
features. We integrate financial knowledge by modelling and conceptualizing relevant parts of
the domain in the ontology. Linguistic knowledge (independent of a specific domain) is brought
to work to enable as generic as possible formulation of rules. These rules define sentimentextraction patterns. Inherent parts of the definition of these patterns are the annotations of the
text created by previous parts of the information processing pipeline in FIRST (e.g., part-of© FIRST consortium
Page 38 of 53
D2.2
speech tagging, lemmatization, and named entity extraction). The analysis takes place on
several levels of a document, starting on word and phrase level. Based on this, the sentiment is
being aggregated to the document level. For this purpose, simple scoring (quantification of
sentiment that represents sentiment direction and intensity) is being employed.
5.2.4 Decision support models
5.2.4.1
Topic trend visualisation models
In FIRST, topic trends will be visualized with a ThemeRiver-like algorithm called Canyon Flow.
Technically speaking, the algorithm provides a “view” of a hierarchy of document clusters as
illustrated in Figure 18. The underlying algorithm is essentially a hierarchical bisecting clustering
algorithm employed on the bag-of-words representation of documents. In Figure 18, c(C, t)
denotes the number of documents with time stamp t in cluster C. This implies that the “width of
a current” at time t is proportional to the number of documents with time stamp t in the
corresponding cluster. Note that this algorithm is originally not employed on document streams
– it merely visualizes a dataset of documents with time stamps. Note also that this visualization
is interactive – the user is able to select a different view of the cluster hierarchy (e.g., more finegrained view of a certain topic).
C2
View
C2
C4
C3
C1
C3
c(C4, t1)
C1
C4
t1
t2
Figure 18: Canyon Flow: a “view” of a hierarchy of document clusters
The approach to employing the algorithm on streams and scaling it up will be very similar to the
pipelining approach presented in (FIRST D2.1 Technical requirements and state-of-the-art,
2011) Section 4.2.1. The algorithm will be decomposed into pipeline stages as shown in Figure
19. When a new data instance, i.e., an annotated document corpus, will enter the Canyon Flow
pipeline, the documents will be first transformed into their bag-of-words representation. Then,
the BOW vectors will be sent to the hierarchical clustering stage. The usual bisecting (k-means)
hierarchical clustering algorithm will be replaced with an efficient online (i.e., stream-based)
variant (see (FIRST D2.3 Scaling Strategy, 2011) Annex 1). The obtained cluster hierarchy
(more accurately, the changes to the current cluster hierarchy) will be pushed, through the
ZeroMQ messaging technology, to a Web server that will forward it to the clients (i.e., Web
browsers). On the client-side, a JavaScript component will be responsible for updating the
visualization (the currents will move with time from right to left) and interacting with the user.
© FIRST consortium
Page 39 of 53
D2.2
Web-based
integrated GUI
Web
browser
Web
browser
Web
browser
HTTP
push
ZeroMQ
channel
Web
server
Annotated
corpus
ZeroMQ
receiver
BOW
vectors
BOW
processor
Topic
hierarchy
Online
bisecting
clustering
ZeroMQ
emitter
Figure 19: The Canyon Flow pipeline and the corresponding part of the Web-based integrated
GUI.
It is also important to note that the user will be able to specify a set of keywords to express his
interests (as discussed in (Havre, Hetzler, & Nowell, 2000)). The Canyon Flow pipeline will not
be altered in any way for this purpose – the required keyword filter will be applied by the Web
server prior to pushing the data to the client.
5.2.4.2
Portfolio selection models
As described above, the portfolio selection use case aims at building a portfolio of financial
instruments whereas the instruments are selected according to the sentiment expressed within
the information sources. This use case builds upon the sentiment measure provided through the
analytical pipeline. In a first step, the portfolio selection process will only be based on the
sentiment measure. As previous research has shown, buying or selling stocks according to the
value of a daily sentiment measure can be profitable (Klein, Altuntas, Kessler, & Häusser,
2011). For that purpose, a daily aggregation of the sentiment measure will be retrieved from the
analytical pipeline. In this context, we will investigate which threshold of the sentiment measure
is appropriate for portfolio selection decisions.
In a second step, we will investigate whether the sentiment measure can be used as an input
into more sophisticated approaches. For that purpose, different methods including machine
learning techniques (explained in detail in (FIRST D6.1 Models and visualisation for financial
decision making, 2011)), qualitative modelling, and optimization techniques will be considered.
Apart from the sentiment measure, possible input variables which will be received from the
previous steps are technical and fundamental indicators or bag-of-words representations of
texts. Furthermore, historical price series will be requested through the knowledge base
developed in WP5. Taking into account these inputs, we aim at detecting patterns which can
serve as a basis for developing portfolio selection models.
Once developed, the portfolio selection models will be stored in the knowledge base and will be
updated frequently. Portfolio selection decision support will be calculated on the server-side and
will be provided on request of the user (pull-based), e.g., using a Web browser.
© FIRST consortium
Page 40 of 53
D2.2
5.2.5 Data exchange between pipeline components
A batch of documents (either news or blog posts), passed between components (stages) in the
data acquisition pipeline (WP3), is internally stored as an annotated document corpus object.
The annotated document corpus data structure (ADC) is very similar to the one that GATE1
uses and is best described in the GATE user’s guide2. ADC can be serialized either into XML or
into a set of HTML files. Figure 20 shows a toy example of ADC serialized into XML. In short, a
document corpus normally contains one or more documents and is described with features (i.e.,
a set of key-value pairs). A document is also described with features and in addition contains
annotations. An annotation gives a special meaning to a text segment (e.g., token, sentence,
named entity). Note that an annotation can also be described with features.
<DocumentCorpus xmlns="http://freekoders.org/latino">
<Features>
<Feature>
<Name>source</Name>
<Value>smh.com.au/technology</Value>
</Feature>
</Features>
<Documents>
<Document>
<Name>Steve Jobs quits as Apple CEO</Name>
<Text>Tech industry legend and one of the finest creative minds of a generation, Steve
Jobs, has resigned as chief executive of Apple.</Text>
<Annotations>
<Annotation>
<SpanStart>75</SpanStart>
<SpanEnd>84</SpanEnd>
<Type>named entity/person</Type>
<Features />
</Annotation>
<Annotation>
<SpanStart>122</SpanStart>
<SpanEnd>126</SpanEnd>
<Type>named entity/company</Type>
<Features>
<Feature>
<Name>stockSymbol</Name>
<Value>AAPL</Value>
</Feature>
</Features>
</Annotation>
</Annotations>
<Features>
<Feature>
<Name>URL</Name>
<Value>http://www.smh.com.au/technology/technology-news/steve-jobs-quits-as-apple-ceo20110825-1jat8.html</Value>
</Feature>
</Features>
</Document>
</Documents>
</DocumentCorpus>
Figure 20: Annotated document corpus serialized into XML.
1
GATE is a Java suite of tools developed at the University of Sheffield used for all sorts of natural language
processing tasks, including information extraction. It is freely available at http://gate.ac.uk/
2
Available at http://gate.ac.uk/sale/tao/split.html (see http://gate.ac.uk/sale/tao/splitch5.html#x8-910005.4.2 for
some simple examples of annotated documents in GATE).
© FIRST consortium
Page 41 of 53
D2.2
The annotated document, contained in the XML in Figure 20, serialized into HTML and
displayed in a Web browser is shown in Figure 21.
Figure 21: Annotated document serialized into HTML and displayed in a Web browser
The data acquisition pipeline ends with a ZeroMQ emitter that sends an annotated document
corpus into the information extraction pipeline (WP4). The information extraction pipeline is
based on GATE and since GATE also uses annotated document corpora internally, a relatively
simple transformation (more accurately: serialization) is applied to transform an XML, received
by the ZeroMQ receiver, into a GATE document corpus object. However for performance
reasons, data acquisition pipeline should also serialize documents straight into GATE format to
avoid extra XML manipulations.
Note that the decision support components (WP6) also form pipelines. The data, passed
between the decision support pipeline stages, is not necessarily in the form of document
corpora. At this stage it is not possible to define exactly, what kind of data will be passed
between those components. The main data structure will no doubtingly be a sparse matrix, used
to describe graphs and feature vectors.
5.2.6 Role of the integration layer
From the dataflow perspective, the integration layer has a three-fold role in the project: first, it
enables robust data communication between different component groups that form the
analytical pipeline; second, it provides a set of high-level services for accessing concrete
features of FIRST for implementing use case scenarios; and third, it supports GUI services by
providing publish-subscribe mechanisms to deal with event notifications coming from constant
document stream processing. We present the role of all three integration facets in the following
sections.
5.2.6.1
Messages passing in pipeline integration layer
Analytical pipeline is composed of groups of components performing specialized tasks (data
acquisition, information extraction and sentiment analysis, decision support and visualisation);
every group of components is developed by its respective partner and written in different
technologies as described in (FIRST D2.1 Technical requirements and state-of-the-art, 2011).
Internally, those components can be integrated using each language’s most suitable way
(method invocation, threads, events, internal queues, etc.), that does not impose any extra
overhead: document stream can be handled within each own execution environment (Java
Virtual Machine for Java components or Common Language Runtime for .NET components) in
© FIRST consortium
Page 42 of 53
D2.2
a zero-copy manner: without necessity to copy memory or data from one component to another.
On the other hand, integration middleware is crucial where components written in different
technologies need to pass data between each other. Message passing using lightweight
communication over network sockets (e.g. ZeroMQ) allows for minimal overhead integration
both on local machine and in the high-speed local network environment, enabling pipeline
decoupling, and overall system distribution, due to lack of central broker.
While pipeline components are designed to process data in a pipeline approach, the data
passing, load balancing and queue buffering is handled by middleware layer. The data in the
pipeline is always pushed forward to the next component. Each emitter component is passing
data only to its receiving counterpart without mediation or routing, which is characteristic for
large, distributed bus-based systems (e.g. ESB-based systems). In FIRST the number of
components is limited and fixed and although traffic load balancing is taken into account, the
number of components remains static.
Technically, the analytical pipeline integration layer is connecting data acquisition, information
extraction, decision support systems and data collecting components for GUI support (as
depicted in previous chapter). There is also the possibility to further pipeline manipulation
techniques, e.g. splitting the pipeline into more parts in order to maximize the usage of
computing resources, but the mechanism of communication remains the same. This and other
techniques are supporting system scalability goals and are further described in (FIRST D2.3
Scaling Strategy, 2011) Section 2.
Message
publishing
component
.NET native
invocation
Data
acquisition
C# (CLR)
binding
Sending buffer
Message
receiving
component
Data control channel
(command queue)
ZeroMQ
emitter
ZeroMQ
receiver
Data push
Network
Sockets
Java native
invocation
Java (JVM)
binding
Information
extraction
Further
pipeline
components
Java (JVM)
binding
Information
extraction
Further
pipeline
components
Receiving buffer
Load
balancing
Data control channel
(command queue)
ZeroMQ
receiver
Receiving buffer
Parallel data
processing
component
(optional)
Figure 22: Integration between data acquisition and information extraction components w/load
balancing.
Figure 22 illustrates sample integration using ZeroMQ messaging. The whole emitter
component (in blue) is integrated within data acquisition components and shares a common
heap space with it. The data (single documents) is asynchronously fed to data sending emitter,
through a helper buffer. The role of the buffer is explained in (FIRST D2.3 Scaling Strategy,
2011) and is used for supporting synchronous operations and in data-peak scenarios. Once
sent through the socket, the data is received by ZeroMQ receiver and made available for Java
component integrated with information extraction component (in green). The receiving queue is
buffering the documents and compensates delays of data processing.
5.2.6.2
High-level services data flow
High level system services are conceptual building blocks for GUI and use case
implementation. They form a common system API for any envisaged integration by exposing
high level, concrete functionalities of FIRST system. As opposed to the push-based pipeline
dataflow, the high-level services are mostly request-response based: they respond to ondemand query over gathered data. Those services are exposed using classical SOA approach
and can be invoked synchronously according to the use case application internal business logic
(see Figure 23).
© FIRST consortium
Page 43 of 53
D2.2
Use case
implementation
SOA Invocation
SOA Invocation
SOA Invocation
1: request
3: request
2: response
5: request
6: response
4: response
High level services:
Data
Cleaning API
Sentiment
Analysis API
Visualisation
API
Decision
Support API
Alerts API
Storage
Access API
Figure 23: Example of synchronous high-level services invocation
5.2.6.3
Push-based GUI services
Communication services are needed to get incoming data from the high level services to the
Integrated GUI. Web applications communicate using the HTTP protocol. HTTP has no support
for allowing a server to notify a client; Request-response model is strictly used where the client
(web browser running the Integrated GUI) makes a request to the server (calls high level
system services), which must then respond with the requested data. In this protocol, sending a
notification to the client is not applicable. Therefore, polling technique is used to retrieve data
from server. Client sends request to the server in time intervals (every one second, two second)
to get data, if there is. But this approach is inefficient.
Server Push (Comet) is used as an alternative to the polling technique to overcome inefficiency
and performance problems. Server Push is an approach in which a long-held HTTP request
allows a web server to push data to a browser, without the browser explicitly requesting it.
Therefore, the client doesn't have to keep asking for updates.
There are several Server Push frameworks (GWT Comet and Atmosphere) that can be
integrated to GWT based web applications. These frameworks will be analyzed and most
suitable for the First project will be selected as a communication frameworks between
Integrated GUI and high level services.
The key component for push-based GUI services is the WP7 data collector component attached
to the final stage of the pipeline. It doesn’t process data, but only listens to specific (subscribed)
events. Once event occur, it is sent instantly to the subscribed party. In case of GUI
implementation, it is application server (e.g. Java EE or servlet container). If the client is
connected, the event is sent through always-open HTTP channel in long polling mode. It is
client’s responsibility (through client-based JavaScript logic) to reopen the connection once the
timeout occurs, and thus maintain the open channel constantly, through the whole user session.
The whole process is depicted in Figure 24.
© FIRST consortium
Page 44 of 53
D2.2
Client
(browser)
4: long polling (data push)
3: HTTP request (event)
Using long polling
(comet) approach for
sending notification
to the GUI
WP7
Application
Server
(GUI logic)
Use case
Implementation
1: subscribe (events)
Application server is
notified as soon as
event occurs
2: publish (events)
2: publish (events)
1: subscribe (events)
High level services
Decision
Support API
Alerts API
Storage
Access API
Managing
subscriptions and
dispatching
events to
subscribers
WP7
Analytical pipeline data
Data
collecting
component
Analyzing data
and detecting if
any subscribed
events ocurred
Figure 24: Data-push mechanism for notification services
5.2.7 GUI integration
FIRST system will comprise of several demonstrator applications (e.g. sentiment extraction,
document stream visualisation, etc.) that will be included in the FIRST Integrated Financial
Market Information System. It will be the one common GUI, providing common “entry point” for
showcasing FIRST functionalities. All applications will be exposed as web applications;
therefore, an extra integration will be required. GUI integration will follow decoupled, widgetbased approach, where web application is composed of many different smaller widgets (small
atomic application), that may be deployed on different servers and displaying data provided by
multiple sources (e.g. pipeline components). However, from the end-user point of view they will
remain one consistent application. Technologically, GUI integration approach will follow
approach presented in section 5.2.6.3 using technologies analyzed in section 4.3. Detail will be
further elaborated within the work of WP7 and associated deliverables.
5.2.8 Role of the storage components
As outlined before, the different steps of the processing pipeline are directly communicating with
each other by using messaging technology to forward data items. Thereby, several advantages
are achieved as subsequent processing steps are provided with new data as it becomes
© FIRST consortium
Page 45 of 53
D2.2
available and the potential bottleneck of using a central storage facility for massive data
exchange is bypassed. Using a central storage facility would cause database overhead due to
inserting and querying data and would also require notifying the downstream components in
some way to make them aware that new data is available for further processing. As near-real
time processing of huge amounts of data is in the scope of FIRST, it has been decided to
circumvent this time-consuming procedure and to directly forward data to downstream
components in the processing pipeline.
Nevertheless, at each point where data is handed over from one work package to the
responsibility of another work package, it is also written to the central storage facility. This
means, in parallel to the emission of messages along the processing pipeline, the passed data
is also inserted into the storage facility. Thereby, the storage component is not directly involved
in the processing pipeline, which avoids the aforementioned drawbacks, but nevertheless holds
all relevant data for potential back-testing or analysis purposes. While it is solely a data drain
from the processing pipeline perspective, it serves as a rich data source for back-testing and
analysis purposes.
Besides it passive role along the processing pipeline, the knowledge base also holds
periodically archived versions of the evolved ontology as well as the results of the sentiment
analysis which are to be used both by the decision models as well as the integrated GUIs built
on top of the FIRST system.
© FIRST consortium
Page 46 of 53
D2.2
6. Deployment
6.1. Hosting platform
The FIRST pipeline is currently running on a single dedicated server machine. The machine has
25 terabytes of disk space plus 200 gigabytes of solid-state disk. It has 256 gigabytes of
memory and is able to run up to 48 concurrent processes (4 processors, 12 cores each),
running on 64bit Windows Server 2008 operating system. At this time, it is expected that this
machine will be enough to analyze the data, being acquired in FIRST, in real time. However, if
these capacities prove insufficient, the integration middleware (ZeroMQ messaging technology)
will be used to distribute the data-analysis pipeline across several machines. Technically, this
means that the pipeline will end with a ZeroMQ emitter at some point and continue with a
ZeroMQ receiver on a different machine. In this setting, the data will be processed to a certain
stage on one machine and then sent via the messaging "channel" to another machine in the
network for further processing. In addition, it would be also possible to perform scaling
techniques (e.g. load balancing) across different machines as pointed out in (FIRST D2.3
Scaling Strategy, 2011). Also other aspects such as instantiate processing components
dynamically (per demand) but we currently do not plan to implement these complex scaling
mechanisms in FIRST. Note that in reality, we do not run one single pipeline but rather a
workflow consisting of several pipelines. Even so, the intuitions and principles for distributed
processing of FIRST data streams remain the same.
6.2. Deployment scenarios
As pointed out in previous chapter, messaging middleware gives more opportunities for flexible
deployment, taking into account such aspects as scalability or hardware availability. In the
primary deployment scenario the FIRST system will run on the single, dedicated machine as
pointed out in 6.1. However, for the sake of further system exploitation, more deployment
scenarios are considered. As the architecture support system distribution (through ZeroMQ
messaging middleware) and given that network connection between the nodes is reliable and
fast, we may consider more further system evaluation possibilities. Apart from standalone
deployment, those are:

Deploying FIRST system on multiple, commodity hardware.
By distributing FIRST processing pipeline on more machines, we can use more servers
but less powerful (and cheaper in price) to simulate processing power of better
performing expensive servers. Such scenario, based on distributing pipeline components
(pipeline splitting and parallelization) has been described from technical point of view in
D2.3 Section 2.2. Distributing FIRST system across many nodes may obviously affect
timeliness of processing time, but it may still be desired from economical point of view.

Deploying FIRST in cloud-based environment.
This scenario might be suitable where pay-as-you-go cloud services are considered for
deployment. In such case, FIRST might be deployed on number of virtual machines,
based on the target resource usage demand. However it is not the scope of the project,
to provide dynamically controlled deployment for the cloud. Rather static deployment is
considered, based on required throughput estimations.
It must be started that those scenarios, though conceivable and viable within architectural
scope, does not have to be supported in the further development of the project if proven
needless from exploitation or business point of view.
© FIRST consortium
Page 47 of 53
D2.2
6.3. Industrial perspective of the FIRST system
The FIRST system and its computation results provide important data source to complement
existing financial industry applications. In this section we provide the perspective of Use Case 1
(UC1) with regard to FIRST architecture and its context within financial industry.
The area of capital markets compliance involves the processing of large amounts of information
regarding the trading activities of financial institutions, the customers and employees effecting
these trades, reference market data, and – new and importantly – announcements and other
textual information which may affect the prices of securities. These data are delivered into the
existing compliance architecture by way of interfaces from various trading and settlement
systems, market data providers such as Thomson Reuters and Bloomberg, and other sources
of static data. There are also large flows of unstructured but highly relevant information such as
news releases, earnings announcements, and other textual information from these and many
other sources which until now have not been amenable to automated processing. This data in
the structured form, as soon available as results of FIRST project, may be incorporated into the
stream of financial data in order to provide additional analysis scenarios.
The FIRST architecture provides necessary connectivity means that facilitate integration with
other financial systems. Such integration might be realized (i) in a data-oriented way, by
accessing databases with results of unstructured data processing or (ii) in a service-oriented
way, by invoking high-level FIRST services to request for certain functionality (e.g. alert reports).
Such data may be later used to augment existing financial systems with new analytical
perspectives, e.g. by combining into existing analysts’ cockpits (see Figure 25) by their system
providers.
Sentiment index
Analyse sentiment
index of chosen
suspicious instrument
Alerts coming from
BNEXT MACOC platform
including extra data
coming from FIRST.
Figure 25: Instrument Cockpit of MACOC application augmented with FIRST data (mockup)
The overall result will enable a significant increase in scope to the existing automated
architectures. In the aforementioned UC1 example, financial institutions and regulatory
authorities will be given new tools to better ensure compliance in the trading environment. By
integrating unstructured information into these architectures, it will now be possible to automate
trading surveillance to detect entirely new kinds of scenarios involving the misuse of such
unstructured information.
© FIRST consortium
Page 48 of 53
D2.2
7. Conclusion
This document presents coherent integrated architecture as a baseline for development of the
FIRST system. By juxtaposing requirements and available state-of-the-art technologies we
chose appropriate techniques in order to deliver suitable design and technical description that
will ensure meeting project criteria. This document also establishes technical communication
between technical partners. It is important input for future development in the following months,
especially in the scope of WP7 system integration, but also other technical workpackages.
© FIRST consortium
Page 49 of 53
D2.2
References
Bohanec, M. (2008). DEXi: Program for multi-attribute decision making, User's manual, Version
3.00. IJS Report DP-9989. Ljubljana: Jožef Stefan Institute.
Bondi, A. B. (2000). Characteristics of Scalability and Their Impact on performance.
Proceedings of the 2nd international workshop on Software and performance (pp. 195-203). ACM.
Brewer, E. A. (2000, July). Towards robust distributed systems. Key Note at Principles of
Distributed Computing . Portland, Oregon, US.
Cattell, R. (2010, December). Scalable SQL and NoSQL data stores. ACM SIGMOD Record , 39
(4), pp. 12-27.
Eeles, P. (2005, November 15). Capturing Architectural Requirements. Retrieved July 1, 2011,
from IBM Rational Works Library:
http://www.ibm.com/developerworks/rational/library/4706.html
FIRST D1.1 Definition of market surveillance, risk management and retail brokerage usecases.
(2011).
FIRST D1.2 Usecase requirements specification. (2011).
FIRST D2.1 Technical requirements and state-of-the-art. (2011).
FIRST D2.3 Scaling Strategy. (2011).
FIRST D5.1 Specification of the information-integration model. (2011).
FIRST D6.1 Models and visualisation for financial decision making. (2011).
Fowler, M. (2002). Patterns of Enterprise Application Architecture. Addison Wesley.
Gilbert, S., & Lynch, N. (2002). Brewer's conjecture and the feasibility of consitent, available,
and partition-tolerant web services. ACM SIGACT New , 33 (2), pp. 51-59.
Gómez-Pérez, A., & Manzano-Maho, D. (2003). OntoWeb Deliverable 1.5: A Survey of
Ontology Learning Methods and Techniques. OntoWeb (IST-2000-29243).
Grady, R. (1992). Practical Software Metrics for Project Management and Process
Improvement. Prentice-Hall. .
Härder, T., & Reuter, A. (1983, December). Principles of Transaction-Oriented Database
Recovery. ACM Computing Surveys , 15 (4), pp. 287-317.
Har-Peled, S., Roth, D., & Zimak, D. (2003). Constraint Classification for Multiclass
Classification and Ranking. In S. Becker, S. Thrun, & K. Obermayer (Ed.), Advances in
Neural Information Processing Systems 15: Proceedings of the 2002 Conference (pp. 809816). British Columbia, Canada: MIT Press.
Hartigan, J. A., & Wong, M. A. (1979). Algorithm 136: A k-Means Clustering Algorithm.
Applied Statistics , 28, 100–108.
Hatzivassiloglou, V., & McKeown, K. (1997). Predicting the Semantic Orientation of
Adjectives. pp. 174-181.
Havre, S., Hetzler, B., & Nowell, L. (2000). ThemeRiver: Visualising Theme Changes over
Time. Proceedings of InfoVis 2000, (pp. 115–123).
Hohpe, G., & Woolf, B. (2003). Enterprise Integration Patterns: Designing, Building, and
Deploying Messaging Solutions. Addison Wesley.
Howitz, C. (2010, May 03). What Is 3-Tier(Multi-Tier) Architecture And Why Do You Need It?
Retrieved August 16, 2011, from SimCrest ERP Round Table Blog:
http://blog.simcrest.com/what-is-3-tier-architecture-and-why-do-you-need-it/
IBM. (2009). Rational Unified Process.
IEEE Std 1471-2000, Recommended Practice for Architectural Description. (2000).
Jude Pereira. (2006). Enterprise Application Integration: Approaches to Integration. MindTree
Consulting.
© FIRST consortium
Page 50 of 53
D2.2
Klein, A., Altuntas, O., Kessler, W., & Häusser, T. (2011). Extracting Investor Sentiment from
Weblog Texts. Proceedings of the 13th IEEE Conference on Commerce and Enterprise
Computing (CEC). Luxembourg.
Kusak, D. (2010). Comparison of Enterprise Application Integration Platforms.
Piël, N. (2010, 06 23). ZeroMQ an introduction. Retrieved 06 30, 2011, from Nicholas Piël:
http://nichol.as/zeromq-an-introduction
Pritchett, D. (2008, June). BASE: An ACID Alternative. ACM Queue , 6 (3), pp. 48-55.
Raible, M. (2010, November 18). Comparing JVM Web Frameworks. Retrieved August 17,
2011, from Raible Designs:
http://raibledesigns.com/rd/entry/my_comparing_jvm_web_frameworks
Raible, M. (2010, December 6). JVM Web Frameworks Rating Logic. Retrieved September 28,
2011, from
https://docs.google.com/document/pub?id=1X_XvpJd6TgEAMe4a6xxzJ38yzmthvrA6wD
7zGy2Igog
Shan, T. C., & Hua, W. W. (2006). Taxonomy of Java Web Application Frameworks.
Proceedings of the IEEE International Conference on e-Business Engineering (pp. 378-385). Washington, DC, USA: IEEE Computer Society.
Shaparenko, B., Caruana, R., Gehrke, J., & Joachims, T. (2005). Identifying Temporal Patterns
and Key Players in Document Collections. Proceedings of TDM 2005, (pp. 165–174).
Vogels, W. (2008, October). Eventually Consistent. ACM Queue , 6 (6), pp. 14-19.
Westermann, E. (2009, April 7). Why EAI? Retrieved 06 29, 2011, from The Code Project:
http://www.codeproject.com/Articles/35151/Why-EAI.aspx
Zeromq. (2010, 08 04). Broker vs. Brokerless. Retrieved 06 30, 2011, from Zeromq:
http://www.zeromq.org/whitepapers:brokerless
ZeroMQ. (2011). ØMQ - The Guide. Retrieved July 28, 2011, from ØMQ:
http://zguide.zeromq.org/page:all
© FIRST consortium
Page 51 of 53
D2.2
Annex 1. Requirements groups
UC1.R-E1.1
UC1.R-E1.2
UC1.R-E3
UC1.R-EU5
UC2.R-E1.1
UC2.R-E1.2
UC2.R-E1.3
UC2.R-E1.4
UC2.R-E2.1
UC2.R-E2.2
UC2.R-EU3.1
UC2.R-F1
Data feeding –
external
sources
Data
feeding
–
UC1.R-E1
Search
functionality
Reputation
UC1.R-F1
DocumentationContextual
Display
andhelp
Access control
UC1.R-M1
UC2.R-F2
Classification of
UC1.R-E4
sentimentsobjects UC1.R-E2
Sentiment
UC1.R-E2
UC2.R-F1
UC1.R-I1
Security
UC1.R-M2
proprietary
Data
feeding –
external
sources
Ad hoc news
UC2.R-F3
vocabulary
Ontology
UC2.R-F3
Ontology
UC1.R-E2.1
visualisation
User
interfaceof
UC2.R-EU1
UC2.R-F3.1
UC2.R-F3.1
User Reports and
Cockpit
Service delivery –
UC2.R-EU2
Insert a registration UC1.R-M3
request
Login
UC1.R-EU2
UC3.R-EU7.1
Recognisable
features
for
Sentiment
UC1.R-E2.2
Data feeding –
external
sources
Data
feeding
–
Recognisable
features
for
Event: New
important topics
UC2.R-F8.1
UC1.R-E3.2
UC3.R-F2.3
UC2.R-F8.2
UC3.R-F2.4
Time Horizon of
prognosis
Document-level
web
Service
delivery –
client Reports and
Alert
UC2.R-EU7
external
sources
Data
feeding
–
external
sources
Data
feeding
–
extraction
Sentiment
classification
–
Sentiment
UC2.R-F9
extraction and
Sentiment
analysis UC1.R-EU4
external
sources
Data feeding
–
proprietary
Data feeding –
UC3.R-F4.2
topic
Drill-Down Topic
UC3.R-F1.1
Individual
instruments
Aggregated
UC1.R-EU5
instruments
Abstract
instruments
Individual baskets
UC1.R-EU8
(portfolios)future
Expected
price change
Expected
future
UC3.R-F4.3
proprietary
Choose
input data UC3.R-U2
set
Reputation
UC2.R-F8
Drill-Down Source UC3.R-F1.2
Drill-down
UC3.R-F1.3
UC3.R-F1.4
UC1.R-E3.1
UC1.R-EU1
UC2.R-EU6
Handle users
UC1.R-EU7
registration
request UC1.R-F2
Handle
users
profile
UC1.R-F3
Cockpit
Display
Sentiments
Ad hoc news
UC1.R-SC1
Alert Reports and
Cockpit to
Access
UC2.R-EU5
historical data and
Display
Sentiments
Report
Repository
UC2.R-U1
UC1.R-F4
UC1.R-EU6
UC1.R-EU9
UC2.R-EU3.1
UC2.R-M1
8. Decision Support system and visualisations
7. Storage, persistence and data access
6. Maintenance and configuration
5. Access control and security
4. User Interface & Service delivery
2. Retrieval of topics and features
1. Data Feeding and acquisition
UC1
UC2
UC3
3. Retrieval and Classification of Sentiments
The following table presents system requirements grouped into 8 functional categories as an input for analyzing project coverage from technical
point of view. See Section 2 for details.
External sources
list maintenance
Proprietary
UC1.R-EU3
Access to
UC2.R-F3.1
historicaltodata and UC2.R-F4
Access
historical data
Historical
list ofand UC2.R-F11
Recognisable
features forfuture
Expected
information list
Ontology
maintenance
Configuration of
UC1.R-F7
usecase
parameter
Configuration
of
UC2.R-EU9
usecase parameter
Classification
of
UC2.R-F5.1
data sources and UC2.R-F5.2
Configuration
UC1.R-F6
relevant
analysed UC2.R-F12
Alert search
function
Sentiments history UC2.R-M1
Decision support
system
(DSS)of
Maintenance
CDS prices
UC3.R-E1.2
Stock prices
UC3.R-EU3
risk models
Display
capabilities
Event Detection
maintenance
Time
series of
UC2.R-F5.3
(structured
data)
Choose input
data UC2.R-F6
set
View and edit
UC2.R-F10
Pillar 3 info
UC3.R-EU5.1
system
Maintenance
of
risk models
Interface
to
Unstructured data UC3.R-EU6.2
sources archives
Sentiments
history UC3.R-EU7.2
UC1.R-EU8
UC2.R-SC4
UC3.R-EU2
reputation ontology UC3.R-F4.1
Retrieval of relevant UC3.R-EU5.2
product
andcockpit UC3.R-EU6.1
Reputation
reputation
change
Risk
measurement
Event: Sentiment
peaks
Notification.
Sentiment
peak in
Event: Changes
correlation
Notification.
Changes in New
Notification:
important topics
UC3.R-F2.1
UC2.R-F7
vocabulary
Retrieval
of relevant
structuredof relevant
Retrieval
UC1.R-F5
Cockpit functions
UC2.R-I1
unstructured
Data feed
UC3.R-F3.1
UC3.R-F3.2
Alert search
function on
Access
UC3.R-EU8
Unstructured data
sources
Drill-down
volatility
change
Continuous
classification
Discrete
UC1.R-F7
UC2.R-SC3
UC3.R-EU8.1
Visualization of
topic trends
and
Currently
relevant
UC3.R-F4.4
classification
Evaporation
UC2.R-EU4
UC3.R-EU8.2
topics
Evolution
of topics
UC2.R-EU8
demand to
reports
Access
standardsentiments
reports
Display
UC3.R-EU8.3
UC2.R-EU9
Sentiments history
UC3.R-EU9
Evolution of topics
related to specific
Portfolio
UC2.R-F10
Reputation cockpit
UC3.R-EU9.1
optimization/asset
Choose instrument
UC3.R-E1
Financial
information
Searching system
UC3.R-EU9.2
Integral part of
diversification
Drop
in portfolio
capabilities
Display
capabilities
Watchlist/Portfolio
UC3.R-EU10.2
UC3.R-E1.3
UC3.R-EU1
Display sentiments
UC3.R-EU11.2
UC3.R-EU2
Sentiments history
UC3.R-EU12
rebalancing
Notification.
Portfolio
Scenario
UC3.R-EU4
Notification
UC3.R-F4.5
simulationeffects
Lead-lag
UC3.R-EU5.2
Notification.
Sentiment peak
Notification.
UC2.R-F5
UC3.R-U2
UC3.R-F2.2
UC2.R-EU3
UC3.R-E1.1
UC3.R-E1.2
UC3.R-EU6.2
UC3.R-EU11.2
Changes
in Drop
Notification:
in
portfolio
Notification.
UC3.R-F4.2
Portfolio Topic
Drill-Down
UC3.R-F4.3
Drill-Down Source
UC3.R-U1
Aggregated
information
Drill-down
UC3.R-EU10.2
UC3.R-U2
© FIRST consortium
Archiving
Sentiments
UC3.R-EU10.1
UC3.R-EU11.1
Page 52 of 53
sentiment Drop
Notification:
in portfolio
Portfolio
D2.2
© FIRST consortium
Page 53 of 53
Download